Joint Correlation Alignment-Based Graph Neural Network for Domain Adaptation of Multitemporal Hyperspectral Remote Sensing Images

In this article, we propose a novel deep domain adaptation method based on graph neural network (GNN) for multitemporal hyperspectral remote sensing images. In GNN, graphs are constructed for source and target data, respectively. Then the graphs are utilized in each hidden layer to obtain features. GNN operates on graph structure and the relations between data samples can be exploited. It aggregates features and propagate information through graph nodes. Thus, the extracted features have an improved smoothness in each spectral neighborhood which is beneficial to classification. Furthermore, the domain-wise correlation alignment (CORAL) and class-wise CORAL are jointly embedded in GNN network to achieve a joint distribution adaptation performance. By introducing the joint CORAL strategy in GNN, the extracted features can not only be aligned between domains but also have a superior discriminability in each domain. This domain adaptation network is named as joint CORAL-based graph neural network. Experiments using multitemporal Hyperion and NSF-funded center for airborne laser mapping datasets demonstrate the effectiveness of the proposed method.


I. INTRODUCTION
M ULTITEMPORAL hyperspectral remote sensing images are increasingly available for data analysis and applications [1]. For hyperspectral remote sensing image classification [2], it is cost and difficult to collect labels. If there are labeled data available from other related images and they can be reused to classify a new image, the classification can be achieved without further labeling cost. Multitemporal hyperspectral images are well related and thus it is appealing to reuse the labeled knowledge of a previously acquired image for classification of a new image. However, a classifier directly trained by the Manuscript  labeled data from one image would present poor performance on a new image, since there exists spectral shift between the two images. Spectral properties are affected by a lot of factors for hyperspectral images, such as the changed season, soil moisture, vegetation composition, topography, illumination, and the acquisition angle of the sun [3]- [5]. Domain adaptation is able to solve the problem. It aims to align distributions across domains and obtain an adaptative classifier for the target image by transferring knowledge from previous images. The previous image with abundant labeled data is called source domain, and the new image with few or no labeled data are called target domain.
Over the past few years, many shallow domain adaptation approaches have been proposed in the remote sensing community, and most of them can be classified into three major categories including instance-based methods, classifier-based methods and feature-based methods. The instance-based methods bridge source and target domains by reweighting data instances [6]- [8]. The classifier-based methods attempt to learn a classifier which has good performance on target domain by transferring labeled information from source domain [9]- [11]. The feature-based methods are the most popular strategy for shallow domain adaptation. They aim to learn a shared feature representation where spectral drift between domains can be minimized [12]- [16].
Recently, deep learning has been applied to domain adaptation because of its excellent feature representation ability. Compared with shallow domain adaptation approaches, end-toend deep domain adaptation approaches transfer more effective knowledge by embedding adaptation modules in the network architecture. Maximum mean discrepancy (MMD) is a popular distribution alignment strategy, which is utilized in many deep domain adaptation methods. Tzeng et al. [17] proposed the deep domain confusion method, which firstly regularizes the single adaptation layer of deep neural network (DNN) using linear-kernel MMD. Similar to MMD, correlation alignment (CORAL) could also measure the distribution distance between domains, and it attempts to align the second-order statistics of the source and target features. Sun et al. [18] extended the CORAL to deep architectures and proposed the CORAL for deep domain adaptation (D-CORAL) approach, so that a nonlinear transformation that aligns the correlations of features between the two domains is obtained. Besides, adversarial learning is This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ another increasingly popular deep domain adaptational strategy and it learns the domain-invariant features in a two-player game inspired by generative adversarial networks (GANs) [19]. Ganin et al. [20] first introduced adversarial learning into deep domain adaptation and proposed domain adversarial neural network (DANN), where a domain discriminator is used to obtain the domain-invariant features.
Deep learning has also been applied for domain adaptation of hyperspectral remote sensing images. Wang et al. [21] proposed a domain adaptation method by learning the manifold embedding and matching the discriminative distribution in source domain with neural network. Chen et al. [22] employed the criterion of cycle consistency to achieve features that are both domain-invariant and discriminative. Liu et al. [23] proposed a class-wise adversarial networks and achieved a superior feature-alignment performance. Gross et al. [24] proposed a nonlinear feature normalization alignment approach for domain adaptation of multitemporal hyperspectral images, which is able to mitigate nonlinear effects in hyperspectral data and transfer spectral features from one domain to another. Saha et al. [25] exploited a cycle-consistent GAN to learn deep transcoding between multisensory and multitemporal domains.
Deep domain adaptation methods usually utilize full connected network for feature extraction of hyperspectral data [21]- [23], [26], which cannot exploit the relations between data points. Some domain adaptation networks that utilize CNN for feature extraction are able to extract spatial spectral information [24], [25]. However, CNN is conducted on small patches, which are usually squared spatial windows, and thus the ability in modeling the relations between data points are still limited [27]. Moreover, the classification of pixels on the boundary between different classes may be affected. For the pixel on the boundary, the patch around this pixel contains signals from multiple classes. Thus, the CNN-extracted features cannot accurately represent the class of the central pixel, but are mixed features from multiple classes.
Recently, graph neural networks (GNNs) have been received significant attentions. GNN operates on graph structure and is able to aggregate features and propagate information through graph nodes. Compared to CNN that exploits the local spatial information, GNN utilizes graph to model the data relations in a larger range. The connected samples in the graph are spectral neighbors that have similar spectral properties and may not spatially adjacent, and thus GNN can model longer-range spatial relations than CNN, and the edge information can also be well preserved [27]- [29]. Moreover, GNN not only feeds pixelwise samples into network which is very suitable for pixel level hyperspectral image classification [28], but also feeds a graph into network to exploit the relations between data. The new feature of a node is a weighted sum of its neighboring nodes on the graph. Therefore, the nodes in a neighborhood are more likely to have similar features, and the smoothness of each spectral neighborhood is improved [29].
GNN was first proposed by Gori et al. [30], and the early study propagated the information of neighboring nodes via a recurrent neural architecture in an iterative manner, which is computationally expensive. To overcome these problems, many methods considered to re-define the GNN architecture by adopting the principle of convolution recently. Bruna et al. [31] proposed the definition of graph convolution based on spectral graph theory and updated the representation of graph by utilizing the graph Fourier transform. Defferrard et al. [32] adopted the Chebyshev polynomial to reduce the computational complexity of graph convolution in [31] and considered the K-localized spectral filter, which provided a faster forward propagation. Kipf et al. [33] proposed a first-order approximation of spectral graph filter and designed a simple layer-wise propagation rule, which achieved excellent performance in the semisupervised classification task of graph structure data.
Lately, some works applied GNN for hyperspectral remote sensing data analysis. Mou et al. [29] proposed a semisupervised nonlocal GNN for hyperspectral data classification. Tong et al. [35] introduced an attention-weighted graph into the GNN for hyperspectral image few-shot classification. Hong et al. [28] developed a minibatch GNN, and then combine the convolutional neural network and GNN with several fusion strategies. The proposed method allows us to train a large-scale network in a minibatch fashion and achieves out-of-sample extension without retraining network. Wan et al. [27] proposed a multiscale dynamic GNN, where the graph is dynamically updated during the training process. They further proposed a context-aware dynamic GNN that can capture the long-range contexture relations in hyperspectral data [36].
Due to the advantages of GNN [27]- [29], [34], [36], we applied GNN for feature extraction of multitemporal hyperspectral remote sensing images in this article, which is able to effectively exploit spectral relational information in images. However, GNN can extract features but is incapable of reducing domain shift or conducting domain adaptation. Therefore, domain adaptation strategy should be introduced to make the GNN network generate domain-invariant features. In the proposed domain adaptation network, we exploit GNN as a feature extractor to generate features for both source and target data, and employ the CORAL domain adaptation strategy for knowledge transfer.
It is worth noting that CORAL strategy only focuses on minimizing the domain-level distribution discrepancy, without considering the class-level information among source and target domain. For multitemporal hyperspectral remote sensing data, different land cover types may have different spectral shift. Adaptation of the domain-level distribution is not equivalent to aligning their class conditional distributions. Since the domainwise CORAL cannot explore the class-level relations between the source and target domains, the class-wise CORAL is utilized to further reduce the distribution discrepancy on a per-class basis, which is able to adapt the conditional distributions across domains. Combining the two alignment strategies can achieve a joint distribution adaptation performance. The CORAL conducted on each class is called class-wise CORAL. We jointly conduct domain-wise CORAL and class-wise CORAL to obtain marginal and conditional distribution adaptation results. Our transfer network that exploits joint CORAL-based GNN is denoted as JCGNN in this article.
The main contributions of the proposed JCGNN method are summarized as follows. 1) To our best knowledge, this is the first attempt to introduce GNN and class-wise CORAL into unsupervised domain adaptation of multitemporal hyperspectral remote sensing images. 2) We apply GNN as the feature extractor for domain adaptation, which considers not only the information among spectral bands, but also the relations among neighboring points. It is able to obtain an improved smoothness in each spectral neighborhood and is beneficial to classification.

3) We introduce joint CORAL domain adaptation strategy
to the GNN to achieve both domain-level and class-level domain adaptation performance. The remainder of this article is organized as follows. Section II provides introduction of unsupervised domain adaptation and GNNs. Section III describes the proposed JCGNN in detail. Section IV discusses the related works. Experimental results are shown in Section V and the conclusion is drawn in Section VI.

A. Unsupervised Domain Adaptation
Unsupervised domain adaptation approach assumes there are abundant labeled data in source domain, but no labeled data in target domain. Due to the domain shift across domains, a classifier that directly trained by source domain data would have an unsatisfactory classification performance for target domain data. In this article, we aim to align the data distribution across domains by jointly conducting domain-wise CORAL and classwise CORAL to obtain marginal and conditional distribution adaptation performance. The joint CORAL domain adaptation strategy is introduced to GNN to achieve unsupervised domain adaptation.
Supposed the source domain data is denoted as X s ∈ R N s ×d with labels Y s ∈ R N s ×1 and the target domain data is denoted as X t ∈ R N t ×d without labeled information, where N s and N t represent the number of source and target features, respectively, d is the feature dimensionality. X s and X t are distributed differently, and the domain adaptation aims to obtain the domain-invariant features of the two-domain data. Then the classifier trained on the source data in the comment feature space can be directly applied for classification of target data.

B. Graph Neural Networks
The network structure of GNN is illustrated in Fig. 1 [37], where three hidden layers are used. GNN employs graph to capture the relations between data points, and the graph is utilized in each hidden layer to obtain features. Softmax classifier is applied to the features from the last hidden layer to conduct classification.
In GNN, graph construction is first conducted before the network training [38]. Let X ∈ R N ×d represent a dataset, where N denotes the number of data points and d represents the dimensionality. The graph constructed on X is denoted as G = (X, E, A) with N nodes, where each node corresponds to a data point The graph construction process generally contains two steps: determination of graph adjacency relationships and calculation of graph edge weights. For graph adjacency construction, we utilized k nearest neighbors method to obtain a sparse graph. For each data point x i ࢠX, we calculated the Euclidean distances between x i all the other data points in X, and then selected its k nearest neighbors with the shortest Euclidean distances. In graph G, node x i is only connected to its k nearest neighbors with edges and a sparse graph is thus achieved. For edge weight calculation, Gaussian kernel function is adopted to measure the similarity between two connected nodes. Suppose x i and x j are connected neighbors, and the edge weight between them is defined as [39] where N(x i ) denotes the k nearest neighbors of x i , A ij denotes the element of adjacency matrix A in row i and column j. σ is the Gaussian diffusion kernel parameter, and dist(x i ,x j ) denotes the Euclidean distances between x i and x j which is calculated as It is worth noting that the adjacency matrix A represents the relational information of spectrum between pixels and the construction process do not need any labeled information.
Based on the constructed graph, the features extracted in the (l + 1)th layer of GNN is denoted as [33] where H (l) ∈ R N ×F l is the features in the lth layer and H (0) = X, F l is the number of nodes in the lth layer and F 0 = d, Θ (l) ∈ R F l ×F l+1 is the network parameters. g( · ) denotes the ReLU activation function.Â is the normalized graph adjacency matrix, which is defined as [33] A whereD is a diagonal matrix andD ii = 1+ j A ij . I N is the identity matrix. It can be seen from (3) that each GNN layer firstly generates a new feature representation by performingÂH (l) and then feed the features to a fully connected layer characterized by Θ (l) . The new feature is denoted by [29] For the ith node, the lth layer feature and the new feature are denoted as H i (l) and T i (l) respectively, and the indices of the ith node and its k nearest neighbors are denoted as N i . Its new feature T i (l) is calculated as whereÂ i denotes the normalized weights between the ith node and all the nodes in the graph. The new feature of the ith node is a weighted sum of its neighboring nodes on the graph. Therefore, the nodes in a neighborhood are more likely to have similar features, and then classified as the same category, which can reduce the within-class differences and thus improve the feature discriminability [29]. Supposing the features of the ith data point from the last hidden layer is denoted as z i , the softmax activation function is applied for classification [33] where z ic represents the cth feature of z i , i ∈ {1, . . . , N}, and p ic is the probability of z i that belongs to the cth class.

III. PROPOSED METHOD
The architecture of the proposed JCGNN is shown in Fig. 2, where three-layer GNN is employed for feature extraction. G f denotes the feature extractor. GNN conducts full-batch network learning. The input of GNN for domain adaptation contains the source labeled data X s ∈ R N s ×d with its normalized graph adjacent matrixÂ s ∈ R N s ×N s and the target data X t ∈ R N t ×d with its normalized adjacent matrixÂ t ∈ R N t ×N t . The source labeled data X s is used to train the classifier, and both X s and X t are employed to obtain the domain-invariant features. Two feature extractors with shared weights are applied to source data and target data, respectively, which can also be regarded as one feature extractor with two inputs. All the four graphs in the first row denote the source graph, where the orange nodes represent source data points and the spectrally similar points are connected by edges. Similarly, the four graphs with green nodes in the second row denote the target graph. The produced domain invariant features are denoted as Z s ∈ R N s ×C and Z t ∈ R N t ×C for source data and target data, respectively.
where Z s i denotes the source data of the ith class in the feature space. Target data can also be divided into C classes according to their pseudo labels.
, where Z t i denotes the target data that are predicted as the ith class. With the softmax classification, P s ∈ R N s ×C denote the predicted probability results of source domain and P t ∈ R N t ×C denote the predicted probability results of target domain. The domain-wise CORAL strategy and class-wise CORAL strategy are combined to constrain the output features between domains to be well aligned.

A. Feature Extraction and Classification by GNN
We apply GNN for feature extraction and classification of source data and target data respectively. For source data X s and target data X t , the graphs G s and G t are first constructed. Then two feature extractors with the shard weights are applied to the two-domain data. The three-layer-GNN-generated features are denoted as

Algorithm 1: JCGNN Approach
epochs_1 and epoch_2 are the numbers of iteration. Output: The target probability prediction results P t .

Initialization:
Randomly initializing the parameters of GNN.
Calculating the normalized adjacency matrix A s and A t by (1). Training: stage 1: pre-train and domain-level alignment: for i = 1 → epochs_1 do Generating embedding of both domains: Z s ← (8) Z t ← (9). Calculating loss function L DCGNN using (15). Updating the parameters of feature extractor G f by backpropagating through L DCGNN . end for stage 2: class-level alignment: for i = 1 → epochs_2 do Generating embedding of both domains: Z s ← (8) Z t ← (9). using predicted probability result P t to assign pseudo labels for target domain data. Calculating loss function L JCGNN using (16). Updating the parameters of feature extractor G f by backpropagating through L JCGNN . end for where Θ (0) ∈ R d×F 0 , Θ (1) ∈ R F 0 ×F 1 and Θ (2) ∈ R F 1 ×C are the parameters of the three GNN layers, F 0 , F 1 are the number of nodes in the first two hidden layers, and C is the number of classes. The softmax activation function in (7) is finally applied to the features Z s and Z t respectively for classification [33].
Source labeled data are used to calculate the classification loss in this article, which is obtained by the cross-entropy error over all source labeled data where y ic s is the one-hot encoding of the class label for the ith source data point, and p ic s is the cth output features of softmax activation function.

B. Feature Alignment by Joint CORAL
The domain-wise CORAL reduces the distribution discrepancy among domains by aligning the covariance of the source and target features [18]. The loss measured by domain-wise covariances is defined as where || · || 2 F is the squared matrix Frobenius norm, and C s and C t denote the covariance matrices of the two-domain data which can be estimated as where 1 is an all-ones column vector. Since different classes may have different spectral drifts, the marginal distribution adaptation cannot guarantee the distribution adaptation of each specific class. To achieve better alignment, CORAL is generalized on a per-class basis, which is called class-wise CORAL. The class-wise CORAL loss is defined as where C denotes the number of class. C i s and C i t denote the covariance of ith class in the source domain and target domain, respectively, which can be estimated similarly with (11) using class-specific samples in the corresponding domains.

C. Two-Stage Training Procedure
The class-wise CORAL requires the labeled information to calculate the covariance matrix of each class. However, there is no labels in target domain for unsupervised domain adaptation. Therefore, we utilized pseudo labels of target data to obtain covariance matrix of each class. Due to spectral drift, the network trained on source domain data without any domain adaptation would have a poor performance on target domain data, resulting in inferior predicted labels of target data. Since domain-wise CORAL does not require labels and is able to reduce the global distribution difference, it is used to obtain a more accurate pseudo labels of target data than the prediction method without using any domain adaptation strategy. Then the source data with labels and target data with pseudo-labels are used to compute the covariance matrix for each class, respectively.
In this article, we adopt a two-stage training procedure: Firstly, we train a domain-wise CORAL-based graph neural network (DCGNN), by defining the loss function as follows: Using DCGNN, the target data are predicted and the pseudo labels are utilized to obtain the covariance matrix of each target class. Second, with the initial pseudo labels obtained by DCGNN, the joint CORAL-based graph neural network (JCGNN) is trained with the loss function composed of joint CORAL loss and classification loss, which is expressed as where the first term denotes the classification loss on source labeled data, the second term denotes the domain-wise CORAL loss and the third term denotes the class-wise CORAL loss. λ 1 and λ 2 are trade-off parameters to balance the contributions of different losses.
Since the joint CORAL can further improve the prediction accuracy, the pseudo labels of target data are updated during each iteration of the training process. The training procedure of JCGNN is summarized in Algorithm 1.

IV. RELATED WORK AND DISCUSSION
The relationships between the proposed JCGNN and several related works are described as follows.
1) Deep Adaptation Network (DAN) [40]: The DAN achieves domain adaptation by exploring multikernel MMD for matching different distributions. In the proposed JCGNN, we achieve domain adaptation by using domain-wise CORAL strategy and class-wise CORAL strategy to extract domain invariant features. The MMD strategy and CORAL strategy are all distribution measurement methods of the data, where MMD utilizes first-order statistic and CORAL exploits the secondorder statistic. Moreover, DAN only considers the domain-level distribution discrepancy without exploring the class-level information. Our proposed JCGNN jointly takes domain-level and class-level distribution discrepancy into account.

2) Multiple Adversarial Domain Adaptation (MADA) [23]:
The MADA is a domain adaptation approach based on adversarial learning and it considers class-level information by using multiple domain discriminators. Our proposed JCGNN jointly embeds domain-wise CORAL strategy and class-wise CORAL strategy in GNN. Both the domain-level information and class-level information are considered.
3) Moving Semantic Transfer Network (MSTN) [41]: Both the MSTN and the proposed JCGNN consider the class-level information. However, MSTN aims to align exponential moving average centroids with the same label but different domains, while our proposed JCGNN adopts CORAL strategy and aims to align the per-class covariance matrix of features. [18]: The D-CORAL embeds CORAL strategy in the deep architecture and aims to align the covariance matrix of features that extracted by DNN. The CORAL adopts the second-order statistics as in our approach. However, it calculates the covariance matrix by utilizing all the input data in a domain, without considering the distribution discrepancy between classes. Moreover, The D-CORAL ignores the interdependence among the spectrum during feature extraction. The proposed JCGNN applies GNN for feature extraction, and adopts CORAL strategy on both domain-level and class-level.

A. Datasets Description
The performance of the proposed JCGNN is evaluated by comparing with several state-of-art domain adaptation methods on two real-world multitemporal hyperspectral remote sensing datasets, including Botswana (BOT) multitemporal Hyperion dataset and the Houston multitemporal dataset.

1) Botswana Multitemporal Hyperspectral Dataset:
The BOT dataset consists of three multitemporal images that were acquired by NASA EO-1 Hyperion instrument in May, June and July 2001 over the Okavango Delta, Botswana. All images contain 242-band data at a 30-m spatial resolution and cover the 357-2576 nm spectral portion in 10 nm spectral resolution. Removal of uncalibrated and noisy bands leaves 145 bands for experiments. The three images include nine identified classes and any two of them can be selected as the source and target data, and thus six data pairs are available. The images and labeled information are shown in Fig. 3. The class names and the number of samples of the image are given in Table I.
2) Houston Multitemporal Hyperspectral Dataset: The Houston images were collected by NSF-funded Center for Airborne Laser Mapping (NCALM). The dataset consists of two multitemporal images that were acquired over the University of Houston campus and its neighboring area in 2012 and 2017, respectively. The 2012 Houston data has 144 bands but the 2017 Houston data only has 48 bands. The spectral range of both images is 380-1050 nm. Every three bands in 2012 Houston data were averaged as a new band [45], and then the number of the spectral band in the 2012 Houston data becomes 48, which is consistent with the dimensionality of the 2017 Houston data. We used the 2012 Houston data and 2017 Houston data with four classes as source and target data respectively. The images and labeled information are shown in Fig. 4. The class names and the number of samples of the image are given in Table I.

B. Setup
We compared the proposed method with several state-of-theart deep learning domain adaptation methods, including DAN [40], DANN [20], MADA [23], MSTN [41], and D-CORAL [18]. In addition, the performance of the DNN, the GNN that were trained on source data without embedding domain adaptation strategy were also employed for comparison. JCGNN achieves distribution alignment by applying the joint CORAL to the GNN architecture. For a better understanding of the proposed JCGNN, we conducted experiments using DNN with joint CORAL (JCDNN), and GNN with only domain-wise CORAL (DCGNN).
The proposed JCGNN was implemented on pytorch framework and composed of two hidden layers (350 units for BOT and 128, 32 units for Houston dataset). The ReLU activation function was employed in the hidden layers and the softmax activation function is employed in the output layer. The dropout strategy was also used in each hidden layer to prevent overfitting. For a fair comparison, all the compared networks adopted the full-connected network with the same hidden units and activation function as JCGNN. GNN and DCGNN adopted the GNN with the same hidden units and activation function as JCGNN.
In the proposed JCGNN, we used Adam to optimize the network and the weight decay was set to be 5e-4. The learning rate was annealed by η = η 0 (1+α i ) p , where η 0 was the initial learning rate and i was the training step updating from 0 to 1. Decay parameters α and p were set to be 10 and 0.75, respectively. The dropout rate was set to be 0.9. We set the initial learning rate as 0.0008 in BOT "May-June," "June-May," 0.005 in BOT  "May-July," "July-May," "June-July," "July-June," and 0.001 in Houston dataset. We adopted a two-stage training procedure. In the first stage, the number of iterations was set to be 1000 for BOT data and 500 for Houston dataset. In the second stage, the number of iterations was set to be 4000 for BOT and 2000 for Houston dataset. It is worth noting that all the source and target data were directly forwarded into the network without mini-batch strategy in GNN, DCGNN and JCGNN. Besides, the number of adjacency nodes k was fixed as eight for all data pairs. More analysis about parameters would be provided in the sensitivity analysis section.
Before training the JCGNN, each spectral band of both source and target data was normalized to have a standard normal distribution N(01). Such data preprocessing method was also applied to all the compared methods. Since the initial network parameters were chosen randomly, all experiments were conducted ten times and the average classification was used for evaluation.

C. Results of JCGNN
The overall accuracy (OA) and kappa coefficients of all the compared methods and proposed JCGNN are given in Tables II  and III. The DNN without any adaptation strategy can achieve satisfactory performances on these data pairs which have a small spectral drift, such as BOT "June-July" and "July-June." On the contrary, if the data pairs have a big spectral drift, such as BOT "July-May", Houston dataset "2012-2017," the accuracies of DNN are low. Almost all the domain adaptation methods obtain higher accuracies than DNN, demonstrating their positive transfer learning ability. The GNN outperforms DNN on almost all the data pairs, which demonstrates that GNN could achieve more effective feature extraction ability by utilizing spectral neighborhood information. The D-CORAL can obtain higher accuracies than DAN, indicating the advantage of CORAL strategy compared to MMD. The MSTN and MADA performed better than DAN, DANN and D-CORAL, which demonstrates that the exploration of class-level knowledge would facilitate producing superior alignment performance. Compared to all the compared methods, JCGNN yields the best performances on all the data pairs. The JCGNN performed better than MSTN and MADA, which demonstrates the advantage of embedding class-wise CORAL strategy and domain-wise CORAL in GNN architecture. It is worth noting that the JCGNN improves the classification performance significantly on the "difficult" data pairs like BOT "July-May" and Houston dataset "2012-2017," and it also achieves comparable performance on the "easy" data pairs. Moreover, the accuracies of the JCGNN obtained almost 5%-10% improvement with respect to all the compared methods on BOT dataset. These observations demonstrate that the GNN, domain-wise CORAL and class-wise CORAL can cooperate well for domain adaptation of remote sensing images.
The JCGNN achieves distribution alignment by applying the domain-wise CORAL loss and class-wise CORAL loss to the GNN architecture, where domain-wise CORAL loss aims to reduce the global distribution discrepancy across domains and class-wise CORAL loss aims to align the distribution on a perclass basis. The GNN architecture is adopted rather than the DNN because the GNN makes use of spectral neighborhood information for better classification performance. For a better understanding of the proposed JCGNN, we also compared the results of JCGNN with JCDNN and DCGNN.
As given in Tables II and III, JCGNN performed better than JCDNN on almost all data pairs, which demonstrates GNN has a better performance than DNN in domain adaptation task. JCGNN outperformed DCGNN, indicating the necessity of distribution alignment on a class-level basis.

D. Analysis of JCGNN
The performances of the five algorithms were also reported for comparison on a per-class basis. As given in Table IV

E. Alignment Performance of JCGNN
To illustrate the effectiveness of the proposed JCGNN, we employed t-SNE embeddings [42] to set the high-dimensional features as two-dimensional (2-D) for visualization. Fig. 5, shows classes in "May-June" "June-May," and "July-June" data pair of the BOT dataset in different colors, and represents scatters from source and target domain by hollow circles and pentagrams, respectively. Compared with the visualizations obtained by DNN without any adaptation in Fig. 5(a), (d), and (h), the visualizations obtained by GNN without any adaptation in Fig. 5(b), (e), and (i) become more compact and better separated. For instance, the features of class 35,9 from "July-June" are gravely overlapped in Fig. 5(h), but they are tightly clustered in Fig. 5(i). This observation demonstrates that the GNN outperforms DNN in classification task. However, GNN is incapable of reducing the domain shift and conducting domain adaptation. Therefore, it is necessary to embed alignment strategy in the architecture of GNN to conduce domain-invariance features. To further illustrate the alignment performance of each class to verify the effect of domain-wise CORAL and class-wise CORAL. Fig. 6 plots the features from second and ninth output units of GNN for "June-May" in BOT datasets, where the red scatters represent one class data from source domain and blue  With the refined adaptation of class-wise CORAL, the covariance of same class from different domains have been well aligned. Class-wise CORAL can achieve more effective alignment performance than domain-wise CORAL due to considering the class-level information.

F. Classification Results of the Whole Image by the JCGNN
To classify the whole target image, we firstly divided all the target data into subsets, where each subset is with 5000 pixels. Then each subset and its graph adjacency matrix are fed into the network. By combining the classification results of all the subsets, the classification map of the whole target image can be obtained. Fig. 7 shows the classification results of the BOT images, where "June-May" and "July-June" data pairs are chosen for demonstration. Fig. 7(a) and (d) shows the DNN classification results without any adaptation for "June-May" and "July-June," respectively. Fig. 7(b) and (e) denotes the classification results of proposed JCGNN. Due to lack of ground truth on the whole images, the classification results of classifier that trained on the target labeled data are used as "reference," and the whole images of "reference" are shown in Fig. 7(c) and (f). In Fig. 7(a) and (c), we could observe that the results of DNN without any adaptation is significantly different from the results of "reference." After adopting the adaptation strategy, the results of JCGNN are highly similar to "reference," which could demonstrate the advantage of our proposed domain adaptation approach.
To better demonstrate the effect of our proposed methods, we chose two local regions for further analysis. All local regions are shown in the pink windows in Fig. 7 and enlarged in Fig. 8. The wetland and upland define the two major ecosystems included in BOT dataset [43]. For BOT "June-May" data pair, a wetland area is selected. The selected wetland local region mainly contains class 1 (water, black), class 2 (primary floodplain, yellow), and class 5 (island interior, dark blue). As shown in Fig. 8(a)-(c), we can clearly see that class 2 (primary floodplain, yellow) and class 5 (island interior, dark blue) are easily misclassified as class 6 (woodlands, purple) in the results of DNN without any adaptation. Fortunately, our proposed JCGNN could effectively prevent the occurrence of misclassification, so that the results of JCGNN is much similar to "reference". For BOT "July-June" data pair, an upland area is selected, which mainly contains class 6(woodlands, purple), class 7 (savanna, light blue) and class 9 (exposed soils, orange). As shown in Fig. 8(d)-(f), the results of DNN without any adaptation assigned many false predictions, and many pixels of class 3(Riparian, light green) were misclassified as class 2 (primary Floodplain, yellow) and class 3(riparian, light green). After using the JCGNN, most pixels belonging to class 2 (primary Floodplain, yellow) and class 3(riparian, light green) are correctly classified.

G. Parameter Sensitivity
We conducted sensitivity analysis for the three parameters in the JCGNN: λ 1 , λ 2 , σ. The parameters λ 1 and λ 2 are the tradeoff parameters where λ 1 controls the weight of domain-wise CORAL loss and λ 2 controls the weight of class-wise CORAL loss. The parameter σ is the Gaussian diffusion kernel. We used six BOT data pairs to show the results, and similar trends could also be obtained on the other data pairs. To find a best domain adaptation performance, each parameter was adjusted on the condition that the other parameters are fixed. For parameter λ 1 , ten different values (0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5) were tested. The optimal parameter values of λ 2 and σ were fixed as (1, 3, 1, 0.5, 1, 1) and (1, 1, 0.7, 1, 0.5, 1) for the six data pairs, respectively. The classification results were shown in Fig. 9(a). We can observe that the proposed approach was not sensitive to this parameter in this range, suggesting that this parameter should be set in this range. Fig. 9(b) shows the classification results with parameter λ 2 from (0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5). The optimal parameter values of λ 1 and σ were fixed as (1.5, 1.5, 1, 0.5, 1.5, 1) and (1, 1, 0.7, 1, 0.5, 1) for the six data pairs, respectively. For BOT "May-June," "June-May," "May-July," "June-July," "July-June," different values of this parameter yield similar classification performance. In addition, we can observe that almost all data pairs can achieve satisfactory performance when λ 2 was equal to 1. We suggested that the value of λ 2 should be set around 1. In Fig. 9(c), we tested different values of parameter σ from (0.1, 0.5, 0.8, 1, 3,5,7,9,15,20), with the parameter values of λ 1 and λ 2 being fixed as (1.5, 1.5, 1, 0.5, 1.5, 1) and (1, 3, 1, 0.5, 1, 1) for the six data pairs, respectively. When the value of σ increases from 0 to 1, the accuracies on these "easy" data pairs increase and reach the best at σ = 1. However, when the σ was too large, the accuracy would drop significantly. For these "difficult" data pairs, such as BOT "July-May," the accuracy would grow with an increasing value of σ. We suggested that the parameter σ should not more than 1 when data pairs with little spectral drifts. When there was large spectral drift across domains, the parameter σ should be greater than 10.

H. Computational Time
The computational time of all the compared domain adaptation approaches was given in Table V. All the experiments were implemented with the deep learning framework and were executed on NVIDIA GeForce RTX 2080 GPU with 16-GB memory. The GPU was used to accelerate the training process. As given in Table V, the training time of GNN is longer than DNN because the training process of GNN did not adapt mini-batch strategy. The MSTN and MADA were slower than DAN, DANN D-CORAL, and DCGNN, which indicated that achieve class-level distribution alignment was more timeconsuming than achieve domain-level distribution alignment. The JCGNN that combined both class-level and domain-level distribution alignment costed more time than most compared domain adaptation approach, but the computational time was still within acceptable limits. Fig. 7. Classification results of the target image in BOT "June-May" and "June-July" data pairs. (a) DNN result without any adaptation for "June-May" data pair (b)JCGNN result for "June-May" data pair. (c) Reference obtained by using target labeled data as training data for "June-May" data pair. (d) DNN result without any adaptation for "July-June" data pair. (e) JCGNN result for "July-June" data pair. (f) Reference obtained by using target labeled data as training data for "July-June" data pair. (g) Class legend.

VI. CONCLUSION
In this article, we proposed a new unsupervised domain adaptation method for multitemporal remote sensing images, which adopted GNN for feature extraction to utilize spectral neighborhood information. Features from two domains are then aligned by jointly exploiting CORAL on both domain-level and class-level. Experiments on two multitemporal hyperspectral remote sensing images datasets demonstrated that the proposed JCGNN method offers the classification performance superior to other state-of-the-art deep domain adaptation methods for multitemporal images, because GNN can be used to extract representative features from spectral neighbors in multitemporal images. Moreover, the JCGNN outperformed DCGNN, indicating the advantage of the class-level distribution alignment. Since the construction of adjacent matrix plays an important part in GNN [31] and [44], we will consider to update the adjacent matrix in the feature extracting process to learn more domain-invariant features in our future work.