Graph Embedding and Distribution Alignment for Domain Adaptation in Hyperspectral Image Classiﬁcation

—Recent studies in cross-domain classiﬁcation have shown that discriminant information of both source and target domains is very important. In this paper, we propose a new domain adaptation (DA) method for hyperspectral image (HSI) classiﬁcation, called graph embedding and distribution alignment (GEDA). GEDA uses the graph embedding method and a pseudo label learning method to learn inter-class and intra-class divergence matrices of source and target domains, which preserves the local discriminant information of both domains. Meanwhile, spatial and spectral features of HSI are used, and distribution alignment and subspace alignment are performed to minimize the spectral differences between domains. We perform DA tasks on Yancheng, Botswana, University of Pavia and Center of Pavia, Shanghai and Hangzhou data sets. Experimental results show that the classiﬁcation performance of the proposed GEDA is better than that of existing DA methods.


I. INTRODUCTION
Hyperspectral remote sensing technique can simultaneously obtain spatial and spectral information of ground objects, and has the ability to identify subtle difference between different materials [1]- [3]. Nowadays, hyperspectral remote sensing has become a hot spot in the development of Earth observation. A large amount of hyperspectral remote sensing images are available due to the launch of new satellites and the development of hyperspectral sensors [4]. Although there exist many hyperspectral images (HSIs), the collaborative processing of different HSIs is still very difficult due to the differences in sensors and acquisition conditions [5]- [8]. HSIs  differences in spectral coverage, spectral resolution, spatial resolution, number of bands, etc. In addition, the acquisition conditions of different HSIs are usually different, which brings obstacles to the long-time sequence analysis and collaborative processing of HSIs [5].
Recently, a new domain adaptation (DA) technique has emerged in the field of machine learning [5], [9]. The main idea of DA is to use the rich knowledge of a labeled source domain to improve the model performance of limited or no labeled target domain. DA can transfer the differences in the imaging environment and hardware conditions of multisource HSIs into a data or feature transformation problem [5], [10]- [15]. By mining data correlation, it can realize the transfer of common knowledge between domains. DA provides a theoretical feasibility for the cross-domain classification problems caused by the inconsistent characteristics of HSIs. The existing DA methods can be roughly classified as samplebased methods, feature-based methods, and classifier-based methods [5], [16]. In this paper, we focus on the featurebased DA method, which either performs subspace learning by exploiting the subspace geometrical and statistical structures [16]- [19], or distribution alignment to reduce the marginal or/and conditional distribution divergence between domains [20]- [22].
In previous years, many feature-based DA algorithms have been proposed, such as subspace alignment [17], correlation alignment [23], transfer joint matching [19], geodesic flow kernel [24], transfer component analysis [18], joint distribution alignment [20], scatter component analysis (SCA) [21], joint geometric and statistical alignment [22], and locality preserving joint transfer [16]. SA intends to learn a transformation matrix that maps source and target domains into individual subspaces, so that the distance between the resulting subspaces is reduced [17]. CORAL uses a linear transformation to align second-order statistical features of source and target domains [23]. GFK learns domain-invariant features by integrating an infinite number of subspaces [7], [24]. TCA maps the data from the two domains together into a high-dimensional reproducing kernel Hilbert space (RKHS) such that the distance between the two domains in the RKHS is minimized [18]. TJM mainly reduces the differences between domains and constructs new features through feature matching and instance weighting [19]. JDA is designed to learn a transformation so that the transformed data align both the marginal and conditional distributions [20]. SCA takes the between and within class scatter matrices of source domain into consideration [21]. The above transformation-based DA methods, such as TCA, TJM, JDA and SCA, only learn a unified transformation to map source and target domains into a shared subspace. When the distribution shift between the two domains is large, it is very different to adapt the distribution [22]. To reduce the shift both statistically and geometrically, JGSA learns two coupling mappings A and B for source and target domains, respectively. It simultaneously performs the distribution alignment and subspace alignment between transformed domains and considers the discriminant information of source domain and global information of target domain [22]. However, JGSA does not consider the data manifold structure. To preserve the local manifold structure of data, LPJT jointly exploits feature adaptation with distribution matching and sample adaptation with landmark selection [16].
The aforementioned methods have shown good performance for DA in computer vision. However, directly applying these method for HSI cross-domain classification usually produces poor results due to the existence of great spectral drafts between domains. The spectral drafts mainly come from the differences in imaging environment and hardware conditions and the variation of materials. In the case of large domain differences, it is very difficult to select effective landmarks with high domain matching degrees for the sample-based DA methods, and also unlikely to exist invariant features or shared latent feature subspace for the feature-based DA methods.
For cross-scene remote sensing image classification, the characteristics of remote sensing images can be exploited to improve the performance of DA. Sun et al. constructed discriminative cross-view subspaces and applied the SA method for the unsupervised cross-view remote sensing image classification [25]. Qin [7]. Recently, some deep learning based DA methods have been proposed for cross-scene HSI classification, such as deep metric learning model [28], classwise distribution adaptation (CDA) network [29], deep crossdomain few-shot learning (DCFSL) [30]. In Ref. [28], a deep metric learning-based feature embedding model was proposed for HSI classification. It projected input features into a well-defined metric space, where the mapping features have small intra-class distance and large inter-class distance [28]. In CDA, a class-wise adversarial adaptation network was constructed and a probability-prediction-based maximum mean discrepancy (MMD) method was introduced to measure the distribution distance [29]. DCFSL incorporated the fewshot learning (FSL) and DA in a unified framework for HSI classification, where a conditional adversarial DA strategy was utilized to overcome domain shift, and FSL was executed to discover transferable knowledge in the source classes and to learn a discriminative embedding model to the target classes [30].
From the above traditional and deep-learning-based DA methods, we can see that feature learning and distribution alignment are key factors for domain adaptation. Although it is very difficult to select effective landmarks, invariant features or shared latent feature subspace in the case of large domain differences, it is feasible to project source and target domains into individual subspaces and then to minimize the subspace distribution distance statistically and geometrically to reduce the spectral drafts. Meanwhile, previous studies have shown that spatial-spectral features and local manifold structure are useful for HSI DA tasks [10], [12], [13], [15], [31]. Therefore, we propose a graph embedding and distribution alignment (GEDA) method for the cross-domain classification of HSIs in this paper. By using the characteristics of HSI, domain relation and data label information, the proposed GEDA can simultaneously reduce the distributional shift and geometrical shift between domains. In the GEDA, spatial filtering is used to increase the spatial consistency of HSI, and then two coupling projections A and B are learned to project source and target domains into subspaces, respectively. The data after projection meets the following requirements: 1) The discriminant information of source and target domains is maintained; 2) The marginal and conditional distribution differences between source and target domains are minimized; 3) The subspace offset between domains is minimized. While keeping the discriminant information in the two domains, an easy transfer learning (EasyTL) method is used to predict the pseudo-labels of the target domain [32], and a graph embedding method is used to learn the intra-class and inter-class scatter matrices of both domains. By alternatively updating subspace features and target pseudo labels, the proposed GEDA can effectively align source and target data. Our contribution is three-folds.
(1) A unified DA framework called GEDA is proposed. It simultaneously considers subspace discriminative feature learning, pseudo label learning and distribution alignment.
(2) The local spatial and spectral discriminative information of HSI are used by means of a simple spatial mean filtering and graph embedding.
(3) An effective pseudo label learning method (i.e., EasyTL) is employed to generate target labels and to promote the minimization of statistical distribution difference between source and target domains.
The rest of the paper is organized as follows. Section II introduces our proposed GEDA method. The Section III is the experimental part, which compares GEDA with existing DA methods for HSI cross-domain classification. Section IV concludes the paper.
II. THE PROPOSED METHOD Definition 1. A domain D = {χ, P (X)} is composed of a feature space χ and a marginal probability distribution of inputs P (X), where X = {x 1 , · · · , x n } ∈ χ is a set of learning samples. Definition 2. A task T = {Y, f (x)} consists of classification results Y and a classifier f (x), where f (x) = Q(y|x) can be interpreted as the conditional probability distribution [20].
For DA, there are two domains, i.e., source domain D s and target domain D t . In general, source domain has labels, and target domain has little or no label. HSI cross-domain classification usually focuses on the unsupervised DA problems where the target domain has no label. Given a labeled data set from the source domain, {(x s 1 , y s 1 ), (x s 2 , y s 2 ), · · · , (x s ns , y s ns )} with x s i ∈ R d and y s i ∈ {1, 2, · · · , C}. The sample and label set of source domain are X s = {x s 1 , · · · , x s ns } and Y s = {y 1 , y 2 , · · · , y ns }, respectively. Let X t = {x t 1 , · · · , x t nt } denote the unlabeled data set from the target domain. DA considers the following situation: the class space of source and target domains are the same: Y s = Y t , but the marginal distribution and conditional probability distribution between the domains are inconsistent: P s (X s ) ̸ = P t (X t ) and Q s (y s |X s ) ̸ = Q t (y t |X t ). The aim of DA is to predict the target label using the model trained on the source samples.
To alleviate the effect of spectral shifts between source and target domains, the proposed GEDA method projects both domains into subspaces, and uses the potential shared features and intra domain structure information of two domains to reduce the domain differences both statistically and geometrically. By the aid of pseudo label of target samples predicted by the EasyTL, GEDA can preserve the local spatial and spectral discriminant information by the spatial mean filtering and graph embedding, and meanwhile can effectively reduce the statistical distribution difference and subspace difference between source and target domains. The flowchart of GEDA is shown in Fig. 1. It mainly includes two modules: subspace learning module and pseudo label learning module. These two modules have coupling interaction. On the one hand, if the features of source and target domains are aligned in the subspace learning module, the subsequent pseudo label learning is more accurate. On the other hand, if the pseudo label of target samples are accurate, the discriminative information of target domain will be well preserved and the subspace learning will be more effective.

A. Local spatial information preservation
Due to spatial correlation, spatial mean filtering can be used to maintain the similarity between neighboring pixels and preserve the local neighborhood consistency. Mean filtering is carried out in both domains.

B. Local spectral information preservation
Considering that the source domain D s and target domain D t may have great spectral differences, two projection matrices A and B are learned for source and target domains, respectively. The graph embedding method is used to learn the intra-class and inter-class divergence matrices of each domain to preserve the intrinsic intra-class compactness and inter-class variation of samples [33]. The preservation of local spectral information can be realized by solving the following Fisher-criterion-based optimization problem: are the Laplacian matrices of the intrinsic graph and penalty graph introduced in the source (target) domain [33], respectively, S s w (S t w ) and S s b (S t b ) are the intra-class and inter-class divergence matrices of the source (target) domain, respectively. The above Fisher criterions can preserve the local spectral similarity and increase the class separability in both domains.
In graph embedding framework [33]- [35], each sample is regarded as a data node and the relation between nodes is described by a weight, such as if nodes x i and x j are connected 0 otherwise (3) where t is a parameter and set to 2 as recommended in Ref. [16]. A graph Laplacian can be calculated by Graph Laplacian matrix L can be used to characterize the intra-class compactness and inter-class variation of samples. To compute the graph Laplacian matrices in Eqs. (1) and (2), it first needs to compute the weight matrix in each domain as follows: (a) Constructing the intrinsic weight matrix W w : For each sample x i , connect the k 1 -nearest neighbor pair v and x i if v has the same label information with x i .
(b) Constructing the penalty weight matrix W b : For each sample x i , connect the k 2 -nearest vertex pairs where samples in each pair belong to different classes.
Following Ref. [16], both the parameters k 1 and k 2 are set to 5 for simplicity.
For constructing the graph Laplacian matrices in the target domain, the pseudo label of target samples are needed. In this paper, we use the EasyTL algorithm [32] to predict target domain pseudo labels. EasyTL aims to learn a probability annotation matrix M ∈ R C×nt with element M cj ∈ [0, 1] denoting the probability of target sample x t j belonging to the class c based on the following optimization problem: where D cj measures the distance between the sample x t j and the c-th class center of source domain.
Based on the probability annotation matrix M obtained by solving the above Eq. (4), the pseudo label of target sample x t j can be given by: The advantages of EasyTL are that it takes no parameters  and is easy to implement. There are also other algorithms for learning pseudo labels, such as the standard K-nearest neighbor (KNN) or support vector machine (SVM) classifiers [20], [22], [36], label propagation [16], label regression [37], etc. Taking into account both the accuracy and simplicity, EasyTL is used in this manuscript.

C. Distribution difference minimization
The joint distribution can be approximately described by the marginal distribution and conditional distribution [20]. The marginal distribution difference is described as the distance between the mean of samples after the projection [20], [38]: The conditional distribution difference is approximately expressed as the summation of the difference of projected mean sample of each class from different domains with EasyTLgenerated pseudo label of target samples: where X (c) are the c-th class sample sets of source and target domains, respectively.

D. Subspace distance minimization
The subspace difference is minimized based on the subspace alignment (SA) strategy [17]. Different from the SA that learns only one transformation matrix, here we learn two projection matrices A and B for source and target domains. Then, we directly minimize the distance between subspaces:

E. Objective function
To simultaneously preserve local spatial and spectral discriminant information and reduce statistical distribution difference and subspace difference between two domains, the objective function of the proposed GEDA can be formulated as: where β and λ are parameters, and {Within Class} ST and {Between Class} ST represent within-class and between-class divergence matrices of two domains, respectively.
To obtain the explicit form of the above objective function, we first combine (6) and (7) as follows: where In Eq. (9), K ts = K st , and 1 n is the column vector with all ones.
Then, we combine Eqs. (9), (1), (2) and (8) to generate the optimization function: where U = [A; B], I ∈ R d×d is the identity matrix. As shown in Eq. (13), GEDA maximizes the between-class divergence of source and target domains, and meanwhile minimizes their distribution differences, offsets, and within-class divergences.

F. Optimization
Let N s = K s + λI + βS s w and N t = K t + λI + βS t w . Then Eq. (13) can be rewritten as, The optimization problem (14) is equivalent to the following problem: To solve model (15), a Lagrange function is constructed: Taking the derivative of L with respect to U , we can get: where Φ = diag (λ 1 , . . . , λ k ) are the k leading eigenvalues and U = [U 1 , . . . , U k ] contains the corresponding eigenvectors, which can be solved analytically through generalized eigenvalue decomposition. Once the transformation matrix U is obtained, the subspaces A and B can be obtained easily. The pseudo code of GEDA is summarised in Algorithm 1. We adopt an iterative optimization strategy to alternatively update the target pseudo labels and subspace projection matrices A and B (i.e., U ). It should be noted that the proposed GEDA method can be extended to solve the nonlinear problems in the RKHS by using kernel function.

A. Data sets
Six HSI data sets, i.e., University of Pavia, Center of Pavia, Yancheng, Botswana, Shanghai and Hangzhou, are used in the experiments.

Algorithm 1 GEDA
Input: Source data D s and label Y S , target data D t ; parameters: λ, β, subspace dimension k. Output: Transformation matrices: A and B; Predicted labels Y t for the target domain.
1. Spatial filtering for source and target data 2. Pseudo-label learning for target domain (EasyTL): 2) Solve the generalized eigen-decomposition problem (17) and select the k eigenvectors corresponding to the k largest eigenvalues to form U = [A; B]. 3) Map the original data to respective subspace to get the embeddings: Update the pseudo label Y t of the target domain: Shanghai-Hangzhou: The Shanghai and Hangzhou data sets were captured by the EO-1 Hyperion hyperspectral sensor [6], which retains 198 bands after removing the bad bands. The size of Shanghai image is 1600 × 230, which includes roads, buildings, plants, and the water of the Yangtze River and Huangpu River. The Hangzhou image size is 590 × 230, including roads, buildings, plants, West Lake and Qiantang River basin.
For the DA problem, we need to construct source and target domains such that the data of source and target domains come from two different HSIs or different regions in one HSI. Based on the six HSI data sets, the following four DA tasks are constructed.
(1) Task 1 (Pavia University and Center task): the source and target domains are chosen from the Pavia University and Pavia Center images, respectively. To keep the consistence of dimensionality, the first 102 bands of the University of Pavia data set are used for analysis. Six common classes in these two images, i.e., Asphalt, Meadows, Trees, Bare Soil, Bitumen, Bricks, are used for the DA tasks. We randomly draw 400 samples from each class from each image to form source and target domains, respectively.
(2) Task 2 (Yancheng task): the Yancheng image is divided into two disjoint regions for DA. The selected two disjoint regions have similar materials, in which six classes (Off shore water, Aquaculture, Paddy, River, Fallow land, Dry land) are chosen for the classification tasks. The selected six classes in the two regions constitute source and target domains.
(3) Task 3 (Botswana task): similar to Yancheng task, the Botswana image is also divided into two disjoint regions for DA. Six common classes in these two regions are chosen to form source and target domains.
(4) Task 4 (Shanghai-Hangzhou task): the source and target domains are set as the Shanghai and Hangzhou images, respectively. Three common classes in these two images, i.e., Water, Land/Building, Plant, are used for the DA task.
The number of samples in each class for the above four DA tasks are shown in Tables I and II.

B. Comparison methods
We compare our proposed GEDA with the following DA methods on the four DA tasks: uses spatial-spectral local discriminant information, intradomain structures and data distribution to reduce geometrically and statistically the differences between domains.
Using the above DA methods, source and target data can be aligned at a certain extent, and then the classifier trained on the source features can be used to classify the target features. The final performance of each DA method is evaluated by the classification results. In the experiments, the 1-nearest neighbor (NN) classifier is chosen as the classifier because it is simple and parameter-free. For the subspace-based DA methods, the dimension of subspace is set to 20. In the case of randomly selecting samples, the experiment is run 20 times and the average results are reported. For JDA, JGSA, LPJT and GEDA, the number of iterations T is set as 5. For GEDA, the size of spatial filter is determined by the characteristics of the data. The spatial window of Pavia and Botswana data sets are chosen as 5 × 5 and 7 × 7, respectively. The Yancheng and Shanghai-Hangzhou data sets do not need to be filtered. For each method, the overall accuracy (OA) and kappa coefficient (κ) on the target domain are used to evaluate the performance.

C. Experiments 1) Experiments on Pavia University and Center task:
In this task, the samples of source and target domains come from different images (i.e., Pavia University and Pavia Center, respectively) and the corresponding acquisition time is inconsistent, so there has significant spectral difference between source and target domains, and the DA task is challenging.   Table III shows the classification results of different DA methods on the Pavia University and Center task. It can be seen that the classical DA algorithms show poor results in this case and their classification accuracies are almost lower than 60%. Even if the category information of the source domain and the global information of the target domain are used, such as JGSA and LPJT, the classification performance is still not satisfactory. The main reason is that there have great spectral drifts in this cross-scene classification problem. By using the spatial filtering to increase spatial consistence and alleviate spectral variation of HSIs, and meanwhile employing the local discriminant information of source and target domains, the proposed GEDA method can achieve a good classification effect for this task.
Table IV provides the classification confusion matrix of NA, JGSA and GEDA. NA shows very poor results on Classes 5 ("Bitumen") and 6 ("Bricks"). It misclassifies most of samples in these two classes to the Class 1 ("Asphalt"). As known, "Bitumen" and "Asphalt" are similar materials, so their spectra are very similar. "Bricks" is also similar to "Asphalt". Due to high spectral similarity, it is very likely to produce confusions on these classes (seeing the classification on Classes 1, 5, 6 of NA). After the JGSA transformations, the classification performance on Classes 1, 5, 6 are improved at a certain extent. However, it still misclassifies most of samples in the "Bitumen" and "Asphalt" classes. Compared with JGSA, our proposed GEDA dramatically improves the classification performance on Classes 1, 5, 6 and shows better overall results. It can discriminate subtle differences between similar classes using both local spatial-spectral information and distribution information between and within domains.
2) Experiments on Yancheng task: The classification results on Yancheng data are shown in Table V. It can be seen that all DA methods obtain acceptable results with OA about 90%. As source and target domains come from two disjoint regions in the same HSI and the spectral resolution of Yancheng GF-5 image is very high, the spectral difference between domains is relative small. In particular, the Class 3 ("Paddy") is obviously different from other five classes, and all DA methods provide correctly classification results on this class. However, most of DA methods completely misclassify the Class 6. Comparing the three geometrical and statistical distribution alignment methods (i.e., JGSA, LPJT and GEDA), JGSA preserves only the source global discriminant information by using the intra-class and inter-class divergence matrices of the source domain, which cannot accurately classify the Class 6. Different from the JGSA, LPJT considers both the local discriminant information in source and target domains, so it produces a relatively better results on the Class 6. Notwithstanding, LPJT ignores the subspace difference and uses label propagation to learn pseudo label of target domain. The pseudo label may be inaccurate when the spectra of different classes are similar. In our proposed GEDA method, we use the EasyTL method to learn the pseudo label of target samples, and use the structural information of the two domains simultaneously. It can be seen that GEDA yields excellent results on Yancheng data. Table VI provides the classification confusion matrices for NA, JGSA and GEDA and the classification accuracy for each class. Although NA shows good classification results on Classes 1, 3 and 6, it mistakenly classifies nearly half of the samples in Class 2 ("Aquaculture") into Class 4 ("River")  and at the same time nearly half of the samples in Class 5 ("Fallow land") into Class 6 ("Dry land"). It is clear that the classes "Aquaculture" and "River" are related to water, and "Fallow land" and "Dry land" are subclasses of land. The Classes 2 and 4, and Classes 5 and 6 are similar in spectral, so they are difficult to be classified. By performing DA, JGSA improves the classification accuracy of Class 2 and Class 5. However, it misclassifies all the samples of Class 6 into Classes 5 and 3. Classes 6 and 5 are subgroups with similar spectral characteristics and are difficult to distinguish. By adding the learned target domain pseudo labels, GEDA improves the classification accuracy of JGSA on the sixth class, where its classification accuracy increases from 0% to 98.36%. This indicates that it is necessary to make proper use of the class information of target domain.
To visually show the feature transformation results, we display the first two components of PCA of the original data, the first two dimensions of JGSA transformed data and GEDA transformed data in Fig. 2. It can be clearly seen that the same class of source and target domains distribute in different regions. That is, there are obvious distribution differences between disjointed source and target domains in the original data. JGSA can improve this situation, but within-class scatter is very large especially for the Class "Off shore water" in red. The proposed GEDA not only can reduce the distribution difference between source and target domains, but also can make the sample points in the same class closer to each other and samples belonging to different classes far away.   Table VII shows the classification results on the Botwana data set. The first six DA methods that do not utilize source labels and structural information provide similar results with OA less than 80%. The last four methods that consider the distribution alignment by using the source labels and target pseudo labels produce relatively better results. JGSA improves JDA because it also uses source discriminative information and target variance. By further considering discriminative information of the target domain and sample weight relations, LPJT improves JGSA slightly. The improvements of GEDA over LPJT are that pseudo-label learning, subspace alignment and spatial filtering processing.

3) Experiments on Botswana task:
Table VIII provides the classification confusion matrix of NA, JGSA and GEDA. For DA problem, the distribution of source and target domains are different, so it is very likely to make a wrong classification of target samples without DA. Here, NA wrongly classifies many samples of Class 3 ("Riparian") to Class 5 ("Acacia woodlands"), and some samples of Class 6 ("Acacia shrublands") to Class 1 ("Floodplain grasses1"). JGSA dramatically improves the results by performing DA with the using of two coupling mappings and the source label information. By using the EasyTL to learn the pseudo label of target domain and the graph embedding method to learn the intra-class and inter-class divergences, the proposed GEDA further improves the JGSA and almost correctly classifies all the samples. Fig. 3 illustrates the scatterplots of NA, JGSA and GEDA in first two principal component spaces. Fig. 3 (a) shows differences in the data distribution and mixed sample points are difficult to distinguish. After the JGSA transform in Fig.  3 (b), the distribution difference is significantly reduced, and the class distance between sample points is large, and samples in the same class are more clustered. Compared with JGSA, our proposed GEDA in Fig. 3 (c) can further reduce the distribution difference and increase the class separability. Even though only two dimensions are used, there is little overlap for different classes. It should be noted that the green class ("Floodplain grasses 2") in the 2-D projection space has three clusters because the original source and target samples in this class have distributed in three subregions. Notwithstanding, this class can be well classified in a high-dimensional space, i.e., k = 20, as shown in Table VIII. 4) Experiments on Shanghai-Hangzhou task: As the Shanghai-Hangzhou data contains a large number of samples [36], we divide the data into ten parts for the experiment. There are 10 data sets in source and target domains. For one data set in the target domain, we use each data set in the source domain to conduct an experiment and then report the average accuracy. The final classification accuracy is obtained when all the ten data sets in the target domain are classified. Table IX shows the classification results of different DA methods. Without DA, NA shows very poor results with κ only 0.471. The subspace learning methods, such as SA, GFK and TCA, cannot show improvement in this case. Due to the scene difference, there is a large distribution difference between the two domains which cannot be reduced by learning the subspace alone. Rather than learning subspace, CORAL directly aligns the second-order statistical features of the two domains, which shows relatively better results than the subspace learning methods. However, it does not use the label information of source domain. Although JDA considers both the source labels and subspace learning strategy, it only learns one transformation matrix which cannot simultaneously reduce the great distribution difference between domains. To overcome the limitation of JDA, JGSA intends to learn two linear transformations for source and target domains respectively such that the transformed source and target domains have less distribution difference in the subspace. JGSA dramatically improves JDA. The proposed GEDA shows the best results for    Fig. 4 displays the ground truth map and classification maps of the target scene by NA, JGSA and GEDA. Although the classification OA of JGSA is 82%, it is not able to effectively classify "Water" (blue) and "Land/building" (cyan), as shown in the white rectangle region. Compared with JGSA, GEDA has higher classification accuracy in "Land/building" (cyan) and "Plant" (yellow).

D. Algorithm analysis and discussion
1) Ablation analysis: In order to analyze the effect of each module in the proposed GEDA, we perform ablation experiments on the Pavia University and Center cross-scene DA task. Table X provides the classification OA. As our GEDA improves the original JGSA method, we first list the result of JGSA and the classification accuracy is 0.6737. Then, we incorporate the EasyTL-based pseudo label learning and graph embedding into the JGSA, respectively. It can be seen that the resulting OAs are increased at a certain extent. If the EasyTL-based pseudo label learning and graph embedding are simultaneously added into JGSA, the accuracy is further improved with OA of 0.7775. Finally, we consider all the above strategies and add spatial filtering into the model and the classification accuracy is 0.8550. The results show that all modules in the proposed GEDA plays a role in classification improvement. To further compare the performance of different DA methods when spatial mean filtering is performed, Table XI provides the classification accuracy of various algorithms on the spatial filtered data. Compared the results in Tables III and XI, it can be seen that the performance of different DA methods have improved at a certain extent when spatial filtering is used, and the proposed GEDA provides the best results no matter the spatial filtering is used or not.  2) Parameter analysis: In the proposed GEDA method, the subspace dimension k, the regularization parameter λ and β are key parameters. We analyze the effect of these parameters on the Pavia University and Center data. The relationship between subspace dimension k and OA is shown in Fig. 5. It can be observed that the performance of GEDA is stable when the dimension of subspace is between 7 and 30. In the experiment, the dimension of the subspace is fixed as 20. The OA versus parameters β and λ is provided in Fig. 6. It can be seen that the model is insensitive to parameters β and λ. In the experiment, parameters λ and β are set as 1 and 0.3, respectively.
3) Running time: The running time of different algorithms on the Pavia University and Center task are shown in Table XII. Due to the iteratively updating of target pseudo labels and subspace features, the proposed GEDA method is computationally less efficiency than SA, CORAL and GFK. However, it is much more efficiency than TJM and LPJT, and comparable with JGSA. IV. CONCLUSION In this paper, we have proposed a new DA method based on the graph embedding and distribution alignment (GEDA). For cross-domain classification of HSIs, GEDA learns two coupling mappings by using the discriminant information of source and target domains, so that the mapped distributions of the two domains are close to each other and the local discriminant information are maintained. The results of four DA tasks in the experiments demonstrate that GEDA can effectively perform cross-domain classification of HSI.