Adaptive Local Discriminant Analysis and Distribution Matching for Domain Adaptation in Hyperspectral Image Classiﬁcation

—Multimodally distributed data is very common in remote sensing images, such as hyperspectral images (HSIs). It is important to capture the local manifold structure while preserving the global discriminant information in the multimodal data. In this article, an adaptive local discriminant analysis and distribution matching (ALDADM) method is designed for the domain adaptation (DA) in HSI classiﬁcation. ALDADM uses adaptive local discriminant analysis to extract features that are discriminative and robust to multimodally distributed data. Meanwhile, considering the spectral properties of HSI, the domain shift is reduced by distribution matching and subspace alignment. During this learning process, ALDADM only selects density peak samples to reduce the inﬂuence of interfering samples. Four DA tasks are designing using University of Pavia, Center of Pavia, Yancheng, and Botswana HSI datasets. Compared with existing DA methods, the experimental results demonstrate that our ALDADM offers a superior performance.

an automatic and efficient method is desired to complete image labeling [1]. An intuitive idea is to train a classifier with already labeled samples and use it directly on new images. However, spectral features may be very different in a new image due to the changed soil properties, environment conditions, and the incident angle of sunlight [2], [3].
In such a situation, domain adaptation (DA) techniques allow us to transfer the knowledge learned from a labeled source domain to a target domain with limited or no labeled samples. This classification problem can be called as cross-domain classification [4]- [6]. In cross-domain classification, the source and target pixels may have large spectral shifts. For example, even if the data comes from the same scene and the same sensor, the distribution of samples can change due to different collection times. When the data comes from different scenes, the pixels of different domains have large spectral shift even if they are the same land cover class. When the sensors are different, they are associated with different feature spaces [7], [8]. Although there exist distribution shift between source and target domains, they are mostly related (e.g., same scene, same sensor, similar materials). Therefore, it is possible and necessary to reduce their distribution differences for appropriate knowledge transfer.
In the past few years, many DA methods have been proposed. Traditional DA methods can be instance-based, classifier-based, or feature-based methods [9]- [11]. Instance-based methods align images by reweighting or picking out important labeled source samples [12]- [14]. Classifier-based methods aim to adapt the classifier trained on the source domain and classify the target domain [15], [16]. Feature-based methods intend to reduce the geometrical or statistical distribution differences between domains in the feature space [17]. The geometrical difference is usually measured as the distance between the subspaces of two domains. Statistical distribution differences represent marginal or/and conditional distribution differences between two domains [18]. Meanwhile, feature-based methods can also be used in conjunction with the first two methods.
Feature-based DA methods are very common. Subspace alignment (SA) learns a transformation matrix that makes the subspaces corresponding to two domains closer together [19], [20]. Correlation alignment (CORAL) learns a linear transformation to transform the source data such that the covariance of transformed source data and target data are close [21]. In This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ [22], a CORAL-based graph neural network is proposed for DA of HSIs, where a domain-wise and a class-wise CORAL are involved. Geodesic flow kernel (GFK) constructs a geodesic flow between the source and target manifold subspaces to characterize geometric and statistical changes between them [23]. An unsupervised manifold alignment method is proposed for DA in [24]. Transfer component analysis (TCA) maps the source and target domains into a reproducing kernel Hilbert space, where the two domain distributions become similar [25]. Transfer joint matching performs feature matching and source sample weighting across domains in a principled dimensionality reduction process [7], [26]. Based on TCA, joint distribution adaptation (JDA) performs both marginal and conditional distribution adaptation by using pseudolabels of the target domain [27]. Similarly, a discriminative cooperative alignment method is developed to align subspace and distribution of domains [28]. Adaptive local neighbors for transfer discriminative feature learning (ALN-TDFL) method reduces marginal and conditional distribution differences and considers both global discriminative consistency and local information preservation when learning domain-invariant features [29]. Scatter component analysis finds a transformation that can maximize the match between domains to improve the class separability [30].
These traditional DA methods all learn a unified map that projects the source and target domains into a shared subspace. However when the domain shift is too large, there may not exist a common subspace. Therefore, joint geometrical and statistical alignment (JGSA) learns two coupled maps for the source and target domains, considering the discriminative information of source domain and the global information preservation of target domain in addition to distribution matching and SA [31], [32]. However, it ignores the manifold structure of the data. If we force the points of the same class to be closer together, the local data structure may be lost, the transferability of features will be reduced, and it will be difficult to deal with multimodally distributed data. Based on the linear discriminant analysis, locality adaptive discriminant analysis learns transformation matrices to pull similar points together and push dissimilar points farther, thereby capturing the local manifold structure of the data [33]. In addition, many DA strategies involve all source samples and pseudolabeled target samples to learn transformations and iteratively updates pseudolabels based on the previous pseudolabels. The mislabeled samples will disturb with the learning process. Discriminant geometrical and statistical alignment method selects density peak landmarks to reduce the interference from the outliers in the source domain and the mispredicted instances in the target domain [34].
To align statistical distribution and subspace between domains and preserve local manifold structure with the use of landmarks, we propose an adaptive local discriminant analysis and distribution matching (ALDADM) for unsupervised DA in this article. It learns two coupled projections for the source and target domains by performing the following steps, i.e., landmark selection, feature alignment, target pseudolabel learning. The density peak based method is first introduced to select landmarks on source and target domains [34], respectively. Then, the features of source and target domains are aligned by simultaneously considering the marginal and conditional distribution alignment, SA, and the local manifold structure and data variance preservation of both domains. In the pseudolabel learning process, graph-based label propagation (GLP) is used to predict labels of the target domain [35]. Overall, the main contributions of this work are summarized as follows.
1) A novel unsupervised DA learning framework is proposed, which well integrates instance-based, feature-based, and classifier-based methods, i.e., instance-based landmark selection, local feature alignment, and pseudolabel learning. 2) A GLP method is used to predict labels of the target domain, and a density peak landmark selection (DPLs) is introduced to filter out more informative samples in source domain and high-confidence pseudolabeled samples in target domain. 3) A local adaptive feature learning module for source and target domains is proposed to handle multimodally distributed data in HSIs, making the learned features discriminative and robust. The rest of this article is organized as follows. Section II details our proposed ALDADM method. Experimental results and analysis are shown in Section III. Finally, Section IV concludes this article.
In the DA problem, the marginal or/and conditional distribution between domains are inconsistent. Due to the nature of HSIs, spatial mean filtering is applied to locally preserve the spatial information. In the following, the data samples of two domains are all filtered and denoted as X s ∈ R d×n s , X t ∈ R d×n t . Fig. 1 shows the framework of ALDADM. It learns two coupled projection matrices A and B for the source and target domains, respectively, by performing landmark selection, feature learning, and target pseudolabel learning.

A. Landmark Selection
Compared to using all instances, selecting the most relevant instances (landmarks) will benefit the subsequent feature learning. With the use of the DPLs method [34], it is likely to select informative instances in D s and pseudolabeled instances with high confidence in D t . DPLs assumes that in the same class, the main instances (cores or landmarks) are clustered together, and some abnormal instances (outliers) are far away from the cluster center.
To implement DPLs, two quantities are computed for each instance x ∈ C k , i.e., the intraclass density ρ a (x) and the interclass density ρ b (x), defined as [34] where C k denotes the sample set of the kth class, C k is the sample set of all other classes except for the kth class, n k and n k denote the number of instances in C k and C k , respectively, and K represents the RBF kernel with being the similarity between x and x i . The closer x and x i are, the larger K(x, x i ) is. For simplicity, the parameter σ is set to be the median of pairwise distance between source samples or target samples. ρ a (x) is called as intraclass density which measures the similarity between x and other samples in the same class. Similarly, ρ b (x) is the interclass density, which measures the similarity between x and samples in different classes. Finally, a relative density is defined It is clear that a larger Δρ(x) means that x is distributed around the instances with the same label and far away from the instances with different classes. Usually, we calculate the Δρ of all instances, sort them from large to small, and select the top r% instances as DPLs. In the experiments, we compute the relative densities Δρ s (x) and Δρ t (x) for the source and target domains, respectively. Because the target domain D t has no labels, the pseudolabels of target domain is used to compute the relative density Δρ t (x).

B. Feature Learning
The domain invariant features are learned by performing distribution and SA with local discriminative information and target variance preservation.
1) Minimizing Distribution Differences: The maximum mean discrepancy (MMD) is used to measure the marginal and conditional probability distribution differences [36]. Considering the large spectral difference between domains in reality, two coupled mappings A and B for source and target domains are learned to firstly minimize the marginal distribution difference as [31] min For the conditional distribution difference, the following classwise MMD is computed: where X Combining the marginal and conditional distribution minimization terms, the final minimization term is where where 1 s/t is the column vector with all ones.
2) Minimizing Subspace Differences: To make the two domains more geometrically aligned, we make the source and target subspaces closer together 3) Maximizing Target Variance: In order to make the projected features retain more original information, we maximize the variance of the target domain as n t 1 t 1 T t is the centering matrix with 1 t ∈ R n t being the column vector with all ones.

4) Local Discriminative Feature Learning:
Multimodally distributed data means that the samples in the same class are distributed in several groups, which is most likely the case in HSIs. Therefore, while maintaining the global relationships of data points, it is equally important to discover the local manifold structure [29]. To address this issue, we propose to incorporate the preservation of local manifold structure into the process of domain-invariant feature learning.
To preserve local manifold structure, a similarity weight matrix for each class is constructed such that the similar points from the same class are as close as possible. For two domains, the formulas are as follows: where the matrix W (s/t) ∈ R n (s/t) ×n (s/t) is used to capture local relationships between data points, and W

C. Target Pseudolabel Learning
In the target landmark selection [i.e., the calculate of relative density in (4)], conditional distribution alignment in (6) and local discriminative feature learning in (15), it needs the pseudolabels of target domain. The learning of pseudolabels is critical. Once some instances' pseudolabels are predicted incorrectly, it will inevitably interfere with subsequent iterative learning process. Different from the most commonly used k-nearest neighbor classifier, we use the GLP strategy to complete the prediction of target domain [35]. Given a whole data matrix X = [X s , X t ] ∈ R d×(n s +n t ) and a label set L = {1, 2, . . . , C}. Define the initial (n s + n t ) × C label matrix Y as Y ij = 1 if x i is labeled as y i = j, Y ij = 0 otherwise. The objective of GLP is to learn a (n s + n t ) × C label matrix F to assign the label of x i as: y i = arg max j≤C F ij . In detail, it first constructs an affinity matrix N according to the similarity between data, and then calculates the matrix S = E −1/2 NE −1/2 , in which E = diag(e 1 , . . . , e n s +n t ) is a diagonal matrix and e i = j N ij . Then, it updates the label matrix according to

D. Objective Function
ALDADM combines (7) and (12)- (15) to obtain the objective function as follows: where α and β are weighting parameters and {local} ST is a source and target local discriminative term. The final expression is specifically where , and I ∈ R d×d is the identity matrix. As recommended in [31], a parameter μ is used to control the scale of B and set as 0.01 in the experiments.

E. Optimization
An alternative iteration strategy can be used to solve the optimization problem (19). The weight matrix W (s/t) is first initialized. Only the points in the same class c have the weight 1/n (c) s or 1/n (c) t , while others are 0. Then, the (19) can be solved by updating P (i. e., A, B) and W (s/t) iteratively.
2) Optimizing weight matrix W (s/t) : When P and W (t) are fixed, then (19) becomes Note that (23) is independent between different c and i, so we can solve the following problem individually: According to (24), the optimal solution is as follows:  ., A, B) in (22). 4) Transform the data into subspaces: (25) and (26). 6) Update the target pseudo label Y t based on Eq. (17).

Until convergence
Similarly By optimizing P and W (s/t) iteratively, ALDADM can continuously reduce the differences between domains, and make the domain-invariant features more discriminative, while being more robust to the multimodally distributed data. The pseudocode of ALDADM is summarized in Algorithm 1.

A. Datasets
University of Pavia (PU) and Center of Pavia (PC): The two scenes were acquired by the ROSIS-03 hyperspectral sensor over the city of Pavia, Italy. The number of spectral bands in the two images is 103 and 102, respectively. They contain 610 × 340 pixels and 1096 × 715 pixels. In the experiment, to maintain dimensional consistency, we select the first 102 bands of the PU. Next, we select seven common classes of the two images, and randomly select 800 samples for each class of the corresponding images to form source and target domains, respectively. For this data, we construct the followng two DA tasks: 1) PCPU with PC and PU as source and target domains, respectively. 2) PUPC with PU and PC as source and target domains, respectively. The datasets and selected classes are shown in Fig. 2.
Yancheng: The dataset was acquired by the visible-shortwave infrared advanced hyperspectral imager of the China GaoFen-5 satellite over the port of Yancheng City, China, on April 4, 2019. The image scene contains 1175 × 585 pixels and 267 spectral bands. Similarly, the Yancheng data are also divided into two disjoint regions in the experiments. We name them as source   and target domains and pick six common classes, as shown in Fig. 3.
Botswana: The data were acquired by the NASA EO-1 satellite over the Okavango Delta, Botswana. The sensor collects data at 30 m pixel resolution. Through a series of processing, the data are remained 145 bands. The image scene has the size of 256 × 1476 pixels. In the experiment, the Botswana image is divided into two disjoint regions with similar land cover, which represent the source and target domains, respectively. For the classification task, we selected six classes, as shown in Fig. 4.
The details are given in Table I. In this article, we focus on the homogeneous DA problem, where the source and target domains are from the same sensor. In the abovementioned four DA tasks in Table I, the source and target domains of PCPU and PUPC tasks are from different scenes although they are acquired by the same sensor. In the Yancheng and Botswana tasks, the source and target domains are from the same scene but disjoint regions.

B. Comparison Methods
For comparison, the following DA methods are considered.

1) No adaptation (NA):
Uses the source classification to directly classify target samples.
3) SA [19]: Aligns the source and target subspaces. 4) GFK [23]: Learns new feature representations through geodesic flows constructed between domains. 5) TCA [25]: Learns transfer components to minimize marginal distribution difference. 6) JDA [27]: Jointly aligns both the marginal and conditional distributions.  [31]: Learns two coupled projections for the source and target domains to achieve geometrical and statistical alignment simultaneously.

7) JGSA
8) ALN-TDFL [29]: Considers local manifold structure and global discriminative consistency in the process of feature learning. 9) DTJM [7]: Aligns features by minimizing the empirical MMD and performs sample reweighting on the embedding matrix while maximizing the dependence between the embedding and labels.
10) ALDADM: Picks density peak landmarks while maintaining global information and local manifold structure and domain alignment both statistically and geometrically.
In the abovementioned comparison algorithms, the 1-NN is the basic classifier. For subspace-based methods, the subspace dimension is fixed as k = 20. In the proposed method, we set the percentage of density peaks for domains as r s = r t = 0.95, and fix the number of iterations as T = 10. For the regularization parameters, we use the same parameters α = 1, β = 1 on Botwana and Yangcheng tasks. For Pavia task, due to the large spectral shift between domains, the subspace difference is large and the parameters are set as α = 20, β = 0.1. The overall accuracy (OA) and the kappa coefficient (κ) are used to evaluate the classification performance.

1) Experiments on PCPU Task:
The DA task is more difficult when the source domain is PC and the target domain is PU. As can be seen from Table II, the accuracy of the classical DA method is mostly below 70%, and the highest is only about 75%. However, ALDADM learns initial pseudolabels for target domain by GLP, and then selects samples with high confidence to learn domain invariant features, so that the classification accuracy can reach 83.5%, increasing by at least 8%. Table III provides confusion matrices for NA, JGSA, and ALDADM. In particular, the classification accuracy of the Class 7 (Shadow) achieves good results on all DA methods, almost all above 95%. NA is obviously very poor in Classes 2 (Meadows), 5 (Bitumen), and 6 (Bricks), which misclassifies 40% of the  samples in Class 2 into Class 4 and most of the samples in Classes 5 and 6 are misclassified into Class 1. JGSA does a poor job of classifying Classes 2 and 4, they confuse Classes 2 and 4 with less than 60% classification accuracy. ALDADM improves Class 2 to 66%, Class 5 to 95%, and Class 6 to 85%.
2) Experiments on PUPC Task: Table IV shows the classification performance of several DA methods on the PUPC task. Obviously, the classification accuracy of the general DA method  is below 80%, while the accuracy of JGSA and ALN-TDFL can reach about 82% due to the preservation of the global discriminative information. ALDADM performs DPLs, and then utilizes the advantages of global discriminative consistency and local manifold structure simultaneously, making the extracted features more discriminative and robust to multimodally distributed data, and the classification effect is also significantly improved, approaching 88%.
The classification confusion matrix about NA, JGSA, AL-DADM is shown in Table V. The NA method performs poorly on Class 6 (Bricks), which misclassifies many samples into Class 1 (Asphalt) because their spectral characteristics are very close. When using JGSA, its classification accuracy on Class 6 is improved by about 40%. Compared with NA and JGSA, our method has been improved on Class 2 and 5-7, and the accuracy is as high as 98% on Classes 5 and 6.
3) Experiments on Yancheng Task: As the source and target domains of Yancheng are adjacent regions, this task is relatively simple. It is obvious from Table VI that the OA of all DA methods can reach over 90%. The proposed ALDADM has the best classification effect, which can reach 99%. However, TCA, JDA, and JGSA completely misclassify Class 6 (Dry land), and basically divided them into Class 5 (Fallow land), as shown in Table VII. In fact, TCA and JDA only consider that the  distribution of the two domains is closer, while JGSA further considers global discriminative information preservation. Even so, it is difficult to distinguish between Classes 5 and 6, which are both about land. However, ALDADM also considers the local manifold structure, so as to distinguish these two classes and increase the classification accuracy from 0% to 100%. Table VIII shows the classification effects of different DA algorithms on the Botswana task. Without DA strategy, the accuracy can reach 85% after the data is filtered. If the classical DA algorithm is used, the  classification accuracy can be increased by 1%-2%. Based on JDA, ALN-TDFL considers the local manifold structure, and its classification accuracy increases by 5%. JGSA can achieve 92% considering the discriminant information preservation and two coupled mappings. Finally, ALDADM learns soft labels through GLP method and selects high confidence samples. By using two mappings, we simultaneously consider local manifold structure and reduce the distribution shift between projected data of two domains. A perfect result is obtained, that is, 100%. The classification map on the target domain of different methods are shown in Fig. 5, where the target region is rotated 90 • for displaying. As shown in the central region, many comparison methods misclassify the Class "Floodplain grasses1" in red color to the Class "Floodplain grasses2" in green color because these two classes all belong to the grass and are difficult to be distinguished due to their high spectral similarity. Notwithstanding, our ALDADM can discriminate the subtle spectral differences between them and produce perfect result.

D. Ablation Analysis
In order to analyze the effect of each module in ALDADM, we conduct ablation experiments on the PCPU task. We randomly select 800 samples from 7 classes to form the source and target domains, and conduct a total of eight experiments on this task, as shown in Table IX. We regard JGSA as the baseline. Experiment 2 means adding DPLs on the basis of JGSA to test the effect of DPLs. In experiment 3, the source discriminative information preservation of JGSA algorithm is replaced with Local DFL. Experiment 4 represents the pseudolabel learning method from 1-NN to GLP strategy [35].  Comparing experiments 2, 3, and 4 with experiment 1, the addition of each part is effective under the framework of JGSA. When performing DPLs, it can filter out some misclassified target samples, thereby reducing the interference of the subsequent iterative learning process and improving classification accuracy. When the Local DFL is added, the OA is improved by 6% compared to the original JGSA. Then, when using GLP method for pseudolabel learning, it improves the classification performance by 9% over the original JGSA. It means that more target samples get the correct labels.
Experiments 5-7 can show that the effect of adding any two parts together is better than that of a single part, for example, the accuracy of experiment 5 is higher than that of experiments 2 and 3. Because it not only reduces the interference of some misclassified samples, but also considers global information preservation and local manifold structure. Experiment 8 is ALDADM, which learns the soft labels for the target domain through the GLP strategy, and then uses DPLs to reduce the interference from the outliers in D s and the mispredicted instances in D t , and reduces domain discrepancy both geometrically and statistically. Therefore, its performance is the best.

E. Landmark Visualization
DPLs are applied to both domains, and the instances with higher intraclass and lower interclass densities are selected. To verify that DPLs in two domains can filter out interfering samples, we selected two classes (i.e., "trees" and "bare soil") of samples for visualization on the PCPU task to give an intuitive display in Fig. 6. It can be seen from the figure that the DPLs can select landmarks with high density.

F. Feature Visualization
The results of feature visualization for original data, JGSA, and our ALDADM are illustrated in Fig. 7, where four classes (i.e., "Asphalt," "Meadows," "Bitumen," "Bricks") are selected from the PCPU task and tSNE is used to display features. In the original features, samples from the same class are obviously clustered into several clusters. Taking the yellow "Meadows" class as an example, it has at least three clusters in either source or target domain. This almost coincides with the original sample distribution in the groundtruth map, where the "Meadows" class has distributed mainly in three disjoint regions in PU and many regions in PC. JGSA forces the samples of the same class to be closer to a certain extent. As can be seen in Fig. 7(c), our ALDADM largely solves the multimodally distributed data problem, that is, it makes the samples from the same class gather more closely, and also preserves the local manifold structure of the data.
After all, JGSA only pays attention to the global relationship of the data, but the similarity weight matrix constructed in the local DFL can relax the global discriminative consistency and capture the local relationship between samples, since only the similar points from the same class are required to drawn closer. Therefore, ALDADM can make the extracted features more discriminative and deal with multimodally distributed data simultaneously.

G. Parameter Analysis
In this section, we investigate the effect of parameters α and β on the PCPU task. The OA versus parameters α and β is shown in Fig. 8, where the model shows a relative better results over a wide range of parameters. Based on these results, α = 20 and β = 0.1 is chosen for PCPU and PUPC tasks. Due to the existing of large spectral shifts between domains in the PCPU or PUPC task, a relatively larger α and smaller β force to decrease the subspace difference at the cost of loosing the requirement on local discriminative preservation as shown in (18).

IV. CONCLUSION
In this article, we propose the ALDADM method for DA in HSIs classification. Its idea is to first perform DPLs on the labeled source samples and the pseudolabeled target samples, thereby reducing the interference of outliers in D s and mispredicted instances in D t . Then, it learns two mappings to make the mapped subspaces of the source and target domains closer, while dealing with the challenge of multimodally distributed data by respecting the local manifold structure. Extensive experiments on four tasks demonstrate the effectiveness and robustness of ALDADM for cross-domain classification.
In this article, we only focus on the homogeneous DA problem, where the source and target domains are from the same sensor. Notwithstanding, there are still large spectral differences between two domains. The changed soil properties, environment conditions, and the incident angle of sunlight can cause some differences in the spectrum of similar materials in different geographical regions or collection times. In addition, some materials have very similar spectral characteristics, such as Asphalt and Bitumen in the PUPC task, and it is easy to confuse them. Therefore, applying DA methods to HSIs has great difficulty due to spectral differences or distribution shifts. When data comes from different sensors, knowledge transfer and classification is more difficult, and it is prone to bad results. In the future, we will try to extend the homogeneous DA method to multisource DA or heterogeneous DA scenarios.