Domain Adaptation in Remote Sensing Image Classification: A Survey

Traditional remote sensing (RS) image classification methods heavily rely on labeled samples for model training. When labeled samples are unavailable or labeled samples have different distributions from that of the samples to be classified, the classification model may fail. The cross-domain or cross-scene remote sensing image classification is developed for this case where an existing image for training and an unknown image from different scenes or domains for classification. The distribution inconsistency problem may be caused by the differences in acquisition environment conditions, acquisition scene, acquisition time, and/or changing sensors. To cope with the cross-domain remote sensing image classification problem, many domain adaptation (DA) techniques have been developed. In this article, we review DA methods in the fields of RS, especially hyperspectral image classification, and provide a survey of DA methods into traditional shallow DA methods (e.g., instance-based, feature-based, and classifier-based adaptations) and recently developed deep DA methods (e.g., discrepancy-based and adversarial-based adaptations).


Domain Adaptation in Remote Sensing
Image Classification: A Survey observations, which can be used for large-scale and long-term applications.
Although different types of RS images are available, they are difficult to be collaboratively processed due to the existence of large distribution differences (e.g., different sensors and acquisition conditions) among these images. In particular, for RS image classification, we usually want to build a model on a known image to classify an unknown one. If these two images have different data distributions, traditional classification methods may not provide satisfying results. Fortunately, if distribution of two images are related, we can use a domain adaptation (DA) technique to build connection between images and transfer knowledge from one image to the other. For RS image processing, there are various cases where two images have different but related distributions [2], [3]: 1) Difference in sensors: two images are acquired by two different sensors on the same scene; 2) difference in spatial locations (bias sampling): two images correspond to two disjoint regions in a large scene; 3) difference in scenes: two images correspond to two different scenes with similar materials; 4) difference in acquisition conditions (atmospheric, illumination, or acquisition angle): two images are acquired in different imaging conditions; 5) difference in acquisition times: two images are acquired in different times and the ground materials have been changed. In all these cases, two images have some related characteristics, such as the same or similar scenes, the same sensor or similar acquisition condition. Thanks to such correlation, the data inconsistency problem can be solved by the DA technique, which forms the differences in sensors or imaging environmental conditions into a data or feature transfer problem.
DA aims to solve the distribution discrepancy between domains [4], [5]. Depending on the availability of target labels, the DA methods can be categorized into unsupervised, semisupervised, and supervised methods, in which an unsupervised DA method with no label in the target domain is a hotspot because it matches many actual situations. In early research of unsupervised DA, scholars mainly focus on traditional DA methods that aim to align distributions from the aspects of instance, feature, or classifier. The instance-based methods mainly consider to adjust the marginal distribution of source or target samples. Feature-based methods align the subspace features of different domains to minimize their distribution differences. Classifier-based methods mainly aim to adapt a classifier trained on the source domain to the target domain. In recent years, many deep learning-based DA methods are proposed [6]. This   By means of deep network architecture, deep DA methods can automatically extract deep features from domains and further learn transferrable features by adding feature adaptation layers in an original deep network architecture or constructing feature learning modules (e.g., adversarial learning). Considering different network architectures, deep DA methods are mainly categorized into discrepancy-based methods and adversarial-based methods [5]. Although these traditional and deep DA methods are widely applied for computer vision tasks, their applicability on RS images is not clear. In this article, we provide a review on these unsupervised DA methods and test their performance on cross-domain RS image classification.
The rest of this article is organized as follows. Section II introduces notations. Sections III, IV, and V describe the shallow DA methods, such as instance-based, feature-based, and classifier-based methods, respectively. Section VI presents the deep DA methods, such as discrepancy-based and adversarialbased methods. Section VII provides experimental results. Finally, Section VIII concludes this article. DA considers the classification problems where the class space of source and target domains are the same but the distribution between domains are different but related. The objective of DA is to classify target samples using the model built on source samples.
The DA methods can be categorized as traditional shallow DA methods and recent deep DA methods. The traditional DA methods can be instance-based, feature-based, and classifier-based DA methods. In the following, they are introduced in detail. Table I shows the taxonomy of DA methods discussed in this article.

III. INSTANCE-BASED METHODS
Instance-based DA methods mainly adjust the marginal distribution of source or target samples such that the distribution of domains are aligned. Let p s (x) and p t (x) be the marginal density distribution of source and target samples, respectively, an importance weight can be defined as [5] By adjusting the reweighting factor w(x), the sample selection bias and covariate shift problems can be alleviated to certain extent [146], [147]. As shown in Fig. 1, the instance reweighting strategy reweights the source data [i.e., the solid points in Fig. 1(c) have large weights] to minimize the marginal distribution difference between domains and then a classifier built on the reweighted source data [i.e., the black line in Fig. 1(c)] can be used to classify target samples.
To solve the sample selection bias, Huang et al. [7] presented a nonparametric kernel mean matching (KMM) method to directly produce resampling weights without distribution estimation. Yaras et al. proposed a randomized histogram matching (RHM) method to augment training data to describe domain shifts of satellite images. In detail, they analyzed different reasons for the domain shift, such as changing sensors, illumination variations, and imaging conditions, and modeled these factors as nonlinear pixelwise transformations, and then training data augmentation with deep neural networks was employed to increase the model robustness to these transformations [8]. Cui et al. [9] proposed an iterative weighted active transfer learning framework (IWATL) for hyperspectral image (HSI) classification. It weighted the source samples by considering the distance between the samples and the classification hyperplane as well as the similarity between the source and target distributions. Li et al. [10] proposed a cost-sensitive self-paced learning (CSSPL) framework for the classification of multitemporal images, which automatically assigned sample weight via a mixture weight regularizer. To reuse a large number of existed labeled images, a historical and target training data weighting strategy was proposed in an extreme learning machine (ELM)-based RS image transfer classification framework [11].
In the instance-based DA methods, source or/and target sample reweighting and landmark selection are widely used strategies [146], [147]. These strategies can also be embedded into the feature-based and classifier-based adaptations for domaininvariant feature learning or classifier refinement [25], [46], [148], respectively.

IV. FEATURE-BASED METHODS
Feature-based DA methods transform source and target data into a feature space such that the data distribution of both domains in the feature space are similar. Then, source features and corresponding labels can be used to generate a classifier to predict label of target samples. Feature-based adaptation is usually realized by joint feature extraction, and typical methods are subspace-based and transformation-based adaptation methods.

A. Subspace-Based Adaptation
Subspace-based DA methods usually project the source and target samples into individual subspaces according to subspace learning or dimensionality reduction methods, and then align the subspaces [5].
Goplan et al. [12] proposed a sampling geodesic flow (SGF) method for DA, which learns intermediate representations of source and target samples via Grassmann manifolds to describe domain shift. However, the SGF approach has several limitations, such as the difficulty in its sampling strategy and the high dimensionality of the new representations. To solve these problems, geodesic flow kernel (GFK) method was proposed [13]. It constructed a GFK to model domain shift and provided a simple solution to compute the kernel. So, the GFK method is easy to implement than SGF [13], [41]. Banerjee et al. [14] proposed a coclustering-based method for DA in the absence of source samples. The samples from both domains are projected in a shared space by the GFK-based projection, followed by a probabilistic support vector machine (SVM)-based iterative coclustering method.
Fernando et al. [15] proposed a subspace alignment (SA) method, which first employed the principal component analysis (PCA) to generate individual subspaces for source and target domains and then learned a linear transformation M to align these subspaces where · 2 F is the Frobenius norm, and P s , P t ∈ R D×d are the low-dimensional representations (i.e., basis vectors) of source and target data, respectively. The procedure of SA is illustrated in Fig. 2.
Sun et al. [16] directly applied the SA for cross-view RS scene classification, where the partial least squares (PLS) method was used to generate discriminative subspace of source domain. They further proposed a transfer sparse subspace analysis (TSSA) algorithm for unsupervised cross-view RS scene classification [18]. It minimized the maximum mean discrepancy (MMD) distance between domains and preserved the self-expressiveness property of the data in a reproducing kernel Hilbert space (RKHS) according to sparse subspace clustering. Wei et al. [17] proposed a robust DA method for HSIs which employed SA to perform subspace feature-level alignment. The SA was extended to a tensor alignment (TA) for HSI classification [19], where tensors of source and target domains were constructed and SA was performed on the tensors. Gao et al. [20] proposed an unsupervised tensorized principal component alignment framework for multimodal RS image classification. Gui et al. [21] proposed a statistical scattering component-based SA for cross-domain polarimetric synthetic aperture radar (PolSAR) image classification.
It can be seen that SA only aligns the subspace bases without considering the distributions of subspaces. To incorporate the distribution alignment into SA, a subspace distribution alignment (SDA) method was proposed to align both subspace bases and subspace distributions [22]. For the cross-scene classification of HSIs, a discriminative cooperative alignment (DCA) method was proposed to alleviate spectral shift [23]. In the DCA, SA and distribution alignment work cooperatively through the subspace correlation constraint and MMD [23]. Zhang et al. [24] proposed a correlation subspace dynamic distribution alignment (CS-DDA) method for RS scene classification, which maximizes the correlation between source and target subspaces and meanwhile dynamically minimizes the statistical distribution difference between domains.
To handle nonlinearity, Aljundi et al. [25] extended the linear SA to kernel-based SA (KSA), The kernel of source and target domains are first constructed on the selected landmarks, and then the SA is performed on the source and target kernels to align the kernel-based subspaces [25]. To further exploit the source labels and multiple kernel representations, an ideal regularized discriminative multiple kernel subspace alignment (IRDMKSA) was proposed for HSI classification [26]. It performs SA in the composite-kernel-based spaces to reduce the distribution differences of domains.
Traditional subspace learning-based strategies usually assume the existence of a single subspace for both domains. However, such an assumption may not be true in many scenarios due to the diversity in the statistical properties of the underlying classes [149]. Banerjee et al. [149] proposed a hierarchical subspace learning-based unsupervised DA technique for multitemporal RS image classification, where node-specific subspaces are learned from a binary-tree. Shen et al. [27] presented a hyperspectral feature adaptation and augmentation (HFAA) method for cross-scene HSI classification, which iteratively learns a common subspace by introducing two separate projection matrices and augments it with a feature selection strategy. Li et al. [28] proposed an iterative reweighting heterogeneous transfer learning (IRHTL) framework, which iteratively learns a shared space of source and target data based on a weighted SVM and conducts an iterative reweighting strategy to reweight the source samples.
Invariant feature-based methods can be regarded as a special case of subspace-based adaptation. It aims to select a set of features that are not affected by shifting factors. The selected features can form a new subspace. Bruzzone et al. [29] proposed a multiobjective optimization framework to select spatially invariant features for the classification of spatially disjoint scenes. The multiobjective framework ensures the selected features with both high discrimination ability and high spatially invariance. The invariant feature selection can also be performed in an RKHS [30]. Paris et al. [31] presented an invariant-feature-based sensor-driven hierarchical DA method. Yan et al. [32] proposed a TrAdaBoost based on an improved particle swarm optimization (PSO) method for cross-domain scene classification, which can select an optimal feature subspace for classifying "harder" and "easier" instances.
There are other subspace-based adaptation methods. Ye et al. [33] proposed a dictionary learning-based feature-level DA technique, which learns a common dictionary to represent source and target data and then aligns their representation coefficient features to reduce the spectral shifts between domains. Wang et al. [34] proposed a pairwise constraint discriminant analysis and nonnegative sparse divergence (PCDA-NSD) method for HSI classification. The PCDA learned potential discriminant information of sample sets in the source and target domains by using pairwise constraints and NSD measured the divergence between different distributions. Lin et al. [35] proposed a dual space unsupervised structure preserving transfer learning (DSTL) framework for HSI classification. It first transfers the data on both domains to a specific subspace, on which the initial classification results for the target HSI are obtained. Then, the initial results on the original target data space are optimized by applying the Markov random field (MRF) approach. Chen et al. [36] proposed a semisupervised dual-dictionary nonnegative matrix factorization (SS-DDNMF) method for heterogeneous transfer learning on cross-scene HSIs, where two different dictionaries are designed for source and target scenes to project two different feature spaces into a shared subspace. Gui et al. [37] proposed a general feature paradigm (GFP) for PolSAR image classification, where information scattering and statistical information are used to reduce the domain shifts.

B. Transformation-Based Adaptation
The transformation-based DA methods transform the original data into new representations to minimize the statistical distribution (i.e., marginal and conditional distributions) discrepancy and geometrical divergence between domains while preserving the underlying structure of original data [5], as shown in Fig. 3.
The MMD, Kullback-Leibler divergence (KL-divergence), or Bregman divergence are usually used to measure the domain's distribution discrepancy [38]. The MMD is defined as where φ is a nonlinear map induced by a universal kernel. Pan et al. [38] introduced the MMD to measure the distribution discrepancy and proposed a transfer component analysis (TCA) method. TCA intends to learn a set of transfer components in an RKHS using the MMD such that marginal distribution differences between domains are reduced and data variance is maximized [38]. The transformation matrix W ∈ R (n s +n t )×d in the TCA can be solved by the following optimization problem: where μ is a regularization parameter, I m ∈ R m×m is an identity matrix, H is the centering matrix, K is the kernel matrix defined on all source and target data. Here, . The unsupervised TCA can be extended to semisupervised TCA (SSTCA) by using the source labels [38]. Matasci et al. [39] directly applied the TCA and SSTCA for the DA of RS image classification. Long et al. [40] proposed a transfer joint matching (TJM) method, which performs feature matching and instance reweighting simultaneously in a unified optimization framework to reduce marginal distribution differences between domains. Peng et al. [41] proposed a discriminative transfer joint matching (DTJM) for HSI classification by considering the label information of source domain.
To align both the marginal and conditional distributions between domains, Long et al. [42] further proposed a joint distribution adaptation (JDA) method. JDA finds a linear transformation A ∈ R D×d to align the marginal distribution based on the MMD and to align the conditional distribution based on a conditional MMD where M 0 and M c are the MMD matrices [42].
The above TCA, TJM, and JDA methods assume that there exists a unified transformation A to map source and target samples into a common space. However, if domain shift is large, it is very different to find a common transformation [43]. Zhang et al. [43] proposed a joint geometrical and statistical alignment (JGSA) method to learn two coupling mappings A and B for source and target domains, respectively. The distribution divergence minimization in the JGSA can be represented as Inspired by the JGSA, Zhou et al. [44], [45] proposed a DA technique based on transformation learning (DATL) for HSI classification. It learned two different transformations using the idea of linear discriminant analysis (LDA) to minimize the the ratio of within-class distance to between-class distance. A distance-based objective function is designed to optimize the transformations and meanwhile to preserve stochastic neighborhood and discriminative information of domains in a latent space [44], [45]. Li et al. [46] proposed a locality preserving joint transfer (LPJT) method to improve the JGSA by considering the local discriminative information preservation and landmark selection in a unified optimization framework. Huang et al. [47] proposed a graph embedding and distribution alignment (GEDA) method for HSI classification, which used the graph embedding method to preserve discriminative information of source and target domains and a pseudo-label learning method to refine the target pseudo labels [47]. Similarly, they further proposed a distribution alignment and discriminative feature learning (DADFL) method [48], which performs classwise discriminative information preservation and uses a structural prediction method to learn pseudo-label of target samples.
Sun et al. [49] proposed a correlation alignment (CORAL) method, which finds a linear transformation A to transform the source data such that the variances of transformed source data and target data can be minimized where CŜ is the covariance of the transformed source features A T X S . The solution to (10) has the closed-form: T . Peng et al. [50] proposed a sparse matrix transformbased CORAL method for HSI classification. Zhu et al. [51] proposed a class centroid alignment method, which aligns the class centroids by moving the target domain samples toward the source domain. To consider the first-and second-order statistical alignment, a class centroid and covariance alignment (CCCA) method was developed for classification of RS images [52]. The proposed method included three main steps: 1) spatial filtering preprocessing, 2) overall centroid alignment-based coarse adaptation, and 3) CCCA-based refined adaptation.
Many canonical correlation analysis (CCA)-based transformation methods were proposed for DA. Qin et al. proposed a cross-domain collaborative learning (CDCL) method for heterogeneous DA of HSIs. It consisted of three parts, i.e., random walker (RW)-based pseudo-labeling, cross-domain learning via cluster canonical correlation analysis (C-CCA), and final classification based on extended RW (ERW) algorithm [53]. Samat et al. [54] proposed a supervised and semisupervised multiview CCA ensemble method for heterogeneous DA in RS image classification. Li et al. [55] proposed a sparse subspace correlation analysis-based supervised classification (SSCA-SC) method for HSI classification, which integrated the idea of CCA into a sparse representation subspace learning framework and directly classified the target samples based on the sparse representation reconstruction residuals. Volpi et al. [56] proposed a kernel CCA transformation (kCCA) method to align spectral characteristics of multitemporal cross-sensor images for change detection.
To correct nonlinear variation between domains, Tuia et al. constructed a nonlinear transform based on vector quantization and graph matching to describe the data changes under different acquisition conditions [57]. They further proposed a semisupervised manifold alignment (SS-MA) method to align the manifolds of RS images [58] by solving a standard Rayleigh quotient where the affinity matrix V can be used to maximize the distances between samples of different classes, and U enhances class similarity between the labeled instances between domains. The matrices U and V can be constructed based on the graph Laplacian. The SS-MA method can be used for multitemporal, multisource, and multiangular classification. Yang et al. [59] proposed a global aligned local manifold (GALM) method to align two globally similar manifolds and to minimize the impact of spectral changes at the local scale. They further extended the MA and proposed spectral and spatial proximity-based MA for multitemporal HSI classification [60]. Hong et al. [61] proposed a learnable MA (LeMA) for semisupervised cross-modality hyperspectral-multispectral classification. Ma et al. [62] proposed a unsupervised MA method for cross-domain classification of RS images, which used an SVM-prediction-based crossdomain similarity matrix and a per-class MMD constraint. To exploit the manifold structure of data, Luo et al. [63] proposed a manifold regularized distribution adaptation (MRDA) algorithm to minimize the per-class MMD and meanwhile preserve the manifold structure of source and target data in the subspace. Wang et al. [64] proposed a DA broad learning (DABL) method for HSI classification, which combined the DA and broad learning system (BLS) to perform MMD-based distribution alignment and manifold structure preservation. Dong et al. proposed a spectral-spatial weighted kernel manifold embedded distribution alignment (SSWK-MEDA) for RS image classification [65], which applied spatial filter to preprocess the hyperspectral data and constructed a spatial-spectral composite kernel for kernelbased adaptation. Gross et al. [66] proposed a nonlinear feature normalization alignment (NFNalign) transformation to mitigate nonlinear effects in hyperspectral data.
There are also some other transform-based methods. Chakraborty et al. [150] proposed an artificial neural networkbased DA strategy, which unified the common data transformation and transfer learning methods. Tardy et al. [67] applied the optimal transport (OT) for land-cover mapping of highresolution satellite images time series. Jia et al. [68] applied the 3-D Gabor transformation to extract spatial-spectral features of HSI for DA.

V. CLASSIFIER-BASED ADAPTATION
The classifier-based DA methods adapt a classifier trained on a source domain to a target domain by considering unlabeled samples of the target domain.
The classifier-based DA can be performed by the adaptation of classifier parameters [69], [70], [71], [72], [151]. Bruzzone et al. [69] proposed a classifier-based DA method to solve the data distribution difference between multitemporal RS images by updating the parameters of a trained maximum-likelihood (ML) classifier on the basis of the distribution of a new image to be classified. The ML-based DA technique was further extended to the Bayesian cascade classifier, multiple-classifier, and multiple cascade-classifier [70], [71], [72]. Zhong et al. [152] proposed a classifier updating method by considering spectral features and guided-filter-based posteriori spatial features. Izquierdo-Verdiguier et al. [73] updated the SVM classifier by adding virtual support vectors (VSVs) for training, where the VSVs contained the invariances to rotations, reflections, and object scale. An SVM-based sequential classifier training (SCT-SVM) approach was proposed for multitemporal RS image classification [74]. By casting the DA as a multitask or multiplekernel learning problem, many multiple-kernel learning-based DA methods were proposed [75], [76], [77], [78]. Xu et al. [79] proposed a DA method through transferring the parameters of ELM. Considering the simplicity of ELM, many ELM-based classifier adaptation methods were proposed, such as cross domain ELM (CDELM) [80], ELM-based heterogeneous DA [81], interpretable rule-based fuzzy ELM (IRF-ELM) [82], ensemble transfer learning based on ELM (TL-ELM) [83]. Wei et al. investigated the combination of multiple classifiers for DA of RS image classification. The multiple domain adaptation fusion (MDAF) method and the multiple base classifier fusion (MBCF) method were proposed to obtain a more stable classification performance [84]. Zhang et al. [85] considered the open set DA problem for RS scene classification via updating the classifier by exploring transferability and discriminability. Wang et al. [86] proposed an easy transfer learning (EasyTL) approach by exploiting intradomain structures to learn both nonparametric transfer features and classifiers.
Semisupervised learning (SSL) and active learning (AL) techniques can also be used to solve the domain shift problem by updating a source classifier with the use of unlabeled target samples [153], [154]. Rajan et al. [87] proposed a binary hierarchical classifier (BHC) framework for knowledge transfer from an existing labeled source domain to a spatially separate and multitemporal target domain, where an SSL technique was used to update the BHC to reflect the characteristics of new data. The classical supervised SVM was also extended to a semisupervised case to solve the DA problem [88], [155]. A typical semisupervised SVM is the domain adaptation SVM (DASVM) [88], which built a standard SVM on a source domain and then iteratively adjusted the SVM model using unlabeled target samples. Kim et al. proposed an adaptive manifold classifier (MRC) in a semisupervised setting, where a kernel machine was first trained with labeled data and then iteratively adapted to new data using manifold regularization [89]. The AL technique was also adopted to update existing classifiers [90], [91], [92], [93], [94], [95], [96], [97]. As shown in Fig. 4, the AL method first built a classifier on the source data and then classified target samples. By selecting some candidates with the highest uncertainty and providing user labels for them, these samples can be used to expand the training set to update the classifier. Deng et al. [140] proposed an active multikernel DA method for HSI classification, which combines the AL with multikernel learning for DA. Kalita et al. [156] proposed a standard deviation (SD)-based AL technique to exploit the labeled source images to generate the "most-informative" target samples. Saboori et al. [157] proposed an active multiple kernel Fredholm learning (AMKFL) method, where a Fredholm kernel regularized model was presented to label samples.

VI. DEEP DOMAIN ADAPTATION
Unlike hand-crafted features in traditional DA methods, deep learning methods can automatically learn features using deep neural networks (DNNs) [158]. Most of the current deep DA methods add adaptation layers to an original deep network architecture to realize the source-to-target adaptation or adopt an adversarial learning strategy to minimize the crossdomain discrepancy. Deep DA methods are mainly divided into discrepancy-based methods, adversarial-based methods, and others [5], [159].

A. Discrepancy-Based Adaptation
The discrepancy-based deep DA methods mainly aim to match marginal or/and conditional distributions between domains by adding adaptation layers (e.g., MMD-based metric) in deep neural networks (DNN) for task-specific representations, as shown in Fig. 5. Long et al. [98] proposed a DNN which was the first time to utilize the DNNs to learn transferable features across domains for DA. In DAN, three adaptation layers based on multiple kernel variant of MMD (MK-MMD) are designed to align marginal distributions between domains. In order to reduce both marginal and conditional distribution differences, they further proposed a joint adaptation network (JAN) which used source labels and target pseudo labels to construct MMD and conditional MMD to align the joint distribution according to an adversarial training strategy. Based on DAN, Zhu et al. [99] proposed a multirepresentation adaptation network (MRAN) [100], which performs cross-domain classification tasks through multirepresentation alignment. They further proposed a deep subdomain adaptation network (DSAN) using the idea of subdomain adaptation [101]. Zhu et al. [160] developed a weakly pseudo-supervised decorrelated subdomain adaptation (WPS-DSA) network for crossdomain land-use classification. Sun et al. [102] proposed a DeepCORAL method to extend the CORAL to deep learning.
For HSI classification, Ma et al. [103] designed a class centroid alignment module in the DNN for cross-domain HSI classification. Garea et al. [104] proposed a TCA-based network (TCANet) for DA of HSIs, which used the TCA to construct an adaptation layer. Wang et al. [105] proposed a deep DA with MMD-based classwise distribution alignment and manifold structure preservation in the target domain. Ma et al. [106] proposed a deep DA network (DDA-Net) for cross-dataset HSI classification, which minimized the domain discrepancy and transferred the task-relevant knowledge from source to target in an unsupervised way. Li et al. [107] proposed a two-stage deep DA (TDDA) method, where in the first stage, the distribution distance between domains is minimized based on the MMD to learn a deep embedding space, and in the second stage, a spatialspectral Siamese network is constructed to learn discriminative spatial-spectral features to further decrease the distribution discrepancy. Zhang et al. [108] proposed a topological structure and semantic information transfer network (TSTnet). It employs the graph structure to characterize topological relationships and combines the graph convolutional network (GCN) and CNN for cross-scene HSI classification. The optimal transmission (OT)based graph alignment and MMD-based distribution alignment work cooperatively. Wang et al. [109] proposed a graph neural network (GNN) DA method for multitemporal HSIs, which incorporated the domainwise and classwise CORAL into the GNN network to align the joint distributions of domains. Liang et al. [110] proposed an attention multisource fusion-based deep few-shot learning (AMF-FSL) method for small-sized HSI classification, which contains three modules, namely, the target-based class alignment, domain attention assignment, and multisource data fusion. It can transfer the learned ability of classification from multiple source data to target data.
Othman et al. [161] proposed a DA network for cross-scene classification. It uses a pretraining and fine-tuning strategy to ensure that the network can correctly classify the source samples, and align source and target distributions and preserve the geometrical structure of target data [161]. Similarly, Lu et al. [111] proposed a multisource compensation network (MSCN) for cross-scene classification task. In the network, a cross-domain alignment module and a classifier complement module are designed to reduce the domain shift and to align categories in multiple sources, respectively. Zhu et al. [112] proposed an attentionbased multiscale residual adaptation network (AMRAN) for cross-scene classification, which contains a residual adaptation module for marginal distribution alignment, an attention module for robust feature extraction, and a multiscale adaptation module for multiscale feature extraction and conditional distribution alignment [112]. Geng et al. [113] proposed a deep joint distribution adaptation networks (DJDANs) for transfer learning in SAR image classification, where marginal and conditional distribution adaptation networks are developed.

B. Adversarial-Based Adaptation
Inspired by generative adversarial nets (GAN) [114], [115], adversarial DA approaches learn transferable and domain invariant features through adversarial learning. GAN contains a generator model G and a discriminator model D. The generator aims to produce samples similar to the source domain, and to confuse the discriminator to make a wrong decision. Among them, the training purpose and process can be summarized where C is the classifier, L cls is the classification loss on labeled source data, L adv G and L adv D are the loss functions of the adversarial training of G and D. The discriminator then tends to discriminate between the true source data and the counterfeits generated by model G [5]. After GAN-based adaptation, a task-specific classifier built on the source domain can be used to classify target samples. Fig. 6 illustrates the GAN-based DA.
Tzeng et al. [116] introduced an adversarial CNN-based architecture that aligned distributions between domains by minimizing the classification loss, soft label loss, domain classifier loss, and domain confusion loss. Pei et al. [117] proposed a multiadversarial DA (MADA) method that constructed multiple classwise domain discriminators to reduce the joint distribution difference between domains. Yu et al. [118] proposed a dynamic adversarial adaptation network (DAAN), which can dynamically assess the relative importance of global and local domain distributions. Saito et al. [119] proposed a deep DA method based on the maximum classifier discrepancy (MCD) that leverages task-specific decision boundaries and adversarial learning ideas to adjust the distribution of source and target domains. The MCD method aims to learn domain-invariant features, which may have lower discriminative ability. To solve this problem, a dynamic weighted learning (DWL) method was proposed to adjust the weights of domain alignment learning and class discrimination learning in the MCD framework [120].
Recently, many adversarial-based DA approaches were developed for HSI classification. Ma et al. [121] proposed an adversarial learning-based DA method for the classification of HSIs, which included a variational autoencoder (VAE)-based generator and a multiclassifier-based discriminator. The generator learns features such that the source classification error is minimized and the classification disagreement on the target dataset is maximized. The discriminator deceives the generator by adjusting classifiers such that the classification disagreement on the target data set is minimized. Miao et al. [122] also used the VAE module to construct a generative model and further designed a joint distributions alignment module to perform coarseto-fine joint distributions alignment for HSI classification. Yu et al. [123] proposed a contentwise alignment method within an adversarial learning framework. Pande et al. [124] proposed a class reconstruction driven adversarial DA method, which incorporates an additional class-level cross-sample reconstruction loss to make the learned space classwise compactness and an additional orthogonality constraint over the source domain to avoid any redundancy within the encoded features. Liu et al. [125] proposed a classwise adversarial adaptation network for HSI classification, which performed classwise adversarial learning. Saboori et al. [126] proposed an adversarial discriminative active deep learning (ADADL) method for HSI classification. Similar to MCD, it incorporates two different land-cover classifiers as a discriminator to consider class boundaries when aligning feature distributions, and combines the entropy measure along with the cross-entropy loss during training to use the information in unlabeled target data [126]. Wang et al. [127] proposed a domain adversarial broad adaptation network (DABAN) for HSI classification. It included a domain adversarial adaptation network (DAAN) and a conditional adaptation broad network (CBAN), which can align the statistical distribution between domains and also enhance the representation ability of domaininvariant features. Yu et al. [123] proposed an unsupervised DA architecture with dense-based compaction (UDAD) for crossscene HSI classification. It incorporated spectral-spatial feature compaction, unsupervised DA, and classifier training into an integrated framework and utilized adversarial domain learning to reduce the domain discrepancy. Deng et al. [128] proposed a deep metric learning-based feature embedding method for HSI classification, which uses an adversarial learning strategy to align source and target features and to preserve the similar clustering structure of source and target features. Fang et al. [162] developed a confident learning (CL)-based DA (CLDA) for HSI classification, where the CL module is designed to select highconfidence pseudo-labeled target samples. Li et al. [129] proposed a deep cross-domain few-shot learning (DCFSL) method for HSI classification, which combines FSL and DA, where a conditional adversarial DA is employed to reduce domain shift and FSL is used to learn transferable knowledge from source to target for classification.
Adversarial-based DA approaches also used for the scene classification of RS images. Teng et al. [130] presented a classifier-constrained deep adversarial domain adaptation (CDADA) method exploiting the idea of MCD for cross domain semisupervised classification of RS scene images, where a deep convolutional neural network (DCNN) is used to build feature representations and adversarial DA is used to align the feature distribution of domains. Zhang et al. [131] proposed a domain feature enhancement network (DFENet) to enhance the discriminative ability of the learned features for dealing with the domain variances of scene classification. Specifically, a context-aware feature refinement module is first designed to recalibrate global and local features by explicitly modeling interdependencies between the channel and spatial for each domain. Then, a multilevel adversarial dropout module is further designed to strengthen the generalization capability of the network. Yan et al. [132] proposed a triplet adversarial domain adaptation (TriADA) method for pixel-level classification of very high resolution (VHR) RS images, which learned a domain invariant classifier by a domain similarity discriminator. Zhu et al. [133] proposed a semisupervised center-based discriminative adversarial learning (SCDAL) method for cross-domain scene classification of aerial images using adversarial learning with center loss. Liu et al. [163] proposed an unsupervised adversarial DA network for remotely sensed scene classification, where a GAN model-based feature extractor makes the source and target distributions closer, and a transferred classifier trained by transferred source domain features is able to acquire a better classification accuracy on the target domain. Zheng et al. [134] proposed a two-stage adaptation network (TSAN) for RS scene classification considering single source domain and multiple target domains, which utilizes the adversarial learning to align single source features with mixed-multiple-target features and self-supervised learning to distinguish the mixed-multiple-target domain. Adayel et al. [164] developed a deep open-set DA method for cross-scene classification using adversarial learning and pareto ranking. To exploit the classification information in target domain, Zheng et al. [165] proposed a DA via a task-specific classifier (DATSNET) method for RS scene classification, where an adversarial learning strategy is used to adjust task-specific classification decision boundaries.
Adversarial-based approaches were also applied for other DA tasks of RS images. Bejiga et al. [135] proposed a domain adversarial neural network (DANN) for large-scale land cover classification of multispectral images, where the network consisted of a feature extractor, a class predictor, and domain classifier blocks. Rahhal et al. [166] proposed an adversarial learning method for DA from multiple remote sensing sources. It aligns the source and target distributions using a min-max entropy optimization method. Elshamli et al. [136] employed the denoising autoencoders (DAE) and domain-adversarial neural networks (DANN) to tackle the DA problem for multispatial and multitemporal RS images. Martini et al. [3] developed self-attention-based domain-adversarial networks for land cover classification using multitemporal satellite images, where the deep adversarial network can reduce the domain discrepancy between distinct geographical zones. Ji et al. [167] proposed an end-to-end GAN-based DA method for land cover classification from multiple-source RS images, where the source images are translated to the style of the target images through adversarial learning for training a fully convolutional network (FCN) for semantic segmentation of target images. Tasar et al. [137] proposed a multisource DA method (i.e., StandardGAN) for semantic segmentation of VHR satellite images. They further designed a unsupervised, multisource, multitarget, and life-long DA method for semantic segmentation of satellite images [168]. Wittich et al. [169] deployed a deep adversarial DA network using semantically consistent appearance adaptation for the classification of aerial images. A color mapping generative adversarial network (ColorMapGAN) was built for DA of RS image semantic segmentation [170]. Makkar et al. [171] adopted adversarial learning to extract discriminative target domain features that are aligned with source domain for geospatial image analysis. Mateo et al. [172] investigated a cross-sensor adversarial DA method of Landsat-8 and Proba-V images for cloud detection.

C. Others
There are some other deep DA methods. Yang et al. [138] proposed a transfer learning-based two-branch CNN model for HSI classification, where the spatial and spectral CNNs are used to extract joint spectral-spatial features from HSIs followed by target network training using transfer learning with limited labeled samples of target domain. Zhou and Prasad [139] proposed a deep feature alignment neural network (FANN) for HSI classification, where discriminative features for both domains were extracted using deep convolutional recurrent neural networks (CRNN) and then aligned layer-by-layer according to the transformation learning-based domain adaptation (DATL) method. Deng et al. [140] proposed an active transfer learning network for HSI classification, which exploited a hierarchical stacked sparse autoencoder (SSAE) network to extract deep joint spectral-spatial features and an active TL strategy to transfer the pretrained SSAE network and the limited training samples from source to target domains. Song et al. [173] added an SA layer into CNN models for DA, which can fulfill domain alignment in feature subspace by fine-tuning the modified CNN models. Liu et al. [174] combined transfer learning and virtual samples in a 3D-CNN model to solve the problem of insufficient samples. Zhong et al. [141] proposed a cross-scene deep transfer learning network with spectral feature adaptation (SFA) for HSI classification, which designed a multiscale spectral-spatial unified network (MSSN) with two-branch architecture and a multiscale bank to extract discriminating features of HSI. Chen et al. [142] proposed an augmented associative learning-based DA (AALDA) method for HSI classification, which employs the criterion of cycle consistency to generate features that are domain-invariant and discriminative. Mdrafi et al. [143] proposed an attention-based DA using residual network for HSI classification, which considers different levels of attentions. Saha et al. [175] developed a graph neural network for multitarget DA in RS classification. Lasloum et al. [176] presented a multisource semisupervised DA method using a pretrained CNN for RS scene classification.
Othman et al. [144] designed a three-layer convex network termed as 3CN for DA in multitemporal VHR RS images. It is composed of three main layers: 1) mapping source training samples to the target domain via ELM; 2) target image classification via ELM; and 3) spatial regularization via the random-walker algorithm. Kellenberger et al. [177] combined the CNNs with AL for animal detection in UAV images, which used the OT to find corresponding regions between source and target data sets in the space of CNN activations. Kalita et al. [178] investigated the DA problem for land cover classification by utilizing the ensemble decision approach of deep neural networks to address the extra and missing class problem. Chakraborty et al. [179] proposed a multilevel weighted transformation based neurofuzzy DA method using stacked auto-encoder for land-cover classification. Lucas et al. [145] proposed a Bayesian-inspired CNN-based semisupervised DA method to produce land cover maps from satellite image time series data. Tong et al. [180] proposed a transferable deep model for land-cover classification of multisource high-resolution RS images, which used a pseudolabel learning strategy to automatically select training samples from the target domain and extracted multiscale contextual information of RS for classification.

VII. EXPERIMENTAL RESULTS AND ANALYSIS
Two images in the 2013 and 2018 IEEE GRSS data fusion contest, i.e., Houston2013 and Houston2018, are used in the experiment. The two images were acquired by the ITRES Compact Airborne Spectrographic Imager (CASI)-1500 sensor over the University of Houston campus and the neighboring urban area on June 23, 2012 and February 16, 2017, respectively [181], [182]. Houston2013 has the size of 349 × 1905 pixels, 144 spectral bands, and 15 categories. Houston2018 has the size of 4172 × 1202 pixels, 48 spectral bands, and 20 categories. For  consistency, 48 spectral bands of Houston2013 are selected and seven common classes in these two images are considered for the DA task [108]. The Houston2013 and Houston2018 images are set as source and target domains, respectively. The RGB composite image and ground-truth map of two images are shown in Fig. 7. The number of samples are shown in Table II. In the experiments, we compare some traditional shallow methods and recent deep DA methods, as shown in the upper and lower part of the line in Table III. The 1-nearest neighbor (1-NN) classifier is chosen as the base classifier. The NA (no adaptation) uses the 1-NN classifier built on the source domain to directly classify target samples. The GFK [13], SA [15], and KSA [25] are subspace-based adaptation methods. The TCA [38], [39], JDA [42], JGSA [43], LPJT [46], DADFL [48],  [101], DeepCORAL [102], and TSTnet [108] are discrepancy-based methods. The DAAN [118], MCD [119], and DWL [120] are adversarial-based methods. For subspace-based DA algorithms, the dimensionality of the subspace is set to 20. The optimal learning rate lr of all deep learning algorithms is chosen from {0.0001, 0.001, 0.01, 0.1}, the batch size of the network is all set to 128, and the number of training iterations is 100. For DAN, DAAN, MRAN, and DSAN, there is a regularization parameter λ whose value is chosen from {0.001, 0.01, 0.1}. For all compared deep learning algorithms, to ensure the fairness of the comparison experiments, the backbone network is the ResNet18. The classification evaluation indicators are overall accuracy (OA) and kappa coefficient (κ).
From Table III, we can see that some traditional DA methods provide poor adaptation performance and their OAs are even worse than NA. It demonstrates that not all of the DA methods can reduce distribution discrepancy, and it is likely to produce negative transfer when there is significant spectral difference between domains. The JGSA, LPJT, and DADFL provide relatively better results than NA because these methods align the statistical and geometrical discrepancy between domains by learning two projections for source and target domains, respectively, and taking into account the local or global discriminative information of domains. The Houston2013 and Houston2018 datasets have great spectral differences, so there may not exist a shared subspace generated by a unified transformation. In addition, the discriminative information of source or/and target domain can be used to improve the DA performance.
For deep DA methods, the adversarial methods, such as DAAN, MCD, and DWL, show better results than NA. Through an adversarial learning, the ability of generator and discriminator are simultaneously improved. The feature generator is likely to produce target features that are highly similar to the source features, and then a task-specific classifier built on the source domain can be used to classify target samples. Among all methods, the recently proposed TSTnet produces the best results. In the feature extraction part of TSTnet, the GCN and CNN are used to extract convolutional and topological structure features. In the adaptation part, the optimal transmission-based graph alignment and MMD-based distribution alignment work cooperatively. By exploiting the topological structure and semantic information of HSIs and considering the distribution alignment and topological relationship alignment, TSTnet generates excellent results.
The classification map on the target domain of different methods are shown in Fig. 8. It can be seen that some methods, such as TCA, CORAL, EasyTL, misclassify the class "Grass stressed" in green color to the class "Grass healthy" in purple color due to the high spectral similarity between these two classes. In addition, many methods misclassify the class "Nonresidential buildings" to the class "Road."   iteratively update the transformation matrix and pseudo labels, so their running times are relatively long.

VIII. CONCLUSION
The early DA methods focus on either instance reweighting or subspace adaptation or transformation-based adaptation.
For RS image classification, due to the existence of large spectral drift between domains, it usually needs to simultaneously consider instance reweighting, subspace learning, and feature transformation. For traditional DA methods, we can incorporate landmark selection or feature weighting, target pseudo-label learning, local discriminative preservation into a subspace-based transformation framework to improve the discriminative ability of DA models.
For deep DA methods, the feature extraction module can be further improved by considering the data characteristics of RS images. The adversarial learning strategy can be combined with the discrepancy-based adaptation. In addition, the target pseudolabel learning can be used in the deep DA methods to iteratively update the network and improve the discriminative ability.
Currently, many existing RS DA methods focus on the general homogeneous unsupervised DA problem where the source and target domains have similar or same dimensionality feature spaces and there are no labeled instances in the target domain. In real situations, the RS classification problem may be more complex. The feature space and class space of source and target domains may be different. The classical DA problem can be extended to the following cases.
1) Heterogeneous DA [183]: The dimensionality of source and target domains are different and features of two domains are disjoint. For example, due to the difference in hyperspectral sensors, different HSIs usually have different spectral bands. 2) Multisource DA [184]: There are multiple source domains.
The challenges lie in the unavailability of target labels and complex composition of multiple source domains [185]. For long-term RS image series analysis, there may exist multiple historical labeled images as sources. Compared with a single source domain, the joint use of multiple sources is likely to improve the DA performance. 3) Open set DA [186]: Only a few categories of interest are shared between source and target data. That is, the class space of source and target domains are different and intersect. Due to the changes of ground materials and acquisition regions, the source and target domains usually have some different classes especially for large-scale RS classification. 4) Partial DA [187]: It is assumed that the target label space is a subspace of the source label space. For example, if we only focus on some special classes in the target domain, the rich information of source domain can be used to perform partial DA. 5) Few-shot DA [188], [189]: The combination of DA with few-shot learning for using very few labeled target samples in training. When the source and target domains have great distribution differences and the number of classes is large, the unsupervised DA will fail. In this case, the limited labeled target samples can play a great role in building a connection between source and target classes. 6) Domain Generalization [190]: It aims to achieve out-ofdistribution generalization by using only source data for model learning. There are many RS images obtained by different sensors or acquired in different conditions. It is likely to learn a model with high generalization ability from available RS images, and then applies the model to classify other images in real-time RS analysis. The DA techniques can be used for large scene and long-term RS image processing. The labeling process for a large scene is costly and time-consuming. The DA technique can help to transfer the labels from a small region to the whole scene. For long-term image processing, historical images can be used to predict unseen images, and change analysis among RS images of different times can be performed.