Universal Domain Adaptation for Remote Sensing Image Scene Classification

The domain adaptation (DA) approaches available to date are usually not well suited for practical DA scenarios of remote sensing image classification since these methods (such as unsupervised DA) rely on rich prior knowledge about the relationship between label sets of source and target domains, and source data are often not accessible due to privacy or confidentiality issues. To this end, we propose a practical universal DA (UniDA) setting for remote sensing image scene classification that requires no prior knowledge on the label sets. Furthermore, a novel UniDA method without source data is proposed for cases when the source data are unavailable. The architecture of the model is divided into two parts: the source data generation stage and the model adaptation stage. The first stage estimates the conditional distribution of source data from the pretrained model using the knowledge of class separability in the source domain and then synthesizes the source data. With this synthetic source data in hand, it becomes a UniDA task to classify a target sample correctly if it belongs to any category in the source label set or mark it as “unknown” otherwise. In the second stage, a novel transferable weight that distinguishes the shared and private label sets in each domain promotes the adaptation in the automatically discovered shared label set and recognizes the “unknown” samples successfully. Empirical results show that the proposed model is effective and practical for remote sensing image scene classification, regardless of whether the source data are available or not. The code is available at https://github.com/zhu-xlab/UniDA.

R EMOTE sensing image scene classification is a pro- cedure for assigning semantic labels according to the content of remote sensing scenes [1], which is beneficial to traffic analysis, urban area monitoring and planning [2], [3], land-use and land-cover [4], and hazard detection and avoidance [5], among other applications.In recent years, many deep learning approaches have been proposed for scene classification of remote sensing images [6], [7], such as autoencoder [8], convolutional neural networks (CNNs) [9], generative adversarial networks (GANs) [10], prototype-based memory networks [11], and transformer [12].These methods usually assume that the training and testing data share the same distribution.However, in a real application, due to the influence of sensors, geographic locations, imaging conditions, and other factors, the distribution of training and testing data may be different.This phenomenon is referred to as the domain gap [13].To address the domain gap problem among different datasets, domain adaptation (DA) algorithms have been proposed.DA aims to leverage a source domain to learn a model that performs well on a different but related target domain [14].
In remote sensing scene classification, most existing DA approaches [15], [16] are proposed to tackle the domain gap between different domains by learning a domain invariant feature representation.Based on the knowledge of the relationship between the source and target label space (categorygap), DA can be divided into closed-set DA, partial DA, and open-set DA.Specifically, closed-set DA usually addresses the domain adaptation problem by leveraging the adversarial learning behaviors of GANs to perform distribution alignment in the pixel, feature, and output spaces [5], [15], [17], [18], which assumes a shared label set between the source and target domains, as shown in Fig. 1(a).In order to relax this assumption, two alternatives have been proposed: partial DA [19], in which the target label space is considered a subset of the source label space, as shown in Fig. 1(b), and open-set DA [20], in which the source label space is considered a subset of the target label space, as shown in Fig. 1(c).For example, an open-set domain adaptation algorithm [21] is proposed in which transferability and discriminability are explored for the purpose of remote sensing image scene classification.However, these DA methods have two major bottlenecks in the domain adaptation of remote sensing scene classification in the wild.
• In a general scenario, we cannot select the proper domain adaptation methods (closed-set DA, partial DA, or open-set DA) because no prior knowledge about the target domain label set is given.
• The source dataset is not available in many practical application scenarios of remote sensing.For example, many satellite companies and users will only provide pre-trained models instead of their source data due to data privacy and security issues.In addition, the source datasets, like highresolution remote sensing images, may be so large that it is not practical or convenient to transfer or retain them to different platforms.
To address the first challenge, a novel scenario of universal domain adaptation (UniDA) is proposed.As shown in Fig. 1(d), UniDA removes all constraints and includes all the above adaptation settings [22].UniDA may contain a shared label set and hold a private label set for a given source label set and a target label set.Two challenges are exposed in a UniDA setting.(1) If we naively match the entire source domain with the entire target domain, the mismatch of different label sets will deteriorate the model.Thus, the samples coming from the shared label set between the source and target domains should be automatically detected and matched.(2) The target samples from private label sets should be marked as "unknown" since there are no labeled training data for these classes.Currently, different transferability criteria (such as entropy in [22], pseudo-margin vector in [23], and the mixture of entropy, confidence, and consistency in [24]) have been proposed to distinguish samples from shared label sets and those in private label sets in the field of computer vision.To address the second challenge, in computer vision, sourcefree domain adaptation is under continuous exploration [25]- [28].For example, [29] proposes the universal source-free domain adaptation setting for natural image classification.However, existing UniDA methods [22]- [24], [29] in computer vision normally assume that the source data set is available when building the classifier platform.This assumption is not valid and practical for the second challenge.Thus, developing a universal domain adaptation method without source data (Fig. 1(e)) has a practical value and is thus desired in real application scenarios of remote sensing image classification.
In UniDA without source data, pre-trained models can be available.Pre-trained models not only serve as strong baselines for the original dataset, but also contain knowledge of the original dataset.Therefore, generating synthetic source domain data from the pre-trained model is the first problem to be solved.There are recent works for distilling a network's knowledge by a small dataset [30] or no observable data [31].It is worth noting that we cannot use generative adversarial networks to directly generate artificial data (similar to [32]), because the core of UniDA without source data is to restore the category distribution (including the shared label set and the private label set) from the pre-trained model.
Bearing these concerns in mind, we propose the UniDA without source data in order to introduce the UniDA setting into remote sensing datasets.In this case, we merely have access to the pre-trained model from the source domain.We have no information about the source data distribution that was used to train.UniDA without source data poses two major technical challenges for designing the corresponding models in the wild.(1) Distilling the knowledge of source data from the pre-trained model.The knowledge is consistent with the source in the category distribution (including the shared label  To address these two challenges, our proposed UniDA without source data for remote sensing images consists of a source data generation (SDG) stage and a model adaptation (MA) stage.In the SDG stage, we reformulate the goal as estimating the conditional distribution rather than the distribution of the source data, since the source data space is exponential with the dimensionality of data.After the conditional distribution of the source data is obtained, a well-defined criterion can be used to distinguish different degrees of uncertainty in order to separate the target samples from the shared label set and those from the private label.However, uncertainty is usually measured by entropy [22], [33], which lacks discriminability for uncertainty when the categorical distributions are relatively uniform [24].Thus, a novel transferable weight is defined by considering confidence and domain similarity.
In a nutshell, our contributions are as follows: • We introduce a more practical and challenging UniDA setting for remote sensing image scene classification.
• We propose a new UniDA model (SDG-MA), which is composed of a source data generation stage and a model adaptation stage.
• In order to generate reliable source domain samples, a novel conditional probability recovery method of the source domain is designed to distill category knowledge.
• A novel transferable weight is utilized to distinguish the shared label sets and the private label sets in each domain.
• Experimental results on four UniDA settings for remote sensing image scene classification demonstrate that the proposed model is effective and practical, regardless of whether the source domain is available or not.

II. RELATED WORK
Most existing DA settings for remote sensing image scene classification can be summarized as closed-set, partial, and open-set DA based on the label set relationship.Closed-set DA is a scenario where the source and target domains share the same label set.The main challenge in this scenario is to overcome the domain gap that comes as a result of the samples being taken from different distributions.Among the recent work on closed-set DA for remote sensing, adversarial learning frameworks have attracted significant interest because of the improved quality of alignment between distributions by adapting representations of different domains.GANs are commonly used at feature maps generated from CNNs where a domain discriminator is trained to correctly classify the domain of each input feature.For example, domain-adversarial neural networks (DANN) [34], Siamese GAN [35], Attention GAN [10], and domain adaptation via a task-specific classifier (DATSNET) framework [36] are presented for the classification of remote sensing images, by learning an invariant representation.Recently, a multitude of closed-set DA algorithms for remote sensing image scene classification [37]- [43] is designed to reduce the global or local distribution differences between domains.In addition, closed-set DA with multiple source domains [44] is proposed for remote sensing image classification.However, it is difficult to ensure that the source domain and the target domain have common classes.Thus, partial DA and open-set DA are proposed to relax this limitation.Partial DA handles the case where the target classes are a subset of source classes.This task is solved by performing importance-weighting on source examples that are similar to samples in the target [19], [45], [46].Open set DA is a more realistic version, where the new classes will appear in the target domain.In the open set DA setting, the target domain contains unknown classes that do not present in the source domain.In remote sensing image scene classification, an open set DA algorithm via exploring transferability and discriminability (OSDA-ETD) [21] is proposed to reduce the distribution discrepancy of the same classes in different domains and enlarge the distribution discrepancy of different classes in different domains.In addition, some open set DA networks based on adversarial learning [47]- [50] and graph convolutional networks [51], [52] are presented for remote sensing image scene classification.However, almost all these methods rely on prior knowledge about the relationship between label sets of source and target domains and assume the co-existence of source and target data.Thus, in order to promote the development of DA methods, we propose a general setting (UniDA) for remote sensing image scene classification.

III. METHODOLOGY
In this section, we elaborate the problem of the UniDA setting without source data and address it by a novel dualstage framework (SDG-MA), shown in Fig. 2.

A. Problem Setting
For UniDA setting without source data (SDG-MA), we merely have access to the pre-trained model M , including feature extractor F and classifier C. We have no information about the source data distribution p(x) that is used to train M .Thus, considering the MA in the second stage, our first goal is to generate reliable source data x f from the pretrained model M .The synthetic distribution is consistent with the source data distribution p(x) in the category distribution (including the shared label set and the private label set), and is as close as possible to the target domain in style.However, it is impracticable to estimate p(x) directly since the source data space is exponential with the dimensionality of data.Thus, as shown in the source data generation stage of Fig. 2, we generate the set by modeling a conditional probability of x given two random vectors y and z. y (y ∼ p y (y)) is a probability vector that represents a label, where p y (y) is an estimation of the true labeled distribution p(y s ) of the source domain.z (z ∼ p z (z)) is a low-dimensional noise, where p z (z) is a random distribution describing the source data points.Thus, we reformulate the goal as to estimate the conditional distribution of source data p(x | y, z) instead of the distribution p(x).
After obtaining the conditional distribution of source data p(x | y, z) from SDG stage, it becomes a UniDA task but now with synthetic source domain.Our second goal is to align distributions of the synthetic source domain and target domain in the technical challenges of domain gap and category gap.A synthetic source domain and a target domain are represented by sampled from target distribution q(x), respectively.We denote by Y f (Y t ) the label set of the synthetic source (target) domain.
The shared label set is denoted by For UniDA setting with source data (MA), the real source domain ns i=1 is available.Thus, only the MA stage is used to align distributions of the real source domain and target domain in the technical challenges of UniDA.

B. Source Data Generation
Source data generation includes two modules, conditional probability generation module and data diversity module.Specifically, firstly, conditional probability generation module is presented to prove that the conditional distribution of source data p(x | y, z) can be estimated by estimating the categorical likelihood p(y | x) and the property likelihood p(z | x).
Secondly, in order to generate a reliable source domain for UniDA, the generated data x f must meet two conditions: 1) in data content, all category distributions in the pre-trained model M can be restored, including source-share and source-private category distributions, and 2) in data style, the generated data can remain similar to the target domain style distribution.Thus, to meet these two conditions, a data diversity module is proposed to ensure the data diversity of the generated source domain.In addition, different schemes of data diversity generation are compared in Section IV-C.
1) Conditional Probability Generation Module: Recall that y (y ∼ p y (y)) and z (z ∼ p z (z)) are a probability vector of a source distribution and a low-dimensional noise, respectively.The variables y and z are conditionally independent of each other given source data x, since they both depend on x but have no direct interactions.In order to generate a reliable and balanced source domain D f , the probability of each sampled point x is 1/|D f |, and the probability at any other point is zero.Thus, D f = {arg max x p(x | y, z)}.Based on Bayesian theory [31], [53], the arg max x p(x | y, z) can be expressed as follows: arg max In this way, the distribution p(x | y, z) can be estimated by estimating the categorical likelihood p(y | x) of the variable y given x and the property likelihood p(z | x) of the variable z given x.Thus, as shown in the 'Source Data Generation' module in Fig. 2, a generator G is designed to obtain the empirical distribution p(x | y, z) by combining y and z randomly sampling from the distributions p y (y) and p z (z).
In our experiments, we set p y (y) to the random categorical distribution of source domain that produces one-hot vectors as y, and p z (z) to the multivariate Gaussian distribution that produces standard normal vectors as z.
2) Data Diversity Module: First, in order to recover the data content from the pre-trained model M , a classifier loss cls is designed.Specifically, given a sampled class vector y and a sampled noise vector z as inputs, G is trained to produce a synthetic source domain sample that M is likely to classify as ȳ.The classifier loss can force the generated data to follow the similar class distribution from model M , by minimizing the distance between y and ȳ, which can be formulated as follows: Notably, y and ȳ are not scalars but probability vectors of length Y f .Thus, the cross-entropy between two probability distributions is utilized to measure the distance between y and ȳ.
However, the classifier loss cls easily leads to generate similar data points for each class in the synthetic source domain.Furthermore, it is necessary for domain adaptation to transfer synthetic source images to the target style.A style loss style is presented to measure differences in style between a synthetic source image x f and a target image x t .The style of remote sensing images represents colors, textures, edges, common patterns, and other image style descriptions.Concretely, we make use of a 16-layer VGG network pretrained on the ImageNet [54] to measure multi-scale feature style differences between images, which can be described as: where φ j (x) is the activation at the jth layer of the style loss network, and is a feature map of shape C j × H j × W j .G φ j (x) denotes a Gram matrix that is equal to the average value of the product of the feature and the transposition of the feature.The Gram matrix can grasp the general style of the entire image.The style loss style (x f , x t ) is the squared Frobenius norm of the difference between the Gram matrices of synthetic source image x f and target image x t .In addition, different layers have different feature styles in the VGG network.Therefore, we sum the Gram matrices difference for each of the four activation layers in the VGG-16.

C. Model Adaptation
The objective of MA is to update the pre-trained model M , which distinguishes samples from the target shared label set Y and those in the target private label set Y t .One important challenge for UniDA is detecting transferable samples.In order to address this challenge, the sample transferable weight w f (x f ) or w t (x t ) is utilized during the training stage to estimate the confidence that x f or x t is from the shared label set.Furthermore, during the testing stage, we use the transferable weight as a decision threshold w 0 to decide whether we should predict a class or mark the sample as "Unknown," a designation that represents all labels unseen during training.This is expressed as: (5) 1) The Transferable Weight: The transferable weight is derived from uncertainty and domain similarity.Similar to [22], [55], the domain similarity d(x) is obtained by the nonadversarial domain discriminator D .The d(x) term can be seen as the quantification of the similarity of target domain samples to the synthetic source domain samples.In particular, a smaller d(x f ) for a synthetic source sample and a larger d(x t ) for a target sample mean that they are more likely to be in the shared label set.
On the other hand, we adopt the assumption that the target data in Y have a lower uncertainty than target data in Y t .Thus, in order to further separate target samples from the shared label set and those from the private label, a welldefined criterion can be used to distinguish different degrees of uncertainty.However, uncertainty is usually measured by entropy [22], [33], which lacks discriminability for uncertainty when the categorical distributions are relatively uniform [24].The confidence of predicted probabilities ȳ(x) is a better measure when the generated categories of source samples are relatively uniform.Digging the confidence further, as private label sets of synthetic source Y f have no intersection with shared label sets Y , samples from p(x f , y f | y f ∈ Y f ) are not influenced by the target data and keeps the highest certainty.In addition, the target samples that are more similar to the source domain samples are more likely to be in the shared label set.Different schemes of the transferable weight are further compared and analyzed in Section IV-C.
With the above analysis, it is reasonable to expect that: Thus, the sample-level transferable weight for synthetic source data points and target data points can be respectively defined as: Note that d(x) ∈ [0, 1] and max ȳ(x) ∈ [0, 1] by the max-min normalization.The weights are also normalized into interval [0, 1] during training.
2) Domain Adaptation: To perform domain adaptation during the training stage, the objective function aims to move the target samples with higher transferable weight towards positive source categories Y .To achieve this, input x from either domain is fed into the feature extractor F , as shown in Fig. 2. The extracted features F (x) is forwarded into the label classifier C and the non-adversarial domain discriminator D , to obtain the transferable weights w f and w t .The extracted feature F (x) is forwarded into the adversarial domain discriminator D to adversarially align the feature distributions of the generated source and target data falling in the shared label set.Thus, the adversarial loss function for adaptation is defined as: Adversarially, the feature extractor F strives to confuse D. Thus, domain-invariant features in the shared label set are obtained.In order to train the classifier C on the synthetic source domain with labels, the cross-entropy loss is the following: where L is the standard cross-entropy loss.Furthermore, to better reflect domain similarity, we predict samples from the synthetic source domain as 1 and samples from the target domain as 0. Thus, similar to [22], [55], a binary cross-entropy loss is used to train non-adversarial domain discriminator D .for each mini-batch do 4: Generate source data x f by G, which combines categorical vectors y (y ∼ p y (y)) and standard normal vectors z (z ∼ p z (z));

5:
Train G by min θg ( cls (y, M (x f )) + style (x f , x t ));  end for 18: end if D. Optimization Algorithm 1 depicts the optimization flow of UniDA without source data procedure, which consists of two independent stages.θ g , θ f , θ c , θ d , and θ d are parameters of G, F , C, D, and D , respectively.First, the SDG stage estimates the conditional distribution p(x | y, z) of source data from the pre-trained model M .Thus, we train generator G via the data diversity module, and combine them as a single objective function: Second, the training of the MA stage can be written as a minimax game: The gradient reversal layer [56] is used to reverse the gradient between F and D to optimize the MA stage in an end-to-end training framework.

IV. EXPERIMENTS
A. Experimental setup 1) Datasets: To verify our algorithm, we select the RSSCN7, UC Merced, AID, and NWPU-RESISC45 to build the cross-domain remote sensing image scene datasets.Specifically, the RSSCN7 dataset [57] contains 2800 remote sensing scene images, which are from seven typical scene categories.There are 400 images in each scene type, and each image has a size of 400×400 pixels.The UC Merced dataset [58] is widely used for remote sensing image scene classification.It consists of 2100 remote sensing images from 21 scene classes.Each scene class contains 100 RGB images with an image size of 256×256 pixels.The AID dataset [59] is a large-scale aerial image dataset acquired from Google Earth.It contains 10,000 images with a size of 600×600 pixels, which are divided into 30 classes.The NWPU-RESISC45 dataset [60] consists of 31,500 remote sensing images divided into 45 scene classes.Each class includes 700 images with a size of 256×256 pixels.The spatial resolution varies from about 30 m to 0.2 m for most of the scene classes.
As shown in Table I, four UniDA tasks for remote sensing scene classification are established.Specifically, the RSSCN7 dataset is suitable as the source domain because of its small number of categories.Thus, three cross-domain scenarios are conducted: RSSCN7 → UCM, RSSCN7 → AID, and RSSCN7 → NWPU.For RSSCN7 → UCM, we use the five public categories as the shared label set-namely farmland, forests, dense residential areas, rivers, and parking lot-the remaining two as the private source label set, and the remaining sixteen of UC Merced as the private target label set.For RSSCN7 → AID and RSSCN7 → NWPU, we use the six public categories as the shared label set (the five previously enumerated plus industries).In addition, a fourth, more complex UniDA task with a higher Jaccard index, AID → NWPU, is carried out.In this setting, we use the twenty public categories as the shared label set, and the rest of the AID and NWPU datasets as the private target label sets.Some sample images of shared label sets from these four datasets are shown in Fig. 3.
2) Evaluation Protocol: The model is tested only on samples from the target domain; all the target-private classes are grouped into a single "Unknown" class.Specifically, during the testing stage of MA, if the target sample's transferable weight is lower than a predetermined threshold w 0 , the input image is classified as "Unknown."Thus, the average of perclass accuracy for all classes, including the shared classes and the "Unknown" class, is the final result.Note that we run each experiment three times and report the average results.
3) Implementation Details: All experiments are implemented in Pytorch [61].In the setting of the SDG stage, we use the standard normal vector z of length 10 in all experiments.The generator G is similar to that of ACGAN [62], which consists of two fully connected layers followed by seven transposed convolutional layers (the number of convolution kernels is all four) with batch normalization after each layer.The size of the generated image x f is 3 × 256 × 256.Adam [63] with a learning rate of 0.001 is used for the generator.In addition, we compute style reconstruction loss at layers relu1 2, relu2 2, relu3 3, and relu4 3 of the VGG-16 style loss network.For the model pre-trained on source data, it consists of a feature extractor F and a classifier network C. A ResNet-50 model with initial weights trained on ImageNet [64] is used as the backbone of the feature extractor.The classifier network is a fully connected network with a single layer.The cross-entropy loss is utilized to pre-train the model on source data.The stochastic gradient descent (SGD) with a learning rate of 0.001 and momentum of 0.9 is used for the model pre-trained on source data.Furthermore, the classification accuracy between the predictions of the generated data x f and the given label y is used to compute the recoverability of the categories in the pre-trained model.After the SDG stage, the generator G with the highest classification accuracy is utilized for the MA stage.
In the setting of the MA stage, the pre-trained model from source data is used to initialize the feature extractor F and the classifier network C. In addition, The discriminators D and D consist of three fully connected layers with ReLU between the first two.We train F , C, D, and D for 40000 iterations with Nesterov momentum SGD.The initial learning rate is set to 0.001, which is decayed using the same schedule as [56].During the testing stage, when the Jaccard index ξ ≥ 0.2 (AID → NWPU), the decision threshold w 0 = 0.6.Otherwise, w 0 is set to 0.8.
4) Methods to Be Compared: We compare the performance of the proposed UniDA (with and without source data) with the following methods.
• Source-only: source-only is only trained on the real source data, and directly tested on the target domain based on the trained model and the target transferable weight.
• UDA [22]: UDA is proposed to first introduce the universal DA setting in computer vision, which is a method with using source data.To discover the shared label sets and the private label sets to each domain, the transferable weights  are defined based on domain similarity and entropy.
• I-UAN [23]: an improved universal adaptation network (I-UAN) is a UniDA method with source data.In I-UAN, the transferable weight of the source domain is defined based on a pseudo-margin vector (maximum predicted probability minus second highest predicted probability) to distinguish the shared label set.The sample-wise transferable weight of the target domain is proposed based on the confidence to distinguish the shared and private label sets in target domain.
• CMU [24]: calibrated multiple uncertainties (CMU) is proposed, with a novel approach in which transferable weights are estimated by a mixture of complementary uncertainty quantities: entropy, confidence, and consistency.CMU is a UniDA method with real source data.
• MA-only: MA-only uses the initialized generator G to generate source data.The generator is initialized randomly.Then, MA is performed between the synthetic source data and the target data.

B. Experimental Results
1) Results on RSSCN7 → UCM: Our first experiment is conducted on RSSCN7 → UCM, including two cases using UniDA with source data and UniDA without source data.The results are listed in Table II.From Table II, in the UniDA setting with source data, it can be seen that the accuracy of all methods improves compared to source-only.This phenomenon illustrates that a domain shift appears in RSSCN7 and UCM datasets.Furthermore, in the UniDA with source data setting, our proposed MA achieves much better performance than all the other baselines, with an average accuracy of 75.11%.In particular, the average accuracy of shared label sets and all label sets improves by 0.4% and 0.73%, respectively, compared with the best baseline CMU [24].These findings demonstrate that the proposed sample-level transferable weight in MA, including confidence and domain similarity, is more efficient than entropy in UDA [22], pseudo-margin vector in I-UAN [23], and the mixture of entropy, confidence, and consistency in CMU [24] for remote sensing image scene classification.
In the second case of UniDA without source data, we observe that our proposed SDG-MA framework significantly outperforms the MA-only method by 18.95% on the average accuracy of all label sets.It is obvious that the proposed source data generation is effective and practical in the UniDA setting without source data of remote sensing images.Notably, compared with the MA, our SDG-MA maintains a more prominent performance in the "Unknown" class.It has been demonstrated that data points generated by the SDG effectively cover the distribution of the source data.
2) Results on RSSCN7 → AID: Our second experiment is conducted on RSSCN7 → AID; results are shown in Table III.A similar tendency is observed in the UniDA with source data setting for remote sensing images.The proposed MA outperforms all the compared methods.Again, the experiment demonstrates that the proposed sample-level transferable weight filters out data coming from shared and private label sets on feature alignment and provides a better criterion for "unknown" class detection than the existing methods.
In addition, the MA achieves the best accuracy outcomes compared with the baselines, for some shared categories (such as "Farmland," "Forests," and "Parking").
In the UniDA setting without source data, our proposed method has improved by 15.26% compared with the MA-only method.This phenomenon once again verifies the reliability of the proposed SDG.For identifying unknown classes, the SDG-MA yields excellent performance.However, the average accuracy of all categories is lower than the source-only method.The reason for this finding is that the difference between RSSCN7 and AID in the shared label sets is relatively smaller than other UniDA tasks.
3) Results on RSSCN7 → NWPU: Our third experiment is conducted on the RSSCN7 → NWPU, and the results are provided in Table IV.In the UniDA setting with source data for remote sensing images, our proposed source-base UniDA is 10.83 percentage points greater than the source-only method and achieves the highest average accuracy among all methods for recognizing all target samples, which verifies the effectiveness of our proposed MA.
In the UniDA setting without source data, our proposal improves by 42.06%, compared with the MA-only method.Notably, the SDG-MA method exhibits a huge performance for "Unknown" category.4) Results on AID → NWPU: The experimental results on AID → NWPU are reported in Table V.In the UniDA setting with source data, the proposed MA improves by 25.27% compared with the source-only method, which confirms the effectiveness and practicality of the proposed MA in a complex UniDA task with a higher Jaccard index.In addition, our proposed method achieves superior performance among all methods on most shared categories and can achieve the highest classification accuracy among all methods for recognizing private label sets in target samples (the "Unknown" category).However, I-UAN [23] is better than MA in identifying shared categories, because the proposed MA has a significant drop in the accuracy of some shared categories, such as "Stadium." In the UniDA setting without source data, the MA-only method achieves poor results (only 7.16%) due to a lack of reliable source domain data.Conversely, SDG-MA achieves superior performance (64.97%) because reliable source data is generated by SDG.Furthermore, SDG-MA achieves the highest accuracy for identifying the "Unknown" category, which further proves that the proposed SDG module can generate a uniform distribution that approximates the real source domain.

C. Model Analysis 1) Feature Distribution Analysis:
To fully understand the proposed UniDA with source data and UniDA without source data, we provide the feature distributions of RSSCN7 → UCM and AID → NWPU in Figs. 4 and 5, respectively.The t-SNE [65] is used to visualize the learned source and target features with corresponding domain labels and category labels.As shown in Fig. 4(a), before adaptation (source only), there are domain shifts between the real source domain (blue) and the target domain (red) according to the domain distribution.From the category labels, the distributions of the shared categories are fragmented and most target private samples are attached near the shared samples.After applying MA (Fig. 4(b)) and SDG-MA (Fig. 4(c)), domain shifts are effectively alleviated.In addition, separability between shared categories is increased and most target private samples are separated from the shared samples.These phenomena demonstrate that our proposed MA strategy is effective for feature alignment.Furthermore, comparing MA (Fig. 4(b)) and SDG-MA (Fig. 4(c)), we can observe that the synthetic source data distribution (green) and the real source data distribution (blue) show a high degree of consistency in class distribution and data diversity.It has been demonstrated that the synthetic source data generated by SDG is effective and reliable.
In Fig. 5(a), we can observe that the real source domain and the target domain have larger data shifts in the UniDA task AID → NWPU than the task RSSCN7 → UCM.After applying MA (Fig. 5(b)), it is clear that MA alleviates the distribution discrepancy in domain labels.Again, this finding demonstrates that the proposed MA is practical and effective.After applying SDG-MA (Fig. 5(c)), intra-class compactness and inter-class separability are significantly improved compared with source only.In addition, the intra-class compactness of the "Unknown" category (Fig. 5(c)) is improved compared with MA (Fig. 5(b)).This phenomenon further verifies the validity of the generated source data.
2) Ablation Study of Model Adaptation: In order to verify the efficacy of the proposed sample-level transferability weight, we perform ablation studies that evaluate variants of SDG-MA, which are listed in Tables II, III, IV, and V. SDG-MA w/o d is the variant that does not integrate the domain similarity into the sample-level transferability weight.SDG-MA w/o y is the variant that does not integrate confidence into the sample-level transferability criterion.As shown in Tables II, III, IV, and V, SDG-MA outperforms SDG-MA w/o d and SDG-MA w/o y, which indicates that both the domain similarity and the confidence in the transferability weight are necessary and important for UniDA tasks.
3) Decision Threshold Analysis: The hyperparameter w 0 is used to decide whether the model would label a sample as "Unknown" or use the predicted label.We analyze two cases of ξ < 0.2 (RSSCN7 → UCM) and ξ > 0.2 (AID → NWPU), which are described in Fig. 6(a) and Fig. 6(b), respectively.
As shown in Fig. 6, "Target domain" represents the average accuracy of all classes, which measures the generation ability of the source domain space and the domain adaptation ability of the model."Target-unknown" is the target accuracy of the "Unknown" class, which is a crucial metric for evaluating the vulnerability and robustness of the model.Note that there are large differences in the results for a threshold in a wide range between 0 and 2.0.When ξ is less than 0.2 (Fig. 6(a)), the average accuracy of the target domain maintains a high and stable accuracy between 0 and 0.8, and target-unknown rises significantly after 0.4.Thus, w 0 can be set to 0.8 for the case of ξ < 0.2.Furthermore, when ξ is greater than 0.2 (Fig. 6(b)), the average accuracy of the target domain exhibits little variance at higher values in a wide range between 0 and 1.However, the accuracy of target-unknown increases significantly after exceeding 0.8.Thus, in order to ensure a positive comprehensive accuracy when the case of ξ > 0.2, w 0 can be set to 1.
4) Varying Size of Shared Label Sets: We explore the effect of the percentages of shared and private label sets on SDG-MA by varying the size of Y .This is done on AID → NWPU.Fig. 6(c) shows the accuracy of SGD-MA with different Y .When Y = 0, the source domain and target domain have no overlap on label sets, i.e.Y f ∩ Y t = ∅.It is observed that SDG-MA classifies all categories into "Unknown".Furthermore, when Y keeps increasing, the performance of SDG-MA remains stable and has high precision.It has demonstrated that SDG-MA is robust for different percentages of shared and private label sets.

5) Ablation Study of Source Data Generation:
We go deeper into the efficacy of the proposed SDG by performing an ablation study that evaluates the data diversity module.The results on RSSCN7 → UCM and AID→ NWPU are shown in Tables VI and VII, respectively.Our proposed SDG-MA performs better than SDG-MA without classifier loss and SDG-MA without style loss, indicating both the   classifier loss and the style loss in the data diversity module are crucial and necessary for synthetic source data generation.More specifically, when the classifier loss and the style loss are not considered for SDG-MA, both category and overall accuracy are relatively poor.This phenomenon indicates that the restoration of data content and ensuring data diversity are paramount to the generation of reliable data.In addition, SDG-MA without style loss outperforms SDG-MA without classifier loss, meaning that the classifier loss (recovering the data content) is even more crucial.

6) Comparison of Different Data Diversity Generation Schemes:
In the SDG stage, data diversity is the key to the successful generation of source data distributions.Recently, two mainstream methods have been applied to ensure the diversity of data generation.The first, GAN-based methods (such as 3C-GAN [32]), are used to produce target-style training samples.Specifically, a discriminator is introduced to match the distributions between the target samples and the generated source samples through the use of adversarial training.The second, a decoder loss in KEGNET [31], produces similar data points for each class and increases the pairwise distance between sampled data points.We compare our proposed style loss in data diversity module with these two methods on RSSCN7 → UCM and AID→NWPU.The results are presented in Tables VI and VII, respectively.It can be seen that the generating ability of our proposed style loss is significantly better than that of 3C-GAN [32] and KEGNET [31], with respect to solving the problem of source domain generation in UniDA without source data.In addition, our style loss maintains a more prominent and uniform performance in per-class accuracy.It has been demonstrated that data points generated by the style loss have better intra-class and inter-class diversity.VIII.It is worth noting that for the pre-trained model on ImageNet [64], the feature extractor comes from the pretrained ResNet-50 on ImageNet, and the classifier is from the pre-trained ResNet-50 on RSSCN7, in order to ensure the condition of the UniDA setting.Compared with our proposed SGD-MA (pre-trained model on RSSCN7 with initial weights trained on ImageNet), it is obvious that the overall average of SDG-MA by using the source data generated from the pretrained model on RSSCN7 is better than that of SDG-MA by using the pre-trained model on ImageNet.Furthermore, we can observe that initial weights trained on ImageNet have a relatively large impact on the proposed source domain generation module, by comparing the pre-trained model without initial weights trained on ImageNet and the pre-trained model with initial weights trained on ImageNet.Thus, we can conclude that both the pre-trained model based on ImageNet (natural images) and the pre-trained model based on RSSCN7 (remote sensing images) have an impact on the proposed source data generation.The pre-trained model based on ImageNet is used to provide reasonable initial weights of the feature extractor, and the pre-trained model based on RSSCN7 provides an effective category distribution of remote sensing images for the proposed SGD.

V. CONCLUSIONS
We have introduced a novel Universal Domain Adaptation setting for remote sensing image scene classification, including UniDA with source data (MA) and UniDA without source data (SDG-MA).UniDA removes all constraints on the relationship between label sets of the source and target domains, which has a high practical value and promotes the development of DA in remote sensing.To realize universal domain adaptation with or without source data, a dual-stage framework is proposed, consisting of a source data generation stage and the purpose of the model adaptation stage.The source data generation stage is to estimate the conditional distribution of the source data and generate reliable synthetic source images from both data content and data style, when the source data is not available.Furthermore, the model adaptation stage aims to detect samples from the target shared label sets and those in target private label sets utilizing the proposed transferable weight.This work can serve as a starting point in a challenging UniDA setting for remote sensing images.However, it is difficult for the transferable weight in model adaptation to tune an optimal threshold to apply it to all UniDA tasks of remote sensing images.Thus, in the future, we will focus on adaptively learning the threshold through the use of an openset classifier.

Fig. 1 .
Fig. 1.Different domain adaptation scenarios.(a) Closed-set DA, which assumes that the source domain and the target domain have shared label sets.(b) Partial DA, which assumes that target label sets are considered a subset of source label sets.(c) Open-set DA, which assumes that source label sets are considered a subset of target label sets.(d) Universal DA, which imposes no prior knowledge on the label sets.Label sets are divided into shared and private label sets in each domain.(e) Universal DA without source data.The source dataset is not available in the practical universal DA scenarios of remote sensing.

Fig. 2 .
Fig. 2. Overview of the proposed UniDA without source data (SDG-MA).The model consists of a source data generation stage and a model adaptation stage.

) Algorithm 1 2 :
Optimization of UniDA without source data Require: Pre-trained model M on the source domain, unlabeled data X t in the target domain, batch size B; Ensure: Classification model M of the shared classes Y and the unknown class in target domain; I. Source data generation stage: 1: for epoch SDG = 1 to epoch SDG,max do Fix the pre-trained model M and the style loss network (VGG-16); Randomly sample x t of size B from X t ; 3: end for II.Model adaptation stage: 8: if starting adaptation then 9:for epoch M A = 1 to epoch M A,max do 10:Randomly sample x t of size B from X t and generate x f of size B by G; 11: for each mini-batch do 12: w f and w t are obtained by C(F (x)) and D (F (x));

Fig. 3 .
Fig. 3. Some sample images of five shared categories extracted from four datasets.Top row to the bottom row are RSSCN7 dataset, UC Merced dataset, AID dataset, and NWPU-RESISC45 dataset, respectively.

Fig. 4 .
Fig.4.Feature visualization on RSSCN7 → UC Merced.For domain, blue and green represents the real source domain and synthetic source domain, respectively.Red refers to the target domain.For category, yellow plots are "unknown" samples, others are "known" samples.

Fig. 5 .Fig. 6 .
Fig. 5. Feature visualization on AID → NWPU.For domain, blue and green represents the real source domain and synthetic source domain, respectively.Red refers to the target domain.For category, yellow plots are "unknown" samples, others are "known" samples.

7 )
Ablation study of the pre-trained model in SGD: An ablation study of the pre-trained model is conducted to investigate the effect of the pre-training models with different datasets and different initializations on the source data generation.The results on RSSCN7 → UC Merced are presented in Table The private label sets of the source and target domain are represented by Y f = Y f \Y and Y t = Y t \Y , respectively.The Jaccard index of the label sets of the two domains, ξ = |Y | |Y f ∪Yt| , is used to measure the overlap in classes.

TABLE II CLASSIFICATION
ACCURACY OF DIFFERENT METHODS ON RSSCN7 → UC MERCED (%).

TABLE III CLASSIFICATION
ACCURACY OF DIFFERENT METHODS ON RSSCN7 → AID (%).

TABLE VI ANALYSIS
OF SOURCE DATA GENERATION ON RSSCN7 → UC MERCED.