Semi-Supervised Heterogeneous Domain Adaptation: Theory and Algorithms

Semi-supervised heterogeneous domain adaptation (SsHeDA) aims to train a classifier for the target domain, in which only unlabeled and a small number of labeled data are available. This is done by leveraging knowledge acquired from a heterogeneous source domain. From algorithmic perspectives, several methods have been proposed to solve the SsHeDA problem; yet there is still no theoretical foundation to explain the nature of the SsHeDA problem or to guide new and better solutions. Motivated by compatibility condition in semi-supervised probably approximately correct (PAC) theory, we explain the SsHeDA problem by proving its generalization error – that is, why labeled heterogeneous source data and unlabeled target data help to reduce the target risk. Guided by our theory, we devise two algorithms as proof of concept. One, kernel heterogeneous domain alignment (KHDA), is a kernel-based algorithm; the other, joint mean embedding alignment (JMEA), is a neural network-based algorithm. When a dataset is small, KHDA’s training time is less than JMEA’s. When a dataset is large, JMEA is more accurate in the target domain. Comprehensive experiments with image/text classification tasks show KHDA to be the most accurate among all non-neural network baselines, and JMEA to be the most accurate among all baselines.


INTRODUCTION
T RADITIONAL supervised learning theories [1] are based on two assumptions: 1) that the training and test data are from the same distribution [2], [3]; and 2) that sufficient labeled training data are available [4], [5]. To ease the above assumptions, researchers have studied the domain adaptation (DA) problem [6], [7], [8], [9]. In DA, there are two different domains: source and target domains, where the source domain contains sufficient labeled data (training data) and the target domain only contains a few labeled data or unlabeled data (test data). Current learning theories of DA [10], [11], [12], [13], [14], [15] show that, when the source and target domains are from the same feature space (i.e., homogeneous DA (HoDA)), DA can be solved under proper assumptions [16].
In reality, however, it is not easy to find a source domain with the same feature space as the target domain of interest [17], [18], [19], [20], [21], [22]; specifically, the source and target domains might be from different feature spaces. To track this issue, researchers have proposed a challenging problem: semi-supervised heterogeneous DA (SsHeDA) [23], [24], where the source and target domains have different feature spaces, while unlabeled and just a few labeled target data are available in the target domain. Though many practical SsHeDA algorithms have been proposed [25], [26], [27], [28], very little theoretical groundwork has been undertaken to reveal the nature of the SsHeDA problem or why the current solutions work as they do [14].
One of our main purposes in this paper is to develop a SsHeDA theory to explain why the labeled source and unlabeled target data can help to reduce the need for labeled target data. We first discuss whether we can simply extend the semi-supervised HoDA (SsHoDA) theory to the heterogeneous situation by introducing feature transformations to adapt the heterogeneous source domain and target domain. Existing SsHoDA theory [13], [29] is based on the theoretical analysis of a weighted sum of source and target risks (weighted risk). Researchers have provided a uniform bound on the target risk of a classifier trained to minimize the weighted risk. This has shown that the need for labeled target data can be lowered by reducing the weight of the target risk. But, an obstacle appears in the heterogeneous situation: the combined risk [13], [29], as a constant term in the SsHoDA uniform bound, becomes a function related to feature transformations. This means that the target risk estimation might be impracticable without sufficient labeled target data.
Motivated by the compatibility condition introduced by semi-supervised probably approximately correct (PAC) theory [30], we devise a novel theory for SsHeDA from a quite new perspective compared with previous domain adaptation theories [10], [11], [12], [13], [14], [31]. Our bold strategy is to explain why the SsHeDA problem can be addressed by proving a novel generalization error of SsHeDA. By reducing the size of the target feature transformation space, the generalization error illustrates how the labeled source and unlabeled target data, together with a suitable compatibility condition, can reduce the need for labeled target data.
Guided by our SsHeDA theory, we devise two SsHeDA algorithms to bring the proposed SsHeDA theory to reality. Kernel heterogeneous domain alignment (KHDA) is a kernel method designed for small datasets. Joint mean embedding alignment (JMEA) is a network method designed for largescale data. Both algorithms maintain two main branches, where the first branch aims to transfer knowledge from source to target domain, and the second branch aims to transfer knowledge from the labeled target data to the unlabeled target data.
Both algorithms are proved to have good performance in a set of experiments comprising seven representative SsHeDA baselines, 30 text classification tasks, and 74 image classification tasks. Extensive experiments demonstrate that KHDA achieves competitive performance compared with non-neural network baselines, and that JMEA achieves better performance than all of the baselines. Our contributions are summarized as follows: 1) We introduce the concepts of compatibility, transfer error rates, and uniform sample complexity as new tools for estimating the need for labeled target data. Co-opting these concepts gives a thoroughly new perspective for theoretically analyzing domain adaptation problems. 2) We propose a generalization error estimation for target risk in SsHeDA. This is the first work on SsHeDA to explain why combining labeled source data with unlabeled target data can reduce the need for labeled target data. 3) We develop two SsHeDA algorithms based on our theoretical work: KHDA and JMEA. KHDA is a kernel-based algorithm that takes less time than JMEA, when the size of datasets is small. JMEA is a neural network algorithm that is more flexible and suitable for handling massive data. This paper is organized as follows. Section 2 reviews the current literature on SsHeDA; Section 3 introduces the problem setting and important notations; Section 4 sets out our fundamental theory of SsHeDA; Section 5 gives the details of how to design algorithms based on our theory; Sections 6 and 7 describe the KHDA and JMEA algorithms, respectively; Section 8 details our experiments; and Section 9 concludes the paper and introduces our future works.

RELATED WORK
Here we briefly discuss the domain adaptation theories and representative SsHeDA algorithms.

Domain Adaptation Theory
Pioneering theoretical work was proposed by Ben-David et al. [10], which shows that the target risk is upper bounded by three terms: source risk, marginal distribution discrepancy, and combined risk. This learning bound has been extended from many perspectives, such as considering different loss functions [32], different distribution distances [33], [34], [35] or the PAC-Bayes framework [36], [37]. According to the survey [14], most works focus on proving tighter bounds by constructing a new distribution distance.
For example, Zhang et al. [15] recently developed a new distribution distance termed margin disparity discrepancy.
Almost all the aforementioned works focus on the homogeneous and unsupervised situation. Only Blitzer et al. [13], Ben-David et al. [29] and Zhou et al. [31] investigated the semisupervised situation. Blitzer et al. [13] and Ben-David et al. [29] mainly focused on the homogeneous situation. These works are based on the weighted sum of the source and target risks and show that, a decrease in the target weight results in a reduced need for the labeled target data. Zhou et al. [31] discussed the heterogeneous situation, however, Zhou's theoretical work is designed specially for their algorithm SHFR and is difficult to extend to more general situations.

SsHeDA Algorithms
The mainstream strategy to address SsHeDA is to align the source and target domains by constructing heterogeneous feature transformations [38]. Representative SsHeDA algorithms can be roughly separated into four main types: geometric/statistical alignment, instance reweighting, pseudo label strategy and feature augmentation.
Geometric or Statistical Alignment. Domain adaptation with manifold alignment (DAMA) [26] and domain adaptation by covariance matching (DACoM) [39] utilize the manifold alignment technique [40] and covariance alignment, respectively. DAMA learns the source and target linear feature transformations to ensure that the geometry structures of the transformed domains are consistent. DACoM learns kernel/ linear transformations to ensure that the transformed domains are matched with higher order moments.
Instance Reweighting. Cross domain landmarks selection (CDLS) [41] does not regard data as being of equal importance during the domain matching process. CDLS learns two linear feature transformations and estimates the weights for the source and target data at the same time. To estimate the discrepancy between the transformed domains, CDLS utilizes the maximum mean discrepancy (MMD) [42].
Pseudo Label Strategy. Generalized joint distribution adaptation (G-JDA) [43] and soft transfer network (STN) [44] both learn feature transformations to project the source and target data into a latent space, where the marginal and classconditional distributions are matched. Lastly, the pseudo label iteration technique is used to update the target labels.
Feature Augmentation. Sparse heterogeneous domain adaptation (SHFA) [23] utilizes the augmented feature transformations, which are special linear projections mapping the source and target data to a higher dimensional space. By incorporating the original features into the augmented features, SHFA enhances the similarities between domains.
In addition to these four strategies, other strategies have also been explored. Transfer neural trees (TNT) [24] effectively adapts domains by using the decision forest technique. Semisupervised entropic Gromov-Wasserstein discrepancy (SGW) is proposed in [45] and relies on optimal transport theory.

Comparison With Existing Study
Theoretical Perspective. There is only one SsHeDA theoretical work [31]. This work limits the loss function to hinge loss and the feature transformation to linear mapping. However, our work eases the restriction of the loss functions and the feature transformations. Hence, our theory can be used in more general situations. Besides, the proposed generalization error in [31] does not provide an explanation as to why the labeled source and unlabeled target data can help reduce the need for labeled target data. However, providing such an explanation is our main purpose.
Algorithmic and Experimental Perspectives. Compared with the SsHeDA algorithms mentioned in Section 2.2, our algorithms KHDA and JMEA are theoretical-guided. This ensures that our algorithms have good generalization ability under proper conditions. Additionally, we construct two new datasets and introduce a new dataset for validating the effectiveness of SsHeDA algorithms. These new datasets are very challenging and practical, and increase the diversity of benchmark datasets in the field of SsHeDA.

PROBLEM SETTING AND CONCEPTS
This section shows the problem setting and related concepts. Main notations are summarized in Table 1 of Appendix I, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TPAMI.2022.3146234.

Problem Setting
Kg be a label space. Source domain and target domain are two different joint distributions P X s Y s and P X t Y t , where X s 2 X s , X t 2 X t and Y s ; Y t 2 Y are random variables. Then, the SsHeDA problem is defined as follows: Problem 1 (SsHeDA). Given sets of samples called the labeled source, labeled target and unlabeled target data S ¼ fðx i s ; y i s Þg ns i¼1 $ P XsY s i:i:d: $ P X t i:i:d: where n l ( n u , n l ( n s , the aim of semi-supervised heterogeneous domain adaptation is to train a classifier g : X t ! Y such that g can classify the unlabeled target data by using S; T l ; T u .

Concepts
Data. x i s , x i l and x i u represent the ith labeled source data, the ith labeled target data, and the ith unlabeled target data, respectively. We use fx i t g n l þn u i¼1 to denote the union of labeled target data and unlabeled target data: , if i > n l . We also use fx i st g nsþn l þnu i¼1 to denote the union of source data and target data: if i > n s . Data Matrices. For simplicity, we set X s ¼ ½x 1 s ; . . . ; x n s s ; X l ¼ ½x 1 l ; . . . ; x n l l ; X u ¼ ½x 1 u ; . . . ; x nu u : Let X c s 2 R dsÂn c s and X c l 2 R d t Ân c l be the sub-matrices of X s and X l , whose column vectors are data with label c. Let X t 2 R d t Ân t be ½X l ; X u . Here n c s ; n c l are the number of labeled source and target data with label c, and n t ¼ n l þ n u . Given any data matrix X, x 2 X means that x is a column vector of X.
Empirical Distributions and Feature Transformations.We denote the notation b P X be the corresponding empirical distribution over any data matrix where d x i is the Dirac measure defined in x i . For example, b P Xs is the empirical distribution corresponding to X s . Given a latent space as source and target transformation spaces, respectively. Given a transformation T T , we define the transformed data matrix as T T ðXÞ ¼ ½T T ðx 1 Þ; . . . ; T T ðx n Þ. Hypothesis Space and Risks. In this paper, we consider a multi-class classification task with a hypothesis space H consisting of scoring functions (hypothesis functions) x ! ½h 1 ðxÞ; . . . ; h K ðxÞ; where h c ðxÞ ðc ¼ 1; . . . ; KÞ indicates the confidence in the prediction of label c. Given ' : R K Â R K ! R !0 as the symmetric loss function, the risks of h h 2 H w.r.t. ' under P T T s ðX s ÞY s and P T T t ðX t ÞY t are given by where f maps a label to the corresponding one-hot vector. It is convenient to use notations b R s ðh h T T s Þ and b R t ðh h T T t Þ represent the empirical risks corresponding to the risks R s ðh h T T s Þ and R t ðh h T T t Þ, respectively.
Domain Distance. To estimate the discrepancy between domains, we use the following well-known measurements.
Definition 1 (Disparity Distance [15]). Let the hypothesis space H be a set of functions defined in a feature space X , ' be a loss function and P 1 ; P 2 be distributions on space X and h h be any element in H . The disparity distance d ' h h;H ðP 1 ; P 2 Þ between the distributions P 1 and P 2 over X is   X ! Rg, the MMD between distributions P 1 and P 2 is

fðxÞ :
Gretton et al. [42] propose the unit ball in a reproducing kernel Hilbert space (RKHS) H k [34] (the subscript k represents the reproducing kernel) as the MMD function class F .
Though the MMD distance is powerful with selected kernels [46], [47], it is not convenient to be optimized as a regularization term in shallow domain adaptation algorithms. The projected MMD [9], [34], [48] has been proposed to transform the MMD distance into a proper regularization term. Given a scoring function h h ¼ ½h 1 ; . . . ; h K 2 H, if h c 2 H k ; c ¼ 1; . . . ; K, the projected MMD is defined as follows: let k Á k 2 be the ' 2 norm, then (2)

THEORETICAL FOUNDATION FOR SSHEDA
This section presents two novel concepts, then reviews the existing SsHoDA theory and discusses the main obstacle to extend the SsHoDA theory into the heterogeneous situation. Lastly, we introduce our main theoretical results. For better understanding the structure of our theorems and their connections with our algorithms, we draw Fig. 1 to illustrate the structure of the following contents of our paper.

Uniform Sample Complexity
Let c be the function from H Â F s Â F t Â P s Â P t Â P t ðXÞ to R, where P s ; P t ; P t ðXÞ are probability spaces over spaces X s Â Y, X t Â Y and X t , respectively. Given any distributions P XsYs 2 P s , P X l Y l 2 P t and P X t 2 P t ðXÞ (in SsHeDA, P X l Y l ¼ P X t Y t ), if random data S with size n s , T l with size n l and T u with size n u are drawn from P XsYs , P X l Y l and P X t , i.i.d., respectively, for any 0 < d < 1, > 0, we denote as the smallest numbers of data (detailed definitions can be seen in Appendix III-A, available in the online supplemental material, ( with a probability of at least 1 À d > 0, then for any h h 2 H, T T s 2 F s and T T t 2 F t , we have jc À b cj < , where c ¼ cðh h; T T s ; T T t ; P XsYs ; P X l Y l ; Above inequality shows that we can reduce the smallest sample number m c l for labeled target data by decreasing the size of the hypothesis space and target transformation space. Next, we provide an example to understand the uniform sample complexity. Example 1. Let cðh h; T T s ; T T t ; P X s Y s ; P X l Y l ; P X t Þ be where C is a uniform constant depending on K and loss '. Example 1 has shown that if the hypothesis space and transformation spaces satisfy appropriate conditions, Eq. (4) can be estimated by finite samples. To achieve a more accurate estimation for Eq. (4), more labeled source data and unlabeled target data are required.

Compatibility and Transfer Error Rate
Compatibility is proposed by [30] to develop the PACmodel style framework for semi-supervised learning. By using the notation of compatibility, the semi-supervised PAC theoretical model provides a unified framework for analyzing why unlabeled data can help to reduce the need for labeled data.
To investigate how the source data and unlabeled target data reduce the need for labeled target data in the SsHeDA problem, we define the SsHeDA compatibility.
Definition 3. Given hypothesis space H, transformation spaces F s ; F t and probability spaces P s and P t ðXÞ over spaces X s Â Y, X t , respectively, the heterogeneous domain adaptation compatibility is a function x : H Â F s Â F t Â P s Â P t ðXÞ ! ½0; 1: Definition 4 (Transfer Error Rate). Given the HeDA compatibility x, the incompatibility of h h; T T s ; T T t with distributions P XsYs and P X t is 1 À xðh h; T T s ; T T t ; P XsYs ; P X t Þ; which is also called the transfer error rate, errðh h; T T s ; T T t Þ, when x, P X s Y s , P X t are clear from the context. For given data S $ P X s Y s , T u $ P X t , we use c errðh h; T T s ; T T t Þ to denote 1 À xðh h; T T s ; T T t ; b P S ; b P T u Þ as the empirical form of errðh h; T T s ; T T t Þ.
Transfer error rate errðh h; T T s ; T T t Þ aims to measure the degree of incompatibility, which is a kind of "error" that measures how unreasonable we believe some proposed hypothesis functions and feature transformations are. Then, we define the hypothesis function and target transformation whose incompatibility is at most a given value a: Definition 5. Given the threshold a ! 0, HðaÞ; F t ðaÞ are fh h 2 H j 9 T T s 2 F s ; T T t 2 F t ; s:t: errðh h; T T s ; T T t Þ ag; fT T t 2 F t j 9 T T s 2 F s ; h h 2 H; s:t: errðh h; T T s ; T T t Þ ag: Next, we need an assumption to estimate the difference between c errðh h; T T s ; T T t Þ and errðh h; T T s ; T T t Þ.
Assumption 1. Given spaces H, F s , F t and transfer error rate If Assumption 1 holds, the transfer error rate can be estimated by using finite source data and finite unlabeled target data. Lastly, we introduce an example of transfer error rate.
Example 2. One can set errðh h; T T s ; T T t Þ as where t is the weight and B is the supremum of '. If H F s ; H F t have finite Natarajan dimension, then errðh h; T T s ; T T t Þ satisfies Assumption 1 (see Appendix III-B, available in the online supplemental material).

Obstacle to Extend SsHoDA Theory
Here we introduce the obstacle to extend SsHoDA theory in SsHeDA. All proofs are given in Appendix IV, available in the online supplemental material. In SsHoDA theory [13], [29], weighted risk is defined as where b ð0 < b < 1Þ is the weight. Utilizing weighted risk, the following theorem shows that the number of labeled target data can be reduced in homogeneous situation.
Theorem 1 (SsHoDA Learning Bound). Let ' be the symmetric loss satisfying the triangle inequality, feature spaces X s ; X t be space X , and F s , F t be fIg, where I is identical mapping from X to X . Given labeled source data S with size n s , labeled target data T l with size n l , and unlabeled target data T u with size n u , for any 0 < d < 1, 0 < g 1 , g 2 < 1; and > 0, if (1) and R s ; R t are R s ðh hÞ; R t ðh hÞ, then with a probability at least 1 À d > 0 The SsHoDA learning bound in Theorem 1 mainly contains three terms: the optimal target risk, the discrepancy distance and the combined risk. When b ! 1, the bound degenerates into the standard learning bound (that is, we use only labeled target data). Note that by choosing different values of b, the bound allows us to effectively trade off the number of labeled target data against the number of labeled source data and unlabeled target data.
It is natural to extend the above theorem to the heterogeneous situation by using the weighted risk and transformations T T s and T T t . However, the combined risk L results in the main obstacle. In heterogeneous situation, L is Eq. (5) shows that L is a function on variables T T s ; T T t , hence, it is not a fixed value. To estimate L using finite samples, the labeled target data are indispensable. The following theorem provides a reason why L is the main obstacle.
Theorem 2. There exist hypothesis space H and non-trivial transformation spaces F s , F t such that for any > 0 and 0 Note that there is a coefficient 2ð1 À bÞ of L in the bound of Theorem 1. Hence, to estimate 2ð1 À bÞL given and d, the number of labeled target data needed is at least Combining Theorems 1 and 2 with Eq. (6), we know that when jX s j; jX t j > 1, there exist hypothesis space H and non-trivial transformation spaces F s , F t such that the number of labeled target data n l is at least to obtain a similar bound in Theorem 1. Hence, the number n l is greater than or equal to m R t l ð; d; H; F s ; F t Þ (detailed discussion can be found in Appendix IV-D, available in the online supplemental material).
Remark 3. According to Theorem 1, in homogeneous situation, for any hypothesis H, when b ! 0, the need for labeled target data is close to 0. However, in heterogeneous situation, there exist hypothesis H and non-trivial transformation spaces F s ; F t such that for any weight b 2 ð0; 1Þ, the number of labeled target data may be larger than m R t l ð; d; H; F s ; F t Þ.

Theoretical Analysis
We start this section from a basic theorem, which contains our main idea about SsHeDA theory. More extensive discussions and all proofs are given in Appendix V, available in the online supplemental material.
Theorem 3. Let H be the hypothesis space, F s ; F t be the source and target transformation spaces, and errðÁÞ be the transfer error rate satisfying Assumption 1. Given labeled source data S with size n s , labeled target data T l with size n l , and unlabeled target data T u with size n u , for 0 < d; then, with a probability of at least 1 À d, for any h h 2 H, T T s 2 F s and Observing Theorem 3, when g 1 is close to 0, then the number of labeled target data is less than that for estimating R t ðh h T T t Þ directly. The crucial reason is because the space F t is replaced by a smaller space F t ða þ Þ.
To further reduce the number of labeled target data, we replace condition then the number of labeled target data can be reduced to m R t l ð; ð1 À g 1 Þd; Hða þ Þ; F s ; F t ða þ ÞÞ: Though Theorem 3 provides an explanation of the uniform sample complexity for the labeled target data, we still cannot explain the representative algorithms [43], [44] by constructing different transfer error rates. This is because the transfer error rate is not related to labeled target data that can be used to help align the heterogeneous spaces and control the approximate error. Hence, we add an additional constraint called the heterogeneous space alignment dðh h; T T s ; T T t Þ; which is able to be estimated by labeled source data and labeled target data. Motivated by previous work [11], [44], [50], we can set the heterogeneous space alignment as class-conditional distribution alignment projected MMD alignment for class MMD alignment for class Theorem 4. Given the same conditions and assumption in Theorem 3, for any 0 < d < 1 and a; > 0, where b dðh h; T T s ; T T t Þ is the empirical form of heterogeneous space alignment defined in Eqs. (8), (9) or (10).
Theorem 4 is an extension of Theorem 3. By introducing the heterogeneous space alignment in Theorem 4, we can align the heterogeneous space between the source and target domains better. Using Theorem 4, we can provide an explanation for representative algorithms, such as STN [44]. Reducing the space size implies that the estimation error decreases and approximate error may increase. To estimate the approximate error, Theorem 5 provides an answer.
Theorem 5. Let ' be the loss satisfying the triangle inequality. If for a 1 ; a 2 > a min , where We use the combined error Lðh h; T T s ; T T t Þ=B as the heterogeneous space alignment term in above theorem. The combined error is deeply related to the conditional distribution discrepancy (see Appendix VI-B, available in the online supplemental material), hence, it can be used to align heterogeneous spaces. In addition, if we require we can also obtain a result similar to the above theorem. Furthermore, we provide an explanation for representative algorithm STN [44] using our theory (see Appendix VI-C, available in the online supplemental material).

BRINGING SSHEDA THEORY INTO REALITY
This section shows how to design a loss function according to Theorem 4, i.e., bringing Theorem 4 into the reality. As discussed in Theorem 4, we should consider the optimization problem as follows: However, such constraint optimization problem cannot be easily solved. Following [51], we replace the constraint in problem (11) as a penalty and have the revised problem where ( > 0) is a free parameter. It is important to construct transfer error rate errðh h; T T s ; T T t Þ and heterogeneous space alignment dðh h; T T s ; T T t Þ. Motivated by [50], [52], in our kernel-based algorithm Motivated by [44], in our neural network-based algorithm where r (r > 0) is free parameter, and F is the unit ball of linear kernel Hilbert space. Besides, we omit the coefficient that ensures the transfer error rate is not larger than 1. Target Distribution Alignment.Note that the selection bias may exist [53], [54], since the number of labeled target data might be small. Thus, to mitigate the selection bias, the target distribution alignment b d t ðh h; T T t ; T T t Þ is considered. An analysis for the target distribution alignment is in Appendix VI-D, available in the online supplemental material. In our kernel-based algorithm, we set b d t ðh h; T T t ; T T t Þ to In our neural network-based algorithm, b d t ðh h; T T t ; T T t Þ is where X c u is the unlabeled data matrix with pseudo label c. Overall Loss Function. Inspired by the above discussions, to solve the SsHeDA problem well, we need to take care of the following optimization problem here c errðh h; T T s ; T T t Þ, b dðh h Ã ; T T s ; T T t Þ are the empirical forms of Eqs. (13) and (14), and b d t ðh h; T T t ; T T t Þ is defined in Eqs. (15) or (16). Note that we have added b R t ðh h Ã T T t Þ in Eq. (18), because we need to guarantee that the labeled target data can be classified accurately.

KERNEL-BASED ALGORITHM FOR SSHEDA
This section presents kernel heterogeneous domain alignment (KHDA) algorithm, where the spaces F s ; F t and H in problem (17) are defined as follows: where d is the dimension of the latent space X , H k s is the RKHS space with kernel k s ðÁ; ÁÞ defined in the space X s Â X s and H k t is the RKHS space with kernel k t ðÁ; ÁÞ defined in the space X t Â X t . The function h c is a linear function, thus, h c T T s 2 H ks and h c T T t 2 H k t .

Loss Function in KHDA
As introduced in Eq. (13), the transfer error rate is R s ðh h T T s Þ þ rD 2 h h ðP T T t ðX t Þ ; P T T sðXsÞ Þ, then the empirical form of the transfer error rate can be written as According to Eq. (13), the heterogeneous space alignment dðh h; T T s ; T T t Þ is set to projected MMD alignment. The empirical projected MMD alignment is Motivated by [50], [55], the pseudo labels are used to further improve the classification performance, hence, we replace above equation by where X c t ¼ ½X c l ; X c u , here X c u is unlabeled data matrix with pseudo label c.
Then, to preserve domains' geometry structures, such as manifold structure and clustering structure, manifold regularization [40] is considered in KHDA. Many kernel-based DA algorithms [50], [52], [56] have studied the manifold regularization and shown that it can help to improve the transfer performance. One can write the manifold regularizations b Wðx; x 0 Þ; where T T ðxÞ ¼ T T s ðxÞ if x 2 X s , otherwise T T ðxÞ ¼ T T t ðxÞ; W Ã ðx; x 0 Þ and Wðx; x 0 Þ are the pair-wise affinity functions and estimate the similarity of x; x 0 . Additionally, when x and x 0 are from different domains, we set W Ã ðx; x 0 Þ ¼ 0.
Summarizing the above discussion, the loss function (18) can be rewritten as follows: where respectively, k Á k 2 s , k Á k 2 t are the squared norms in RKHS with kernel k s and k t respectively; kh h T T s k 2 s þ kh h T T t k 2 t and kh h Ã T T s k 2 s þ kh h Ã T T t k 2 t are used to avoid over-fitting; and s is the free parameters (s ! 0). In addition, we set the loss ' as the squared loss 'ðy; y 0 Þ ¼ ky À y 0 k 2 2 in KHDA.

Reformulation of the KHDA Loss Function
This section shows how to reformulate Eq. (20). Following the representer theorem [57], T T s and T T t can be written as a a i k s ðx; x i s Þ; 8x 2 X s ; where a a i ; b b b b i 2 R 1Âd are the parameters. We define matrices a a 2 R n s Âd , b b 2 R n t Âd and Q Q 2 R ðn s þn t ÞÂd as ; We also define the kernel matrix where K ss ¼ ½k s ðx i s ; x j s Þ 2 R n s Ân s , K tt ¼ ½k t ðx i t ; x j t Þ 2 R n t Ân t are source and target kernel matrices, respectively.
Additionally, the hypothesis space H is the linear space, thus, we can write h h and h h Ã as follows: where G G; G G Ã 2 R dÂK are the parameters. Empirical Risks. Here we will use a matrix to rewrite the following equation: Let the label matrix be Y 2 R ðnsþn t ÞÂK where A is a ðn s þ n t Þ Â ðn s þ n t Þ diagonal matrix with , if x i st 2 X l , otherwise A Ã ii ¼ 0; and k Á k F is the Frobenius norm.
Distribution Alignment. Using the representer theorem [57] and kernel trick [52], we rewrite Eq. & where simðx i s ; x j s Þ is the similarity function such as cosine similarity, N p ðx i s Þ denotes the set of p-nearest neighbors to x i s and p is a free parameter. The pair-wise target affinity matrix W t is denoted as otherwise: : Using W s and W t , we have Using the representer theorem and kernel trick, we can formulate b M 1 ðh h Ã ; T T s ; T T t Þ and b M 2 ðh h; T T t ; T T t Þ as where L Ã and L are the Laplacian matrices, which can be written as D Ã À W Ã and D À W. Here D Ã , D are diagonal matrices with D Ã ii ¼ (26), (27) with Eq. (20), the optimization problem is written as where LðG G; G G Ã ; Q QÞ is

Analytical Solution
We theoretically analyse the optimization problem (28) and the following theorem tells us that the optimization problem (28) has countless solutions. All proofs are in Appendix VI, available in the online supplemental material. has a solution, then the optimization problem has countless solutions, where LðG G; G G Ã ; Q QÞ is defined in Eq. (29).
Although the optimization problem (28) has countless solutions, we are only interested in Q QG G and Q QG G Ã . Next, we investigate whether the optimization problem (28) can be transformed to an optimization problem with respective to Q QG G and Q QG G Ã in Theorem 7.
Theorem 7 implies an important result that though the solutions of problem (28) are not unique, Q QG G; Q QG G Ã are fixed and are the unique solution of the problem (30). Then, we present the solution to problem (30) in Theorem 8.
Theorem 8. If the kernels k s and k t are universal, then the optimization problem (30) has a unique solution

Based on Theorem 8, h h T T t and h h Ã T T t can be written as
h h Ã T T t ðxÞ ¼ where x 2 X t , and Z i , Z Ã i are (i þ n s )th rows of matrices Z and Z Ã , respectively.

KHDA Algorithm
To compute Eqs. (31) and (32), the labels of the unlabeled target data are required. However, we have not any label information related to these data. A simple and effective method is to use the pseudo label iterative strategy [12], [55], [58], [59]. Motivated by pseudo-label iterative strategy, Algorithm 1 presents how to iteratively improve the quality of Eqs. (31) and (32) and give a final kernel-based solution to the SsHeDA problem, which is explained below.
Step 1 (Initialize pseudo labels, lines 2-3). We use SVM with the labeled target data T l as the training data to predict the pseudo target labels Y u for unlabeled target data T u . Then, we let Y Ã u ¼ Y u .
Step 2 (Construct classifiers, lines 4-5). Using Eqs. (31), (32) with the pseudo labels Y Ã u and Y u , we obtain h h Ã T T t ; h h T T t .
Step 3 (Bridge classifiers, lines 6-9). To link h h Ã T T t ; h h T T t , we update the pseudo label Y Ã u by classifier h h Ã T T t , then use Eq. (31) with the pseudo label Y Ã u to learn the third classifier e h h T T t . Next, we advocate taking advantage of the comple- where h c , h Ã c and e h c are the cth coordinate of h h, h h Ã , e h h. As a result, the pseudo label of a given target data x can be predicted by arg max c2Y ½f 1 ðxÞ; . . . ; f c ðxÞ; . . . ; f K ðxÞ. Using classifier f f, we obtain the pseudo labels Y u , where f f ¼ ½f 1 ; . . . ; f c ; ::; f K .
Step 4 (Update, line 10). We repeat Steps 2 and 3 until convergence, and choose f f as the final classifier.

NETWORK-BASED ALGORITHM FOR SSHEDA
To address the SsHeDA problem in large-scale datasets, this section presents joint mean embedding alignment (JMEA) to train a network to classify target data. In JMEA, we use the fully-connected neural networks to construct the spaces F s ; F t and H. Since JMEA does not need to compute the kernel matrix of the whole training set, the computational cost of JMEA is lower than that of the algorithm KHDA.

Network Structure in JMEA
According to Eqs. (14), (16) and (18), we have the following loss function: where Ã is a parameter to re-weight the labeled-data loss b R s ðh h Ã T T s Þ þ b R t ðh h Ã T T t Þ and always set to 2 in this paper. Since we need to minimize above loss function by using two different classifiers h h Ã and h h over the representations of source and target data, the network used in JMEA contains two branches (see Fig. 2). The first branch takes the labeled source, labeled target and unlabeled target data as inputs and aims to train a classifier (i.e., h h Ã ) to classify source representations (i.e., T T s ðX s Þ) and target representations (i.e., T T t ðX t Þ) well. The second branch only takes labeled and unlabeled target data as inputs and aims to train a classifier (i.e., h h) for the target representation (i.e., T T t ðX t Þ). Note that, the second branch is used as the final target classifier.

Loss Function in JMEA
In the first branch, we need to train a classifier to classify source and target representations well. Thus, the following loss function is used to optimize the parameters of Branch I where X c t is ½X c l ; X c u . In the second branch, we need to train a classifier to classify target representations well. Thus, the following loss function is used to optimize the parameters of Branch II Since we need to use pseudo labels to compute Eqs. (38) and (40), we apply high-confident pseudo labels and the soft-label trick [44] to ensure the high quality of pseudo labels. Hence, we revise Eq.
where I k c ¼ I k \ I c , I k is a set collecting the index of the top-k high-confident target data (annotated by h h T T t ), I c is a set collecting the index of target data with the pseudo label c, and y i u is the soft label of x i u (i.e., where h is a small constant 10 À6 to avoid numerical problems when jI Ãk c j gets close to 0, I Ãk c ¼ I Ãk \ I Ã c , I Ãk is a set collecting the index of the top-k high-confident target data (annotated by h h Ã T T t ), I Ã c is a set collecting the index of target data with the pseudo label c, and Note that we have applied the co-teaching manner to avoid accumulating errors of pseudo target labels [60], [61]. Namely, the pseudo-labeled target data used in Branch I are annotated by Branch II (i.e., y i u ), and the pseudo-labeled target data used in Branch II are annotated by Branch I (i.e., y Ãi u ). Since two branches have different views to annotate unlabeled target data, two branches can teach each other to avoid accumulating errors of pseudo target labels [61], [62]. Finally, JMEA has four parts of the overall loss function: 1. Labeled-data loss:

JMEA Algorithm
Algorithm 2 presents how JMEA trains a network to classify data from the target domain. First, we initialize parameters of T T s , T T t , h h, h h Ã (line 2). Then we shuffle data S; T u ; T l , and update the value of k to decide how many high-confident pseudo-label target data should be selected (lines 3 and 4). After a mini-batch is fetched, we obtain the pseudo labels of the unlabeled target data by h h Ã T T s and h h T T t (lines 5 and 6), respectively. Based on the confidence of each pseudo label, we select top-k high-confident pseudo-label target data (lines 7 and 8). Then we can compute the overall loss (lines 9-11) and update the parameters of T T s , T T t , h h, h h Ã by minimizing the overall loss.

EXPERIMENTS AND EVALUATIONS
This section empirically evaluates the proposed algorithms KHDA and JMEA on different SsHeDA tasks. We then conducted experiments to analyze the sensitivity of Algorithm 2. JMEA Algorithm for SsHeDA 1: Input Data S; T u ; T l ; #Epochs T ; Lowest selection rate r; Mini-batch size n b ; Parameters , r; 2: Initial T T s , T T t , h h, h h Ã ; for i ¼ 1; 2; . . . ; T do 3: Shuffle datasets S; T u ; T l ; 4: Update k ¼ minfbn b Â i=T max þ rc; n b g; // Set the number of high-confident pseudo-label target data for N ¼ 1; . . . ; N max do 5: Fetch mini-batches from S and T u ; // We use the full batch for T l 6: Compute labeled data loss: hyperparameters. More experiments are given in Appendix VIII, available in the online supplemental material.
Text$Image. UPMC Food-101 dataset [66] contains text and image datasets and consist of about 100,000 recipes with 101 food categories. For images (I), we use the Big Transfer-M (BiTM) with ResNet-50 [65] to extract the features. For text features (T), we adopt NLP model BERT [67] to extract the features [68]. Then, we randomly select 30 data per class as the source data, 30 data per class as the unlabeled target data, and 1,3,5 data per class as labeled target data. There are 6 SsHeDA tasks. The average accuracy and standard error of 10 random trials are shown in Table 3.
Text$Image. Wikipedia dataset [69], [70] is extracted from Wikipedia feature articles and consists of 2,866 imagetext pairs with 10 semantic classes. For images (I), we use the Big Transfer-M (BiTM) with ResNet-101 [65] to extract the features. For text features (T), since most of Wikipedia's texts are long-sequence, we adopt the NLP model Big Bird [71] to extract the features. All data in the source domain are selected randomly as the labeled source data. For the target domain, we randomly choose 3,5,7 data per class as labeled target data, and randomly choose 50 data per class in the remaining data as the unlabeled target data. There are 6 SsHeDA tasks. The average accuracy and standard error of 10 random trials are shown in Table 3.
Text$Text. Multilingual Reuters Collection (MRC) [72], [73] is a text dataset applied for multi-lingual text categorization and consists of 11,000 articles from six categories in five languages, i.e., English, French, German, Italian, and Spanish. Following the same settings in previous work [43], we use BOW with TF-ITF to describe each article. Then we use PCA in the BoW features to preserve 60% energy [72], [73]. We set English, French, Italian and German as the source domains and Spanish as the target domain. 100 data per class in the source domain are selected randomly as the labeled source data. For the target domain, we randomly choose 10,15,20 data per class as the labeled target data and randomly choose 500 data per class as the unlabeled target data. There are 12 SsHeDA tasks. The average accuracy and standard error of 20 random trials are shown in Table 4.
Image!Text (end-to-end). Road-View dataset is constructed through the natural language-based vehicle retrieval (NLVR) dataset [74] for end-to-end learning tasks. The road- The underline indicates the best accuracy among all non-neural network algorithms, and the bold color indicates the best accuracy among all neural network algorithms. The underline indicates the best accuracy among all non-neural network algorithms, and the bold color indicates the best accuracy among all neural network algorithms. Baseline Algorithms. 1NN, SVMt, DAMA [26], SHFA [23], G-JDA [43], CDLS [41], DACoM [39], TNT [24] and STN [44] are used as the baseline algorithms for non-end-toend tasks. Except for 1NN and SVMt, the details on other baselines are given in Section 2. In the end-to-end task, we consider the following baselines: 1) Target-ERM where we only use labeled target data to fine-tune the RoBERTa-large model [76], and 2) ST-ERM where we train JMEA only using labeled information in both domains, and 3) JMEA-BII where we use labeled and unlabeled target data to fine-tune the RoBERTa-large model [76] (i.e., only training Branch II of JMEA), and 4) the end-to-end version of STN [44].

Experimental Setup
Before detailing the evaluation results, we explain how the parameters of KHDA and JMEA are set.
Parameters for KHDA. There are several parameters: 1) the kernel k s and k t ; 2) #iterations T ; 3) s, r, #neighbor p.
As suggested in [42], [50], we choose the Gaussian kernel k n ðx n ; x 0 n Þ ¼ exp À kx n À x 0 n k 2 2 2r 2 n ; where n 2 fs; tg, x n ; x 0 n 2 X n and the kernel bandwidth r n is medianðkx n À x 0 n k 2 Þ, 8 x n ; x 0 n 2 X n . When n ¼ s, X n ¼ X s . When n ¼ t, X n ¼ X t . The details of the parameters are shown in Table 1.
Parameters for JMEA. There are several parameters in JMEA: 1) #epochs T ; 2) r, and r. Except for the end-toend task, the details of these parameters are shown in Table 1. T T s and T T t are three-layer fully-connected neural networks. The h h and h h Ã are two-layer fully-connected neural networks. In the end-to-end task, the backbone of the Branch I of JMEA is the ResNet-50 model [75], and the backbone of Branch II of JMEA is the RoBERTa-large model [76], and we use the end-to-end manner to implement JMEA. Due to the complexity of Road-View task, we detail JMEA's parameter setting regarding the end-to-end task in Appendix VIII, available in the online supplemental material.
Metric. The classification accuracy [55] on the test data is where g g is the predicted classifier and X c u is the unlabeled target data matrix with true label c.

Experimental Results
The classification accuracy and standard error on different tasks are shown in Tables 2, 3

and 4.
Image$Image. 1) In Table 2, compared to all non-neural network baselines, KHDA works the best for almost all tasks (12=12) and the mean accuracy achieves an improvement at least 2:5%. Compared to all neural network baselines, JMEA works the best for all tasks (11=12) and the average accuracy of JMEA achieves an improvement at least 2:5%. KHDA and JMEA both achieve better performance than all baseline algorithms. 2) It is notable that the accuracy of all algorithms increases when using more labeled target data per class. In addition, KHDA is better than JMEA when the number of labeled target data per class is 1. KHDA becomes worse than JMEA if the number of labeled target data per class is 3 or 5. 3) Except for DAMA, baselines SHFA, G-JDA, CDLS, DACoM, TNT and STN achieve better mean performance than 1NN and SVMt. This indicates that the baselines SHFA, G-JDA, CDLS, DACoM, TNT and STN can transfer knowledge from the source data to the target data.
Text$Image. The results for the Wikipedia and Food-101 datasets are reported in Table 3. 1) JMEA works the best for all tasks (12=12) and has achieved a mean improvement at least 0:5%, compared to all baselines. 2) Among all nonneural network algorithms, KHDA works the best for all tasks (12=12) and has achieved a mean improvement at least 2:0%. 3) In some tasks, STN is slightly better than KHDA (0:1% $ 0:6%). However, in tasks T!I Wiki, KHDA is better than STN (0:9% $ 3:3%). 4) DACoM and TNT are worse than 1NN and SVMt in tasks I!T and T!I Food. The reason is that the number (101) of classes for Food-101 may be beyond the capacity of DACoM and TNT.
Text$Text. Table 4 shows the means and standard errors of classification accuracy for all algorithms on the MRC. 1) Of all the non-neural network-based algorithms, KHDA performs the best on 11 tasks (11=12). 2) JMEA has achieved the best performance compared to all algorithms and generally outperforms all other baseline algorithms by at least 0:9%; 0:3% and 0:3% (average accuracy), respectively, for different labeled target data per class. 3) According to the results from Table 4, the accuracy of all algorithms increases when using more labeled target data per class.
Image!Text (end-to-end). Table 5 shows the means and standard errors of classification accuracy of JMEA and baselines on the Road-View task. In Table 5, JMEA outperforms all baselines. In particular, JMEA has higher accuracy than the state-of-the-art network-based SsHeDA algorithm STN.

Parameter Sensitivity
We conduct experiments on three different tasks: CIFAR-8, CIFAR-59 and Food-101 (3 labeled target samples per class) to evaluate the mean-accuracy variations of KHDA and JMEA using different parameters.
Parameter r in KHDA. We run KHDA with varying values of r. Fig. 3a plots the classification accuracy w.r.t. different values of r. From this figure, we observe that 1) when r is from [1.0,5.0], the performance may be the best and when r ¼ 0:01, the performance is the worst; 2) as increasing r from 0.01 to 1.0, the accuracy increases; 3) as increasing r from 1.0 to 100, the accuracy decreases slowly. KHDA can achieve satisfactory performance, if r 2 ½1:0; 10:0.
Parameter p in KHDA. We run KHDA with varying values of p. Fig. 3b plots the classification accuracy w.r.t. different values of p. From this figure, we observe that as increasing p from 2.0 to 64.0, the accuracy on CIFAR-8 is quite stable, and the accuracy on CIFAR-59 and Food-101 decrease slowly. In particular, by changing p in the range of [2.0,10.0], KHDA achieves satisfactory performance.
Parameter s in KHDA. We run KHDA with varying values of s. Fig. 3c plots the classification accuracy w.r.t. different values of s. From this figure, we observe that 1) the performance is the best when s ¼ 0:001, and the performance is the worst when s ¼ 10:0; 2) as increasing s from 0.001 to 10.0, the accuracy decreases gradually. Specifically, by changing s in the range of ½0:001; 0:05, the mean accuracy of KHDA is still higher than that of the non-neural network baselines.
Parameter T in KHDA. The results of the convergence analysis are provided in Fig. 3d, which shows that KHDA achieves steady performance in a few iterations (T < 5).
Parameter r in JMEA. We run JMEA with varying values of r. Fig. 3e plots the classification accuracy w.r.t. different values of r. From this figure, we observe that 1) the performance of JMEA is very steady, when r is in the range [0.0001,0.001] on CIFAR-8 and Food-101 datasets; 2) as increasing r from 0.0001 to 0.001, the accuracy on CIFAR-59 dataset increases and achieves the highest value when r ¼ 0:001. Thus, we recommend selecting r in ½0:005; 0:001.
Parameter r in JMEA. We run JMEA with varying values of r. Fig. 3f plots the classification accuracy w.r.t. different values of r. From this figure, we observe that 1) the mean accuracy of JMEA will drop significantly when r is greater than 0.001, meaning that a small value of r will be better if the number of classes is large; 2) when increasing r from 0.0001 to 0.02, the accuracy on CIFAR-8 dataset is quite stable. Overall, if we select r in the range of ½0:0005; 0:001, JMEA can achieve satisfactory performance.
Parameter r in JMEA. We run JMEA with varying values of r. Fig. 3g plots the classification accuracy w.r.t. different values of r. From this figure, we observe that the accuracy on all datasets is quite stable, when increasing r from 50 to 250. We recommend selecting r in the range of [100,200].

Ablation Study
Here we present the ablation study for KHDA and JMEA. This study is conducted on different tasks: CIFAR-8, CIFAR-59 and Food-101 (3 labeled target samples per class). We report average accuracy for different dataset.
Ablation Study for KHDA. We conduct comprehensive experiments to show the contribution of the individual components in KHDA in Table 6. We consider the following baselines: 1) w/o D: In KHDA, train classifiers without distribution alignment Eq.     Guangquan Zhang received the PhD degree in applied mathematics from the Curtin University of Technology, Perth, Australia, in 2001. He is currently a professor and director of the Decision Systems, Australian Artificial Intelligence Institute, and e-Service Intelligent (DeSI) Research Laboratory, Faculty of Engineering and Information Technology, University of Technology Sydney, Australia. His research interests include fuzzy machine learning, fuzzy optimization, and machine learning and data analytics. He has authored four monographs, five textbooks, and 350 papers including 160 referred international journal papers. He has won seven Australian Research Council (ARC) Discovery Project grants and many other research grants. He was awarded an ARC QEII Fellowship in 2005. He has served as a member of the editorial boards of several international journals, as a guest editor of eight special issues for IEEE Transactions and other international journals, and has co-chaired several international conferences and work-shops in the area of fuzzy decision-making and knowledge engineering.