Semi-Supervised Domain Adaptation Using Explicit Class-Wise Matching for Domain-Invariant and Class-Discriminative Feature Learning

Semi-supervised domain adaptation (SSDA) is a promising technique for various applications. It can transfer knowledge learned from a source domain having high-density labeled samples to a target domain having limited labeled samples. Several previous works have attempted to reduce the distribution discrepancy between source domain and target domain by using adversarial-based or entropy-based methods. These works have improved the performance of SSDA. However, there are still lacunae in producing class-wise domain-invariant features, which impair the improvement of the classification accuracy in the target domain. We propose a novel mapping function using explicit class-wise matching that can make a better decision boundary in the embedding space for superior classification accuracy in the target domain. In general, in a target domain with low-density label samples, it is more challenging to create a well-organized distribution for the classification than in a source domain where rich label information is available. In our mapping function, a representative vector of each class in the embedding spaces of the source and target domains is derived and aligned by using class-wise matching. It is observed that the distribution in the embedding space of the source domain can be effectively reproduced in the target domain. Our method achieves outstanding accuracy of classification in the target domain compared with previous works on the Office-31, Office-Home, Visda2017 and DomainNet datasets.


I. INTRODUCTION
Traditional supervised learning approaches are quite effective, but they require sufficient labeled samples to successfully train a model. However, collecting labeled data is often expensive and time-consuming. Domain adaptation has emerged as a new machine learning strategy in which the model is built using a large amount of labeled data from a source domain and a small amount of labeled data (or even none) from the target domain. In general, The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao . domain adaptation can reduce the labor cost of re-labeling by utilizing the knowledge learned in the primary domain (source domain) and then transferring that experience to the target domain, which shares common features but has a different distribution. The key issue of domain adaptation is how to approximate the joint distribution of the source domain and target domain, i.e., to predict the labels of unlabeled target data with the minimum prediction error. Domain adaptation is widely used in various real-world applications such as image classification [1]- [4], object detection [5]- [8], semantic segmentation [9]- [11], and person re-identification [12]- [14]. Depending on whether the label information of the target domain can be used for training, the domain adaptation method is categorized into two subgroups: unsupervised domain adaptation (UDA) and semi-supervised domain adaptation (SSDA).
In UDA [15]- [23], the source samples and unlabeled target samples are integrated for training. The knowledge from the source domain obtained by the supervision training is transferred to the target domain. However, unlabeled target samples that are less correlated with the source samples are less affected by the supervision in the source domain, which leads to inter-domain discrepancy in UDA. By contrast, SSDA [24]- [28] uses extra information by adding a few labeled targets and enforcing the corresponding target feature to be attracted toward source feature clusters, which guarantees partial alignment between the two domain distributions.
These methods help to improve the network performance in the target domains. However, they still show a poor generalization quality because of two main reasons: First, such as the model of S+T [24] is trained by the supervised learning manner with the labeled source and target samples without any information from the unlabeled target data. Therefore, the information from the target domain is exploited inefficiently. Moreover, the number of labeled target samples is much lower than the labeled source samples leading to the biased selection problem. In fact, MME [26] shows that the estimated prototypes of the labeled samples are biased toward the source domain when the vast majority of the labeled source samples and the minority labeled target samples are combined. Second, the knowledge being transferred from the source domain to the target domain also contains weakly-related source representations with the target domain, which can be harmful to the target performance. This phenomenon is known as negative transfer. Some previous works [15]- [17] use adversarial learning inspired by generative adversarial networks (GAN) [40] to mitigate the negative transfer. These methods show a good performance by reducing the domain discrepancy between source and target domains by confusing a domain classifier. However, they ignore class discriminability, leading to a limitation of the target classification performance. Therefore, in this paper, we propose a novel SSDA method that can reconstruct the well-organized source distribution in the target domain via a proposed mapping function while using limited labeled target data. The concept of the proposed mapping function is shown in Fig. 1. The source distribution is almost perfectly organized, so that classification between classes is easily possible by training on large-scale labeled data. The distribution of the source domain is then reproduced by minimizing the distance between the class centroids in the target domain and source data samples within the same class. The proposed mapping function allows the target domain to actively select the feasible source features to be reproduced. In addition to this, dual feature extractors are used to separately capture features of the source and target domain. Therefore, it can avoid the accumulated error in a single network, and the negative effect from the noise labels in the source domain is mitigated.
Our contributions are summarized as follows: • First, two feature extractors are used to train separately on the source domain and the target domain. Only the source features that are closely related to the target features are transferred to the target domain. This mitigates the negative transfer problem and accumulated errors from the bias learning.
• Second, a new mapping function is proposed to reconstruct the well-organized distribution of the source domain on the target domain by using explicit class-wise matching for domain-invariant and class-discriminative feature learning.
• Finally, we conduct extensive experiments on the Office-31, Office-Home, Visda2017 and DomainNet benchmark datasets to demonstrate the superiority of our proposed method.

II. RELATED WORK
In this section, we review the existing methods for UDA and SSDA.

A. UNSUPERVISED DOMAIN ADAPTATION
Domain-adversarial training of neural networks (DANN) [15] is a popular training method in UDA, which utilizes an adversarial manner to transfer both domains' data to a common feature space and shares weights between the source and target domains. DANN proposes a gradient reversal layer to reduce the discrepancy between the source and target domains. The adversarial discriminative domain adaptation (ADDA) [16] method uses two convolution neural networks, one each to extract the image features of the source and target. The discrepancy between source and target representations is minimized by using the adversarial adaptive method. Unsupervised Domain Adaptation with Deep Metric Learning (M-ADDA) [18] is an improved version of ADDA, which solves UDA tasks by using a metric-learning-based method. In this method, first, the source model is trained on the source dataset by using triplet loss. Then it works as a reference to train the target model on the target samples through adversarial training, aiming to achieve domain-invariance. In general, the framework of M-ADDA is similar to ADDA except for added triplet loss in the source training term.

B. SEMI-SUPERVISED DOMAIN ADAPTATION
In the SSDA [24]- [28], a few target labels are added, and it works as a bridge to leverage target distribution toward the source distribution. Semi-supervised domain adaptation via minimax entropy (MME) [26] that uses the minimax entropy technique is the most popular method. Specifically, each class in the source domain is represented by a prototype. Then, the classifier is trained to produce the domain-invariant prototype for each class by maximizing the entropy of the softmax prediction output of unlabeled samples in the target domain. The feature extractor is updated by minimizing the entropy on unlabeled samples in the target domain to reduce the distance between them and the class prototype. However, only unlabeled target samples having a close relationship with labeled targets move to the class prototypes. Other unlabeled target samples stay unaligned, which leads to an intra-domain discrepancy problem in the target domain [27]. APE [27] is one of the earlier methods to analyze the target intra-domain discrepancy issue and attempts to resolve it via three schemes, i.e., attraction, perturbation, and exploration. However, APE cannot solve the bias of the decision boundary, which is dominated by the source domain. Bidirectional Adversarial Training (BiAT) [28] exploits the advantages of adversarial learning to enforce the exchange of source and target domain knowledge mutually. In this method, a bidirectional strategy is created using two opposing adversarial learning methods. One approach uses adaptive adversarial training to transfer knowledge from the source domain to the target domain. Another one uses entropy-penalized virtual adversarial training for transferring target knowledge to the source domain. The main concept, and pros and cons of each existing SSDA method are listed in Table 1.

A. PROBLEM FORMULATION
In semi-supervised domain adaptation, we are given labeled data from the source domain and a few labeled samples from the target domain. The set of labeled source samples is denoted as  the set of source samples, x s i is the i-th element in this set, y s is the label vector, y s i is its i-th component, and n s is the number of source images. The labeled target samples are denoted as , where x t l is the group of labeled target samples, x t l i is the i-th labeled target sample, y t l i is its category label, and n l t is the number of labeled target samples. An unlabeled target set is denoted as , where x t u i is the i-th unlabeled target sample, and n u t is the number of unlabeled target samples. The total target dataset is denoted as D t = D l t ∪ D u t . All target images are denoted as x t = x t l ∪ x t u . Table 2 summarizes the important symbols used to explain the proposed method.
The core idea of our proposed method is to establish the mapping function that can reproduce a well-organized source distribution on the target domain with few labeled target samples by using class-wise matching (shown in Fig. 1) for domain-invariant and class-discriminative feature learning.

B. STEP 1: SUPERVISED TRAINING ON THE SOURCE DOMAIN
In step 1, we train the source feature extractor E 1 and classifier C, as shown in Fig. 2, by minimizing the standard cross-entropy loss with K classes on the source samples (x s ) and their corresponding labels y s in a supervised manner as follows: where 1 [.] is an indication function whose value is 1 if the input [] is true, otherwise 0. At the end of this step, the source distribution in the embedding space can be well-organized for the classification because it can utilize the rich labeled samples for the training.

1) Extraction of domain-invariant representations:
During step 2, the pre-trained feature extractor E 1 is fixed to extract the domain representation from the source domain. Similar to DANN [15], the feature extractor E 2 captures the target domain feature from the target samples and then its parameters are updated to minimize the domain discrepancy between the source and target domains by fooling the domain classifier D as follows: where min is the domain loss and is indi- Overall, only the domain features of source and target domains are aligned by adopting (2) but the important information in each class cannot be considered. Thus, this process slightly improves the performance of the target classification. There is still a shortcoming in producing class-wise domain-invariant features, which prevents the classification accuracy in the target domain from improving. Therefore, we propose a novel mapping function that uses explicit class-wise matching to establish a better decision boundary in the embedding space of the target domain based on the well-organized source distribution where rich label information is available. In the proposed mapping function, a representative vector of each class in the embedding spaces of the source and target domains is derived and aligned by using the proposed class-wise matching.
2) Class-wise matching: The feature extractor E 2 and classifier C are trained using the supervised learning method on the few labeled target samples as follows: Using this, the feature extractor E 2 can correctly extract the unique characteristics in the target domain. Then, we compute a centroid c t k of the k-th class of target domain in the embedding space and c t k is indicated as in Fig. 1. Each class centroid is calculated by taking a mean vector of feature vectors that belong to the same class as follows: where D l,k t and n l,k t denote the set of labeled target images and number of labeled target samples with class k, respectively. f (x i ) is the feature vector of x i . The class centroids represent the features of each class in the target domain. For each class centroid denoted in (4), we compute the distances from x s i , then produce the sample-to-centroid distance over K classes in cross-domain via a softmax function as follows: where d ., . is the function of Euclidean distance between source samples and class centroids of target data, and P s→t y = k | x s i is the probability x s i belonging to class k in the target domain. This procedure is shown in Fig. 1 and implemented as shown in Fig. 2. The parameters of target feature extractor E 2 are optimized to minimize the distance between the location of each sample in the source domain and its corresponding c t k by minimizing the following crossentropy loss:  . Diagram for pseudo labeling and the consistency regularization on the unlabeled target samples [31].
ways to exploit the information from the unlabeled target samples that closely correlate with the labeled target sample. They showed their effectiveness by using data augmentation [32] and consistency regularization with pseudolabeling [31]. Inspired from the current SOTA method [31], as shown in Fig. 3, weak augmentation and strong augmentation are applied to the unlabeled images before feeding them to the feature extractor E 2 which was trained in step 2 with limited labeled samples. While weak augmentation is a simple transformation such as flipping and blurring on images, strong augmentation is borrowed from RandAugment [32], which uses random augmentation techniques including rotation, polarization, brightness, and color variations on an input image. The prediction vectors of a weak augmented image and a strong augmented image can be defined as follows: p weak The pseudo labels of unlabeled samples are generated by taking the probability of prediction values of weak augmentation, x + σ . Then, consistency regularization is conducted by minimizing the cross-entropy of the prediction of a strong augmented image x + δ and its pseudo label. At this time, the model is very sensitive to incorrect pseudo labels. Therefore, only the prediction p weak x tu i , in which over the given threshold value (max p weak x tu i > τ ), τ is the threshold value (the detailed process to select the optimal τ is showed in IV. D), is selected to sort out incorrect pseudo labels. Then, the regularization cost for the unlabeled target sample with a high confident pseudo label is computed as follows: where 1 [.] is an indication function and H (., .) is the crossentropy.
[26] and [41] show a way to successfully cluster the features of the unlabeled target data. They minimize conditional entropy measured using the similarity between the weight vector of the classifier, which represents a certain class, and unlabeled target features. This is calculated as follows:   where P(y = k | x t u ) represents the probability of x t u belonging to class k, namely the k-th dimension of softmax score vector P(y = k | x t u ) = σ (C(E 2 (x t u ))). The classifier is trained to update its weight vectors by maximizing the entropy on the unlabeled target data, while the feature extractor is trained to generate the unlabeled target feature more similar to the updated weight vector by minimizing the entropy. Following this, the total cost functions used for training the feature extractor E 2 and classifier C are computed as follows: L u x t u i + λH (10) where λ is a hyper-parameter used to balance between minimax entropy and supervision losses and will be explained in section IV. 4. The components in (10) are summarized as: L E 2 and L C are the costs used to train the feature extractor E 2 and classifier C, respectively. They consist of elements such as: L d is the domain loss to minimize the discrepancy between source and target domains. L t is the classification loss on the labeled target samples, which is computed by the standard entropy minimization. L s→t described in (6) is the mapping function loss, which is used to minimize the distance between the source samples and the class centroid of the target domain within the same class. L u is the consistency regularization loss that was explained in (8). H is the conditional entropy which was described in (9).

E. INFERENCE ON THE TARGET DATASET
In this step, by using feature extractor E 2 and classifier C, class prediction y predict on the target domain is given as:

IV. EXPERIMENTS
In this section, first, benchmark datasets for experiments are described. Then, baseline and implementation details, results, and comparison are provided. Finally, we analyze the effectiveness of the proposed method based on some ablation studies.

FIGURE 5.
Visualization of source and target features with t-SNE [39]. We plotted the features of ten classes on the source and target domains of (a) S+T [24], (b) ENT [25], (c) MME [26], and (d) Our method on DomainNet dataset with a scenario Painting to Real. Each class was represented by different colors. The left column illustrated the source distribution. The middle column showed the output features on the target domain. On the right, the features of the source and target domains were aligned to measure the gap between them to evaluate the efficiency of adaption methods. The features in the proposed method were well-aligned in the two domains compared with S+T, ENT, and MME methods.
2) Office-Home [34] is a standard benchmark dataset for domain adaptation containing 15,500 images belonging to 65 categories, forming four domains: Artistic (Art), Clipart (Cl), Product (Pr), and Real-World (Rw), which represent artistic depictions for object images, picture collection of clipart, object images with a clear background, and object images collected with a regular camera, respectively.
3) DomainNet [35] is a benchmark dataset for large-scale domain adaptation, which consists of six domains of 345 categories. For a fair comparison with the previous SSDA methods, we selected Real (R), Clipart (C), Painting (P), and Sketch (S) as the four evaluation domains and performed the following cross-domain evaluations: R←C (adaptation from source Real to target Clipart), R←P, P←C, C←S, S←P, R←S, and P←R with 126 classes. For each set of crossdomain experiments, we evaluated classification accuracy in the target domain with varying cases such as 1-shot, 3-shot, 5-shot, and 10-shot settings, where 1, 3, 5, and 10 are the number of available labeled target samples, respectively. 4) Visda2017 [42] dataset consists of 55,388 Real images (R) and 152,397 Synthetic (Syn) images from 12 categories. Synthetic samples worked as the source domain, and Real samples were used for the target domains. We randomly selected three Real images in each of 12 categories for 3-shot setting to conduct SSDA experiments.
All results for comparison of Office31, Office-Home, and DomainNet datasets were collected from previous works [26], [27], and [28] based on ResNet-34 backbone. Except for results of Visda2017 dataset, we ran them ourselves by using codes released by authors. 1,2 A list of domains and classes in benchmark datasets for our experiments were presented in Table 3 and example images of datasets are shown in Fig. 4. 5) Implementation details: We adopted AlexNet [36] and ResNet-34 [37] as the backbone networks for SSDA. The number of images in each mini-batch were computed by N × (m + k), where N is the number of classes, m is the number of samples in each selected class of the source domain, and k is the labeled target samples in the target domain. For example, in our experiments, we set N = 10, m = 10, and k = 3 to implement a 3-shot setting. The indexes of 10 classes were selected randomly, in each class contains ten labeled images from the source domain and three labeled images from the target domain. N and m were maintained, and k could be adjusted depending on the shot setting. For instance, the values of k could be set at 1, 3, 5, or 10 corresponding to 1-shot, 3-shot, 5-shot, and 10-shot settings, respectively. In addition, the k labeled target images in each class were fixed during training. We used the Stochastic Gradient 1 https://github.com/VisionLearningGroup/SSDA_MME 2 https://github.com/TKKim93/APE Descent (SGD) optimizer. The learning rate was computed by using the following formula: η = (η 0 /((1 + 10p) 0.75 )), where η 0 = 0.01 is an initial learning rate, p = [0, 1] was the training progress. It was adjusted during the stochastic gradient descent (SGD) as following the strategy used in [15]. The weight decay was set as 0.0005, the momentum was 0.9. λ in (10) was set as 0.1. All implementations were done in PyTorch [38] and on a GeForce RTX3090 GPU. 6) Comparison: We compared our proposed method with seven recent approaches: S+T [24], DANN [15], CDAN [20], ENT [25], MME [26], APE [27], and BiAT [28]. For fair comparison, DANN and CDAN were modified to train on the labeled source, limited labeled target, and unlabeled target samples.    Office-Home, respectively. The proposed method showed the best performance in all the scenarios. On the Office-31 dataset, considering the results in 1-shot as well as 3-shot settings, our method also reported outstanding performance when using AlexNet backbone. On Office-Home dataset, the average classification accuracy on the target domain of our method was higher than MME and APE, i.e., 2.7% and 2.3% with AlexNet backbone and 2.8% and 1.9% with ResNet-34 backbone in the 3-shot setting. VOLUME 9, 2021   2) Results on DomainNet: Table 6 presented the classification accuracy of the proposed and benchmark methods on DomainNet dataset for 1-shot and 3-shot settings. In experiments using ResNet-34 as the backbone network, the mean accuracy of our method in 1-shot and 3-shot settings was higher than S+T, i.e., 16.2% and 14.9%, respectively. Compared with APE, our method obtained notable accuracy improvements in 1-shot and 3-shot settings. In experiments using the AlexNet backbone, our method reported that the average classification accuracy on the target domain was higher than BiAT, up to 7.2% and 7.6%, respectively, in 1-shot and 3-shot settings. To prove the efficiency of our proposed method for various few-shot cases, we additionally conducted experiments in 5-shot and 10-shot settings with the ResNet-34 backbone. As can be observed in Table 7, compared with APE, which is the SOTA method for SSDA, our method improved performances by 2.5% and 2.3% in 5-shot and 10-shot cases. With the same settings, the proposed method provided higher mean accuracy than S+T, up to 12.8% and 11.8%, respectively.
3) Results on Visda2017: We extensively evaluated the proposed method on Visda2017 dataset. The detailed comparison results of our method and the state-of-the-art SSDA methods were listed in Table 8. The proposed method achieved the best mean accuracy, 86.7%, and gained 8.3% better than APE 28]. The S+T [24] showed the lowest results among the existing SSDA methods because the model of S+T was trained without using the unlabeled target data. In contrast, other methods tried to exploit the information from the target domain via the unlabeled target data. The classification performance of ENT [25] was lower than MME [26] because MME operated with minimization and maximization entropy terms of the unlabeled target data while ENT simply used only the minimum entropy regularization on the unlabeled target data.

C. FEATURE VISUALIZATION
In Fig. 5, we showed the extracted features of source and target domains with t-SNE [39] on DomainNet dataset with a scenario P to R using the ResNet-34 backbone. The leftside figures visualized the results of the distribution of source features. A different color was used for denoting each class. The middle images showed the output features on the target data. Then, the features of the source and target domains were aligned to measure the gap between them to evaluate the efficiency of adaption methods. They were displayed in the right-side figures. The red color represented the features of the source domain while the blue color indicated the features of the target domain. This figure showed that features of both domains extracted by our method were well-aligned compared to other benchmark methods. The extracted results in the embedding space of two methods, S+T and ENT, were relatively similar because they used the same strategy to train their models, which is the cross-entropy loss on the mixed labeled data of source and target. While our method was successful at reproducing a well-organized source distribution on the target domain through the proposed mapping function (MP), MME using a minimax entropy-based approach provided a worse alignment compared with the proposed method. Figure 6 (a) showed the well-clustered features of the source distribution extracted by the feature extractor E 1 on the source data. Figure 6 (b) illustrated the reproduced version of the source distribution on the target domain when the same data was extracted by the feature extractor E 2 . Overall, it was proven that the proposed mapping function worked effectively.

1) Sensitiveness of threshold value for pseudo labeling:
During the pseudo labeling in step 2, the correct pseudo labels for unlabeled target samples can considerably increase the accuracy of classification. Figures 7 (a) and (b) showed the results of the adaptation from R to C on the DomainNet dataset under the 3-shot setting to analyze the sensitivity of the network performance with varying thresholds τ in (8) for the pseudo labeling. Figure 7 (a) expressed the variation of inference accuracy depending on different τ s. Figure 7 (b) showed the inference accuracy at the final training step denoted by a blue dashed line in Fig. 7 (a) to find the optimal τ . In Fig. 7 (a), when the τ is too small, such as τ = 0.3 ∼ 0.6, the inference accuracy changed negligibly or even decreased, as the model suffered from a negative effect due to incorrect pseudo labels. When τ was set to 0.94 or 0.98, the pseudo labels used for training could be chosen very strictly, leading to the classifier ignoring useful information. As shown in Fig. 7 (b), the final τ value was set to 0.9, indicated by a red dashed line. Furthermore, in Fig. 7 (a), the inference accuracy increased steadily with τ = 0.9, while other cases started to decrease. However, we limited the number of iterations for ablation studies to fair comparison with the previous works.
2) Impact of each component on the target learner: In this portion, we analyzed the impact of components applied in our framework, including the domain adaptation module The baseline (BL) was built by adding a feature extractor to the MME [26] architecture without the above three components. The results of various scenarios were displayed in Fig. 8, in which we studied the tendency of change of the target classification accuracy. Therefore, we could evaluate the impact of each component in the proposed framework. First, we analyzed the impact of DA on the baseline framework. In this case, the cost function was computed as the sum of (1), (2), (3), and (9). In this figure, the performance of the baseline with the domain adaptation module showed poor classification accuracy; just over 65%. This is lower than the result reported in MME. However, when we used SL, the loss function was identified by (1), (2), (3), (8), and (9). The classification accuracy on the target domain improved up to over 69%. This is easy to understand because the SL could support the proposed network to exploit the useful information effectively from the unlabeled target samples by establishing the relationship between labeled target samples and unlabeled target samples. Only when the MP was applied, the classification performance improved significantly. This is because the proposed MP successfully imitated the well-established source distribution built on the high-density labeled samples to the target domain as illustrated in Fig. 5 (d) and Fig. 6 (a). Thus, the inference accuracy when the baseline was combined with (ST+MP, cost function calculated by (1), (3), (8), (6), and (9)) and (DA+ST+MP, cost function computed by (1), (2), (3), (8), (6), and (9)) on DomainNet reached over 76% and over 77%, respectively, after 20,000 iterations. In this case, the baseline integrated with (DA+ST+MP) reported that its classification accuracy was slightly higher than the case of the baseline integrated with (ST+MP). This proved that the proposed method could achieve high performance without an adaptation module. The detailed results were provided in Table 9, which presented that the classification accuracy in the target domain increased steadily with the proposed MP.   [24], ENT [25], MME [26], APE [27], and Our method, respectively.
3) The confusion matrix visualization analysis: Figures 9 (a)-(e) displayed the confusion matrix of the different SSDA methods. As shown in Fig. 9, the existing SSDA methods caused the intra-class problem seriously. Specifically, in the ENT [25] and APE [27] methods, the inference accuracy of the Truck class was very low. The model of these methods was confused for the representations among Truck, Bus, and Car classes because they contained many similar representations.
On the contrary, the accuracies of inference results on the Truck class of S+T [24] and MME [26] methods were improved. Because S+T was proposed to reduce the intra-class variation problem, while MME utilized the advantages of the minimax strategy on the entropy of unlabeled target samples, however, their target classification accuracy was limited. This is because S+T ignored the unlabeled target information during training, and MME had the bias learning problem. The proposed method achieved the highest classification performance. It demonstrated that the proposed mapping function and dual feature extractor worked effectively to mitigate the bias learning and accumulated error in the single network.
The feature visualization in Fig. 5 and the confusion matrix visualization in Fig. 9 demonstrated that the proposed method achieved the class feature discriminability on the tar-get domain. By using dual feature extractors, it could mitigate accumulated errors in the single network due to bias selection. In addition, the proposed mapping function was successful in reproducing the well-organized source distribution on the target domain. Therefore, it boosted the target classification results.

V. CONCLUSION
In this work, we developed a new structure with dual feature extractors to capture discriminative features on source and target domains, respectively. Specifically, a feature was trained with high-density samples on the source domain to establish a well-organized distribution. Then, it was connected to the target domain, which is trained using a few target samples, by the class-wise mapping function to reconstruct the well-organized source distribution in the embedding space on the target domain. Experiment results on the cross-domain dataset verified that the embedding space of source and target domain generated by our proposed method was well aligned comparing to several previous domain adaptation methods. Furthermore, the inference accuracy in the target domain was improved considerably compared with the benchmark methods.