Multi-view Collaborative Learning for Semi-supervised Domain Adaptation

Recently, Semi-supervised Domain Adaptation (SSDA) has become more practical because a small number of labeled target samples can significantly boost the empirical target performance when using SSDA. Several current methods focus on prototype-based alignment to achieve cross-domain invariance in which the labeled samples from the source and target domains are concatenated to estimate the prototypes. The model is then trained to assign the unlabeled target data to the prototype within the same class. However, such methods fail to exploit the advantage of using few labeled target data because the labeled source data dominate the prototypes in the supervision process. Moreover, a recent method [1] showed that concatenating source and target samples for training can damage the semantic information of representations, which degrades the trained model’s ability to generate discriminative features. To solve these problems, in this paper, we divide labeled source and target samples into two subgroups for training. One group includes a large number of labeled source samples, and the other obtains a few labeled target samples. Then, we propose a novel SSDA framework that consists of two models. A model trained on the group that has the labeled source samples to provide an “inter-view” on the unlabeled target data is called the inter-view model. A model trained on a few labeled target samples that provides an “intra-view” of the unlabeled target data is called the intra-view model. Finally, both of these models collaborate to fully exploit information on the unlabeled target data. To the best of our knowledge, our proposed method achieves the state-of-the-art classification performance of SSDA in extensive experiments conducted on several visual benchmark domain adaptation datasets that utilize the advantages of multiple views and collaborative training.


I. INTRODUCTION
With large-scale labeled data samples and the growth of computing power, supervised learning methods have shown empirical results in various computer vision applications such as image classification [2]- [4], image semantic segmentation [5]- [7], and object detection [8]- [10]. These methods assumed that the training and the test sets come from the same distribution; however, the training data (a source domain) and test data (a target domain) in most real-world applications are related but follow different distributions. Therefore, if a model trained on the source domain directly performs on the target domain, it poorly generalizes on the domain due to domain shift, which degrades the accuracy of the target application (most representatively, in image classification) for the target domain. To solve this problem, DANN [11] introduced an adversarial learning strategy to minimize the difference between the source and target distributions to achieve domain-invariant knowledge across domains. In the DA setting, a model is often trained on plentiful labeled data from the source domain to successfully perform the target task on the target domain, with little-to-no labeled data with a different distribution. Depending on the availability of labeled target samples during training, DA can be categorized as unsupervised domain adaptation (UDA) [11]- [16] or semi-supervised domain adaptation (SSDA) [17]- [25].
UDA [11]- [16] attempts to achieve domain-invariant rep-VOLUME 4, 2016 FIGURE 1. Inter-view and intra-view models to consider the correlation between labeled source data and labeled target data with unlabeled target data.
resentation for image classification tasks by minimizing the distribution between source and target domains; however, these works still have room for performance improvement because they focus on domain-invariant features without considering specific representations in each class. Thus, recently, SSDA [17]- [25] for image classification has received significant attention, in which the model is trained to give large amounts of labeled source data and has access to a few labeled target data. Similar to MME [17], UODA [20] is a SSDA method developed based on prototypes that are estimated from given labeled source and target samples. Then, the model of these methods is trained to encourage the unlabeled features clustered around the estimated prototypes. However, these methods cannot alleviate the domain gap because the large amount of source data, much larger than the target data, dominates when creating the class prototypes. In addition, [1] argued that, in many real-world applications, samples of different classes are often similarly expressed in the feature space. Therefore, when a model is trained on integrated source and target samples, the discriminative feature representation ability can be reduced because the semantic information of representations can be damaged. Therefore, in this paper, instead of integrating all labeled source and target domains for training, as in MME [17], UODA [20], and STar [24], we divided the labeled samples into two subsets to train two different models. The model trained on the labeled source samples is called an "intermodel," and the model trained on the labeled target domain is called an "intra-model." These models provide different views to predict unlabeled target data.
Although the inter-view model is trained on large amounts of labeled data from the source domain, it may not provide satisfactory image classification accuracy on unlabeled target data due to the domain shift problem. The intra-view model has an overfitting problem for classification of the target data because this model is trained on a few labeled target samples, which cannot generalize the unlabeled target data. Therefore, to solve the above problems, we proposed a novel framework called Multiple-view Collaborative Learning (MVCL). As shown in Figure 1, the training in our approach progresses in three stages. First, the inter-and intra-view models are trained on the labeled source and labeled target samples, respectively, and work as an inter-view and intra-view to extract information from the unlabeled target samples in the second stage. In this stage, the inter-and intra-view models will alternate, offering their pseudo labels selected from the highest prediction scores to teach the other model. This process, called collaborative learning (Co-learning), allows both models to exchange their mutually complementary information to make consistent predictions on unlabeled target data. Finally, we used adversarial learning via the minimax entropy strategy [17] to encourage the unlabeled target features clustered around the prototypes in the third stage.
Our main contributions are summarized as follows: • To solve the bias in learning problem in the supervision phase of previous SSDA works, we divided the labeled samples into two subsets which are labeled source samples and labeled target samples instead of integrating them into one set, as in [17], [20], and [24]. We use two models to simultaneously extract unique features on these two subsets. The first model trained on labeled source samples is called the inter-view model because it provides an inter-view on the unlabeled target data. Similarly, the second model trained on labeled target samples is called the intra-view model because it corresponds to an intra-view on the unlabeled target data. • Then, we successfully unified collaborative learning and domain adaptation into a framework for SSDA. Specifically, each individual model obtained partial information of the target data. The inter-and intra-view models then exchange their knowledge to comprehensively represent the target data via collaborative learning. Furthermore, MVCL can extract intrinsic information from each view and adaptively balance it complementarily to achieve consistency among different views. Therefore, it can alleviate the problem of feature degeneration and enhance the reasonability of using a consensus representation for multiple views. • We conducted extensive MVCL experiments on several domain adaptation datasets, including Office-31, Office-Home, VisDA2017, and DomainNet, to show that our method achieved SOTA classification performance in SSDA.

II. RELATED WORK
In this section, we review the related works in SSDA and multiple views.

A. SEMI-SUPERVISED DOMAIN ADAPTATION
Recently, the SSDA approach has received a lot of attention [17]- [25], where a few labeled target samples work as leverage to improve the domain adaptation performance for image classification. SSDA via minimax entropy (MME) [17] is the first method that aligns the representations of source and target domains using adversarial learning called minimax entropy. Specifically, in this approach, they train their framework with concatenated labeled source and target samples to create prototypes represented by the weight vectors of a classifier. Then, in the first adversarial training step, they re-weight these vectors by maximizing the entropy of unlabeled target samples to estimate domain-invariant prototypes. In the second step of adversarial training, they use the minimizing entropy strategy on the unlabeled target samples to update the feature extractor and encourage the target features clustered around the prototypes. The classification accuracy in the target domain significantly increases when applying the MME method. However, recent studies [18] and [19] show that this method still has room for improvement in terms of classification accuracy in the target domain. APE [18] argued that only unlabeled target instances that have features that close the relationship with those of the labeled target samples move to the estimated prototypes.
Other unlabeled targets are misaligned, which leads to an intra-domain discrepancy issue in the target domain. APE proposed a method that includes attention, perturbation, and exploration schemes to solve this problem; however, their method does not provide a solution for when the feature representations of the trained model are dominated by a large number of labeled source samples in the supervision process.
Another method [19] successfully significantly improved the classification accuracy in the target domain by developing a novel mapping function (MAP-F) that could solve the bias in the learning of MME. To alleviate the bias problem caused by the unbalanced number of available source and target samples, MAP-F divides the labeled samples into two subgroups: labeled source and labeled target samples. Then, first, they train a model with labeled source samples to obtain a wellorganized source distribution. Next, they introduce a novel mapping function to minimize the distance of target class centroids from the source samples within a class to reproduce the well-clustered features of the source domain in the target domain. Although MAP-F showed outstanding classification accuracy in the target domain, it uses two feature extractors simultaneously that require a lot of training time.

B. MULTI-VIEW DOMAIN ADAPTATION
Multi-view representation learning emerges as a promising direction in machine learning for classification. The key that plays an essential role in this approach is to extract multiple elements of knowledge from the input data by taking multiple views and then integrating them to find a good representation of the input data. Thus, a few domain adaptation methods [26]- [29] have successfully improved the learning perfor-VOLUME 4, 2016

S/T
The source/target domain The common feature extractor The domain alignment loss The i-th element source sample and its label The i-th element target sample and its label  [26] introduces multiple views for a DA framework. This approach uses two views to exploit the source and target information; one view extracts the labeled source information and is then used to construct another view for the target samples by minimizing the consistency loss between the two views. Multi-view Discriminant Transfer (MDT) [27] is a multiple views-based approach for DA. The main concept of this method is to determine the optimal discriminative weight vectors for each view by maximizing the correlation between the two views while simultaneously minimizing the domain discrepancy loss. The Maximum Classifier Discrepancy (MCD) method [28] consists of two classifiers trained on labeled source samples where both views classify the unlabeled target data. In this method, they introduce a discrepancy loss to maximize the disagreement between predictions of these classifiers to classify the target samples far from the support of the source domain. In contrast, the feature extractor learns to extract target representations that are near the support of the source samples to minimize their discrepancy. [29] provides a short survey that discusses and analyzes the framework for the unification of multiple-view learning and domain adaptation.

III. PROPOSED METHOD
In this section, we introduce the definition of the problem in our method and notations of SSDA. Then, we present the training processes with the loss functions for the proposed method.

A. DEFINITION OF THE PROBLEM AND NOTATIONS
In the SSDA setting, we have the set of the source domain denoted as where N S is the number of labeled source (S) samples, y S i ∈ R K is the label of the samples x S i and K is the number of classes. In addition, the set of labeled target samples (T l ) is denoted as where N Tu is the number of unlabeled target samples and N Tu N T l . Table 1 lists all important symbols.
The goal of SSDA is to design a framework that not only formally reduces the domain shift between both domains but also tries to achieve class-wise matching by leveraging a few labeled samples in each class of the target domain.
As shown in Figure 2, the proposed method indicates the inter-view model M S (E, C S ) using a dotted red box and the intra-view model M T (E, C T ) using a dotted blue box. These two models have classifiers C S and C T and share feature extractor E. The training process in the proposed method has three stages: the first uses supervised learning, the second uses collaborative learning (Co-learning) on the unlabeled target domain, and the third uses the minimax entropy strategy, as shown in Figure 1. Each stage is explained in detail in the following subsections.

B. SUPERVISED TRAINING
In the first stage, the two models M S (E, C S ) and M T (E, C T ) are trained using the labeled data. Specifically, the labeled source samples are fed into the inter-view model M S (E, C S ), which consists of a shared CNN-based feature extractor E that obtains their corresponding representations. Then, these features are categorized by the task-specific classifier C S by minimizing the standard cross-entropy loss on the ground-truth labels as follows: is the indication function that receives a value of 1 or 0 when the input [] is true or false, respectively. The i-th source image x S i has a label y S i = k ∈ K. Similarly, the feature extractor E and target classifier C T of the intra-view model M T (E, C T ) are trained using the standard cross-entropy over the limited labeled target samples as follows: Consequently, the parameters of the shared feature extractor E are optimized by minimizing the following loss: The parameters of the classifiers C S and C T are updated by adopting Eqs. (1) and (2), respectively. The model M S trained on the labeled source data provides an inter-view aspect while the model M T trained on a few labeled target samples works as an intra-view when they are used to extract information from the unlabeled target data.

C. DOMAIN ALIGNMENT
The domain shift [14] means that the inter-view model trained according to (1), which has only source label information, cannot provide satisfactory classification accuracy on the target domain. To solve this, we use the maximum mean discrepancy (MMD) approach [30] to minimize the distance between the source and target distributions. Noticeably, the MMD is a useful metric that compares the distributions of data in the source and target domains by mapping the data to a high-dimensional embedding in Reproducing Kernel Hilbert Space (RKHS). Regarding the source distribution (P) and target distribution (Q) for domain adaptation, the MMD between P and Q is equivalent to the distance between the means of the samples in the source and target in RKHS, using the mapping function φ. Through the MMD, the interview model M S is trained to minimize the distance between P and Q so that it can achieve domain-invariant feature representations. The MMD loss of the source and target can be estimated as follows: where H is the RKHS and φ(.) ∈ H is the mapping function of X to the RKHS. The MMD loss can be expressed in terms of a kernel method. We exploit the Gaussian kernel function that satisfies the condition of the MMD to project samples of the source and target domains instead of the mapping function. The impact of L M M D on the classification performance in the target domain is reported in the ablation study in Section IV. E.

D. CONSISTENCY CLASS ALIGNMENT WITH MULTI-VIEW CO-LEARNING
For the inter-view, after finishing stage 1, the model M S holds rich information from the source domain to transfer to the target domain. However, this model lacks information from the target domain. For the intra-view, the model M T poorly generalizes on the target domain because it has only been trained on limited labeled target samples. Therefore, we solve this problem in the second stage using multiple views with co-learning. Specifically, the multiple views-based strategy lets us fully exploit the information of the unlabeled target data; co-learning encourages these two models to exchange knowledge to alleviate each of their shortcomings. For example, the inter-view model M S generates two predictions for two augmented versions of an unlabeled image: a weakly augmented version x Tu i +σ and a strongly augmented version x Tu i + δ. The weak augmentation used a simple transformation method, such as random cropping or flipping. The strong augmentation used RandAugment [31], which randomly selects from a list of 14 various augmentation schemes, such as rotations, translations, and color/brightness enhancements. The two predictions by M S over the weakly and strongly augmented images are computed as follows: where p w S (x Tu i ) and p str S (x Tu i ) are the predictions of the weak and strong augmentation versions of an unlabeled target image x Tu i generated by M S . Similarly, M T also provides its two predictions p w T (x Tu i ) and p str T (x Tu i ) on the same input image.
Then, the co-learning process aims to enforce the consistency regularization that is conducted by minimizing the cross-entropy of the selected pseudo label from the interview prediction (P S = max P S p w S (x Tu i )) and each intra-view prediction p str T (x Tu i ) of the strongly augmented image. This entire process is conducted as follows: in the inter-view, the model M S offers its pseudo labels selected from the highest confident prediction of the weakly augmented version of an unlabeled target image, x Tu i + σ. This is then converted into a one-hot encoded label for the calculation of cross-entropy, with the prediction of the strongly augmented version of the same unlabeled target image x Tu i + δ, as predicted by the model M T . Simultaneously, the model M T provides its pseudo labels are generated over a weak augmentation image to match with the prediction of the strongly augmented transformation predicted by the model M S over the same unlabeled target image. Finally, the generalization performance on the target domain is improved by integrating both views' complementary information.
The incorrect pseudo labels can negatively impact the performance of the models in the target domain; therefore, only the prediction of unlabeled target samples that have a high probability over a given threshold value is selected as a pseudo label, maxp w j (x Tu i ) > τ , where τ is the threshold value and j is the index for the domain. The consistency losses between M S and M T are calculated as follows: where 1 [.] is an indication function and M T offerŝ , which are converted into a one-hot encoded label to use as a pseudo label for supervised learning. τ inter and τ intra are the threshold values used to select pseudo labels of M S and M T . These are studied in the ablation study in Section IV. E. VOLUME 4, 2016 The loss for the multi-view co-learning process is used to update the parameters of feature extractor E, which are calculated as follows: The classifiers C S and C T are trained using (6) and (7), respectively.
In the third stage, the parameters of the feature extractor E, classifiers C S and C T are updated using the minmax strategy following [17]. Deep convolution neural networks (CNNs), such as Alexnet [32], VGG16 [33], or ResNet-34 [34], are used as the backbone network of feature extractor E. Each classifier consists of two fully connected layers (FCs). The last linear layer is replaced by a K-way linear classification (K represents the number of classes) that aims to exploit the cosine similarity-based classifier architecture. Therefore, they are also called "cosine classifiers." Each cosine classifier is presented by K class-specific weight vectors W = [w 1 , w 2 , .., w K ], where each weight vector w i represents the i-th class prototype. The final probability of each cosine classifier is denoted by T W T f is the output of this classifier, T is a fixed temperature (0.05), and f is the input feature. The minimax entropy strategy is conducted as follows: In an entropy maximization process, each cosine classifier is trained such that each w i is similar to the generated target features f to achieve domain-invariant prototype generation. Then, in the entropy minimization process, feature extractor E is trained to achieve discriminative features on the unlabeled target data by assigning the extracted features of the unlabeled target samples to a certain prototype.
The minimax strategy is applied to both the inter-and intra-view models. For the inter-view model, the conditional entropy loss of the unlabeled target samples to the targetclustering classifier is C S determined as follows: where p inter (y = k | x Tu i ) represents the probability of x Tu i belonging to class k predicted by the inter-view model. Similarly, for the intra-view model, the conditional entropy loss of the unlabeled target samples to the target-clustering classifier C T is determined as follows: where p intra (y = k | x Tu i ) is the probability that the intraview model predicts that x Tu i belongs to class k. The total loss for feature extractor E over three stages is computed as follows: where λ is a balancing parameter and was set as in [17]. The total losses for classifier C S and C T are calculated as follows:

E. INFERENCE ON UNLABELED TARGET DATA
The prediction on the unlabeled target data is calculated by taking an ensemble value of the sof tmax outputs for the two models as follows:

IV. EXPERIMENT
In this section, we show the experimental details of our investigation of the proposed method's efficiency for the SSDA setting. We implemented the one-/three-shot settings as in [17] on four commonly used domain adaptation benchmark datasets, namely Office-31 [35], Office-Home [36], VisDA2017 [37], and DomainNet [38]. The transfer tasks are denoted as: source domain→target domain.

A. DATASETS
• The Office-31 dataset has three different domains: Amazon (A), Webcam (W), and DSLR (D), which have approximately 2,800, 800, and 500 images, respectively, and 31 classes. Following [17], we evaluated our method in two cases, D→A and W→A. • Office-Home is a new visual domain adaptation dataset that contains images from four different domains: Real (R), Product (P), Clipart (C), and Art (A), each of which has 65 categories. We used 1-shot and 3-shot splits and evaluated the adaptation performance on the target domain for 12 pairs (source→target) as in [17]. We evaluated the proposed method on 12 transfer tasks.  R→C (Real is selected as the source domain to adapt on the target domain Clipart), R→P, P→C, C→S, S→P, R→S, and P→R. For each task, we evaluated our method on the 1-, 3-, 5-, and 10-shot settings, where one, three, five, and ten are the labeled samples that were randomly selected from the target domain, respectively. Table 2 summarizes the dataset description.

B. EXPERIMENT SETTINGS
Similar to previous SSDA approaches [17], [18], [24], we used Alexnet, VGG16, and ResNet-34 as backbones for the shared feature extractor; they were pre-trained on the ImageNet dataset [39]. The two classifiers of M S and M T have the same architecture as [17]. We used a Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of η 0 = 0.01, weight decay of 0.0005, and momentum of 0.9. The strategy to update the learning rate η was conducted following [11]: η = (η 0 / ((1 + 10p) The threshold values for selecting the pseudo labels of the inter-and intra-view models were set to 0.96, as detailed in Section IV. E. The batch size (b) was set to 128 in the experiments. We conducted all experiments on the widely used Pytorch [40] framework.
All results of the Office-31, Office-Home, and DomainNet datasets for the benchmark methods were collected from previous works [17], [22], [23] based on Alexnet, VGG16, and ResNet34. For the results on the VisDa2017 dataset, we implemented the benchmark methods ourselves using the codes released by the authors. For the experiments on the Office-31 and Office-Home datasets, the model was trained for 10,000 training steps to collect the best accuracy for target validation. For the experiments on the VisDA2017 and DomainNet datasets, we trained all models for 50,000 training steps to observe the best classification accuracy on the target domain. Note: DA: Domain Alignment, SL: Self-learning, CL: Co-learning

C. COMPARISON WITH STATE-OF-THE-ART APPROACHES
We compared the proposed method with previous SOTA SSDA approaches, including the following: 1) Minimax entropy (MME) [17].
Additionally, we made a comparison to S+T [41], which is trained with the labeled source and target samples without using the unlabeled target samples. DANN [11] and ENT [42] are both methods widely used in UDA. DANN is the domain adversarial learning method, which employs a domain classifier for matching feature distributions of the source and target domains. ENT is trained on labeled data using the standard cross-entropy loss and unlabeled data using entropy minimization. We modified these methods to be suitable for the SSDA setting, as in [17].

D. ANALYSIS OF RESULTS
Results on the VisDa2017 dataset: Table 3 reports comparisons of the results of our method and SOTA SSDA approaches on the VisDA2017 dataset in the 3-shot setting. We could see that the proposed method showed remarkable improvement in classification accuracy on the target domain for almost all tasks. Our method achieved the best classification accuracy in the target domain, 88.9%, and was 2.2% better than the current SOTA method MAP-F [19]. This means over 11.8% and 11.1% improvements in the average classification results compared to MME [17] and APE [18], respectively. .
Results on the DomainNet dataset: The mean classification accuracy of our method achieved the best performance on the DomainNet dataset in both cases using Alexnet and ResNet34 as the backbone network. The details of the comparison results are listed in Table 4. Specifically, compared to the most popular SSDA method, MME [17], the average classification results on the target domain of our method were boosted by 9.2% and 8.5% of the 1-shot and 3-shot settings, respectively, when using Alexnet as the backbone network. Using ResNet34 as the backbone network, the proposed method achieved the best accuracy of the target domain in all tasks and surpassed the current best results obtained by CDAC [22] by 2.6% and 2.1% in the 1-shot and 3-shot settings, respectively.
We extended the experiments on the DomainNet dataset to evaluate the proposed method in the 5-shot and 10-shot settings. Compared to the existing SOTA SSDA counterparts, our method showed outstanding results in almost all adaptation scenarios.
Results on Office-Home and Office-31 datasets: The average classification results of our method showed the best performance for Alexnet, VGG16, and ResNet34 on the target domain under the 3-shot setting for the Office-Home dataset, as presented in Table 6. It performed 0.3%, 1.1%, and 1.4%, respectively, better than the current SOTA method. The classification accuracy on the target domain of various methods for the Office-31 dataset is listed in Table 7; our approach reported remarkable classification results in the target domain when using Alexnet and VGG16 as backbone networks on both the 1-shot and 3-shot settings.

E. ABLATION STUDIES
The contributions of each component of the proposed  , and co-learning (CL). We used the MME [17] as the baseline of models M S and M T . The baseline results on the target data were computed by taking an ensemble of softmax outputs from both models. We added a domain loss for DA to match the feature distributions of the source and target domains extracted by the inter-view model M S . For the SL process, each model exploited the target information by generating the pseudo labels on the unlabeled target samples to train itself. In the CL process, both the inter-view and intra-view models exchanged their knowledge by alternatively providing their pseudo labels, selecting from the highest confidence prediction to teach the other model. Consequently, the shortage in each view is alleviated, leading to improved classification performance in the target domain. Table 8 recorded three adaptation tasks (R→P, P→C, S→P) on DomainNet with the 3-shot setting using ResNet34 as the backbone network. The average classification results on the target domain of the inter-view and the intra-view models were significantly different when only using BL, with 8.3% gapping. The average accuracy of the inference results on the target domain was improved when we added DA (BL+DA) to reduce the domain discrepancy between the source and target domains in the feature space, which is expressed in Section III. C. Because model M S and model M T share feature extractor E, in the classification accuracy on the target domain of the intra-view model, M T , also slightly increased. Compared to the baseline, in scenario BL+DA+SL, the average classification results on the target domain of both inter-and intra-view models surpassed 5.8% and 6.8%, respectively, when they were complementary to the target information from the unlabeled samples by using SL. However, as shown in Table 8, the prediction results on the target domain of the inter-view model over BL, BL+DA, and BL+DA+SL methods were significantly higher than the prediction results provided by the intra-view model. This is because the inter-view model was trained on the large amounts of labeled source samples, while the intraview model was trained on the small amounts of labeled target samples. Therefore, the intra-view model was poorly generalized to the target domain compared to the inter-view model.
We could observe that in the case BL+DA+CL, the bias prediction of both models was removed when using CL, which encourages both inter-view and intra-view models to make similar predictions on an unlabeled target sample by mutually exchanging their knowledge. To demonstrate the efficiency of CL in the SSDA setting, we extended it to all adaptation tasks on the DomainNet dataset, and the results are listed in Table 9. Both models provided similar prediction accuracies in all tasks. The average accuracy of the ensemble result of the case BL+DA+CL was nearly 10.0% higher than that of the case BL+DA+SL. The intra-view model M T inherited abundant ground-truth information from the inter-view model M S via CL. Simultaneously, the interview model M S supplemented the labeled target information during training. The experiments showed that the classification accuracy of the target domain could be significantly improved by using the proposed method for two reasons. First, the target information was extracted efficiently via multiple views. Second, CL successfully distilled useful class information of each model to transfer to the other model. Therefore, it could maximize the within-class correlation and simultaneously minimize the correlation between each class. We provide empirical evidence from feature visualization analysis in a later section.
Sensitivity of the proposed method with varying threshold values: In this section, we explain how to select an appropriate threshold value. As mentioned in Section III.  D, each model generated pseudo labels by choosing the highest confidence predictions on unlabeled target samples. Simultaneously, its weights were updated from the other's pseudo labels. . This is because the inter-view and intra-view models were trained on the differently labeled datasets. Therefore, the quality and quantity of their pseudo labels were quite different. To investigate the sensitivity of the proposed method with pseudo labels generated from each model, we implemented the following: a model provided its pseudo labels with a fixed threshold value to observe the optimal threshold value of the rest model. We conducted the P→R task on Original APE [18] UODA [20] Ours DomainNet using ResNet34 as the backbone network under the 3-shot setting. We observed the inference results of the proposed method of all experiments during 30,000 training steps.
We could obtain a lot of pseudo labels by using a small threshold value. However, these pseudo labels contained noisy labels, leading to degradation in the classification performance on the target data. On the contrary, we could get high-quality pseudo labels by setting a high threshold value. Nevertheless, in this way, the useful information of the target data could be discarded, which also led to a decrease in accuracy. Therefore, we should study the selection of a threshold value that can control the trade-off between the quantity and quality of pseudo labels. [43] suggested that the quality of pseudo labels should be considered more than the quantity to obtain better performance. Thus, we set τ inter = 0.92 and changed τ intra = 0.4 ∼ 1.0 to evaluate the impact of the intra-view threshold value on the classification performance on the target domain. Similarly, we adjusted τ inter = 0.4 ∼ 1.0 and fixed τ intra = 0.92 to investigate the sensitivity of the classification results of the target domain to the inter-view threshold value. Figure 3 shows the classification accuracies on the target domain of the proposed method, depending on the pair (τ inter , τ intra ). In the case of fixed τ inter and changed τ intra , the classification accuracy on the target domain of the proposed method was over 83.0% when τ intra was 0.94, which is indicated by a green dashed line. In the case of fixed τ intra and changed τ inter , the highest classification accuracy on the target domain of our method on the unlabeled target data could achieve nearly 84.0% with τ inter = 0.98, which is indicated by the red dashed line. As shown in this figure, we could see that the classification result on the target domain provided by the inter-view model was higher than that of the intra-view model with the same fixed threshold value, 0.92. That means the inter-view model trained on the large amounts of labeled source samples generated pseudo labels more accurately than the intra-view model trained on small amounts of labeled target samples when the threshold value changed from 0.4 to 0.9. These results are concordant with the classification accuracy of the target domain shown in Table 8.
As shown in Figure 3, we determined that the optimal threshold value was in an interval from 0.94 to 0.98. We observed the variation of the classification performance of the target domain corresponding to three pairs, (τ inter = 0.94, τ intra = 0.98), (τ inter = 0.96, τ intra = 0.96), and (τ inter = 0.98, τ intra = 0.94), to decide the optimal threshold value for the proposed method. Figure 4 displays the test accuracies of these three tasks. As shown in this figure, in case (τ inter = 0.96, τ intra = 0.96), the prediction results for the unlabeled target data still increased and achieved 84.0% after 30,000 training iterations. On the contrary, after 20,000 training iterations, in cases (τ inter = 0.94, τ intra = 0.98) and (τ inter = 0.98, τ intra = 0.94), the classification results of the target domain showed almost no change. Both models of the proposed framework obtained the best prediction results on the target domain with the same threshold value that was easy to understand. With CL, the quality of both inter-view and intra-view models was similar in terms of performance on the unlabeled target data, which is also demonstrated by the results listed in Table 9; the predictions of both models were the same for all tasks.

F. FEATURE VISUALIZATION ANALYSIS
Visualization Analysis of Classification Features: Figure 5 shows the t-SNE visualization [44] of the source and target features categorized by the different methods for the P→R task on DomainNet under the 3-shot setting using the ResNet-34 backbone network. Different colors indicate different classes. The figures on the left side and in the middle columns show representations of the source and target domains, respectively. A different color denotes each class. The figure in the right-side column presents the effects of distribution matching between the source and target domains by showing the t-SNE visualization of domain representations in the shared feature space. The red color denotes the source domain, while the blue color indicates the target domain. We could see that the distribution matching of the S+T method was less efficient than APE and our method, as shown in the figure in the right-side column. The target features in each class were discriminated by our method more clearly compared to APE, as shown in the middle columns.
Confusion Matrix Visualization: The confusion matrixes of APE, UODA, and our method are represented in Figures 6 (a)-(c), respectively. These experiments were implemented on the VisDa2017 dataset based on the ResNet-34 backbone network. As shown in Figure 6, both the APE and UODA methods could not discriminate the representations among Bus, Car, Train, and Truck classes well because these classes share common features. On the contrary, our approach could alleviate this problem, as shown in Figure 6 (c).
Attention Map Visualization: Grad-CAM results [45] of APE, UODA, and our method are displayed in Figure 7, which shows the results extracted by the last convolutional layer in ResNet-34 using an input randomly selected from the Bicycle and Truck classes of the VisDa2017 dataset. The results of both classes showed that the model of our method could be enforced to focus on the main object while other models were quite sensitive to the background or noise. As shown in Figure 7, the model of our method performed better than the UODA model on the Bicycle class and was more robust to noise than the APE and UODA models on the Truck class. These results are concordant with the confusion matrix visualization in Figure 6 and the classification accuracy of the target domain in Table 3.

V. CONCLUSION
In this paper, we successfully integrated the multiple views strategy and collaborative training into a framework for SSDA. Specifically, the multiple views strategy is responsible for extracting unlabeled target data from the different aspects of labeled target samples. Collaborative learning encourages the different models to exchange their knowledge to alleviate the shortage in each model. We conducted extensive experiments on four visual benchmark domain adaptation datasets. The experimental results have shown that MVCL is better than other state-of-the-art SSDA approaches. The success of our method indicates the importance of preserving the discriminative information of each class for learning domaininvariant representations in domain adaptation.