Adaptative Balanced Distribution for Domain Adaptation With Strong Alignment

Aligning and balancing the marginal and conditional feature distributions are two critical procedures for unsupervised domain adaptation (UDA) problems. However, existing methods usually consider the former while ignoring the latter. To improve the problems of instability and imbalance, we propose the Adaptative Joint Distribution Adaptation Network (AJDAN) by analyzing the multi-modal interactions between the two types of distributions and adding a self-learning network to simultaneously balance them. Furthermore, we give higher weights to samples that are far away from the domain boundary (easy-to-classify samples) using Strong Binary Cross-Entropy (SBCE). The strong alignment strategy is adjustable and allows the network to better train easy-to-classify samples than traditional Binary Cross-Entropy (BCE) in various scenarios. The experiment shows that AJDAN with SBCE (AJDAN+S) provides an average of 68.3% accuracy on the Office-Home dataset, and 89.1% accuracy on the Office31 dataset, showing its superiority by 2~3 percent above the existing state-of-the-art methods.


I. INTRODUCTION
In recent years, deep learning models have achieved great success in the field of computer vision such as object recognition [1]- [4], object detection [5], [6], and object tracking [7], [8] due to abundant training datasets. However, due to the phenomenon known as dataset bias or domain shift [9]- [14], the performance of existing models trained on source datasets tends to be reduced heavily on different but related target datasets [15]. Under these circumstances, transfer learning which aims at alleviating this phenomenon, has become popular in recent years. Transfer learning usually assumes that the source and target data are from similar but different distributions [13]. For instance, the same zebra would have different distributions between a realistic photograph and an oil painting. Therefore, the critical procedure for transfer learning is to minimize the distribution divergence among different domains [17]- [21].
Domain adaptation approaches can effectively minimize distribution divergence. There are mainly three ways to realize domain adaptation in existing methods: adapt marginal distribution [22] alone, conditional distribution [23] alone, The associate editor coordinating the review of this manuscript and approving it for publication was Peng Liu . or both [17], [20]. Experiments have shown that aligning two distributions together is far more efficient than aligning either alone [24], and mainstream solutions have adopted the idea of aligning two distributions together currently. In this context, aligning two distributions together and maintaining their balance simultaneously have become more significant.
Existing methods such as Joint Distribution Adaptation (JDA) [17] or Joint Adaptation Networks (JAN) [25] align multiple convolutional network layers simultaneously. However, different distributions are often considered equally in existing methods, which is not consistent with the actual situation. As shown in Fig.1, in real applications, the marginal and conditional distributions often contribute together to the overall distribution, and their effects are not always the same. For example, under the situation of similarity between the two domains being fairly low (source → target I in Fig.1), the marginal distribution becomes more critical. When the marginal distribution is close (source → target II in Fig.1), the conditional distribution should be taken more seriously [24]. Unfortunately, in practice, it is usually challenging to determine which distribution is more critical (the unknown Target in Fig.1). To solve this imbalanced problem, existing methods such as Balanced Distribution Adaptation (BDA) [26], Dynamic Adversarial Adaptation Network (DAAN) [24], and Manifold Embedded Distribution Alignment (MEDA [27]) give these two distributions different weights by calculating the MMD [28] distance or Proxy A-distance [9]. However, the two distributions are independent in these methods and thus these methods fail to fully capture their underlying multi-modal interactions. In addition to this, the parameters in these methods need to be given manually and cause instability among multiple datasets. Under these circumstances, we combined the two distributions and let the network maintain their balance through self-learning. Compared with the existing methods, the main characteristic of this paper is replacing the previous manual parameters with self-learning parameters. In addition, this paper is also the first to propose this idea. In addition to the distribution alignment strategy mentioned above, a suitable loss function can also improve the performance of final network. Existing methods adopt Cross-Entropy (CE) or Binary Cross-Entropy (BCE) [29] loss functions for the discriminator [30], which will lead to the problem of vanishing gradients. As shown in Fig.2, if the input samples are far from the domain boundary (easyto-classify samples), the training curves of these samples will fall to the dotted line position. These easy-to-classify samples cause slight errors because they are on the right side. However, these samples are still far from the domain boundary, which means the features extracted from them are not transferable. Finally, the domain classifier network will ''ignore'' these samples, and the convolutional network part cannot learn the actual transferable features due to not obtaining enough loss gradient. Existing methods, such as Least Squares Generative Adversarial Networks [31] (LSGAN), adopt the least squares loss function for the discriminator. However, the loss function of LSGAN is constant, while the characteristics of datasets are varied and LSGAN fail to apply to multiple datasets. Therefore, this paper designed a novel loss function using Strong Binary Cross-Entropy (SBCE) which can give higher weights to samples far away from the domain boundary. Furthermore, SBCE can be adjusted flexibly according to different problems. As far as we know, SBCE is the first loss function in the field of transfer learning aiming to tackle easy-to-classify samples and is adjustable in various scenarios. This paper proposes two novel methods to tackle the above two issues. For distribution adaptation, we present the Adaptative Joint Domain Adaptation Network (AJDAN). Compared with the existing techniques, AJDAN combines the marginal and conditional distributions and balances them through self-learning, which significantly enhances the robustness of adaptation ability. For the shortcoming of the traditional loss function, we propose the Strong Binary Cross-Entropy (SBCE), which can dynamically increase the weight to these easy-to-classifier samples. To sum up, our main contributions are as follows: • We propose AJDAN, which combines marginal and conditional feature distributions and at the same time maintains their balance through self-learning.
• We design a new loss function called SBCE. By dynamically increasing the weight of samples far away from the domain boundary, SBCE can realize strong alignment between the source and target domains.

A. UNSUPERVISED DOMAIN ADAPTATION (UDA)
Unsupervised domain adaptation (UDA) belongs to a sub-direction of transfer learning, which is an open theoretical and practical problem. Defining datasets with different distribution as different domains, the purpose of UDA is to use the source labeled domain dataset to train the network, and the trained network can achieve good performance in the unlabeled target domain dataset. The key of UDA is to project the transformation of different domains to the same subspace, such as Reproducing Kernel Hilbert Space (RKHS) and reduce the distribution discrepancy.
Compared to traditional methods, deep networks can learn more transferable features for domain adaptation [32], [33]. Therefore, domain adaptation methods based on deep learning have become popular these years. For instance, Maximum Mean Discrepancy (MMD) [25], [34], Correlation Alignment (CORAL) [35], and Central Moment Discrepancy (CMD) [36] aim to reduce the distribution divergence among domains. The distribution divergence previously was limited to a single layer and gradually expanded to multi-layers [25]. For example, Pan et al. proposed Transfer Component Analysis (TCA) [22] to align the marginal distributions between domains. Based on TCA, Joint Distribution Adaptation (JDA) [17] proposes matching both marginal and conditional distributions. Subsequent researchers [20], [37] extended JDA by adding structural consistency and domain invariant clustering. More layers mean more structural information, and how to manage multiple information became a new problem. Existing methods such as Balanced Distribution Adaptation (BDA) [26], Dynamic Adversarial Adaptation Network (DAAN) [24], and Manifold Embedded Distribution Alignment (MEDA) [27] all give their solutions. They provide different distributions with different weight by calculating the MMD [28] distance or Proxy A-distance [9]. However, these methods do not perform consistently on multiple datasets because of artificial parameters. Therefore, reducing artificial parameters to make the network realize self-learning is a direction to improve the network performance.

B. DOMAIN-ADVERSARIAL LEARNING
As a particular case of deep domain adaptation, domainadversarial learning means these methods that learn transferable features and minimize distribution discrepancy using adversarial learning methods [38]. They integrate adversarial learning and domain adaptation into a two-player game such as Generative Adversarial Networks (GANs) [30]. Specifically, they exhibit an architecture whose feature extraction layers are simultaneously shared by the task classifier and domain classifier. The first classifier aims to correctly predict task-specific class labels on the source data, while the second aims to predict the domain labels of input samples. The feature extraction layers can learn domain invariant features by deceiving domain classifiers and eventually achieve the goal of domain adaptation.
Recently, we have witnessed a considerable amount of research [39]- [41] on domain-adversarial learning. For example, DANN [42] proposed a domain adversarial training method to generate domain invariant features by deceiving a domain discriminator. As a pioneering work, their domain-adversarial learning idea was adopted by many subsequent researchers [43], [44].
However, when data distributions embody complex multimodal structures, existing methods based on one domain classifier may fail to capture such multi-modal structures. Therefore, they cannot realize discriminative distribution alignment without mode mismatch. Some researchers have given their solutions. JDA [17] attempted to tackle this by aligning distributions via separate domain discriminators. MADA [40] enabled fine-grained alignment of different data distributions based on multiple domain discriminators. Co-DA [41] constructed multiple diverse feature space and aligned distributions individually. Unfortunately, these methods do not work very well since separating label-prediction and feature distributions make it hard to capture the multi-modal information between them. In CDAN [45], the author attempted to capture the multimodal structures underlying the data by a multi-linear conditioning scheme. However, experiments show that the multi-linear conditioning scheme cannot seize complete joint feature alignment because it ignores the marginal distribution information. We try to solve these problems by creating a new joint distribution feature and therefore improved the accuracy by 2 ∼ 3%.

III. METHOD
This section elaborates on our proposed algorithm. First, we introduce the problem definition. Then, we cover the basics and then present the Adaptative Joint Domain Adaptation network (AJDAN). Finally, we recommend the Strong AJDAN (AJDAN+S) method.

A. PROBLEM DEFINITION
Transfer learning applies to many scenarios, such as object recognition, object detection, and object tracking. This paper mainly studies object recognition problems. Assume a source domain where n s represents the number of labeled samples, n t represents the number of unlabeled samples, x s i and x t i represent the source and target input images respectively and y s i represents the corresponding class labels. The D s and D t here are sampled from two joint distributions P(x s , y s ) and Q(x t , y t ) respectively, which distributes independently and identically (P = Q).
Suppose an image classification network is composed of a feature extractor part G and a task label classifier part C. The objective function of the network can be summarized as: where θ C and θ G mean the parameters of C and G. means the loss function. To simplify the following analysis, we define G(x) with f m and define C(G(x)) withŷ. The goal of transfer learning is using the feature extraction network G and task classifier C trained in D s to obtain y t of D t . VOLUME 9, 2021

B. ADVERSARIAL LEARNING FOR DOMAIN ADAPTATION
Inspired by GAN [30], adversarial learning has been effectively used to minimize the discrepancy between domains. It uses a domain classifier D to distinguish the domain label of the input sample, whose objective function is: where z i means the domain label of the input samples. After the training converges, D fails to distinguish the input samples from the two different domains, meaning the discrepancy between the two domains is small, and they are fused totally. This training process can be formalized as (3).
whereθ D andθ G mean the parameters of network D and G when they fully converge. Here, θ D aims to reduce L(θ D ), and θ G aims to increase L(θ D ). After convergence, the two kinds of distributions are completely fused and the distance between them closes to zero.

C. ADAPTATIVE JOINT DOMAIN ADAPTATION NETWORK (AJDAN)
The practical procedure of transfer learning is to minimize both the marginal and conditional feature distribution [17], [20] between domains. Specifically, this refers to minimizing the distance as shown in (4).
Based on the theories in DANN [42], f m contains the global feature information and thus we use f m to represent P(x). Because the target domain D t has no labels, it is not feasible to calculate the conditional feature distribution P(y t |x t ). According to a quantitative study [33], [45], the output of the task classifierŷ potentially reveals the multi-modal structure. Therefore, many existing methods combineŷ and f m to simulate conditional feature representation. Research [46] has shown that a multilinear map of infinite-dimensional nonlinear feature maps has been successfully applied to embed a conditional distribution and the multilinear map can model multiplicative interactions between different variables. Inspired by this, different from most existing methods, that concatenatesŷ and f m roughly, we adopt the tensor product betweenŷ and f m to perform lossless multilinear fusion. This operation can be represented by the formula: To facilitate the following analysis, we use f c to refer to P(y t |x t ). Equation (5) can be simplified as (6).
The equilibrium parameter µ in (4) control the weight of the marginal distribution and the conditional distribution. When µ → 0, it means that the datasets are dissimilar. Thus, the marginal distribution P(x) is more dominant. When µ → 1, it reveals that the datasets are similar, so the conditional distribution P(y|x) is more important. Existing methods such as MEDA [26] evaluate the weight using the proxy A − distance by learning a domain invariant classifier, and DAAN [24] obtains µ using the value of loss function from different domain classifiers. However, MEDA requires 1 + C binary classifiers for the calculation of its adaptative factor, and DANN [42] contains too many manual parameters and lacks stability over multiple datasets. Apart from reducing the existence of manual parameters, experiments have also shown that it is necessary to analyze the status of feature distributions and their relationship to enhance the performance. Existing methods usually consider the former and ignore the latter, which leads to a decline in their performance. Therefore, the paper tries to reduce the number of artificial parameters and capture the relationship between distributions.
To avoid using the artificial parameter µ, we plan to form the marginal feature representation f m and conditional feature representation f c into a new joint variable f j and then equation (4) can be transformed into (5).
f j (P(x t ), P(y t |x t ))) (7) Determining the ways to obtain j requires a specific analysis of the status of the two distributions. However, the marginal and conditional feature distribute in RKHS, which makes it arduous to analyze them manually. A recent advance named Conditional Generative Adversarial Networks (CGANs) [47] discovers that different distributions can be matched better by conditioning the generator and discriminator on relevant information, such as associated labels and affiliated modality. This discovery shows that GAN can be extended to a conditional model using some specific auxiliary information, such as class labels or data from other modalities.
Motivated by conditional GANs, we observe that different distributions can be combined as auxiliary information to each other. By the way shown in (6), domain variances in both feature representations f m and f c can be modeled simultaneously.
f j = f m ⊕ f c here means concatenating the marginal feature representation f m and conditional feature representation f c into a new joint variable f j . We name this process as joint combination.
The new variable f j here contains the joint information of the original two feature distributions. This is a lossless combination because the original feature information is unchanged, so we do not have to worry about information loss during the transformation process. Furthermore, the mutual auxiliary of the two kinds of feature information also enables the network to extract deeper information between them [47]. We set a joint domain discriminator D j to analyze the joint information and predict the domain labels of input samples.
The following way to align the joint variable f j is the same as that to align a single distribution. The entire network architecture is shown in Fig.3. We named this method as Adaptative Joint Domain Adaptation (AJDAN). Experiments show that AJDAN can achieve good results with a simplified network. The objective function of AJDAN is as follows.
After training convergence, the parameters θ G , θ C , θ J will deliver a saddle point.
AJDAN can also be viewed as an updated version of existing methods. The difference between them lies in AJDAN replacing manually calculated weight parameters with network self-training weight parameters. Experimental results show that adaptive learning makes it possible to achieve better results over multiple datasets without adding another large network.
We can theoretically demonstrate the conclusion that AJDAN is an upgraded version of existing methods. Taking DANN [34] as an example, which denotes the distance of the feature distribution as Set two domain discriminators D m and D c , which are used for marginal features and conditional features respectively. Then, equation (4) can be converted to (13).

min(Distance(D s , D t ))
≈ min((1 − µ)Distance(P(x s ), P(x t )) + µDistance(P(y s |x s ), P(y t |x t ))) = min ( where L(θ D m ) and L(θ D c ) are loss values of D m and D c respectively. To make the network not overly incline to the marginal or conditional feature distributions alone, their feature distributions need to satisfy (11).
The existing methods usually manually adjust the calculation formula of µ to achieve (13), whereas AJDAN hides the balancing process in the neural network training. We can transform their formulas to the same matrix form to prove the point. For existing methods, equation (13) can be expressed in matrix form (16).
For AJDAN, the objective function can be formalized as (14).
where F M , F C , and F J and M D m , M D c , and M Dj are matrix forms of f m , f c , and f j and θ D m , θ D c , and θ D j respectively. Z is the one-hot encoding format of domain label z, which is [0, 1] T or [1, 0] T . Moreover, equation (16) and equation (17) can be converted into (18).
As shown in (18), equation (16) and (17) turn out to be equal finally. The difference is that the D j in AJDAN replaces the functions of D m , D c , and µ. Therefore, compared with the existing methods, AJDAN reduces the parameters that need to be trained and reduces artificial variables.
To make the above demonstration more intuitive, as shown in Fig.4, the green lines are parameters related to marginal feature distribution, and the red lines are parameters associated with conditional feature distribution. These neuron nodes adjust the weight of the two distributions, which is same as the role of µ. Because these parameters are self-learned by the network, they are more effective and robust over multiple datasets.

D. STRONG ALIGNMENT FOR EASY-TO-CLASSIFY SAMPLES
To achieve adversarial learning, existing methods [42], [45] utilize one or more domain classifiers to distinguish the target features from the source features. The loss function will affect the performance of the domain classifiers. Existing methods have widely use traditional Cross-Entropy (CE) in various deep learning tasks due to its outstanding performance. The formula of Binary Cross-Entropy (BCE), which is a particular case of CE in the binary problem is shown in (19): (19) where p ∈ [0, 1] is the output of the domain classifier, z is the domain label of the input sample, and l(p, z) is the value of the BCE loss function. Fig.2(b) shows the function curve of BCE when z equals 0 or 1. As shown in Fig.5, easyto-classify samples are those that are far from the domain boundary while hard-to-classify target samples are those that are near the domain boundary. To put it another way, easyto-classify samples are those whose predicted domain labels are close to real domain labels. However, the gradient of BCE function for the easy-to-classify samples (the dotted line position in Fig.2(b)) is gentle, and therefore the training weight is inadequate. This problem makes it challenging for the network to train easy-to-classify samples effectively. However, in real life, easy-to-classify samples occupy a considerable proportion and need to be emphasized. Existing methods that adopt traditional BCE ignore these samples, and thus their performances are reduced. Therefore, focusing on easy-to-classify samples can achieve a strong alignment and obtain more ideal results on multi datasets. Some researchers have used the least squares loss function to relieve this problem [31], but the least squares loss function is constant and cannot be adjusted according to different datasets. To tackle this, we propose a new loss function named Strong Binary Cross-Entropy (SBCE) by adding a modulating factor f (p t ) to the traditional BCE loss function. We can select the most suitable SBCE for current datasets by adjusting f (p t ), so SBCE is more practical than the least squares loss function. The formula of SBCE is shown in (17).
Here, we choose f (p t ) to be a monotonically increasing function.
f (p t ) = 1 + γ p t 0 < γ < e 2 (22) SBCE : l(p t , z = −(1 + γ p t ) log(p t )0 < γ < e 2 (23) where γ is an adjustable variable, controlling the size of the weight on easy-to-classify samples. Fig.6 shows the loss function curve when domain label z = 1. As we can see from the black and red lines in Fig. 6, when γ is small, the loss function using the strong alignment strategy exhibits little difference with traditional BCE. Increasing the value of γ can augment the effect of the strong alignment strategy, but there are limits to this increase. For example, when γ is above e 2 , such as the purple line and yellow line in Fig.6, the loss function can no longer meet the basic classification requirements. Thus, here, we limit the value of γ between 0 and e 2 .

IV. EXPERIMENTS
In this section, we thoroughly evaluate the performance of our algorithm on several benchmark datasets. Specifically, we first introduce the benchmark datasets, then describe the experimental setup, report the statistical test of the results, and finally make an empirical analysis of the proposed method.

A. DATASET SETUP
The following is an introduction to the datasets used in this paper.
Office31 [48] is a standard benchmark for UDA. It contains 4652 images of 31 object categories, all of which are office data from A(Amazon), W(Webcam), and D(DSLR).
Digit contains two types of digits datasets: Mnist data and Street View House Number (SVHN) data. The data from Mnist includes a total of 70,000 images that fall into 10 categories and each image has a size of 28 × 28. The image size of the SVHN data set is 16 * 16, with a total of 630420 images and 10 types of numbers.
Office-Home [49] is developed mainly for the study of domain adaptation. It includes three different domains: A(Artistic), C(Clipart), P(Product), and R(Real-World) and each domain contains 65 object classes.
VisDA2017 [50] is a challenging testbed for UDA with the domain shift from synthetic data to real imagery. In total there are ∼280k images from 12 categories. ImageCLEF-DA [51] is a dataset organized by selecting 12 common classes shared by three public datasets (domains): Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). The dataset organizers selected 50 images per class and 600 images in total for each domain.

B. TECHNICAL DETAILS
The methods compared in this paper are Domain Adversarial Neural Network (DANN) [42], Conditional Domain Adversarial Network (CDAN) [45], Joint Adaptation Network (JAN) [25], Deep Adaptation Network (DAN) [34], and Dynamic Adversarial Adaptation Network (DAAN) [24]. Among them, DANN is aimed at marginal feature and CDAN is aimed at conditional feature. Furthermore, JAN, DAN, and DAAN consider both, but they combine two features crudely without using network self-learning. The AJDAN proposed in this paper not only combines these two, but also uses the self-learning weight. Table 4 shows the difference among these methods.
We follow the standard protocols for unsupervised domain adaptation [42], [25]. We use all labeled source examples and all unlabeled target examples and compare the average classification accuracy based on three random experiments. We conduct importance-weighted cross-validation (IWCV) [54] to select hyper-parameters for all methods. For some key parameters in existing methods, we set λ = 1 for all experiments in CDAN. For MMD-based methods such as JAN, we use Gaussian kernel with the bandwidth set to median pairwise distances on training data [34].
We use ResNet50 [52] pre-trained on the ImageNet [53] dataset as our backbone network and replace the last convolutional layer of the network for the current tasks. The labeled source domain data and unlabeled target domain data are used to finetune the model, and they share all network parameters. Because the value of λ can affect the result on the target domain, we will analyze its specific influence in a later section. Here, we use a mini-batch stochastic gradient descent (SGD) with the momentum equal 0.9 and set λ and γ to 1. The learning rate η was set up η = η 0 /(1+ap) b , where p goes from zero to one linearly with the number of iterations. To ensure data's authenticity, we take the average outcome of three repeat experiments as the final result.

C. EXPERIMENTAL RESULT
In this section, we compare the results of various methods in standard datasets and the full results are shown in Table. 1, Table 2, and Table 3. Table 1 shows the classification accuracy of 12 tasks on the office-home dataset. Compared with DANN aligning the marginal distribution alone and CDAN+E aligning the conditional distribution alone, the accuracy of the  newly proposed AJDAN+S in this paper is increased by 3-5 percentage points. As shown in Table 1, the average accuracy of AJDAN+S is 68.3%, which is increased by 3.5% compared with that of CDAN+E and 11.7% compared with that of DANN. Table 2 shows the classification accuracy of 6 tasks in the Office31 dataset. AJDAN+S is still better than the several existing methods. The average accuracy of AJDAN+S is 89.0%, which is increased by 2.3% compared with that of CDAN+E and 6.8% compared with DANN. Table 3 shows the classification accuracy of 6 tasks in the ImageCLEF-DA dataset. The average accuracy of AJDAN+S is 88.1% and better than those of other methods. To sum up, in the above three experiments, AJDAN+S offers better performance than JAN, DAN, and DAAN. This proves the superiority of network self-learning parameters over manual setting parameters.
When the source domain and the target domain are similar, such as the D→W and W→D tasks in Office31 or I→C task in ImageCLEF-DA. AJDAN+S shows little advancement (less than 0.1%) compared with CDAN+E. The reason is because their marginal distributions are close by nature, so increasing the alignment of marginal feature distribution does not work effectively.

D. EMPIRICAL ANALYSIS 1) ABLATION STUDY
To compare the diverse influence of aligning marginal, conditional, and joint distributions, we test and visualize the performance of DANN, CDAN+E, and AJDAN. To compare the impact of strong alignment, we test the performance of AJDAN and AJDAN+S. The backbone network is ResNet50, and λ is set to 1. Fig. 7(a) shows the result.  As shown in Fig. 7(a), the joint alignment strategy performs better in most transfer tasks on Office-Home datasets. The average accuracy of AJDAN increases by 9% compared to that of DANN and 1% compared to that of CDAN+E. After adoption of the strong alignment strategy, the average accuracy of AJDAN+S increases by 11% relative to that of DANN, 3% relative to that of CDAN+E, and 1.5% relative to that of AJDAN.
It is worth noting that the strong alignment strategy is not effective in all cases. For example, we test CDAN, AJDAN, and AJDAN+S on VisDA-2017 and Digit datasets, and the results are shown in Table 5. For the SVHN→MNIST task in the Digit dataset, AJDAN+S performs better than CDAN+E and AJDAN. However, in the VisDA-2017 dataset (Synthetic → Real), AJDAN performs better than AJDAN+S. The use of the strong alignment strategy does not give the anticipated improvement. Instead, it decreases the accuracy by 3%.
The reason is that, aligning features distribution means, not only object categories but also backgrounds and scene layouts must be similar across domains. When using strong alignment, feature alignment is done both at the global image scale and at the instance (object) scale, so strictly matching might work well for small domain shifts that only affect the appearance/texture of objects (e.g. weather related shifts), but it is likely to hurt the performance for larger shifts that affect the layout of the scene. Fortunately, the formula we give is flexible. In the above experiments, γ is fixed to 1 to make sure the experiment is consistent, but we can adjust it and make AJDAN+S adaptative for different tasks in practical application.

2) BALANCE FACTOR λ
Factor λ controls the balance between classification task and domain alignment task. To test the influence of the λ factor, we adjusted its value and tested the accuracy of the corresponding AJDAN+S in the A→W task of Office31. As shown in Fig.7(b), the overall accuracy will increase first and then decrease. For example, in the A→W task, the accuracy reached a peak at 0.6, while in the D→A task, the accuracy reached a peak at 1.0. In general, factor λ is suitable between 0.6 and 1. Fig.7(c) is a graph showing the changes in accuracy over iteration of the A→W task on the Office31 dataset. When training, ResNet50 converges rapidly within 1000 iterations, but its accuracy in the target domain stops increasing after that and decreases instead. The reason is that as the training iteration proceeds, the network overfits the source domain, so the performance in the target domain drops instead of rising. After using the transfer learning strategy, this overfitting phenomenon is avoided in DANN, CDAN, and AJDAN. In terms of convergence speed, the convergence speed of CDAN and AJDAN is similar, and are slightly better than that of DANN. In terms of accuracy, AJDAN is slightly more accurate than CDAN and more accurate than DANN and ResNet50. Table 6 shows the computation costs of various methods and their corresponding accuracy in Office-Home dataset. Since the transfer network is a small part compare with the overall network, there is little difference in overall computation cost among different methods. As shown in Table 6, DAAN has the largest computation cost for setting a transfer network separately for each category. In general, AJDAN achieves the best results without additional computation.

4) VISUALIZATION
To compare the existing methods and AJDAN, we use t-SNE [48] to visualize the feature distribution of ResNet50, DANN, CDAN, and AJDAN. Fig.8 shows the feature distribution of A→W tasks in the Office31 dataset and A→C tasks in the Office-Home dataset. The two tasks correspond to remote and near domain adaptation problems respectively. The red points represent source domain and the green points represent target domain.
As shown in Fig.8, in ResNet50, the target domain features are scattered around the source domain features, and the overall distribution is far away from the source domain, clustered in small areas. In DANN, the two types of characteristics are mixed to a certain extent, which is better than ResNet50. In CDAN, the features of the source and target domains are further fused, but there were still some unsatisfactory phenomena, such as some target domain features failing to find the corresponding source domain features. Finally, AJDAN combines the advantages of DANN and CDAN and achieves the best results compared with the former.

V. CONCLUSION
Within the field of UDA, the paper proposes a new domain adaptation method named Adaptative Joint Distribution Adaptation Network (AJDAN). Experimental results show that, compared with DANN based on marginal feature alignment, CDAN based on the conditional feature alignment, and JAN based on joint alignment using manual parameters, AJDAN based on joint alignment using self-learning parameters offer a result that is 2% above the existing methods at their best. Moreover, we also replace traditional BCE with SBCE. Experimental results show that the use of AJDAN+S can further improve the accuracy by approximately 3%. Regarding the direction for improvement of AJDAN in the future, we note that the value of γ and λ needs to be determined empirically, which brings inconvenience to practical use. In future work, a better approach is needed to obtain the values of γ and λ, and the performance is expected to improve further. ZHENGSHAN