Metric Information Matrix for Maximum Mean Discrepancy for Domain Adaptation

In this paper, we focus the problem of unsupervised domain adaptation which transfers knowledge from a well-labeled source domain to an unlabeled target domain with distinctive distributions. Based on Gromov-Hausdorff’s theory, we proposed two kinds of feature mappings in the model of joint distribution adaptation by embedding the original feature subspace to a common subspace. It can been seen as a part of feature embedding used for the models based feature alignment. Our experiments show that constructed mappings have the abilities to alleviate the feature discrepancy and mitigate the distribution shift between source domain and target domains.


I. INTRODUCTION
In some application scenarios, one can only access to information on source domains, but there doesn't exist direct information about a target domain. Domain adaptation (DA) aims at training a model in source domain and applying it to a new target domain. The key of most domain adaptation methods is to learn a transformation on the features to reduce the discrepancy of the distributions between the source and the target datasets. There are different situations in real-world problems: 1) The marginal distributions are different, while the conditional distributions are similar. 2) The marginal distributions are similar, while the conditional distributions are different. 3) Both the marginal and the conditional distributions are different. The approaches of domain adaptation fall into two major categories: based on optimization and based on deep learning.
Maximum mean discrepancy (MMD) is a widely adopted tool based on optimization to measure the discrepancy between the source and target domain distributions. A popular domain adaptation is feature-based, which projects different domain data into a shared subspace to minimize their discrepancy, usually measured by MMD [1]. Li et al. proposed a novel heterogeneous domain adaptation method that can optimal both feature discrepancy and distribution divergence in a unified objective function, and the notion of progressive alignment was given to lean a new transferable feature space by dictionary-sharing coding and align the distribution gaps The associate editor coordinating the review of this manuscript and approving it for publication was Seifedine Kadry . on the new space [2]. Joint distribution adaptation (JDA) and its generalized discriminative joint probability MMD (DJP-MMD) were proposed to measuring the distribution shift between domains [4], [5].
Deep domain method based on neural network currently provide some new solutions to domain adaptation. Hoffman et al. derived some new normalized solutions with strong theoretical guarantees for the cross-entropy loss and other similar losses, and present a number of novel contributions to the multiple-source adaptation problem [6]. In order to character the common feature subspace in more details, an Adversarial Tight Match method with a novel distance loss named Maximum Density Divergence, was proposed in adversarial domain adaptation which enjoys the benefits of both adversarial training and metric learning [7]. Moreover, a quintessential domain adaptation framework on deep learning is the Deep Transfer Network (DTN), where the highly flexible deep neural networks are used to implement such a distribution matching process [8].
In this paper we focus on the approaches based on MMD for domain adaptation. We introduced a graph structure, named metric information matrix, which can be used to character the geometric information of a set or a feature space. Then we used the invariant property to character the local metric information of a feature space for domain adaptation. We also provided some additional notions such as Gromov-Wasserstein barycenters and k−nearest neighbour graph to improve the MMD-like algorithms based metric matrices. The main contributions of the paper are: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ (1) We introduce the notion of metric information matrix constructed by some metric matrices, Toeplitz Matrix and Markov Matrix. It can be used as a based block to represent the geometric structural information of the source domain and target domain.
(2) We provide a strategy that can help us to search a common subspace where one can optimize easily the between-domain transferability and the between-class discriminability by using MMD-like methods. Moreover, associated with MMD, it can be seen as a part of feature embedding in the adversarial domain adaptation models. An experiment is given to verify the effectiveness of the hybrid adversarial model.
(3) One can combine the proposed approach with the MMD-like approaches in a straightforward way by transform the original domains to a common subspace. And we show by extensive experiments that the proposed method outperforms the baseline methods in accuracy on different datasets.

II. MATHEMATICAL FOUNDATIONS
Let D s be the source domain with n s labeled samples is the feature tensor, and let y be the label, with y ∈ {1, . . . , C} for C−class classification. In this paper we only consider homogeneous transfer learning, i.e., X s = X t and Y s = Y t . Domain adaptation aims to search a mapping H that brings H (X s ) and H (X t ) together in some means, so that the welltrained classifier works well on both H (X s ) and H (X t ).
Consider a feature mapping g that maps x to a feature space. The general objective function of domain adaptation can be defined as following where d S,T = d(P(X s , Y s ), P(X t , Y t )) is a discrepancy metric between the source and target domain distributions, R(h) = h 2 F controls the mapping complexity, and λ is a regularization parameter.

A. MATHEMATICAL FOUNDATIONS
The metric matrix theory is an important research object in metric measure geometry, and it is a tool to study the Gromov's theory of convergence and concentration of metrics and measures. By the definition of distance matrix, it is easy to see that any translation, rotation, or reflection transform will lead to the same metric matrix such as Euclidean distance matrix. We can use this property to character the local metric information of a set (or a domain).
Let (X , d) be a metric space. We introduce the following definition to character the metric information of a domain data.

1) METRIC INFORMATION MATRIX
For a metric space X and a natural number N , the distance matrix M N (X ) of X in order N is defined to be the set of . . , N run over all points in X . By the definition of metric (distance) matrix, we can obtain a set of N × N matrices for order N . Then for a feature space (X , D) of the domain we can define a metric information matrix (MIM) by where D is a metric. If X is compact, then the distance matrix K N (X ) is compact. It can be used to character the geometry feature of a set, and by Gromov-Housdorff (or Gromov-Wasserstein) distance one can measure the distance between the two metric information matrices.

2) TOEPLITZ MATRIX
A Toeplitz matrix or diagonal-constant matrix is a matrix in which each descending diagonal from left to right is constant.

3) MARKOV MATRIX
If the probability of moving from i to j in one time step is Pr(j|i), the stochastic matrix P is given by using P i,j as the i−th row and j−th column element.

4) GROMOV-WASSERSTEIN BARYCENTERS
Gromov-Wasserstein barycenters of measured similarity For a large class of losses (KL loss or L 2 loss), one can compute the global minimizer by the Proposition 3 in paper [3].

5) MAXIMUM MEAN DISCREPANCY
Maximum mean discrepancy is a distance associated with the embedding probabilities in a reproducing kernel Hilbert space. For the embedding of P and Q, Maximum Mean Discrepancy MMD k (P, Q) can be defined as the distance between the corresponding mean elements, i.e.,

B. DOMAIN ADAPTATION
Domain adaptation aims at learning a model in the source domain and applying it to the target domain. Many domain adaptation methods try to reduce the distribution shift between the source and the target datasets. A discriminative joint probability maximum mean discrepancy for domain adaption was proposed in paper [4]. The object function for domain adaptation of DJP-MMD can be defied as arg min where H = I −1 n is the centering matrix, in which n = n s +n t and 1 n ∈ R n×n is a matrix with all elements being 1 n .

III. DOMAIN ADAPTATION WITH INFORMATION MATRIX
To facilitate domain adaptation, we improve MMD-like models by minimize the object function on a common subspace. Generally speaking, for any N ≤ |X |, X and M N (X ) cannot be placed in one-to-one correspondence. If N = |X |, then MMD algorithms based on labels and pseudo-labels doesn't work on this subspace. If N = |X |, M N (X ) is also called similarity matrix which is indeed a symmetric nonnegative adjacency matrix with zero diagonal. In this paper we construct a neighbour graph of domains and perform MMD-like models for domain adaptation on the graph.
is constructed, where NN k denotes the set of k nearest neighbors in X , and γ is parameter following recent work on manifold-based search [9]. Because this matrix is constructed by a metric function, we call it metric information matrix (i-MIM).
A linear mapping is given as where γ is a trade-off parameter.
For clarity, we summarize the learning algorithms in Algorithms 1.

2) i-MIM
We introduce a modified algorithm for domain adaptation that can be summarized as follows. We construct a nearest neighbour graph on the source domain and target domain. Then we perform the discriminative joint probability maximum mean discrepancy for domain adaptation on two graphs. The idea is developed below, while a graphical overview of the proposed approach is shown in Figure 1. Then the modified object function of the discriminative joint probability maximum mean discrepancy for domain adaptation can be defied as arg min

B. II-MIM METHOD
The i-MIM only characters the metric relationship of points on each domain and that will miss the information of the original data. We provide a strategy to mapping the source and target data into a common subspace. We construct an information matrix which can be constructed as a nearest neighbor graph or as a matrix with some special properties such as Toeplitz matrix. We introduce the definition of convolutional filter masks proposed in [10].

1) CONVOLUTIONAL FILTER MASKS
A convolutional filter mask W is a sequence support in where γ is a trade-off parameter. The matrix S · M is named the metric information matrix (ii-MIM). Then we perform the discriminative joint probability maximum mean discrepancy for domain adaptation on two graphs. These ideas are developed below, while a graphical overview of the proposed approach is shown in Figure 2. Then the modified object function of the discriminative joint probability maximum mean discrepancy for domain adaptation can be defied as arg min The optimal problems can be optimized by the algorithm provided in the paper [4].

IV. EXPERIMENTS
In this section, some experiments are given to show the effectiveness of the modified algorithms.

A. DATESETS 1) OBJECT RECOGNITION
Office+Caltech includes four domains: Caltech (C), Amazon (A), Webcam (W), and DSLR (D). In this paper we adopt the SURF features, and obtain 4 × 3 = 12 different domain adaptation tasks by randomly selecting a source domain and a target domain.

2) COIL
COIL includes 20 objects with 1, 440 images. The images of each object are taken 5 degrees apart as the object was rotated on a turntable, and each object has 72 images of 32×32 pixels. The dataset is partitioned into two equal subsets (COIL1 and COIL2) with different distributions.

3) DIGIT RECOGNITION
USPS and MNIST are two public digit recognition datasets with different resolutions. Our experiments used the public USPS and MNIST datasets released by Long et al. [5], which randomly sampled 1,800 images in USPS and 2,000 images in MNIST. They both have 10 classes of digits, with different distributions.

4) MODULATION RECOGNITION
Experiments are conducted on the public dataset named RadioML [11], which is commonly used for domain adaptation modulation recognition. It contains 11 modulation classes, 8PSK, BPSK, QPSK, QAM16, QAM64, AM-DSB, CPFSK, GFSK, LFM, WBFM, PAM4. Each example waveform is of size 2 × 128. GNU Radio models for time varying multi-path fading of the channel impulse response, random walk drifting of carrier frequency oscillator and additive Gaussian white noise are added, and the signals are passed through harsh channels models which introduce unknown scale, translation and dilation. RadioML consists of two parts, named RadioML2016.10a (10a) and RadioML2016.04c (04c). Each class contains 10,000 samples for 20 signal-to-noise ratios levels.

B. SETTING
In our experiments we compared the models for unsupervised domain adaptation tasks, JPDA [4], TCA [12], JDA [5], DICE [13], PACET [14] and BDA (which used the A-distance [15] to compute the weight, instead of grid search in [16]), adversarial discriminative domain adaptation [17] and adversarial unsupervised domain adaptation [18]. We chose k-nearest neighbor model as the classifier. The parameter settings in [12] were used for TCA, JDA and BDA. We fixed p = 100 and T = 20 in all experiments, and the regularization parameter λ = 1 with linear kernel for Office+Caltech dataset, λ = 0.1 with primal kernel for other datasets. In the experiment of deep domain adaptation we used an improved neural network model in our previous work [19] to feature extraction.

1) DOMAIN CLASSIFICATION ACCURACY
The classification accuracies of the following algorithms are given in Table 1 and Table 2. The accuracies of proposed algorithms of the i-MIM and the ii-MIM are respectively 56.87% and 56.50%, which both outperform the baseline methods. It clearly demonstrates the effectiveness of the proposed methods. Furthermore, we can obtain that: (1)The i-MIM method always performs better in the feature subspace constructed by nearest neighbour graph, and it shows that the geometry feature subspace can replace the original domain as a latent subspace to perform the MMD-like methods.
(2) As can be seen from the results of the ii-MIM method, convolutional filter masks can be used to construct a linear transform to map the original feature space to a common subspace which is more easier to optimize the domain adaptation approaches based on MMD.
(3) In the frame of deep domain adaptation the proposed method based on MIM-like feature embedding still outperforms the baseline methods (ADDA, GR-AUDA or FA-AUDA) in a 1-D feature vector space extracted by a deep neural network.
(4) The feature embeddings associated with metric information matrix can obtain a more transferrable and also   more discriminative feature mapping for cross-domain visual adaptation.

2) FEATURE VISUALIZATION
We adopt the t-SNE method [20] to visualize the learned feature on the C-W task, W-C task, D-C task and C-D task on Office+Caltech dataset. The visualization results are plotted in Figure 3-6. From the results, we can obtain the following observations: (1) Figure 3 shows the results of the first five classes' data distributions when transferring Caltech (source) to Amazon  (target), before and after different distribution adaptation approaches, where RAW denotes the raw data distribution. For the raw distribution, the samples of the source and the target domains are mixed together. After domain adaptation, the proposed approaches bring data distributions of the source and the target domains together, and also keeps samples from different classes well-separated.
(2) Compared with JPDA, the proposed methods based MIM not only align the distributions between the source and target data but also enhance the discriminative power of the target data (See S 4 → T 4 .).
(3) i-MIM and ii-MIM can both align the distributions between the source and target data. And i-MIM is a better feature representation than ii-MIM and JPDA (See (S 3 → T 3 ) or (S 5 → T 5 ) in Figure 3).

3) PARAMETERS SENSITIVITY AND CONVERGENCE
We consider the parameter sensitivity and convergence of the proposed models on different datasets. The main adjustable parameter γ was studied for two classes approaches base MIM. The result is shown in Figure 7. The method of ii-MIM is robust when γ ∈ [0.1, 0.6]. Another parameter is the metric used in metric information matrix. Here we compare 12 different kinds of distances, Euclidean, suare-dEuclidean, seuclidean, mahalanobis, cityblock, minkowski,   chebychev, consine, correlation, hamming, jaccard and spearman. It shows that the distance in MIM is important for domain adaptation, but there is no clear evidence to show which distance is the best one. In the view of average accuracy cosine distance and Euclidean distance are more robust.

4) ABLATION STUDY
We also conducted the ablation study to check if the MMD based on i-MIM can indeed improve the ability of the domain adaptation. The JDP-MMD was used as a baseline. When embedded in DA, the classification accuracies of the approaches based on different metric matrices are shown in Figure 8. Compared with the baseline method, the proposed algorithms can improve classification accuracies on the tasks COIL2-COIL1 and COIL1-COIL2.

V. CONCLUSION
In this paper, we constructed two metric information matrices which could be as basic components of feature mappings for the models of domain adaptation. Combined with MMD-like model, the feature space induced by metric information matrix is more simple and effective to measure the discrepancy between different domains. The feature mapping can also be as a basic block of a deep domain adaptation neural network such as Adversarial Discriminative Domain Adaptation. Experiments on different classification datasets verified the superiority of proposed methods.
Furthermore, the proposed models are not the state-of-theart models. For example, DICE [13] and PACET [14] are both more effective than ours for domain adaptation in the experiments. In the future work, we will try to borrow ideas from PACET and DICE to modified our models. WENJUAN