Parallel Transport on the Cone Manifold of SPD Matrices for Domain Adaptation

In this paper, we consider the problem of domain adaptation. We propose to view the data through the lens of covariance matrices and present a method for domain adaptation using parallel transport on the cone manifold of symmetric positive-definite matrices. We provide rigorous analysis using Riemanninan geometry, illuminating the theoretical guarantees and benefits of the presented method. In addition, we demonstrate these benefits using experimental results on simulations and real-measured data.


I. INTRODUCTION
T HE increasing technological sophistication of current data acquisition systems gives rise to complex, multimodal datasets in high-dimension. As a result, the acquired data do not live in a Euclidean space, and applying analysis and learning algorithms directly to the data often leads to subpar performance.
To facilitate the analysis and processing of such data, one approach is to observe complex high-dimensional data through the lens of objects with a known non-Euclidean geometry. Notable examples of such objects are Symmetric and Positive Definite (SPD) matrices, which live on a cone manifold with a Riemannian metric. One of the most common forms of SPD matrices is a covariance matrix, which captures the linear relations between the different data coordinates. These relations are typically simple to compute, and therefore, recently, have become popular features in many applications in computer vision, medical imaging, and machine learning [1]- [4]. In particular, in [5], [6], the Riemannian geometry of covariance matrices was studied and exploited for medical imaging and physiological signal analysis.
Typically, Riemannian geometry is used to map objects from the non-Euclidean manifold to a linear Euclidean space by projection onto a tangent plane of the manifold. In existing work, the use of Riemannian geometry is usually limited to a single tangent plane. This indicates a hidden assumption that the SPD matrices corresponding to the data are confined to a local region of the manifold. However, the SPD matrices of the data often do not live in a small neighborhood on the manifold, and thus the resulting calculations may be inaccurate.
One such particular scenario, in which SPD matrices span a large portion of the cone manifold, occurs when the data comprise multiple domains corresponding to multiple sessions, subjects, batches, etc. For example, we will show that in a The authors are with the Viterbi Faculty of Electrical Engineering, Technion-Israel Institute of Technology, Haifa 32000, Israel (e-mail: oryair@campus.technion.ac.il; ronen@ee.technion.ac.il).
Brain-Computer-Interface (BCI) experiment, the covariance matrices of data acquired from a single subject in a specific session capture well the overall geometric structure of the data. Conversely, when the data consist of measurements from several subjects or several sessions, then the covariance matrices do not live in the same region of the manifold.
Often, multi-domain data pose significant challenges to learning approaches. For example, in the BCI experiment, it is challenging to train a classifier based on data from one subject (session) and apply it to data from another subject (session). This problem is largely referred to as domain adaptation or transfer learning, and it has attracted a significant research effort in recent years [7], [8].
Broadly, in domain adaptation, the main idea is to adapt a given model that is well performing on a particular domain, to a different yet related domain [7], [8]. Specifically, in the context of the cone manifold of SPD matrices, previous work proposed (geometric) mean subtraction as a simple method for domain adaption of BCI data [6]. Although this approach provided reasonable results for overcoming the differences between multiple sessions of a single subject, we show here that it fails to overcome the differences between multiple subjects. In [3], a Parallel Transport (PT) approach was proposed, which can be applied either directly to the data, or to a generative model of the data to reduce the computational load for large datasets. However, their approach considers a general Riemannian manifold. Since there is no closed-form expression of PT on Riemannian manifolds, besides the sphere manifold and the manifold of all SPD matrices [1], [4], no specific scheme or algorithm was provided. We note that PT can be approximated using Schild's Ladder [9], an approach that has been used extensively on the manifold of imaging data [4], [5], [9], [10].
In this paper, we propose a domain adaptation method using the analytic expression of PT on the cone manifold of SPD matrices. We claim that this is a natural and efficient solution for domain adaptation, which enjoys several important benefits. First, the solution is especially designed for SPD matrices, which have proven to be good features of data in a gamut of previous work [5], [6], [10]. Second, the analytic form of PT on the cone manifold circumvents approximations. Third, PT can be efficiently implemented, in contrast to the computationally demanding Schild's Ladder approximation. We establish the mathematical foundation of the proposed domain adaptation method. To this end, we provide new results in the geometry of SPD matrices. In addition, we show applications to both simulation and real recorded data, obtaining improved performance compared to the competing methods.
In parallel to our study, recent work [11] has proposed a scheme for transfer learning using the Riemannian geometry of SPD matrices, with a tight connection to the present work. We will show that the affine transformation proposed in [11] can be recast as PT. In this paper we provide the mathematical foundation to analyze this transport, we discuss the advantage of our solution compared to [11], and we point out the special case in which the two methods coincide. This paper is organized as follows. In Section II, we present preliminaries on the Riemannian geometry of SPD matrices. In Section III, we formulate the problem, present the proposed domain adaptation method, and provide mathematical analysis and justification. Section IV shows experimental results on both simulation and real data. Finally, we conclude the paper in Section V.
II. PRELIMINARIES ON RIEMANNIAN GEOMETRY OF SPD MATRICES In this section we provide the preliminaries regarding SPD matrices, and we refer the reader to the book [12] for a detailed exposition of this topic. We note that in this paper we focus on covariance matrices, however the statements also hold for general SPD matrices. By definition, an SPD matrix P ∈ R n×n has only strictly positive eigenvalues. An alternative definition is that for any vector v = 0 the quadratic form is strictly positive, i.e., v T P v > 0.

A. Metric and distance
The definition of an SPD matrix entails that the collection of all SPD matrices constitutes a convex half-cone in the vector space of real n×n symmetric matrices. This cone forms a differentiable Riemannian manifold M equipped with the following inner product where T P M is the tangent space at the point P ∈ M, S 1 , S 2 ∈ T P M, and ·, · is the standard Euclidean inner product operation. The symmetric matrices S ∈ T P M in the tangent plane live in a linear space, and therefore, we can view them as vectors (with a proper representation). Throughout this paper, we interchangeably use the terms vectors and symmetric matrices when referring to S ∈ T P M. This Riemannian manifold is a Hadamard manifold, namely, it is simply connected and it is a complete Riemannian manifold of non-positive sectional curvature. Manifolds with nonpositive curvature have a unique geodesic curve between any two points, a property that will later be exploited. Specifically, the unique geodesic curve between any two SPD matrices P 1 , P 2 ∈ M is given by [12,Thm 6.1.6] The arc-length of the geodesic curve defines the following Riemannian distance on the manifold [12]: Fig. 1. The cone manifold of 2 × 2 SPD matrices. The black dots mark the boundary of the cone (i.e., matrices with eigenvalue zero). Each magenta curve is the geodesic between pairs of matrices (blue circles and green squares). All the geodesic curves are of the same length (i.e., the Riemannian distance between all the pairs is equal).
where P 1 , P 2 ∈ M, · F is the Frobenius norm, log(P ) is the matrix logarithm, and λ i (P ) is the i-th eigenvalue of P . We additionally denote by d dt ϕ(t) = ϕ (t) ∈ T ϕ(t) M the velocity vector of the geodesic at t ∈ [0, 1]. Figure 1 presents an illustration of the geodesic curve and the Riemannian distance. The cone manifold of 2 × 2 SPD matrices can be displayed in R 3 , since any symmetric matrix P = x y y z is positive if and only if x > 0, z > 0 and y 2 < xz.

B. Exponential and Logarithm maps
The Logarithm map, which projects an SPD matrix P i ∈ M to the tangent plane T P M at P ∈ M, is given by The Exponential map, which projects a vector S i ∈ T P M back to the manifold M is given by An important property relates the Logarithm and Exponential maps to the geodesic curve. Formally, let P 1 , P 2 ∈ M, and consider the (unique) geodesic ϕ(t) from P 1 to P 2 . The initial velocity ϕ (0) ∈ T P 1 M is given by the Logarithm map ϕ (0) = Log P 1 (P 2 ). Similarly, the Exponential map projects the initial velocity vector ϕ (0) back to P 2 , namely, P 2 = Exp P 1 (ϕ (0)).

C. Riemannian mean
The Riemannian mean P of a set {P i |P i ∈ M} is defined using the Fréchet mean: A special case is the Riemannian mean P of two SPD matrices P 1 , P 2 ∈ M, which has a closed-form expression, and is located at the midpoint of the geodesic curve: 1 . Generally, for more than two matrices, the solution of the optimization problem (4) can be obtained by an iterative procedure. Barachant et al [6] presented an algorithm based on [13] for estimating the Riemannian mean. For completeness, we include their algorithm in Appendix F.
Given a set {P i |P i ∈ M} and its Riemannian mean P , there is a commonly used approximation of the Riemannian distances on M in the neighborhood of P . Specifically, the approximation of the Riemannian distance d 2 R is given by: . For more details on the accuracy of this approximation, see [2].  i (t) ∈ R D . Suppose each subset lives in a particular domain, which could be related to the acquisition modality, session, deployment, and set of environmental conditions. In our notation, the superscript k denotes the index of the subset, the subscript i denotes the index of the timeseries within each subset, and t represents the time axis of each time-series.

III. DOMAIN ADAPTATION WITH PARALLEL TRANSPORT
Our exposition focuses only on two subsets, and the generalization for any number of subsets is discussed at the end of this section. In addition, we consider here time-series, but our derivation does not take the temporal order into account, and therefore, the extension to other types of data, where t is merely a sample index, e.g., images, is straight-forward.
Analyzing such data typically raises many challenges. For example, a long-standing problem is how to efficiently compare between high-dimensional point clouds, and particularly, time-series. When the data are measured signals, sample comparisons become even more challenging, since such highdimensional measured data usually contain high levels of noise.
In particular, in our setting, we face an additional challenge, since the data is given in different domains; comparing timeseries from the same subset is a difficult task by itself, even more so is comparing time-series from two subsets from different domains.
Our goal is to find a new joint representation of the two subsets in an unsupervised manner. Broadly, we aim to devise a low-dimensional representation in a Euclidean space that facilitates efficient and meaningful comparisons. For the purpose of evaluation, we associate the time-series x (k) i (t) with labels y (k) i and define "meaningful" comparisons with respect to these labels. More concretely, we evaluate the joint representation by the Euclidean distance between the new representation of any two time-series with similar corresponding labels, independently of the time-series respective domain. We emphasize that the proposed approach is unsupervised and it does not depend on the labels, which are only considered for the purpose of evaluation.
Devising such a new representation will facilitate efficient and accurate domain adaptation schemes. Specifically, given a subset x , we could train a classifier based on the new derived representation of the subset. Then, when another unlabeled subset x becomes available, we could apply the trained classifier to the derived (joint) representation of the latter subset.

B. Illustrative example
To put the problem setting and our proposed solution in context, throughout the paper, we will follow an illustrative example, taken from the brain computer interface (BCI) competition IV (dataset IIa) [14]. Consider data from a BCI experiment of motor imagery comprising of recordings from D = 22 Electroencephalography (EEG) electrodes. The dataset contains several subjects, where each subject was asked repeatedly to perform one out of four motor imagery tasks: movement of the right hand, the left hand, both feet, and the tongue. Let i=1 be a subset of recordings acquired from a single subject, indexed (1), where the time-series x (1) i (t) consists of the signals, recorded simultaneously from the D EEG channels during the i-th repetition/trial. Each time series x (1) i (t) is attached with a label y (1) i , denoting the imagery task performed at the the i-th trial. Common practice is to train a classifier based on X (1) , so that the imagery task could be identified from new EEG recordings. This capability could then be the basis for devising brain computer interfaces, for example, to control prosthetics.
Suppose a new subset of recordings acquired from another subject, indexed (2), becomes available. Applying the classifier, trained based on data from subject (1), to the new subset of recordings from subject (2) yields poor results, as we will demonstrate in Section IV-B. Indeed, most methods addressing this particular challenge, as well as related problems, exclusively analyze data from each individual subject separately. By constructing a joint representation for both X (1) and X (2) , which is oblivious to the specific subject, we develop a classifier that is trained on data from one subject and applied to data from another subject without any calibration, i.e., without any labeled data from the new (test) subject.

C. Covariance matrices as data features
As described before, we suggest looking at the data through the lens of covariance matrices. We denote the covariance matrices by: Typically, since the statistics of the data is unknown, we use estimates of the covariance, such as the sample covariance. We note that our approach is applicable to any kind of input data given as SPD matrices. For example, in machine learning, common practice is to use kernels which represent an inner product between features after some nonlinear transformation [15].
By using covariance matrices as data features we enjoy a few key benefits. First, since covariance matrices are computed from data by averaging over time, they tend to be robust to noise. Second, covariance matrices can be seen as a low dimensional representation. Third, they have useful geometric properties and a well-developed Riemannian framework, as described in Section II. Particularly, they have a Riemannian metric (3), facilitating appropriate data samples comparisons, which is a basic ingredient of many analysis and learning techniques. In this work, we build on and extend the latter.
Recently, the usefulness of covariance matrices has been demonstrated in the context of the BCI problem [6]. There, Barachant et al. considered data from a single subject and proposed to project the covariance matrices {P i } of the recordings from each trial (after some whitening) into the tangent plane of the Riemannian mean P , namely compute S i = Log P (P i ). Then, a classifier was trained on the set {S i }. Using this approach, state of the art results for motor imagery task classification were obtained. However, when considering several subsets from multiple domains, such as different sessions or subjects, as reported in [6], the covariance matrices convey a domain-specific content, which in turn poses limitations on task classification. For multiple sessions on different days, Barachant et al. proposed to subtract the Riemannian mean from each subset, namely, to project each subset P (k) = P (k) i to the tangent space at its own mean. Indeed, when the train set and the test set were obtained on different days, this mean normalization improved the task classification rate. However, in the case of multiple subjects, this approach is inadequate. As mentioned before, given recordings from one subject as a train set and recordings from another subject as a test set, the classification of the different mental tasks based on covariance matrices fails completely.
This illuminates the primary challenge addressed in this work -how to build a representation so that any two covariance matrices associated with the same mental task, but from possibly different sessions or subjects, will be given a similar representation. Importantly, since the task labels are unknown, this objective cannot be directly imposed. In the sequel, we exploit the Riemannian geometry of covariance matrices, and devise such a representation in an unsupervised manner by preserving local geometric structures.

D. Formulation
Consider two subsets P (1) and P (2) from two different domains consisting of N 1 and N 2 covariance matrices, respectively. Let P (1) and P (2) be their respective Riemannian means. Let ϕ(t), given explicitly in (2), denote the unique geodesic from P (2) to P (1) such that ϕ(0) = P (2) and be the symmetric matrix (or equivalently, the vector) in the tangent space T P (k) M, obtained by projecting P Our goal now is to derive a new representation Γ S given by the map Γ : live in the same space. This allows us to relate samples from the two subsets, and compute quantities such as S . In addition, we require that the new representation will fulfill the following properties: 1) Zero mean: 2) Inner product preservation: Properties (1) and (2) imply that the new representation Γ preserves inter-sample relations, defined by the inner product. Note that a map Γ satisfying properties (1) and (2) is not unique; for any Γ admitting to properties (1) and (2), the composition R • Γ, where R is an arbitrary rotation within the subspace T P (1) M, satisfies properties (1) and (2) as well. To resolve this arbitrary degree of freedom, we use the geodesic between two points on the SPD manifold, which is unique [12]. Concretely, in property (3), the two intrinsic symmetric matrices (vectors) ϕ (0) ∈ T P (2) M and ϕ (1) ∈ T P (1) M, induced by the velocity of the unique geodesic at the source and destination, are used to fix a rotation and to align the subset Γ S We remark that the above properties imply that the subset Γ S (2) i is embedded in the ·, · P (1) inner product space. In the sequel, we will describe how to circumvent the dependence of the inner product space on P (1) and make the new representation truly Euclidean by pre-whitening the data. Additionally, note that the mean subtraction presented in [6] admits only properties (1)-(2).

E. Domain adaptation
First, we explicitly provide the expression for parallel transport on the SPD cone manifold, and then we use it to define the map Γ.
Lemma 1 (Parallel Transport). Let A, B ∈ M. The PT from B to A of any S ∈ T B M is given by: where E = AB −1 1 2 . This lemma was presented in [1,Eq. 3.4]. The proof of the lemma is given in Appendix A and it is based on [16]. An illustration of the PT on the SPD manifold is presented in Figure 2. Note that the inner products between the three vectors in the figure are preserved under the parallel transport Γ B→A and the appearance could be misleading since the space is not Euclidean.
The proof is given in Appendix B. Theorem 1 sets the stage for domain adaptation. We propose a map Ψ : M → M that adapts the domain of the subset of SPD matrices P (2) to the domain of the subset P (1) . For any P To enhance the geometric insight, we explicitly describe the three steps comprising the construction of Ψ: The implementation of Ψ can be simplified and made more efficient by using the following theorem. In words, the "parallel transport" of an SPD matrix P ∈ M from B to A is given the same transformation applied to S = Log B (P ). Namely, the "parallel transport" of the SPD matrix P from B to A is equal to projecting P to the tangent plane at B, parallel transporting the projection to the tangent plane at A, and then projecting back to the SPD manifold. As a consequence, we show in the sequel that the map Ψ in (8) can be written simply in terms of Γ. The proof of Theorem 2 is given in Appendix C. We note that we present the theorem in a general context, since we did not find such a result in the literature and believe it might be of independent interest.
Theorem 2 enables us to efficiently compute Ψ P (2) i , since it circumvents the computation of the Logarithm and Exponential maps of the SPD matrix in steps 1 and 3 above. Instead, the transformation defined by E is computed only once for the entire set, and (8) can be recast as: where E P (1) P (2) −1 1 2 . Note that this equality is well defined since any tangent plane to the SPD manifold M is the entire space of symmetric matrices [16]. Thus far in the exposition, only the uniqueness of the geodesic curve on the manifold of SPD matrices was exploited, such that the PT along the geodesic admits the property in (6), namely: Γ (ϕ (0)) = ϕ (1). Importantly, PT specifically along the unique geodesic curve exhibits important invariance to the "relative" location on the manifold.

We denote this relation by
Lemma 2. The relation ∼ is an equivalence relation.
The proof is straight-forward as we show in the following.
• Reflexivity is satisfied by setting E to be the identity matrix.
In other words, two pairs are equivalent if the relation of the two matrices in the pair is given by the same transformation Γ. We interpret such equivalent pairs as matrices with equivalent intra-relations (e.g., if (A 1 , (A 2 , B 2 )), but with a different global position on the manifold. For example, each two pairs in Figure 1 are equivalent pairs. Proposition 1. Let (A 1 , B 1 ) be a pair of SPD matrices A 1 , B 1 ∈ M, and let (A 1 , B 1 ) denote the equivalence class of all matrix pairs that are equivalent to (A 1 , B 1 ). Then, for any (A 2 , B 2 ) ∈ (A 1 , B 1 ) : where Γ (P ) = EP E T and E is the transformation defined in Definition 1.
The proof is given in Appendix D. An immediate consequence of Proposition 1 is that the domain adaptation via the representation Ψ is invariant to the relative position of P (1) and P (2) on the manifold, and is constructed equivalently for every pair in the equivalence class (P (1) , P (2) ) .
To demonstrate the importance of the property above, we revisit the illustrating BCI problem. Suppose (P A1 , P B1 ) are the Riemannian means of the covariance matrices of Subject A and Subject B recorded in Session 1, and suppose (P A2 , P B2 ) are the Riemannian means of the covariance matrices of Subject A and Subject B recorded in Session 2. If (P A1 , P B1 ) ∼ (P A2 , P B2 ), then there exists a transformation Γ such that Γ encodes the relation between Session 1 and Session 2 whereas the relation of the two subjects is encoded by Γ P B1 →P A1 or by Γ P B2 →P A2 , depending on the session. Proposition 1 guarantees the consistence of the relation between Subject A and Subject B. Namely, the Riemannian mean of Subject B in Session 2 can be related to the Riemannian mean of Subject A in Session 1 using the relation between the sessions (given by Γ) and the relation between the two subjects (given either by Γ P B1 →P A1 or by Γ P B2 →P A2 ), independently of the relative location of the means on the manifold.

F. Extension to K subsets
Overall, by Theorem 2, for a general number of subsets K ≥ 2, we can apply PT using Ψ (9) directly to the SPD matrices P (k) = P (k) i without projections to and from the tangent plane. LetP denote the Riemannian mean of Riemannian means (centroids) P Each subset P (k) is then parallel transported from its corresponding centroid P (k) toP . Formally, let Γ after applying PT, which is given by This projection, which is further discussed in [6], can be interpreted as (i) data whitening byΓ Algorithm 1 Domain adaptation using Parallel Transport for SPD matrices Input: P is the SPD matrix associated with the i-th element (e.g., highdimensional time-series) in the k-th subset.
is the new representation of P 2) ComputeP , the Riemannian mean of P (k) K k=1 . 3) For all k and all i, apply Parallel Transport using (7):

4)
For all k and all i, project the transported matrix to the tangent space via: We conclude this section with two remarks. First, since the matricesS (k) i are symmetric, only their upper (or lower) triangular part with a gain factor of √ 2 applied to all non-diagonal elements could be taken into account. Second, alternative choices ofP could also be used, for example, the identity matrix. Indeed, recently [11] proposed to align datasets for transfer learning in a similar context using the identity matrix asP . However in [11], the alignment appeared as an empirical affine transformation, whereas in this work, we provide the geometrical justification and rigorous mathematical analysis.
Specifically, in the case of two subsets with means P A ∈ M and P B ∈ M, the affine transformation presented in [11] can be interpreted as two consecutive applications of PT: from P B to I and then from I to P A . The arbitrary choice of I as an intermediate point introduces dependence of the algorithm on the global position on the manifold. Indeed, such a procedure, which can be expressed by Γ I→P A • Γ P B →I does not admit the invariance property specified in Proposition 1.
Interestingly, the method proposed in [11] coincides with the present work, namely, Γ P B →P A = Γ I→P A •Γ P B →I when the identity matrix I is on the geodesic ϕ between P B and P A . In this case, the matrices P A and P B commute and they have the same eigenvectors (see Appendix E). From a data analysis perspective, when P A and P B are two covariance matrices of two subsets, this implies that the subsets have the same principal components.
We setP as the Riemannian mean of the centroids so that the overall transport applied to the covariance matrices is minimal. This choice is motivated by the assumption that transporting accumulates distortions. This is a straight-forward generalization of the two subsets case, where the parallel transport is carried out along the shortest path (unique geodesic curve).

IV. EXPERIMENTAL RESULTS
In this section we show the results of Algorithm 1 for both a synthetic example and for real data.

A. Toy Problem
We generate time series in R 2 , so that their covariance matrices are in R 2×2 . Since the covariance matrices are symmetric, this particular choice enables us to visualize them in R 3 . Concretely, any 2 × 2 symmetric matrix A = x y y z can be visualized in R 3 using (x, y, z) ∈ R 3 . A is positivedefinite if and only if: x, z > 0 and y 2 < xz. These conditions establish the cone manifold of 2×2 SPD matrices.
Consider the set of hidden multi-dimensional times series which depends only on φ i , and therefore, when presenting the population covariances of the time-series s i [n] in R 3 , two coordinates are fixed and only one varies with i. We generate two observable subsets, where M (1) is randomly chosen, and M (2) = 1.5 −1 0 0 1 M (1) . The two subsets X (1) and X (2) can be viewed as two different observations of s i [n] through two unknown observation functions M (1) and M (2) . For example, X (1) and X (2) can represent two different batches, and M (1) and M (2) can represent the discrepancy between two different sessions of a particular experiment. For each x (k) i [n], we compute its sample covariance matrix by where P si denotes the inaccessible sample covariance of s i [n], which is given by: Our goal is to obtain a new representation of the observed data both in X (1) and X (2) , which circumvents the effect of M (1) and M (2) . Moreover, in the new representation, we aspire to associate two observations from possibly different subsets which have a similar initial phase φ i . In Figure 3 we plot the 2×2 SPD matrices in R 3 , where the black points mark the boundaries of the cone manifold. The red line marks the center of the cone, given by αI for α ∈ [0, 2], and the blue point on the red line indicates the identity matrix I, namely, where α = 1. Figure 3(a) presents the two subsets P (k) = P (k) i , k = 1, 2 of accessible sample covariance matrices, colored by φ i (left) and by k (right). We observe that the two subsets P (1) and P (2) are completely separated, while each subset has a similar structure governed by the values of φ i . We apply Steps (1)-(3) of Algorithm 1 to the subsets P (k) to obtain Γ (k) i . Figure 3 , colored by φ i (left) and k (right). Now we observe that in the new representation, the two subsets are aligned, namely the discrepancy caused by M (1) and M (2) is removed while the intrinsic structure given by φ i is preserved. As a result, we can associate covariance matrices from different batches but with similar underlying φ i values. Note that this was accomplished by Algorithm 1 in a completely unsupervised manner, without access to the hidden "labels" φ i . obtained by Algorithm 1 colored by φ i (left) and by k (right). Note that in the new representation (b), the two subsets are aligned, namely the discrepancy caused by M (1) and M (2) is removed while the intrinsic structure given by φ i is preserved.

B. BCI -Motor Imagery
As described in Section III, we use data from the BCI competition IV [17]. The dataset contains EEG recordings acquired by 22 EEG electrodes from 9 subjects, where the data from each subject was recorded on 2 different days of experiments. The experiment protocol consists of repeated trials, where in each trial the subject was asked to imagine performing one out of four possible movements: (i) right hand, (ii) left hand, (iii) both feet, and (iv) tongue. Overall, in a single day, each movement was repeated 72 times by each subject, and therefore, the dataset contains 288 trials from each subject in each day of experiments.
We remark that all the algorithms participating in the competition reported on poor classification results for particular four subjects. Since our goal is not to improve the classification of the data from each subject, we excluded these four subjects (indexed 2,4,5,6).
Initially, focusing on the data from a single subject, we show that Algorithm 1 builds a representation of the data which enables us to train a classifier with data from one day of experiments and apply it to data from the other day of experiments. Then, we further show that Algorithm 1 builds a representation that allows us to train a classifier based on data from one subject and apply it to data from a different subject without any additional labeled trials. Finally, we extend the latter result and show the performance on multiple subjects.
1) Single Subject -Different Days: In the first experiment, we process the recordings of Subject 8 (arbitrarily chosen) from the two days of experiments. We report that the results for the other 4 subjects were similar. We denote the subsets of trial recordings from day k = 1, 2 by From the recordings of each trial i, we compute the sample covariance matrix P To highlight the challenge, we first computeP , the Riemannian mean of all covariance matrices (from both subsets). Then, we project the matrices onto TP M by computing B . For visualization purpose, we apply tSNE [18] to the vectors B  = 1, 2), and on the right, the points are colored according to the mental task. We observe that, similarly to the toy problem, the recordings from the different days are completely separated. This implies that one cannot train a classifier from the recordings from day 1 and apply it to the recordings from day 2.
We apply Algorithm 1 to the subsets P (k) covariance matrices and obtain the subsetsS (k) = S (k) i . Figure4 (b) presents the two dimensional representation of the vectorsS (k) obtained by the tSNE algorithm. On the left, the points are colored according to the different days, and on the right, the points are colored according to the mental task. We observe that the difference between the different days of experiments is completely removed. More importantly, we further observe that the new representations of the two subsets are aligned, i.e., we obtain similar representations of two recordings associated with the same mental task, regardless of their respective sessions (days of experiments).
2) Two Subjects: We repeat the evaluation, but now with the two subsets X (k) , k = 1, 2 which were recorded from two subjects, specifically, Subject 3 and Subject 8. We repeat the steps from the previous examination. We computeP , the Riemannian mean of all covariance matrices (from both subsets). Then, we project the covariance matrices onto TP M and apply tSNE to obtain two dimensional representations. Figure 5 (a) presents the two dimensional representations obtained by the tSNE algorithm. On the left, the points are colored by the subject index, and on the right the points are colored according to the mental task. Similarly to the singlesubject two-sessions case, we observe that the recordings from different subjects are completely separated in the obtained representation.
We also apply the mean transport approach presented in [6]. The mean transport is obtained by projecting each subset of covariance matrices P (k) to its own tangent plane T P (k) M, where P (k) is the Riemannian mean of the k-th subset. In other words, we compute S (k) . Figure 5 (b) presents the two dimensional representation obtained by the tSNE algorithm. We observe that indeed the two subsets are not separated as in Figure 5(a), however, the inner structure of each subset was not preserved. Thus, this scheme is insufficient and does not support training a classifier based on data from one subject and applying it to data from another subject.
Finally, we apply Algorithm 1 to the subsets P (k) , and obtain the subsetsS (k) = S (k) i . Figure 5(c) presents the two dimensional representation of the subsetsS (k) obtained by the tSNE algorithm. We observe that in this representation, the subsets are not separated. Moreover, the two subsets are aligned according to the mental tasks, and indeed points that correspond to the same mental task assumed a similar value in the new representation. This new representation allows us to train a classifier using recordings from one subject and apply it to recordings from another subject.
3) Multiple Subjects: In the third experiment, we apply Algorithm 1 to multiple subjects. We processed data from five subjects (indexed 1,3,7,8 and 9) and from all trials corresponding to only three of the mental tasks: left hand, foot, and tongue. Namely, we omit one class for visualization purposes. We denote the subsets of the recordings by 1, 3, 7, 8, 9. As before, we compute the covariance matrices P (k) = P (k) i . We computeP , the Riemannian mean of all covariance matrices. Then, we project the covariance matrices onto TP M and apply tSNE to obtain a two dimensional representation. Figure 6 (a) presents the two dimensional representation obtained by the tSNE algorithm. On the left, the points are colored by the subject index, and on the right the points are colored according to the mental task. As before, we observe that the recordings from different subjects are completely separated.
Next, we apply Algorithm 1 to the five subsets of covariance matrices P (k) and obtain five subsets of new representations S (k) = S (k) i . Figure 5 (b) presents the two dimensional representation of the subsetsS (k) obtained by the tSNE algorithm. We observe that also in the multiple subjects scenario, Algorithm 1 was able to center and align the subsets.
To provide quantitative results, we apply a leave-onesubject-out cross validation, namely, we trained a classifier based on 4 out of the 5 subjects and evaluated the classification accuracy for each one of the three methods mentioned in Section IV-B2. We compare the classification accuracy of Algorithm 1 to the two other approaches denoted by: (i) "Baseline (No Transport)", and (ii) the "Mean Transport" approach proposed in [6]. Figure 7 presents the classification accuracy obtained using the three competing methods as a function of the evaluated subject. In all cases, applying Algorithm 1 to the data dramatically improved the classification accuracy. In addition, Figure  8 presents the confusion matrices when Subject 3 was tested. Figure 8 (left) presents the confusion matrix obtained from the data without applying any transportation ("Baseline"). Figure  8 (center) presents the confusion matrix obtained by the "Mean Transport" approach. Figure 8 (right) presents the confusion matrix obtained by Algorithm 1. We observe that Algorithm 1 obtains significantly better classification results compared with Fig. 7. The classification accuracy obtained by the three competing methods.  the "Baseline" and "Mean Transport" algorithms. In addition, the confusion matrices highlight the challenge in training a classifier from multiple subject data. For example, in both confusion matrices in Figure 8 (left) and (center), the "foot" mental task falsely dominated the prediction (it was predicted 183 and 164 times, respectively, whereas is was performed only 72 times). This implies that the decision regions of the classifiers are completely misaligned with the data from a new unseen subject.

C. Sleep Stage Identification
Here, we demonstrate the applicability of Algorithm 1 to real medical signals. Specifically, we address the problem of sleep stage identification. Typically, for this purpose, data are collected in sleep clinics with multiple multimodal sensors, and then, analyzed by a human expert. There are six different sleep stages: awake, REM, and sleep stages 1-4, indicating shallow to deep sleep.
The data we used is available online in [19] and described in detail in [20]. A single patient's night recording contains several measurements including two EEG channels and one electrooculography (EOG) channel sampled at 100[Hz]. We used recordings from three subjects. We split each subject's night into non-overlapping 30 seconds windows. We omit the awake and sleep stage 4 windows due to too few occurrences. For visual purposes, we kept only windows corresponding to REM and stage 3.
We denote the i-th window of the k-th subject by x i (t) with its corresponding covariance matrix P (k) i ∈ R 3×3 . We first computeP , the Riemannian mean of all covariance matrices (from the three subsets). Then, we project the matrices onto TP M by computing S . For visualization purposes, we apply PCA to the vectors S (k) i and present the first two principle components 1 . Figure 9 (a) presents the two dimensional representation of the vectors obtained by PCA. Namely, each point in the figure 1 Since the covariance matrices in this experiment are of size 3 × 3, dimension reduction using PCA was sufficient. It was preferred here over tSNE since it preserves the global geometry of the data.
is the representation of a vector S    subjects. We apply Algorithm 1 to the three subsets P (k) i of covariance matrices and obtain the subsets S (k) i . Figure 9 (b) presents the two dimensional representation of the vectors S k) i obtained by PCA. On the left, the points are colored according to the different subjects, and on the right, the points are colored according to the sleep stage. Now we observe that the data is clustered according to the sleep stage while the difference between the three subjects is completely removed.
As in the Subsection IV-B3, to provide quantitative results, we train a classifier based on Subject 1 and Subject 2 and evaluate the classification accuracy on Subject 3. Figure 10 presents the obtained confusion matrices. Figure 10 (left) presents the confusion matrix obtained from the data without applying any adaptation ("Baseline"). Figure 10 (center) presents the confusion matrix obtained by the "Mean Transport" approach. Figure 10 (right) presents the confusion matrix obtained by Algorithm 1. We observe that using Algorithm 1 demonstrates better classification results compared with the "Baseline" and the "Mean Transport" algorithms.
V. CONCLUSIONS Analyzing complex data in high-dimension is challenging, since such data do not live in a Euclidean space. Therefore, basic operations such as comparisons, additions, and subtractions, which are the basis of any analysis and learning technique, do not necessarily exist and are not appropriately defined. In this work, we propose to view the complex data through the lens of SPD matrices, which reside on an analytic Riemannian manifold. Using the Riemannian geometry of SPD matrices, we presented an approach for multi-domain data representation. Based on this new representation, we proposed an algorithm for domain adaptation. We extend the existing results in the Riemannian geomery of SPD matrices and establish a framework for the justification and analysis of the proposed solution. We demonstrated the usefulness of the presented domain adaptation method in applications to simulation and real recorded data. Proof. The PT of S along the geodesic between B and A is given by [16]: For better readability we denote A = P (1) and B = P (2) .
First we remark that Γ B→A is well defined since T P M is the space of all symmetric matrices regardless the matrix P 2 , and if the input S is symmetric then by definition Γ B→A (S) is symmetric as well.
Proof of Theorem 1. Condition (1) is immediate since Γ is a linear operation, and therefore we have Conditions (2) and (3) Namely, A −1 E is a symmetric matrix. Thus, we get that and finally, we obtain Proof of condition (3): Let B ∈ M be an SPD matrix with the following spectral decomposition: Consider the geodesic ϕ(t) from B to A: Thus, its velocity at t = 0 is given by and similarly, the velocity at t = 1 is given by where in ( * ) we pull out V = B from the log , since it is a scalar function: log V P V −1 = V log (P ) V −1 . Let U be the following unitary matrix Thus, B has the same eigenvectors as A and they commute

APPENDIX F RIEMANNIAN MEAN ALGORITHM
Algorithm 2 Riemannian mean for SPD matrices as presented in [6] Input: a set of SPD matrices {P i ∈ M} N i=1 . Output: the Riemannian mean matrix P .
1) Compute the initial term P = 1 N N i=1 P i 2) do a) Compute the Euclidean mean in the tangent space: where · F is the Frobenius norm.