Joint Embedding of Graphs

Feature extraction and dimension reduction for networks is critical in a wide variety of domains. Efficiently and accurately learning features for multiple graphs has important applications in statistical inference on graphs. We propose a method to jointly embed multiple undirected graphs. Given a set of graphs, the joint embedding method identifies a linear subspace spanned by rank one symmetric matrices and projects adjacency matrices of graphs into this subspace. The projection coefficients can be treated as features of the graphs, while the embedding components can represent vertex features. We also propose a random graph model for multiple graphs that generalizes other classical models for graphs. We show through theory and numerical experiments that under the model, the joint embedding method produces estimates of parameters with small errors. Via simulation experiments, we demonstrate that the joint embedding method produces features which lead to state of the art performance in classifying graphs. Applying the joint embedding method to human brain graphs, we find it extracts interpretable features with good prediction accuracy in different tasks.


INTRODUCTION
I N many problems arising in science and engineering, graphs arise naturally as data structure to capture complex relationships between a set of objects. Graphs have been used in various application domains as diverse as social networks [1], internet mapping [2], brain connectomics [3], political voting networks [4], and many others. The graphs are naturally high dimensional objects with complicated topological structure, which makes graph clustering and classification a challenge to traditional machine learning algorithms. Therefore, feature extraction and dimension reduction techniques are helpful in the applications of learning graph data. In this paper, we propose an algorithm to jointly embed multiple graphs into low dimensional space. We demonstrate through theory and experiments that the joint embedding algorithm produces features which lead to state of the art performance for subsequent inference tasks on graphs.
There exist a few unsupervised approaches to extract features from graphs. First, classical Principal Component Analysis can be applied by treating each edge of a graph as a raw feature [5]. This approach produces features which are linear combinations of edges, but it ignores the topological structure of graphs and the features extracted are not easily interpretable. Second, features can be extracted by computing summary topological and label statistics from graphs [6], [7]. These statistics commonly include number of edges, number of triangles, average clustering coefficient, maximum effective eccentricity, etc. In general, it is hard to know what intrinsic statistics to compute a priori and computing some statistics can be computationally expensive. Third, many frequent subgraph mining algorithms are developed [8]. For example, the fast frequent subgraph mining algorithm can identify all connected subgraphs that occur in a large fraction of graphs in a graph data set [9]. Finally, spectral feature selection can also be applied to graphs. It treats each graph as a node and constructs an object graph based on a similarity measure. Features are computed through the spectral decomposition of this object graph [10]. Adjacency Spectral Embedding (ASE) and Laplacian Eigenmap (LE) are proposed to embed a single graph observation [11], [12]. The inference task considered in these papers is learning of the block structure of the graph or clustering vertices. Given a set of graphs {G i = (V i , E i )} m i=1 , ASE and LE need to embed an adjacency matrix or Laplacian matrix of G i individually, and there is no easy way to combine multiple embeddings. The joint embedding method considers the set of graphs together. It takes a matrix factorization approach to extract features for multiple graphs. The algorithm manages to simultaneously identify a set of rank one matrices and project adjacency matrices into the linear subspace spanned by this set of matrices. The joint embedding can be understood as a generalization of ASE for multiple graphs. We demonstrate through simulation experiments that the joint embedding algorithm extracts features which lead to good performance for a variety of inference tasks. In the next section, we review some random graph models and present a model for generating multiple random graphs. In Section 3, we define the joint embedding of graphs and present an algorithm to compute it. In Section 4, we perform some theoretical analyses of our joint embedding. The theoretical results and real data experiments are explored in Section 5. We conclude the paper with a brief discussion of implications and possible future work.

SETTING
We focus on embedding unweighted and undirected graphs for simplicity, although the joint embedding algorithm works on weighted graphs, and directed graphs with some modifications.
be m graphs, each with n vertices, and A i be the adjacency matrix of graph G i . The vertices in these graphs should be matched, which means that all the graphs have a common vertex set V . The joint embedding algorithm embeds all G i s simultaneously into R d and represents G i by a vector λ i ∈ R d . Before discussing the joint embedding algorithm, we need a random graph model on multiple graphs, on which the theoretical analysis is based. Let us first recall a model on a single graph: Random Dot Product Graph [13].
.., x T n ] ∈ X n be a n × d matrix, and given X, suppose that A is a random n × n adjacency matrix such that Also, define P := XX T to be edge probability matrix. We write A ∼ RDPG(X) to denote the distribution of a random dot product graph with latent positions X. When the rows of X are not fixed, but instead are random variables with a distribution F on X , (X, A) ∼ RDPG(F ) denotes the distribution of a random dot product graph with latent positions distributed according to F .
The RDPG is a convenient model which is designed to capture the relationship between the vertices of a graph using latent positions, and some extensions have been proposed to capture more general connectivity structures [14]. Moreover, other popular models, including the stochastic block model (SBM) [15] and mixed membership SBM [16] are special cases of the RDPG and its generalizations.. The RDPG can be further generalized to a Latent Position Graph by replacing the inner product by a kernel [17]. The Adjacency Spectral Embedding of a RDPG adjacency matrix is well studied [18]. Next, we propose a new random graph model which generalizes the RDPG to multiple graphs.
if the entries of A i are independent Bernoulli random variables, .
h k h T k is defined to be the edge probability matrix for graph i. In cases that {λ i } m i=1 are of primary interest, they are treated as parameters. When the {λ i } m i=1 are random variables, F denotes their distribution on X , and the model is expressed as When the entries of λ i are nonnegative, then X i = HΛ 1/2 i are the latent positions of graph i, and hence, on a single graph, RDPG and MREG are equivalent if the edge probability matrix is positive semidefinite. In MREG, we allow self loops to happen. This is mainly for theoretical convenience. Next, we introduce another random graph model: Stochastic Block Model [15], which generalizes the Erdos-Renyi (ER) model [19] that corresponds to a SBM with only one block. SBM is a widely used model to study the community structure of a graph [20], [21].
Definition Stochastic Block Model (SBM). Let π be a prior probability vector for block membership which lies in the unit K − 1-simplex. Denote by τ = (τ 1 , τ 2 , ..., τ n ) ∈ [K] n the block membership vector, where τ is a multinomial sequence with probability vector π. Denote by B ∈ [0, 1] K×K the block connectivity probability matrix. Suppose A is a random adjacency matrix given by, Then, A is an adjacency matrix of a K-block stochastic block model graph, and the notation is A ∼ SBM (π, B). Sometimes, τ may also be treated as the parameter of interest, in this case the notation is A ∼ SBM (τ, B).
The top panel of Figure 1 shows the relationships between three random graph models defined above and the ER model on 1 graph. The models considered are those conditioned on latent positions, that is τ , X and λ in SBM, RDPG and MREG respectively are treated as parameters; furthermore, loops are ignored in MREG. If an adjacency matrix A ∼ SBM (τ, B) and the block connectivity matrix B is positive semidefinite, A can also be written as an RDP G(X) with X having at most K distinct rows. If an adjacency matrix A ∼ RDP G(X), then it is also a 1-graph M REG(λ 1 , h 1 , ..., h d ) with h k being the normalized kth column of X and λ 1 being the vector containing the squared norms of columns of X. However, a 1-graph M REG(λ 1 , h 1 , ..., h d ) is not necessarily an RDPG graph since λ 1 could contain negative entries which may result in an indefinite edge probability matrix.
The bottom panel of Figure 1 shows the relationships between the models on multiple graphs. For RDPG, the graphs are sampled i.i.d. with the same parameters. MREG has the flexibility to have λ differ across graphs, which leads to a more generalized model for multiple graphs. A d dimensional MREG can represent any SBM with K ≤ d blocks, in which the block memberships are the same across all the graphs in the population, but possible different connectivity matrices B i for each graph, a common assumption in modeling multilayer and time-varying networks [22], [23]. Other models in this setting also allow some vertices of the SBM to change their block memberships over time [22], [24]; in this case, the MREG can still represent those models, but the dimension d may need to increase. Actually, it turns out that if d is allowed to be as large as n(n−1) 2 , MREG can represent any distribution on binary graphs, which includes distributions in which edges are not independent. Theorem 2.1 implies that MREG is really a semi-parametric model, which can capture any distribution on graphs. One can model any set of graphs by MREG with the guarantee that the true distribution is in the model with d being large enough. However, in practice, a smaller d may lead to better inference performance due to reduction in the dimensionality.
In the next section, we consider the joint embedding algorithm which can be understood as a parameter estimation procedure for MREG.

Joint Embedding of Graphs
The joint embedding method considers a collection of vertex-aligned graphs, and estimates a common embedding space across all graphs and a loading for each graph. Specifically, it simultaneously identifies a subspace spanned by a set of rank one symmetric matrices and projects each adjacency matrix A i into the subspace. The coefficients obtained by projecting A i are denoted byλ i ∈ R d , which is called the loading for graph i. To estimate rank one symmetric matrices and loadings for graphs, the algorithm minimizes the sum of squared Frobenius distances between adjacency matrices and their projections as described below.
Definition Joint Embedding of Graphs (JE). Given m graphs {G i } m i=1 with A i being the corresponding adjacency matrix, the d-dimensional joint embedding of graphs {G i } m i=1 is given by (1) Figure 1: Relationships between random graph models on 1 graph and multiple graphs. The top panel shows the relationships between the random graph models on 1 graph. The models considered are those conditioned on latent positions, that is τ , X and λ in SBM, RDPG and MREG respectively are treated as parameters. ER is a 1-block SBM. If a graph follows SBM with a positive semidefinite edge probability matrix, it also follows the RDPG model. Any SBM and RDPG graph can be represented by a ddimensional MREG model with d being less than or equal to the number of blocks or the dimension of RDPG. On one graph, inhomogeneous ER (IER), n-dimensional MREG and n-block SBM are equivalent. The bottom panel shows the relationships between the random graph models on multiple graphs. The models considered are those conditioned on latent positions, and for ER, SBM and RDPG graphs are sampled i.i.d. with the same parameters. In this case, MREG has the flexibility to have λ differ across graphs, which leads to a more generalized model for multiple graphs.
Here, · denotes the Frobenius norm and λ i [k] is the kth entry of vector λ i .
To make sure that the model is identifiable and avoid the scaling factor, h k is required to have norm 1. In addition, {h k h T k } d k=1 must be linearly independent to avoid identifiability issues in estimating λ i ; however, {h k } d k=1 needs not to be linearly independent or orthogonal. To ease the notations, let us introduce two matrices Λ ∈ R m×d and H ∈ R n×d , where λ i is the ith row of Λ and h k is the kth row of H; that is, The equation (1) can be rewritten using Λ and H as Denote the function on the left hand side of the equation by f (Λ, H) which is explicitly a function of λ i s and h k s. There are several alternative ways to formulate the problem. If vector λ i is converted into a diagonal matrix D i ∈ R d×d by putting entries of λ i on the diagonal of D i , then solving equation (1) is equivalent to solving Equation (1) can be also viewed as a tensor factorization problem. If {A i } m i=1 are stacked in a 3-D array A ∈ R m×n×n , then solving equation (1) is also equivalent to where ⊗ denotes the tensor product and Λ * k is the kth column of Λ. It is well known in the tensor factorization community that the solution to Equation 1 may not necessarily exist for d ≥ 2. This phenomenon was first found by Bini et al. [25], and Silva and Lim gives a characterization all such tensors in the order-3 rank-2 case [26]. Although there may not exist a global minimum, finding the local solution in a compact region still provide significant insights to the data. We design an algorithm which is guaranteed to converge, and provide analysis under the d = 1 case. The joint embedding algorithm assumes the graphs are vertex-aligned and undirected. The vertex-aligned graphs are common in many applications of interest such as neuroimaging [27], multilayer networks [28] or time-varying graphs [7]. In case that the graphs are not aligned, graph matching should be performed before the joint embedding [29], [30]. The mis-alignments of some vertices will have adverse effects in estimating corresponding latent positions in H; however, a small number of mis-aligned vertices should not have a big impact in estimating Λ. If the graphs have weighted edges, the joint embedding can still be applied. Also, the MREG model can be easily extended to weighted graphs by replacing the Bernoulli distribution with other proper distributions. In fact, in the experiment of section 5.3, the graphs are weighted, where the edge weights are the log of fiber counts across regions of brains. In case of directed graphs, to apply the joint embedding, one can symmetrize the graph by removing the direction of edges. Alternatively, h k h T k in equation (1) can be replaced by h k g T k , with h k and g k representing the in and out latent positions respectively. With this modification, equation (1) becomes the tensor factorization problem [31].
The optimization problem in equation (1) is similar to Principal Component Analysis (PCA) in the sense of minimizing squared reconstruction error to recover loadings and components [5]. However, there are extra symmetries and rank constraints on the components. Specifically, if h k h T k is replaced by a matrix S k in equation (1) (2) the problem can be solved by applying PCA on vectorized adjacency matrices. In this case, there is a S k to estimate for each latent dimension which has n(n+1) 2 parameters. Compared to PCA, the joint embedding estimates a rank one matrix h k h T k for each latent dimension which has n parameters, and h k can be treated as latent positions for vertices, but the joint embedding yields a larger approximation error due to the extra constraints. Similar optimization problems have also been considered in the simultaneous diagonalization literature [32], [33]. The difference is that the joint embedding is estimating an n-by-d matrix H by minimizing reconstruction error instead of finding a n-by-n non-singular matrix by trying to simultaneously diagonalize all matrices. The problem in equation (1) has considerably fewer parameters to optimize, which makes it more stable and applicable with n being moderately large. In case of embedding only one graph, the joint embedding is equivalent to the Adjacency Spectral Embedding solved by singular value decomposition [11]. Next, we describe an algorithm to optimize the objective function f (Λ, H).

Alternating Descent Algorithm
The joint embedding of {G i } m i=1 is estimated by solving the optimization problem in equation (1). There are a few methods proposed to solve similar problems. Alternating least squares (ALS) is a popular method to solve similar problems [31], [34], but often ignore symmetric constraints. Gradient approaches have also been considered for similar problems [35], [36]. We develop an alternating descent algorithm to minimize f (Λ, H) that combines ideas from both approaches [37]. The algorithm can also be understood as a block coordinate descent method with Λ and H being the two blocks [38], [39]. The algorithm iteratively updates one of Λ and H while treating the other parameter as fixed. Optimizing Λ when fixing H is straightforward, since it is essentially a least squares problem. However, optimizing H when fixing Λ is hard due to the fact that the problem is non-convex and there is no closed form solution available. In this case, the joint embedding algorithm utilizes gradient information and takes an Armijo backtracking line search strategy to update H [40].
Instead of optimizing all columns Λ and H simultaneously, we consider a greedy algorithm which solves the optimization problem by only considering one column of H at a time. Specifically, the algorithm fixes all estimates for the first k 0 − 1 columns of Λ and H at iteration k 0 , and then the objective function is minimized by searching through only the k 0 th column of Λ and H. That is, When m = 1, by the Eckart-Young theorem, the solution for (Λ 1,k0 ,ĥ k0 ) is given by the leading eigenvalue and 1kĥkĥ T k , and hence the solution obtained by our greedy approach coincides with the eigendecomposition of A 1 . If m > 1 there is no closed-form solution in general, so we develop an optimization method next.
Let f (Λ * k0 , h k0 ) denote the sum on the left hand side of the equation. To compute a d-dimensional joint embedding (Λ,Ĥ), the algorithm iteratively finds one component at a time by solving the optimization for the one dimensional embedding problem (3). Once a new component k 0 is found, the matrix of loadingsΛ is updated by minimizing the optimization problem (1) with the first k 0 components fixed atĥ 1 , . . . ,ĥ k0 , resulting in the least squares problem This step finds the projection of the graphs into the space spanned by the first k 0 components {ĥ kĥ T k } k0 k=1 . The algorithm solves the d-dimensional embedding problem by iteratively updatingĤ andΛ using equations (3) and (4).
There are a few advantages in iteratively solving one dimensional embedding problems. First, there are fewer parameters to fit at each iteration, since the algorithm is only allowed to vary h k0 at iteration k 0 . This makes initialization and optimization steps much easier compared to optimizing all columns of H simultaneously. Second, it implicitly enforces an ordering on the columns of H. This ordering allows us to select the top few columns of Λ and H in cases where model selection is needed after the joint embedding. Third, it allows incremental computation. If d and d dimensional joint embeddings are both computed, the first min(d, d ) columns ofĤ will be the same. Fourth, although the global rank-d tensor approximation problem does not always have a solution for d ≥ 2, a one dimensional embedding that corresponds to a rank-1 tensor approximation always have a solution [26]. Finally, based on numerical experiments, the difference between optimizing iteratively and optimizing all the parameters when d is small is negligible; however, the iterative algorithm yields a slightly smaller objective function when d is large. The disadvantage of optimizing each column separately is that the algorithm is more likely to end up at a local minimum when the objective function is structured not in favor of embedding iteratively. In practice, this problem can be mitigated by running the joint embedding algorithm several times with random initializations.
To find Λ * k0 and h k0 in equation (3), the algorithm needs to evaluate two derivatives: ∂f The gradient of the objective function with respect to h k0 is given by The derivative of the objective function with respect to Λ ik0 is given by Setting the derivative to 0 yieldŝ where ·, · denotes the inner product.
Once a new componentĥ k0ĥ T k0 is obtained, the algorithm proceeds to update Λ by solving the optimization problem (4). Note that the problem can be split into m least square subproblems, one for each graph A i , of the form for each i = 1, . . . , m. By obtaining the derivative with respect to Λ for all i = 1, . . . , m and setting them equal to zero, Λ is updated by solving m systems of linear equations with k 0 variables each. Let Γ be a k 0 × k 0 matrix and Ψ be a k 0 × m matrix such that Γ kj = (ĥ T jĥ k ) 2 and Ψ ki = (ĥ T k A iĥk ). The algorithm updates the first k 0 columns of Λ by solving The joint embedding algorithm alternates between up-datingΛ * k0 andĥ k0 according to equation (5) and (6), after which the values of all the loadings (Λ * 1 , . . . ,Λ * k0 ) are updated using equation (7). Algorithm 1 describes the general procedure to compute the d-dimensional joint embedding of graphs  (5) and (6) are not practical due to h k h T k and R ik being large and dense. However, they can be rearranged to avoid explicit computation of h k h T k and R ik . Equation (5) becomes Similarly, equation (6) can be rewritten aŝ Based on the rearranged equations, efficiently evaluating matrix vector product A i h k0 is needed to calculate the derivatives. This can be completed for a variety of matrices, in particular, sparse matrices [41]. The Algorithm 1 is guaranteed to converge to a stationary point. Specifically, at the termination of iteration k 0 , ≈ 0 is ensured due to exact updating by equation (6). Second notice that updating according to equation (5) and (6) always decreases the objective function. Due to the fact that h k0 lies on the unit sphere and the objective is twice continuous differentiable, ∂f ∂h k 0 is Lipschitz continuous. This along with Armijo backtracking line search guarantees a "sufficient" decrease c ∂f ∂h k 0 2 in objective function each time when the algorithm updates h k0 [40], where c is a constant independent of h k0 . Since the objective function is bounded below by 0, this implies convergence of gradient, that is ∂f The joint embedding of graphs model requires for identifiability that the basis of the embedding space {h k h T k } m k=1 is linearly independent. This condition also ensures that equation (4) has a unique solution. Although the optimization problem (3) does not directly enforce this constraint, the next theorem guarantees that the components obtained by our method are linearly independent. The proof is given in the appendix. are also linearly independent, orΛ ik0 = 0 for all i = 1, . . . , m.
In general, the objective function may have multiple stationary points due to non-convexity. Therefore, the joint embedding algorithm is sensitive to initializations. Actually, like many of the problems in tensor factorization, finding the global minimum in joint embedding is NP-Hard [42]. When time permits, we recommend running the joint embedding several times including many random initializations. In Section 5.1, we study the effects of different initialization approaches through a numerical simulation experiment. For other simulation and real experiments, we initializeĥ k0 using the top left singular vector of the average residual matrix R ik0 /m. Computing this vector is computationally cheap compared to the computational cost of the joint embedding when the number of graphs m is larger than 1. When m = 1, the SVD initialization and the joint embedding solution coincide, and when m > 1 the SVD initialization gives a good approximation to the solution if all the graphs are identically distributed. The optimization algorithm described above may not be the fastest approach to solving the problem; however, numerical optimization is not the focus of this paper. Based on results from numerical applications, our approach works well in estimating parameters and extracting features for subsequent statistical inference. Next, we discuss some variations of the joint embedding algorithm.

Variations
The joint embedding algorithm described in the previous section can be modified to accommodate several different settings.
Variation 1. When all graphs come from the same distribution, we can force estimated loadingsλ i to be equal across all graphs. This is useful when the primary inference task is to extract features for vertices. Since all graphs share for k = 1 : d do 4: Initialize h k and Λ * k

5:
while not convergent do 6: Fixing Λ * k , update h k by gradient descent (5) 7: Project h k back to the unit sphere 8: Fixing h k , update Λ * k by (6) 9: end while 11: Update Λ by solving (7) 12: Update residuals: which is equivalent to Therefore, the optimization problem can be solved exactly by finding the singular value decomposition of the average

Variation 2.
When there is a discrete label y i ∈ Y associated with the graph G i available, we may require all loadingsλ i to be equal within class. Let Λ ∈ R |Y|×d , the optimization problem becomes In this case, when updating Λ as in equation (6), the algorithm should average Λ yk within the same class, that isΛ

Variation 3.
In some applications, we may require all Λ ik to be greater than 0, as in non-negative matrix factorization. One advantage of this constraint is that graph G i may be automatically clustered based on the largest entry ofλ i . In this case, the optimization problem is To guarantee nonnegativity, the algorithm should use nonnegative least squares in updating Λ [43]. Furthermore, a constraint on the number of non-zero elements in ith row of Λ can be added as in K-SVD [44], and a basis pursuit algorithm could be used to update Λ [45], [46]. Next, we discuss some theoretical properties of the MREG model and joint embedding when treated as a parameter estimation procedure for the model.

THEORETICAL RESULTS
In this section, we consider a simple setting where graphs follow a 1-dimensional MREG model, that is h 1 ). The 1-dimensional joint embedding is well defined in this case, that isλ i andĥ 1 defined in Equation 1 is guaranteed to exist. Under this MREG model, the joint embedding of graphs can be understood as estimators for parameters of the model. Specifically,λ i and h 1 are estimates of λ i and h. We prove two theorems concerning the asymptotic behavior of estimatorĥ 1 produced by joint embedding.
Letĥ m 1 denote the estimates based on m graphs and define functions ρ, D m and D as below: One can understand D m and D as sample and population approximation errors respectively. By equation (1), By equation (6), The first theorem states thatĥ m 1 converges almost surely to a global minimum of D(h, h 1 ). Alternatively, the theorem implies the sample minimizer converges to the population minimizer. Theorem 4.1 ensures that, in the limit,ĥ m is arbitrarily close to parameters in the set of global minimizer of D(h, h 1 ). However, the global minimizer is definitely not unique due to the symmetry up to sign flip of h, that is D(h, h 1 ) = D(−h, h 1 ) for any h. This problem can be addressed by forcing an orientation ofĥ m 1 or stating that the convergence is up to a sign flip. In this case, Theorem 4.1 does not apply. We are currently only certain that when all graphs are from the Erdos-Renyi random graph model, the global minimizer is unique up to a sign flip. The next theorem concerns the asymptotic bias of h . It gives a bound on the difference between the population minimizer h and the truth h 1 .

Theorem 4.2. If h is a minimizer of D(h, h 1 ), then
To see an application of Theorem 4.2, let us consider the case in which all graphs are Erdos-Renyi graphs with 100 vertices and edge probability of 0.5. Under this setting, The second interval is disturbing. It is due to the fact that when h T 1 h is small, the bound is useless. We provide some insights as to why the second interval is there and how we can get rid of it with additional assumptions. In the proof of Theorem 4.2, we show that the global optimizer h satisfies

Taking a closer look at E(
We can see that is generally not maximized at h = h 1 . If n is large, we can apply a concentration inequality to (h T (A i − P i )h) 2 and have an upper bound on E((h T (A i − P i )h) 2 ). If we further assume A i is not too sparse, that is E(λ 2 i ) grows with n fast enough, then the sum of these two terms is dominated by the first term. This provides a way to have a lower bound on h T 1 h . We may then replace the denominator of the bound in Theorem 4.2 by the lower bound. In general, if n is small, the noise term may cause h to differ from h 1 by a significant amount. In this chapter, we focus on the case that n is fixed. The case that n goes to infinity for Random Dot Product Graph is considered in [47].
The two theorems above concern only the estimation of h 1 , but not λ i . Based on equation (6), the joint embedding estimates λ i byλ When m goes to infinity, we can apply Theorem 4.1, Then, applying the bound on h − h 1 derived in Theorem 4.2 and utilizing the fact that h T A i h is continuous in h, we can obtain an upper bound on |λ m i − h T 1 A i h 1 |. When A i is large, h T 1 A i h 1 is concentrated around λ i with high probability. As a consequence, with high probability |λ m i − λ 1 | is small. In the next section, we demonstrate properties and utilities of the joint embedding algorithm through experiments.

EXPERIMENTS
Before going into details of our experiments, we want to discuss how to select the dimensionality d of the joint embedding. Estimating d is an important model selection question that has been studied for years under various settings [48]. Model selection is not the focus of this paper, but we still face this problem in numerical experiments. In the simulation experiments of this section, we assume d is known to us and simply set the dimensionality estimatê d equal to d. In the real data experiment, we recommend two approaches to determined. Both approaches require first running the d -dimensional joint embedding algorithm, where d is sufficiently large. We then plot the objective function versus dimension, and determined to be where the objective starts to flatten out. Alternatively, we can plot {Λ ik } m i=1 for k = 1, ..., d , and selectd when the loadings start to look like noise with 0 mean. These two approaches should yield a similar dimensionality estimate ofd.

Simulation Experiment 1: Joint Embedding Under a Simple Model
In the first experiment, we present a simple numerical example to demonstrate some properties of the joint embedding procedure as the number of graphs grows. We repeatedly generate graphs with 20 vertices from 3-dimensional MREG, We keep doubling the number of graphs m from 2 4 to 2 12 . At each value of m, we compute the 3-dimensional joint embedding of graphs. Let the estimated parameters based on m graphs be denoted byλ m i andĥ m k . Two quantities based on h m k are calculated. The first is the norm difference between the current h k estimates and the previous estimates, namely . This provides numerical evidence for the convergence of our principled estimation procedure. The second quantity is ĥm k − h k . This investigates whether h k is an unbiased estimator for h k . The procedure described above is repeated 20 times. Figure 2 presents the result.
Based on the plot, the norm of differences ĥm k −ĥ m/2 k seem to converge to 0 as m increases. This suggests the convergence ofĥ m 1 . Second, we notice that the bias ĥm 2 − h 2 and ĥm 3 − h 3 do not converge to 0; instead, it stops decreasing at around 0.1 and 0.2 respectively. This suggests thatĥ k is an asymptotically biased estimator for h k . Actually, this is as to be expected: when there are infinitely many nuisance parameters present, Neyman and Scott demonstrate that maximum likelihood estimator is inconsistent [49]. In our case, there are infinitely many λ i as m grows; therefore, we do not expect the joint embedding to provide an asymptotic consistent estimate of h k .
In applications such as clustering or classifying multiple graphs, we may be not interested inĥ k .λ i is of primary interest, which provides information specifically about the graphs G i . Here, we consider two approaches to estimate λ i [1]. The first approach is estimating λ i [1] through joint embedding, that isλ The second approach estimates λ i by assuming h 1 is known. In this case, equation (6) giveŝ λ i [1] calculated this way can be thought as the 'oracle' estimate. Figure 3 shows the differences in estimates provided by two approaches. Not surprisingly, the differences are small due to the fact thatĥ m 1 and h 1 are close.
Next, we investigate the effects of four different initialization approaches. The first approach utilizes SVD on average residual matrix to initialize h k at each iteration. The second approach randomly samples independent Gaussian variable for each entry of h k . The third approach takes the best initialization among 10 random initializations. The fourth approach initializes h k using the truth. To compare these approaches, we generate 16 graphs from the MREG model and jointly embed them with four different initializations. Then, another 16 graphs are generated and the objective function on these graphs are evaluated usingĤ estimated by joint embedding. This procedure is repeated 100 times. Mean objective function and total running time with standard error of these four approaches are shown in Table 1. Based on Wilcoxon signed-rank test, SVD and truth initializations are significantly better than random initializations on this scenario. For the rest experiments, the initialization is completed by SVD.

Simulation Experiment 2: Joint Embedding to Classify Graphs
In this experiment, we consider the inference task of classifying graphs. We have m pairs   Each pair consists of an adjacency matrix A i ∈ {0, 1} n×n and a label y i ∈ [K]. Furthermore, all pairs are assumed to be independent and identically distributed according to an unknown distribution F A,y , that is The goal is to find a classifier g which is a function g : {0, 1} n×n → [K] that has a small classification error L g = P (g(A) = y).
We consider a binary classification problem where y takes value 1 or 2. Here, h 2 has −0.1 as its first 50 entries and 0.1 as its last 50 entries. Graphs are generated according to the MREG model, where F is a mixture of two point masses with equal probability, We let the class label y i indicate which point mass λ i is sampled from. In terms of SBM, this graph generation scheme is equivalent to SBM ((1, ..., 1, 2 To classify graphs, we first jointly embed 200 graphs. The first two dimensional loadings are shown in Figure 4. We can see two classes are separated after being jointly embedded. Then, a 1-nearest neighbor classifier (1-NN) is We compare classification performances of using the joint embedding to extract features to five other feature extraction approaches: Adjacency Spectral Embedding, Laplacian Eigenmap, Graph Statistics, Graph Spectral Statistics, and PCA. For Adjacency Spectral Embedding (ASE) and Laplacian Eigenmap (LE), we first embed each adjacency matrix or normalized Laplacian matrix and then compute the Procrustes distance between embeddings. For Graph Statistics (GS), we compute topological statistics of graphs considered by Park et al. in [7]. For Graph Spectral Statistics (GSS), we compute the eigenvalues of adjacency matrices and treat them as features [50]. For PCA, we vectorize the adjacency matrices and compute the factors through SVD. After the feature extraction step, we also apply a 1-NN rule to classify graphs. We let the number of graphs m increase from 4 to 200. For each value of m, we repeat the simulation 100 times. Figure 5 shows the result. The joint embedding takes advantage of increasing sample size and outperforms other approaches when given more than 10 graphs.

Real Data Experiment 1: Subject classification on HNU1 data
In this section, we use the HNU1 data [51] to classify graphs from different subjects. These data consist of diffusion tensor imaging (DTI) records of the brain of 30 healthy different subjects, each of which was scanned 10 times over a period of one month. Each scan was processed with the NeuroData's MRI to Graphs (NDMG) pipeline [52] using the Talairach atlas [53] to register the vertices, and we obtained a sample of 300 graphs (10 graphs per subject), each of which is composed of 1105 vertices.
It has been suggested that the information encoded in patterns of brain connectivity can uniquely identify different subjects [54], [55], and there is some evidence of low-rank structure in those differences [56], [57]. We study how lowdimensional representations can capture inter-individual variability by using the HNU1 data to classify subject scans.  (8). The features are first extracted using methods described above; subsequently, we apply a 1-NN to classify graphs. For each value of m, the simulation is repeated 100 times. ASE, LE, GS and GSS do not take advantage of increasing sample size in the feature extraction step. PCA has poor performance when the sample size is small. Joint embedding takes advantage of increasing sample size and outperforms other approaches when given more than 10 graphs.
The joint embedding of graphs is an ideal method for this task, since it provides a low-rank representation of each of the graphs which is jointly learned from the sample.
The information encoded in the adjacency matrices can be used to accurately discriminate between different individuals. This can be confirmed by a 1-NN classifier using the vectorized adjacency matrices, which gives an almost perfect classification accuracy. However, our goal here is to study the accuracy of low-dimensional representations of the graphs to discriminate between subjects. Thus, we measure how does our joint dimensional embedding method perform as the number of embedding dimensions grows. We compare the performance of our method with PCA, as another dimensionality reduction method. Note that the joint embedding of graphs directly imposes a rankone restriction on the components, enforcing a low-rank representation of the graphs. PCA instead represents the adjacency matrices using components that are usually full rank, and requires many more parameters to represent those components.
Before computing the embedding, all graphs are centered by the mean, that is, we computeÃ i = A i − 1 300 300 j=1 A j . Using a 5-fold cross-validation, the data of each subject is divided into training and test sets. We measure the effect of the sample size using two different scenarios, the first uses 30 graphs on the training set (one scan per subject) and the second uses 240 graphs (eight scans per subject). The rest of the graphs are used in the test set to evaluate the accuracy, and the average over the 5 folds is computed. We follow the same procedure as in experiment 1. Using the training set, we compute an embedding of the graphs into d dimensions, either by doing JEG or PCA on the vectorized adjacency matrices. After that, all the data is projected into the d-dimensional embedding, and we use 1-NN to estimate the labels of the test data. Figure 6 shows a comparison of the average classification accuracy and model complexity for both methods. We measure model complexity as the number of embedding dimensions d, and as the total number of parameters used in each instance (d(n + m) for JEG and d(n 2 + m) for PCA). In general, both methods are able to perform very accurate classification when d is large enough, but JEG uses far fewer parameters in the components for the same number of dimensions (see middle plot). The advantage of JEG is especially remarkable when the sample size is small (first row). In this case, JEG shows a better performance than PCA when d is not large. For values close to 30, PCA shows almost perfect performance, which is not surprising, since the maximum number of principal components that can be obtained by PCA is m = 30, and thus PCA is not really performing any dimensionality reduction when d is close to this value, but rather projecting onto the training data itself. JEG on the other hand is able to provide very accurate classification error in all scenarios using low-rank representations of the graphs with a fewer number of parameters that are interpretable. These can be observed in Figure 7, which shows the latent positions of the vertices obtained by JEG for the first seven dimensions. To construct these plots, only the vertices that are labeled as left (507 of them) or right (525) were considered. As it can be observed, several dimensions of the latent positions show a structure that is clearly related to the hemisphere side. This structure is specially highlighted on the 7th dimension of the embedding.

Real Data Experiment 2: Joint Embedding to Cluster Vertices
In the previous experiments, we focus on feature extraction for graphs. Here, we consider a different task, that is spectral clustering through the joint embedding. In general, spectral clustering first computes (generalized) eigenvalues and eigenvectors of adjacency matrix or Laplacian matrix, then clusters the latent positions into groups [11], [12]. The cluster identities of latent positions become the cluster identities of vertices of the original graph. Adjacency Spectral Embedding (ASE) is one of the spectral clustering algorithms used to find the latent positions of the vertices to fit the SBM and RDPG [58], which are special cases of our MREG model. When applied to one graph, the joint embedding is equivalent to Adjacency Spectral Embedding (ASE). When given multiple graphs, the joint embedding can estimate latent positions for graph i as We apply this spectral clustering approaches to two Wikipedia graphs [59]. The vertices of these graphs represent Wikipedia article pages. Two vertices are connected by an edge if either of the associated pages hyperlinks to the other. Two graphs are constructed based on English webpages and French webpages. The full graph has 1382 vertices which represents articles within 2-neighborhood of "Algebraic Geometry". Based on the content of the associated articles, they are grouped by hand into 6 categories: people, places, dates, things, math terms, and categories.
We consider a subset of vertices from 3 categories: people, things, and math terms. After taking the induced subgraph of these vertices and removing isolated vertices, there are n = 704 vertices left. Specifically, 326, 181, and 197 vertices are from people, things and math terms respectively. We consider approaches to embed the graphs to obtain latent positions. First, we consider clustering of each graph separately by doing ASE on the English graph A en (ASE+EN), or equivalently, JE on A en , and ASE on the French Graph A f r (ASE+FR), and compare with the   f r (JE+FR). We also consider methods to estimate joint latent positions, by doing ASE on the mean of both graphsĀ = (A en + A f r )/2 (ASE+(EN+FR)) [56], and the matrixĤ obtained by JE on both graphs (JE+ (EN,FR)). The dimension d is set to 3 for all approaches, and the latent positions are scaled to have norm 1 for degree correction. Then, we apply 3-means algorithm to the latent positions.
The joint latent positions of the graphsĤ estimated by the joint embedding are provided on Figure 8. The latent positions of math terms are separated from the other two clusters. However, the latent positions of people and things are mixed. Table 2 shows the clustering results measured by adjusted rand index and the purity of clustering [60], [61]. The standard error is estimated through repeatedly clustering bootstrapped latent positions. All methods yield clustering results which are significantly better than random. The English graph demonstrates clearer community structure than the French graph. The joint embedding produces latent positions that combine the information in both graphs, and leads to better clustering performance. The JE and ASE give similar results on the English graph, but JE is able to improve the clustering performance on the French graph significantly. The JE also shows better performance than ASE onĀ, which also uses the information of both graphs. These results show that our method provide interpretable representations for the vertices, with an accuracy comparable to state of the art methods for spectral clustering.

CONCLUSION
In summary, we developed a joint embedding method that can simultaneously embed multiple graphs into low dimensional space. The joint embedding can be utilized to estimate features for inference problems on multiple vertex matched graphs. Learning on multiple graphs has significant applications in diverse fields and our results have both theoretical and practical implications for the problem. As the real data experiment illustrates, the joint embedding is a practically viable inference procedure. We also proposed a Multiple Random Eigen Graphs model. It can be understood as a generalization of the Random Dot Product Graph model or the Stochastic Block Model for multiple random graphs. We analyzed the performance of joint embedding on this model under simple settings. We demonstrated that the joint embedding method provides estimates with bounded error. Our approach is intimately related to other matrix and tensor factorization approaches such as singular value decomposition and CP decomposition. Indeed, the joint embedding and these algorithms all try to estimate a low dimensional representation of high dimensional objects through minimizing a reconstruction error. We are currently investigating the utility of joint embedding with more or less regularizations on parameters and under different set ups. We are optimistic that our method provides a viable tool for analyzing multiple graphs and can contribute to a deeper understanding of the joint structure of networks.

APPENDIX A
Proof of Theorem 2.1 Denote the probability of observing a particular adjacency matrix A i under distribution F by p i . It suffices to show that there is a set of parameters for MREG such that observing A i under MREG is also p i .
For undirected graphs with loops on n vertices, there are n+1 2 possible edges. Let A 1 , A 2 , ..., A 2 ( n+1 2 ) be all the possible adjacency matrices. Since real symmetric matrix of size n has n+1 2 free entries which lies in a linear space, if there exists n+1 2 linearly independent rank one symmetric matrices, they form a basis for this space. It turns out that the rank one symmetric matrices generated by vectors is the standard basis for n-dimensional Euclidean space.
Next, we construct parameters for the MREG. Let d be Since {h k h T k } d k=1 forms a basis for real symmetric matrices, for each adjacency matrix A i , there exists a vector λ i , such that Let F be a finite mixture distribution on points  (3) it can be noted that the solution forΛ * k0 is given bŷ Note that equation (4) implies that for any i = 1, . . . , m, To prove the claim of the theorem, we use a technique similar to that employed by Bickel and Doksum [63]. Let h be a global minimizer of D(h, h 1 ). By definition, we must have

Proof of Theorem 4.2
The proof of theorem relies on two lemmas. The first lemma shows that h is the eigenvector corresponding to the largest eigenvalue of E( A i , h h T A i ). The second lemma shows that under Frobenius norm. Then, we apply Davis-Kahan theorem [64] to establish the result of theorem. We notice that Therefore, Taking the derivative of E( A i , hh T 2 ) + c(h T h − 1) with respect to h, Setting this expression to 0 yields, Using the fact that h = 1, we can solve for c: Then, substituting for c, Therefore, we see that h is an eigenvector of E( A i , h h T A i ) and the corresponding eigenvalue is E( A i , h h T 2 ). Furthermore, E( A i , h h T 2 ) must be the eigenvalue with the largest magnitude. For if not, then there exists an h with norm 1 such that however, by Cauchy-Schwarz inequality we must have implying E( A i , h h T 2 ) > E( A i , h h T 2 ), which contradicts equation (9) of this appendix. Thus, we conclude that h is the eigenvector corresponding to the largest eigenvalue of E( A i , h h T A i ).
We compute E( A i , h h T A i ) by conditioning on P i .
Here, diag() means only keep the diagonal of the matrix; * means the Hadamard product, and J is a matrix of all ones. Using the fact that P i = λ i h 1 h T 1 , we have If we consider the norm difference between This finishes the proof for the lemma.
Notice that the only non-zero eigenvector of E(λ 2 i )(h T 1 h ) 2 h 1 h T 1 is h 1 and the corresponding eigenvalue is E(λ 2 i )(h T 1 h ) 2 . We apply the Davis-Kahan theorem [64] to the eigenvector corresponding to the largest eigenvalue of matrices E( A i , h h T A i ) and E(λ 2