Sparse Quadratic Approximation for Graph Learning

Learning graphs represented by <inline-formula><tex-math notation="LaTeX">$M$</tex-math><alternatives><mml:math><mml:mi>M</mml:mi></mml:math><inline-graphic xlink:href="schenk-ieq1-3263969.gif"/></alternatives></inline-formula>-matrices via an <inline-formula><tex-math notation="LaTeX">$\ell _{1}$</tex-math><alternatives><mml:math><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="schenk-ieq2-3263969.gif"/></alternatives></inline-formula>-regularized Gaussian maximum-likelihood method is a popular approach, but also one that poses computational challenges for large scale datasets. Recently proposed methods cast this problem as a constrained optimization variant of precision matrix estimation. In this paper, we build on a state-of-the-art sparse precision matrix estimation method and introduce two algorithms that learn <inline-formula><tex-math notation="LaTeX">$M$</tex-math><alternatives><mml:math><mml:mi>M</mml:mi></mml:math><inline-graphic xlink:href="schenk-ieq3-3263969.gif"/></alternatives></inline-formula>-matrices, that can be subsequently used for the estimation of graph Laplacian matrices. In the first one, we propose an unconstrained method that follows a post processing approach in order to learn an <inline-formula><tex-math notation="LaTeX">$M$</tex-math><alternatives><mml:math><mml:mi>M</mml:mi></mml:math><inline-graphic xlink:href="schenk-ieq4-3263969.gif"/></alternatives></inline-formula>-matrix, and in the second one, we implement a constrained approach based on sequential quadratic programming. We also demonstrate the effectiveness, accuracy, and performance of both algorithms. Our numerical examples and comparative results with modern open-source packages reveal that the proposed methods can accelerate the learning of graphs by up to 3 orders of magnitude, while accurately retrieving the latent graphical structure of the data. Furthermore, we conduct large scale case studies for the clustering of COVID-19 daily cases and the classification of image datasets to highlight the applicability in real-world scenarios.


I. INTRODUCTION AND RELATED WORK
T HE representation of data in the form of graphs is a ubiquitous task in every scientific domain that deals with interacting or interconnected data. Graphs are fundamental mathematical entities with nodes (or vertices) and edges connecting them. The relationship between two connected nodes is usually captured by the scalar value of the weight of the edge that links them. In many domains data is generally available in the form of an unstructured list of samples or variables, with no available relational information among them. The construction of the latent graphical structure of such a dataset often offers an intuitive representation of the data. It can also result in a dimensionality reduction of the problem through the utilization of prior knowledge about the underlying graph (e.g., the level of sparsity or a priori information about the connectivity of the nodes). Overviews of various recent graph learning approaches can be found in [1], [2], [3]. Undirected weighted graphs, with edges representing the conditional dependence among the variables, are typically constructed with a Gaussian graphical modeling (GGM) approach [4]. In this context, each vertex corresponds to a variable, with edges being present between the vertices only if the vertices are conditionally dependent. These dependencies among the data points can be both positive and negative, and are encoded in a matrix that represents the graphical structure. The non-zero entries of this matrix correspond to the dependencies between two variables. This matrix is the inverse of the covariance matrix, also known as the precision matrix, which encodes the graphical structure of a Gaussian Markov random field (GMRF). A common prior imposed on the estimation of the precision matrix is that the conditional correlations among the random variables are sparse [5], i.e., there is a limited number of conditional correlations between the variables. This prior corresponds to imposing a degree of sparsity on the estimated precision matrix. A widely used approach for the estimation of sparse precision matrices is the 1 -regularized maximum likelihood estimation (MLE), commonly referred to as the "graphical LASSO" problem [6]. A popular second-order solution method for the graphical LASSO problem with superlinear convergence is the QUadratic approximation of Inverse Covariance matrices (QUIC) algorithm [7]. The Sparse QUIC (SQUIC) algorithm [8] continues the progress on large-scale, second-order methods by exploiting the underlying sparse linear algebra operations. In [9] and [10] it has been shown that SQUIC is equivalently accurate and significantly faster than other state-of-the-art precision matrix estimation routines (e.g. [11], [12], [13]) in both, synthetic and real-world datasets.
More recently, GGMs under the constraint that all partial correlations are non-negative have received significant attention. The problem of finding variables that are only non-negatively correlated corresponds to enforcing an M -matrix structure on the precision matrix [14], [15]. Symmetric M -matrices are positive definite with non-positive off-diagonal elements, i.e., they are part of the set (1) where , denote positive (semi-)definiteness. Slawski and Hein [16] estimate matrices from the set (1) with a sign-constrained log-determinant divergence minimization algorithm without regularization, thus limiting the applicability of their algorithm to smaller datasets. They also establish that an a posteriori thresholding of the off-diagonal entries of the precision matrix successfully retrieves matrices that encapsulate only the positively correlated variables. In [17] an algorithm that does not require any tuning parameters is proposed that estimates only the graphical structure without the weights. In [18] the optimization problem is solved with an alternating direction method of multipliers (ADMM) algorithm with LASSO and adaptive LASSO penalties.
A tightly connected research direction is concerned with the estimation of the combinatorial graph Laplacian, a symmetric, positive semidefinite and weakly diagonally dominant matrix. If we allow Θ 0 in (1), then graph Laplacians were part of the subset of (1) defined as The matrices in the set (2) are singular, with their off-diagonal entries capturing the weight of the edges of the graph in reversed sign. Here, the initial work of Lake and Tenenbaum [19] focused in the estimation of graph Laplacians through the optimization of an 1 -regularized MLE problem by adding positive constant values to the diagonal entries of the graph Laplacian to account for its singularity. In [20], Egilmez et al. build upon previous of their work in the field [21], and propose an optimization framework for the estimation of graph Laplacian matrices by introducing new problem formulations with sign and structural (i.e., connectivity) constraints, and developing tailored algorithms for these problems using again an 1 -regularization term to enforce sparsity in the graph. Similarly, in [22] the authors convert combinatorial structural constraints into spectral ones on graph matrices, and develop an optimization framework based on block majorization-minimization to solve the graph learning problem. In [23] nonconvex regularization terms are proposed in order to enforce sparsity in the retrieved matrices. Additionally, various M -matrix learning algorithms have been proposed based on the assumption that the graph structure emerges from a set of smooth signals. The authors in [24] adopted a factor analysis model and imposed a Gaussian probabilistic prior on the latent variables that control these signals to obtain a graphical representation. In [25] the same problem is formulated as a weighted 1 -minimization, and in [26] a scalable variant is proposed that utilizes approximate nearest neighbors techniques to reduce the dimensionality of the problem.

Contributions and outline
The focus of our work is centered around the fact that the learning of M -matrices belonging to the set (1) via an 1 -regularized Gaussian maximum-likelihood method is currently prohibitive for high dimensional data. Motivated by the effectiveness of SQUIC [8], [9], [10] in learning precision matrices of very large dimensions we introduce hereby two algorithms that learn graphs of non-negatively correlated random variables.
The first one, SQUIC-fit, performs two consecutive unconstrained precision matrix estimations with an 1 -regularized minimization. It utilizes the positively correlated variables identified in the first run as graphical bias for the retrieval of the second precision matrix, which is subsequently thresholded in order to retrieve the final M -matrix. The second one, SQUIC-sqp, is a constrained method that effectively enforces the non-positivity in the off-diagonal entries of the estimated precision matrix. The constrained minimization is achieved by means of a sequential quadratic programming (SQP) algorithm, and the corresponding projected Karush-Kuhn-Tucker (KKT) system is solved by a preconditioned conjugate gradient method (PCG).
Extensive numerical experiments are provided for both introduced algorithms. We begin with a performance and accuracy comparison with several state-of-the-art M -matrix estimation packages for synthetic datasets with up to 10 4 random variables. Then we proceed with a study on the recovery accuracy of SQUIC-fit and SQUIC-sqp when prior graphical information is available and incorporated in the optimization procedure. Following the synthetic tests, we present two didactic case studies where we highlight the applicability of the introduced algorithms in real-world datasets. For the first case study we perform the spectral clustering of p = 3·10 3 US counties based on the number of daily COVID-19 cases they reported for a window of 671 days. Finally, we classify image datasets with up to p = 7·10 4 dimensions based on the eigenvectors of the M -matrices estimated by the proposed algorithms.
The remainder of this paper is organized as follows. In Section II, we briefly recap the learning of graphs in the form of precision and M -matrices when assuming that data samples are drawn from a GMRF field. In Section III we initially present at a high level the plain SQUIC method for large-scale sparse precision matrix estimation. We then proceed with introducing the SQUIC-fit and SQUIC-sqp algorithms for the learning of graphs in the set (1). In Section IV, we perform numerical experiments on synthetic datasets and compare with state-of-the-art methods in order to validate our proposed routines. In Section V, we present the case studies on real-world datasets, and finally in Section VI we draw conclusions from this work.

Notation
In what follows, we denote scalar quantities with lowercase, vectors with lowercase bold, and matrices with uppercase bold characters. The (i, j)th entry of a matrix A is symbolized by A ij and all entries in row i or column j by A i: and A :j , respectively. Sets are denoted by capital calligraphic characters, for example, A, the identity matrix as I and the vector of all ones as e.

II. GRAPH LEARNING BACKGROUND
A common approach for various graph learning approaches consists of assuming that the data samples are drawn from a GMRF field. In subsection II-A we describe the problem of estimating precision matrices from GMRFs. Subsequently, in subsection II-B we show how this optimization procedure can be formulated in order to learn graph M -matrices in the set (1).

A. Sparse Precision Matrix Estimation
The retrieval of the graphical structure of a GMRF model corresponds to the estimation of the precision matrix (inverse covariance matrix) by means of the MLE problem. The basic assumption on the given data Y ∈ R p×n is that one reads its columns as a set n independently and identically distributed (i.i.d.) samples of a p-variate Gaussian distribution N (µ * , Σ * ), where Σ * ∈ R p×p and µ * ∈ R p are the true covariance matrix and mean, respectively. Even if the assumption i.i.d. is not fulfilled, determining Σ * or its inverse Θ * := (Σ * ) −1 yields sufficient information that could be used for graph learning purposes. More specifically, in a setting with non-Gaussian distribution, the estimation of positive definite precision matrices can be related to the Bregman divergence regularized optimization problem [32]. The entries Θ * ij of the precision matrix describe the conditional dependence of components i, j provided that all other components are fixed and the associated graph leads to the GMRF. In statistics, the MLE method is employed to approximate Θ * . To do so, the negative log-likelihood objective function is minimized, where S ∈ R p×p is the sample covariance matrix. The graph of Θ * is essential to describe the GMRF, thus one reformulates the minimization as a LASSO problem by adding an additional 1 regularization term. This term enforces sparsity in the graphical representation of Θ * and explains the term graphical LASSO (GLASSO). Given a sparsity parameter matrix Λ ∈ R p,p with Λ ij > 0, we aim to solve the following convex 1 -regularized negative log-likelihood problem where denotes the element-wise Hadamard product and Θ 0 denotes positive-definiteness. The regularization term can be expanded as Λ Θ 1 = p i,j=1 Λ ij |Θ ij |. Typically, small entries in Λ result in reduced sparsity in the estimated precision matrix Θ. Besides enforcing sparsity for the computed Θ, the minimization in (4) is also suitable when the number of dimensions p is significantly larger than the number of samples n, thus making (4) an appealing problem formulation for large-scale data science applications. The selection of the optimal regularization parameter is a challenging problem in itself, with cross-validation [33] and stability based approaches [34] commonly adopted.
There is a plethora of methods available for solving (4), see, e.g. [6], [7], [8], [35], [36], [37] for a selection of them. Recently some of these methods have gained increased attention thanks to a quadratic approximation, briefly outlined hereby, with the purpose of accelerating convergence.
Let f : Θ → R be the non-regularized negative loglikelihood function in (3). Up to a constant, the second-order Taylor expansion of f around Θ iŝ where W = Θ −1 denotes the inverse of the computed approximation Θ. The Newton direction Δ ∈ R p×p of the approximate objective functionf can now be written as the solution of the following problem The quadratic approximation approach solves (4) as a sequence of optimization problems of the form (6). Since (6) is an 1regularized quadratic convex minimization problem, one can find a "closed form solution" if Δ is restricted to the scalar case with either one single diagonal entry Δ ii or two single identical off-diagonal entries Δ ij = Δ ji . A natural solution strategy is a coordinate descent update where one successively minimizes over all indices {i, j}. Looking at the subgradient of f (Θ), it suffices to only consider {i, j} from the free set I free , where with the cardinality of the free set I free typically expected to be much less than p 2 . This way an update matrix Δ is obtained and in order to ensure that the next iterate Θ is positive definite and meets an Armijo-type criterion, we update our current estimate of the optimizer Θ with αΔ for an appropriate step size α ∈ [0, 1). For more details we refer the reader to [7] and references therein.
Alternatively to optimizing f (Θ) by a sequence of quadratic problems such as (6), some methods directly substitute f (Θ) by a single quadratic loss surrogate function, which is then required to be minimized. Among these methods are, e.g. [11], [38].

B. Sparse M -Matrix Estimation
M -matrices in the set (1) can be considered as precision matrices Θ whose partial correlations −Θ ij / Θ ii Θ jj , i = j are all non-negative [14]. The GMRF corresponding to a precision matrix of that form is referred to as attractive [39]. The constrained GLASSO estimator is defined as The off-diagonal elements of Θ are now constrained to nonpositive values Θ ij ≤ 0 and correspond to the weights of the resulting graph of the GMRF. In Section III-C we describe how (8) can be approximated by a sequence of quadratic and differentiable functions with linear inequality constraints.

Connection to graph Laplacians
The combinatorial graph Laplacian L ∈ R p×p is a symmetric positive semidefinite matrix in the set (2) with off-diagonal elements non-positive, thus it is considered as singular M -matrix. Such matrices can be expressed in the form L = sI − B, where s = ρ(B) is the spectral radius of the nonnegative matrix B ≥ 0. We refer to [40] for the theoretical background of singular M -matrices and their pseudoinverses. The constant vector of ones e is in its nullspace, i.e., L · e = 0, because the row and column sums of L are zero, i.e., L ii + i =j L ij = 0. A very important property of the spectrum of L is that the multiplicity of the zero eigenvalue corresponds to the number of connected components k of the graph [30]. This also implies that the Laplacian matrix is singular, with rank p − k > 0. Note that L encodes an improper GMRF (IGMRF) [22], [41] of rank p − k, as opposed to Θ in (8) which is of full rank.
An undirected weighted graph G(V, E, A), as illustrated in Fig. 1, is defined by its node set V = {1, 2, . . . , p} representing the data points, and the similarity between the edges E which is encoded in the elements of the weighted adjacency matrix A ∈ R p×p . Its combinatorial graph Laplacian L can be understood in terms of the weighted adjacency matrix A, that encodes the weights A ij ≥ 0 of the edges, and the diagonal degree matrix T ∈ R p×p , which captures the degree of each node The positive entries of A, or equivalently the negative off-diagonal entries of L, represent the edge weights of a graph, while zero entries A ij = 0, i = j, imply that there is no connection between nodes i and j.
Different variants of graph Laplacian matrices have also been extensively studied. The normalized symmetric L sym = T −1/2 LT −1/2 and random walk L rw = T −1 L Laplacians [42] are both scaled by the degree of the edges and have been successfully used for clustering tasks [28], [29], [43]. Additionally, nonlinear reformulations of the graph Laplacian from the traditional 2-norm to the p-norm for p ∈ (1, 2] have proven to lead to a sharp approximation of balanced cut metrics and improved clustering assignments [44], [45], [46]. All aforementioned graph Laplacian variants can be constructed after obtaining the weights of the graph's edges, encoded in the adjacency matrix A. In what follows we estimate non-negatively correlated variables in the form of an M -matrix Θ from an optimization problem of the form (8), and then set A = − Θ. The appropriate type of graph Laplacian is subsequently built according to the application at hand.

III. ESTIMATING M -MATRICES WITH SQUIC
In this section we present two algorithms developed for the MLE of M -matrices emerging from high dimensional datasets. Our contributions build on top of the existing SQUIC library for large-scale precision matrix estimation, thus we begin in III-A with a short overview of the method [8], [9] and its latest development as demonstrated in [10]. In III-B we present the SQUIC-fit algorithm, an unconstrained approach to M -matrix estimation based on two consecutive 1 -regularized optimization problems. Then, in III-C we introduce SQUIC-sqp, a constrained sequential quadratic programming approach for the solution of problems of the form (8).

A. MLE for Large Dimensions
The SQUIC algorithm extends the original QUIC algorithm [7] for large-scale applications and is effective for problems that exhibit a high degree of sparsity in both Θ and the intermediary computations. The critical components of the MLE method based on quadratic approximations can be summarized in five tasks, namely 1) efficient data structures for the matrices S, Θ, W and Δ, 2) computation of the sparse sample covariance matrix S, 3) Cholesky decomposition of Θ to check its positive definiteness and to compute log det Θ, 4) computation of the inverse of the computed approximation W ≈ Θ −1 , 5) efficiently solving the quadratic approximation problem (6). The SQUIC method addresses these challenges by using compressed sparse column storage, which is common when working with sparse matrices, and by replacing several dense matrix operations by state-of-the-art sparse matrix computations. We note that both algorithms that will be introduced in the following subsections, SQUIC-fit and SQUIC-sqp, share the need to efficiently address tasks 1 − 4. Their key difference lies in the fact that SQUIC-sqp solves a constrained version of the quadratic surrogate problem (6), with the aim of learning M -matrices in the set (1).
Though the sample covariance matrix S is approximated as being sparse, the computation is dense due to the undetermined sparsity pattern. Initially, the off-diagonal values |S ij | < Λ ij are discarded. During the overall iteration, any values of S which have not been computed yet and which have a corresponding nonzero entry in W are computed on the fly. The kernel operation in computing the matrix S is matrix-matrix multiplication, which is highly parallelizable.
To efficiently compute the Cholesky decomposition, SQUIC uses the algorithm CHOLMOD [47] which is part of the SuiteSparse Matrix Collection. 1 CHOLMOD is based on the supernodal approach, which successively detects dense block structures during the factorization and produces a matrix in a hybrid format. In this format, several consecutive columns with the same nonzero pattern are treated as one dense block. These dense blocks can be efficiently handled with high-performance libraries such as the Intel(R) Math Kernel Library (MKL). For details we refer to [48]. Once the Cholesky decomposition of Θ is successfully computed, log det Θ can be easily obtained as a by-product.
The Cholesky decomposition returns pointwise lower diagonal factors, which we convert into block triangular using the structure of the supernodes. Then, the precision matrix Θ and its inverse Θ −1 are equipped with these block structures, leading to benefits in both storage and computation. The factorization reads Θ = PBDB P , where P is a suitably chosen 1.https://sparse.tamu.edu/ permutation matrix in order to reduce the fill-in, B is block lower triangular with unit diagonal and D is block diagonal. This factorization can be employed to approximately compute with −E being the strictly lower triangular part of B. The series B −1 = p−1 k=0 E k requires at most k = p − 1 terms as E p = 0, and entries of small magnitude are being dropped. Similarly, the final product (B inv ) D −1 B inv also drops entries of small magnitude. We note that the computation of B inv as well as the final product P(B inv ) D −1 B inv P are also efficiently parallelized in SQUIC. The convergence of the algorithm is determined by requiring that the relative difference between the objective function at the updated Θ and the previous Θ prev is below a threshold τ , i.e.
The quadratic optimization problem (6) also uses block structures which leads to block coordinate descent updates by efficiently recycling as much information as possible from previous descent steps. We are not going into the details of this approach and kindly refer the reader to [10].

B. A Post-Processing Approach for M -Matrix Estimation
The first learning algorithm of M -matrices in the set (1) that we present hereby can be considered an unconstrained 1 -regularized technique. In SQUIC-fit we do not enforce additional sign constraints in the estimation of the precision matrix, but instead follow a post processing approach coupled with the utilization of a matrix sparsity parameter in order to obtain the graphical structure of non-negatively correlated variables. Our approach consists of two consecutive estimations of precision matrices Θ (1) , Θ (2) that are solutions of a problem of the form (4). The first precision matrix Θ (1) is computed with the aid of a scalar regularization parameter λ, and is utilized in order to estimate the binary graphical structure of the non-negatively correlated variables in the data Y ∈ R p×n under question. The second precision matrix Θ (2) is estimated with a matrix sparsity parameter Λ that encodes this graphical structure. Finally, the M -matrix in the set (1) is extracted by post-processing the entries of Θ (2) . An outline of the algorithmic scheme for SQUIC-fit is presented in Algorithm 1. In step 1 we aim to solve the 1regularized negative log-likelihood problem, that is The scalar tuning parameter λ is set such that the resulting graph is sparse, and its values usually adjust the regularization according to the number of variables p and the number of features n. Then in step 2 we estimate the structure of the negative off-diagonal entries of Θ (1) as with I denoting the p × p identity matrix. The thresholding parameter κ 0 is chosen sufficiently small so that all negative offdiagonal elements in Θ (1) ij are detected. These values correspond to an attractive GMRF, and capture the notion of positive correlation between two nodes (variables) i, j of the graph. We consider κ = 0 in all our numerical experiments. Larger values can be expressed as the ratio κ = ij | denoting the cardinality of the negative off-diagonal entries. This threshold selection sparsifies G further, and, as a consequence, also the final estimated M -matrix, and removes potentially noisy entries.
In steps 3 − 4 we subsequently utilize the graphical structure of G ∈ R p×p in the composition of the matrix tuning parameter Λ for solving The matrix sparsity parameter is composed as where η < λ ∈ R, thus the regularization matrix Λ effectively uses the sparsity pattern of G as a graphical bias in the estimation of the structure of Θ (2) . The parameter η is always expressed as a ratio of λ in our numerical experiments. Smaller values of η indicate higher confidence that G accurately captures the structure of the positively dependent variables, as the entries corresponding to its nonzero sparsity pattern will be penalized less in the second precision matrix estimation in step 4. The final step 5 of SQUIC-fit involves a post-processing procedure to construct the M -matrix from the entries of Θ (2) . The final matrix Θ is formed by selecting the structure and the weights of the non-positive off-diagonal entries of the estimated precision matrix Θ (2) as In both steps 1 and 4 the SQUIC algorithm is executed up to a convergence tolerance τ . These key operations of SQUIC-fit are illustrated in Fig. 2, where we attempt to retrieve a grid structure with p = 16 nodes from n = 48 samples drawn from the zeromean multivariate Gaussian distribution Y ∼ N (0, Θ −1 * ), with Θ * being the true underlying M -matrix of the grid.
Incorporating available connectivity information for the graphical structure of non-negatively correlated variables in the data Y is also possible in Algorithm 1. In this case the structure of G is part of the input, and the algorithm is reduced to steps 3 − 5.

C. An SQP Approach for M -Matrix Estimation
The second learning algorithm that we introduce, SQUIC-sqp, is a constrained approach for the estimation of M -matrices based on sequential quadratic programming. The minimization of the subject to Θ ij 0 for all i = j.
In order to approximate (14) locally, we employ again the second-order Taylor expansion of f around Θ as in (5). According to I free in (7), problem (14) is restricted to entries Θ ij which are potentially nonzero and for this reason they can be assumed to have either positive sign (i = j) or negative sign (i = j). Thus the local regularized quadratic objective function can be rewritten as Similarly to SQUIC-fit, prior knowledge on the latent graphical structure can be incorporated in the objective function through a matrix sparsity parameter of the form (12). The constrained M -matrix estimation problem (14) is substituted by Now the local approximate function (15) is quadratic and differentiable with linear inequality constraints (16b). Therefore it can be easily solved by sequential quadratic programming [49], [50]. The SQP method distinguishes successively between active constraints (which refer to Δ ij such that Δ ij ≈ −Θ ij ) and inactive constraints (i.e., Δ ij −Θ ij ). The gradient of the unconstrained quadratic function q(Δ) in (15) reads Note that with respect to differential calculus, (17) should be read as vectorized equation 0 = ∇ vec(q(Δ)) = (W ⊗ W) · vec(Δ) + vec(S − W + Λ (2I − ee )). We prefer to keep the concept of matrix equations since this allows to maintain the additional structure of symmetric matrices. The matrix associated with the term WΔW in (17) is equivalent to W ⊗ W, where ⊗ refers to the Kronecker product (i.e., the size of the resulting matrix is squared compared with W). Therefore the Hessian of q is given by Since W is positive definite, so is W ⊗ W. As (16b) corresponds to box constraints, the KKT system for the unconstrained variables can be solved using a straight projection, i.e., systems (16) and (17) are restricted to the diagonal entries Δ ii (which are always unconstrained) and the inactive off-diagonal entries Δ ij , whereas for active constraints, the entries Δ ij enter as inhomogeneity. We denote the affiliated index subsets of I free by I u for the unconstrained indices and by I c for the active constraints. The minimization problem is therefore reduced to solving the projected system ∇q(Δ) = 0 restricted to I u . This requires solving systems with the submatrix of W ⊗ W belonging to the unconstrained indices {i, j} ∈ I u . With respect to the linear system ∇q(Δ) = 0, the sought matrix Δ ∈ R p×p is formally treated as the vector vec(Δ) in a subspace of R p 2 defined via I u . Because of the sheer size of this system, even when projected to the set of unconstrained variables, the only viable option is to use an iterative method. In our case we use the preconditioned conjugate gradient (PCG) method [51], [52] with diagonal preconditioning, implemented in a parallel fashion. Given the tolerance τ , the PCG method checks the relative residual, the relative change of the sought quantity Δ as well as the relative change of the l 1 -regularized quadratic objective function q(Δ). In the case that all conditions are de-activate a promising constraint and update I u , I c 10: else 11: Compute ν ∈ (0, 1] such that Θ ij + νΔ ij 0 for all {i, j} ∈ I u , i = j 12: Θ ← Θ + νΔ 13: end if 14: end while 15: Δ ← Θ − Θ old output Δ fulfilled, PCG is stopped. It should be noted that the vectors used in the PCG method are sparse symmetric matrices stored in compressed column storage format, and as a result, the data is naturally partitioned. The parallelization is then performed along the set of columns of the underlying matrices. As an example consider tr[XY] of two sparse symmetric matrices which takes over the role of the scalar product of two vectors. To do so, perform scalar products of the columns of X and Y in parallel and finally accumulate these independent scalar products to a single number. The SQP method successively evaluates the computed Δ ij , checks the constraints, activates and de-activates constraints until eventually the solution is computed. We sketch the SQP part in Algorithm 2. Due to the simplicity of box constraints in (16b), the Lagrange multiplier M ij of the augmented system q(Δ) If there exist negative components in M, then there must be further descent directions and the index pair {i, j} associated with the most negative entry of M is a promising candidate to be de-activated. We do not go into further details of the SQP method but kindly refer to the literature [49]. In what follows we denote the SQP-based variant of SQUIC as SQUIC-sqp.

IV. ANALYSIS AND VALIDATION ON SYNTHETIC DATA
In this section, we present experimental results on synthetic data for an extensive evaluation of the accuracy and efficacy of the proposed M -matrix estimation routines SQUIC-fit, as outlined in Section III-B, and SQUIC-sqp as summarized in Section III-C. We will compare our methods against the following state-of-the-art graph learning packages: 2 1) Combinatorial Graph Laplacian (CGL) [20]: Graph Laplacian estimation via an iterative block-coordinate descent algorithm. The authors here decompose the original problem into a series of lower-dimensional subproblems. We use a cycle of 100 row/column updates for the minimization of the objective function. 2) Structured Graph Learning (SGL) [22]: The graph Laplacian is estimated by converting combinatorial structural constraints into spectral constraints, and the resulting optimization problem is solved with an algorithm based on quadratic methods. The parameter controlling the quadratic approximation term is set at β = 20, as suggested by the authors. The following Subsection IV-A summarizes our experimental setup, and then IV-B is devoted in a comparison of the accuracy of the methods under consideration. Then, in Subsection IV-C we present timing comparisons between the methods for datasets of an increasing size. Last, in Subsection IV-D, we shift our attention to the way incorporating prior knowledge of the graphical structure of a dataset influences the accuracy of the M -matrix retrieval.

A. Experimental Setup
Since both external algorithms directly estimate the combinatorial graph Laplacian in the set (2), we compare the accuracy of our proposed methods by estimating the M -matrix Θ and then building the combinatorial graph Laplacian as The accuracy in the estimation of L is measured in terms of F-score and relative error (RE). The F-score is defined as where precision is defined as precision = T P/(T P + F P ) and recall as recall = T P/(T P + F N). T P stands for true positive entries, i.e., actual edges that are detected by the algorithm; F P corresponds to the false positives, i.e., edges that are falsely detected, and F N stands for the edges that the algorithm failed to detect. A score of F = 1 suggests that the matrix has been fully recovered, while smaller values of F suggest worse recovery success. The relative error is defined as where L is the estimated matrix and L true the true reference graph Laplacian matrix.
2.The CGL code is available at: https://github.com/STAC-USC/Graph_ Learning. The code for SGL is available as an R package at: https://cran.rproject.org/web/packages/spectralGraphTopology/index.html Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Fig. 3. Accuracy comparisons between the different combinatorial graph Laplacian estimation methods measured in terms of F-score (20) and relative error (21). a) F-scores for the lattice grid graph G (64) grid . b) F-scores for the random clusters graph G (60) clust . c) RE for the lattice grid graph G (64) grid . d) RE for the random clusters graph G (60) clust .
We base our results on two synthetic datasets generated from Gaussian distributions with a mean of zero and the following types of predefined graphical structures: r A grid graph structure denoted as G (p) grid , where p is the number of nodes. Each node is connected to its four nearest neighbors (except the nodes at the boundaries).
r A random structured matrix denoted as G (p) clust representing a graphical structure of p/100 balanced clusters with an average node degree of 20 and with 90% of the edges contained within the clusters [53]. Edge weights are then randomly selected based on a uniform distribution from the interval [0. 1,3]. From these structures we generate an IGMRF model parametrized by the true graph Laplacian L true . From this IGMRF model n samples are drawn from the degenerate zero-mean multivariate Gaussian distribution x i ∼ N (0, L † true ), where L † true is the Moore-Penrose pseudoinverse of L true . The sample covariance matrix Σ is computed as We follow the approach of [54] and define the regularization parameter as where the scaling term log (p)/n adjusts the regularization according to p and n, and c ∈ R is based on experimental results and is selected such that the F-score of the resulting graph is maximized. The convergence tolerance for all the methods is set to τ = 10 −4 , the threshold parameter for SQUIC-fit to κ = 0, and all the results reported hereby correspond to their mean value after 10 runs.

B. Accuracy Estimation
Our first round of numerical experiments is designed to evaluate and compare the accuracy of the two proposed algorithms in the retrieval of the synthetic combinatorial graph Laplacian matrices emerging from the graphical structure of G (64) grid and G (60) clust .
We generate 10 instances of each synthetic graph, and present in Fig. 3 the mean accuracy results in terms of F-score (20) and the associated RE (21). The performance of the algorithms is compared for different ratios of sample sizes n/p = {0.1, 0.25, 0.5, 1, 5, 10, 25, 50, 100}. The parameter c in (23) is selected independently at each level for each method, and corresponds to the one that maximizes the F-score. For SQUIC-fit we set in (12) For the lattice grid graphs (Fig 3(a) and (c)) SQUIC-fit achieves very high F-scores and low RE for higher sampling ratios n/p > 1, while still remaining competitive in the low sampling regimes n/p ≤ 1 in terms of F-score. The accuracy of our post processing approach is similar to that of SGL both in terms of F-score and RE. For the random clusters graphs (Fig. 3(b) and (d)) SQUIC-fit achieves the highest F-scores for the sampling ratios n/p > 1, with the RE reported being comparable with that of SGL. Our constrained approach SQUIC-sqp reports here similar F-score and RE with CGL for all sampling regimes.
Our constrained method, SQUIC-sqp, performs better than SQUIC-fit in terms of accuracy in the very low sampling regimes in both synthetic cases. In particular, SQUIC-sqp outperforms SQUIC-fit both in terms of F-score and relative error when the ratio of the available data samples and the dimensionality of the problem is n/p ≤ 0.5. Similar findings for the difference in the accuracy between constrained and unconstrained methods have been reported in several related works [20], [55]. These advantages are expected, as exploiting the M -matrix constraint fulfills the model assumptions of attractive GMRFs.

C. Timing Comparisons
We proceed with a comparison of the runtimes of the methods under question when learning M -matrices. To this end, we consider a sequence of 6 true combinatorial graph Laplacian matrices L true of increasing size. In particular, for the lattice grid graph G  solution in terms of F-score is reported at each p level. We report these timing results in Fig. 4.
The timing results for CGL and SGL are excluded if the runtimes exceed 10 4 seconds. For the lattice grid graph (Fig. 4(a)) SGL exceeds this time limit for p ≥ 4096 and CGL for p = 16384. For the random clusters graph (Fig. 4(b)) the time limit is exceeded by SGL at p ≥ 5000 and by CGL at p = 10 4 . In Fig. 4 we additionally observe that SQUIC-fit outperforms all competing algorithms across all dimensions for both graphical structures. This is an expected behaviour, as SQUIC-fit is the only unconstrained method included in the comparisons. We additionally note that the final post-processing step of SQUIC-fit removes on average 11% of the edges of the second estimated precision matrix for G

D. Incorporating Graphical Bias
In this unit test we study the recovery accuracy of the two introduced SQUIC algorithms when using prior graphical knowledge in the estimation of the sparse M -matrix Θ. The graphical structure of the bias G is defined as a corrupted version of the structure of the true graph Laplacian matrix L true . We control the degree of this corruption with a random symmetric sparse matrix Z ∈ R p×p with δ · |L true |/p number of nonzeros per row. The structure of G is then defined as Notice that for δ = 0 we retrieve the exact structure of L true , while for an increasing δ > 0 the structure of G has an increasing number of noisy entries. Then the matrix tuning parameter is composed in similar fashion to (12) as with λ opt being the scalar regularization parameter resulting in the highest F-score, and b ∈ R a scalar parameter controlling the effect of the matrix bias on the regularization. Larger values of b in (25) result in the matrix bias G being more strictly enforced. We select b = 2 for a moderate influence of G on the estimated L.
We consider two true graph Laplacian matrices emerging from the graphical structure of G 1024 grid and G 1000 clust with n = 500 number of samples. In Fig. 5 we present the effect that an increasing noise factor δ = {0, 1, . . . , 70} has on the retrieval accuracy of both SQUIC-fit and SQUIC-sqp in terms of F-score, and compare it with the retrieval accuracy achieved by the algorithms when using a scalar regularization parameter λ with no graphical bias. Note that the corruption matrix Z has no effect on the retrieval accuracy when using a scalar regularization parameter, as no graphical bias is utilized in the composition of the matrix penalty term Λ.
The best F-scores achieved at the optimal scalar λ opt are represented with the horizontal dashed lines. The performance of the SQUIC algorithms when taking into account a noisy graphical bias G in the matrix sparsity parameter Λ (solid lines) greatly outperforms the scalar counterparts. In particular, for the lattice grid graph G  Fig. 5(b)) the baseline of SQUIC-fit is F − score = 0.59, and is improved for δ ≤ 30, while for SQUIC-sqp the best F-score of 0.56 is improved when considering a graphical bias with δ ≤ 45.

V. EXPERIMENTS WITH REAL-WORLD DATA
In this section, we illustrate the applicability and efficiency of SQUIC-sqp and SQUIC-fit in the estimation of sparse Mmatrices emerging from real-world problems. In Section V-A we identify the largest connected components of a graph emerging from the COVID-19 daily cases in the USA, and perform spectral clustering with the M -Matrix of the largest component. Subsequently, in Section V-B we classify image datasets of up to p = 7·10 4 dimensions based on the eigenvectors of the estimated M -matrices.

A. Clustering of COVID-19 Daily Cases
We consider the publicly available data for the US confirmed daily cases, reported at the county level [56]. 3 We emphasize that the case study presented here is intended to highlight the capabilities of the proposed algorithms and not propose any course of COVID-19 related actions.
The dataset under consideration consists of p = 3342 counties and reports the number of daily COVID-19 cases C for n = 671 days for the time window 22 January 2020 to 23 November 2021. Counties with a total number of cases n C < 100 are discarded, resulting in p = 3209, and these cases are normalized by the number of residents per county in order to obtain information on the infection rate per capita. 4 Subsequently, the M -matrix Θ of the positively correlated counties is constructed with SQUIC-fit in 72 seconds and in 211 seconds with SQUIC-sqp, and the largest connected components of the resulting graphical structure are identified. For the SQUIC-sqp variant we use a scalar regularization parameter of λ = 0.7, and for the SQUIC-fit algorithm we set in (9) λ = 0.7 and in (12) η = 2λ/3. The matrices retrieved from both algorithms are almost identical, thus in what follows we report the results obtained with SQUIC-fit. 3.The COVID-19 Data Repository is provided by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University at https://github. com/CSSEGISandData/COVID-19. Puerto Rico municipalities are included.
4.Demographic information of the USA at the county level is available at https://www.census.gov/data/datasets/time-series/demo/popest/2010scounties-total.html.
We illustrate in Fig. 6 the six largest connected components of Θ. The largest component ( Fig. 6(a)) includes 1774 counties from the entire USA, the second one ( Fig. 6(b)) captures 165 counties from the states of Oklahoma and Iowa, the third one ( Fig. 6(c)) 113 counties from Missouri, the fourth one ( Fig. 6(d)) 81 counties from Michigan, the fifth one ( Fig. 6(e)) 79 counties from Nebraska, and the sixth largest connected component of Θ ( Fig. 6(f)) includes 66 counties from the state of Florida. The clear geographic partition of the components 2 − 6 demonstrates that SQUIC-fit successfully captures the positively correlated variables of the dataset.
We proceed with an analysis of the clusters present in the largest connected component of Θ. This component is denoted as Θ a , and is used to build the random-walk normalized graph Laplacian L rw = T −1 L, where L is defined as in (19) and T is the diagonal degree matrix satisfying T ii = L ii for all i. After computing the eigenvalues λ k of L rw the number of natural clusters present in the dataset is estimated with the relative eigengap [57] A high value of γ k indicates that there is a gap between the kth and the (k + 1)st eigenvalue, and thus, a partition of Θ a into at least k clusters results in a minimization of the balanced graph cut criteria [28]. In order to obtain discrete partitions, the eigenvectors corresponding to the k smallest eigenvalues of L rw are clustered with the k-means algorithm with 20 orthogonal and 10 random initializations [58]. We present the clustering results using Θ a in Fig. 7. According to the relative eigengap, 8 distinct clusters are present in the subgraph. The locations of the counties present at each cluster are illustrated in Fig. 7(a), and the cardinality of the respective clusters is presented in Fig. 7

B. Image Classification
In this case study we demonstrate the applicability of the introduced algorithms in the estimation of M -matrices emerging from image applications. We study the problem of classifying facial images and handwritten characters according to their labels by applying a spectral clustering routine on the eigenvectors of the estimated random walk Laplacian L rw . Classification accuracy is measured in terms of the unsupervised clustering accuracy (ACC ∈ [0, 1]), and the normalized mutual information (NMI ∈ [0, 1]) [59]. For both classification metrics a value of 1 suggests a perfect grouping of the nodes according to the true labels. We consider the following publicly available datasets  The high dimensionality p of these datasets renders them computationally unfavorable for the CGL [20] and SGL [22] methods, thus here we compare our proposed algorithms against the traditional approach of building adjacency matrices A. This approach consists of initially creating the connectivity matrix G ∈ R p×p from a k-nearest neighbors routine, with the number of nearest neighbors (NN) set such that the resulting graph is connected. For these datasets the number of nearest neighbors needed for a connected graph is NN = 12 for YaleA and Olivetti and NN = 11 for both USPS and KMNIST. Subsequently, the similarity matrix H ∈ R p×p between the data points is defined similarly to [64] as H ij = max{H i (j), H j (i)} with H i (j) = exp(−4 , with σ i standing for the euclidean distance between the i-th data point and its k-th nearest neighbor. The adjacency matrix A is then created as We utilize the kNN connectivity matrix G as graphical bias for SQUIC-fit and SQUIC-sqp and find the optimal scalar tuning parameter λ = λ opt for each case such that the discrete objective of spectral clustering, the balanced graph cut [28], is minimized. The matrix tuning parameter Λ in (12) is then set with η = λ opt / √ p for both SQUIC-fit and SQUIC-sqp. Our strategy is thus penalizing the graphical bias G with a decreasing rate for an increasing number of dimensions p. The goal is to obtain within a reasonable amount of time graphical representations of the datasets that are sparser than G and more accurate, and which will therefore lead to an increase in the classification accuracy metrics after applying a spectral clustering routine. Sparsity in the graph is measured in term of edge density, defined as = |E|/(|V | * (|V | − 1)), which is a ratio reflecting how close  I  CLASSIFICATION RESULTS FOR THE IMAGE DATASETS OF SUBSECTION V-B. WE REPORT THE EDGE DENSITY , AND THE ACCURACY METRICS ACC, NMI. THE  BEST VALUE ACHIEVED IS PRESENTED IN BOLD, AND THE PERCENTAGES SHOW HOW INFERIORLY THE OTHER METHODS FARED AGAINST IT the graph is to a complete graph, with = 1 for a complete graph.
The M -matrices of the 4 datasets under consideration are retrieved in t = 0.9, 6.7, 46.3 and 1255.6 seconds with SQUICfit, and in t = 2.4, 3.6, 31.5 and 2024 seconds with SQUIC-sqp respectively. We summarize the rest of our results in Table I. For each dataset we report the edge density, the ACC and the NMI achieved by the best method, and the percentage the remaining methods are inferior to that value. Inferiority in percentage values is defined as I = 100 · γ · (e ref − e best )/e best , where e best is the best value, e ref the value it is compared against, and γ = −1 for minimization scenarios ( ) and γ = 1 for maximization ones (ACC, NMI). Both SQUIC-fit and SQUIC-sqp improve the classification accuracy of the traditional kNN graph for all the datasets considered. In particular, SQUIC-fit achieves the highest accuracy metrics for all cases, and the lowest edge density for all cases except USPS. The reduction of the edge density is more evident for YaleA and Olivetti, as the tuning parameter η = λ opt / √ p applied on the entries of the graphical bias G has a less impact for graphs of low dimensions p. In Fig. 8 we illustrate this reduction in for the YaleA dataset. For visual clarity we select a subset (variables 100 to 155) of the image dataset YaleA, organized in five distinct classes, denoted by letters A to E, with each class composed by eleven variables. We order the variables in a circular layout and compare the graphical structure obtained by SQUIC-fit (398 gray edges in Fig. 8(a)) and SQUIC-sqp (432 gray edges in Fig. 8(b)). The red edges in both figures represent the edges that were removed from the graphical bias G, estimated with a kNN routine, after applying SQUIC-fit (68 edges) and SQUIC-sqp (28 edges). Multiple edges that connect variables belonging to different classes are removed in both cases, thus reducing the interclass connectivity of the graph. The advantages of these sparser graphical structures, with edge weights assigned by solving the MLE problem, are verified by the increased classification scores of Table I.

VI. CONCLUSION
In this work, motivated by the effectiveness of the SQUIC package in learning precision matrices of very large dimensions, we developed two algorithms for learning M -matrices that represent graphs whose nodes are non-negatively correlated random variables. Both algorithms are based on the 1 -regularized minimization of the MLE problem, and are able to incorporate available information about the latent graphical structure of the data under question in the form of a matrix regularization parameter. The first one, SQUIC-fit, is an unconstrained approach that performs two consecutive precision matrix estimations, and utilizes the positively correlated variables identified in the first run as graphical bias for the retrieval of the second precision matrix. Subsequent post-processing on its entries guarantees that the resulting matrix is positive definite and an M -matrix. The second one, SQUIC-sqp, is a constrained method that enforces the M -matrix structure during the MLE optimization procedure. The constrained minimization is achieved by means of a sequential quadratic programming algorithm, with the corresponding KKT system being solved with a preconditioned conjugate gradient method.
Our methods are compared against various state-of-the-art methods in a series of synthetic tests, showing that the introduced algorithms offers significant gains in terms of time-to-solution, while accurately retrieving the underlying M -matrix structure. In particular, for artificial graphs emerging from a grid and a randomly clustered structure the two introduced algorithms attain equivalent retrieval accuracy scores, and are up to 3 orders of magnitude faster for graph dimensions p ∈ [256, 10 4 ]. Additionally, we see that for these synthetic cases incorporating in the matrix regularization parameter available information regarding the latent graphical structure of the data greatly improves the retrieval accuracy of both SQUIC-fit and SQUIC-sqp.
Furthermore, we provide two case studies that demonstrate the applicability of the introduced algorithms in real-world scenarios. In the first one we identify the largest connected components of the M -matrix emerging from daily 671 COVID-19 cases for 3209 US counties, and observe that these components correspond to clear geographical patterns. Subsequently, we perform spectral clustering on the largest connected component and report that the resulting clusters are also revealing distinct geographic partitions. For the second case study we classify with image datasets with up to p = 7·10 4 variables based on the eigenvectors of the estimated M -matrices, and report increases in the classification accuracy over the traditional approach of building adjacency matrices for spectral methods.
The consistency of our results, from the artificial tests to the real-world cases, highlights the effectiveness of the introduced graph learning algorithms and the broad applicability of the presented work.