Sparse Cholesky Covariance Parametrization for Recovering Latent Structure in Ordered Data

The sparse Cholesky parametrization of the inverse covariance matrix is directly related to Gaussian Bayesian networks. Its counterpart, the covariance Cholesky factorization model, has a natural interpretation as a hidden variable model for ordered signal data. Despite this, it has received little attention so far, with few notable exceptions. To fill this gap, in this paper we focus on arbitrary zero patterns in the Cholesky factor of a covariance matrix. We discuss how these models can also be extended, in analogy with Gaussian Bayesian networks, to data where no apparent order is available. For the ordered scenario, we propose a novel estimation method that is based on matrix loss penalization, as opposed to the existing regression-based approaches. The performance of this sparse model for the Cholesky factor, together with our novel estimator,is assessed in a simulation setting, as well as over spatial and temporal real data where a natural ordering arises among the variables. We give guidelines, based on the empirical results, about which of the methods analysed is more appropriate for each setting.


Introduction
The multivariate Gaussian distribution is central in both statistics and machine learning because of its wide applicability and well theoretical behaviour.Whenever it is necessary to reduce model dimension, such as in a high dimensional setting [1] or in graphical models [2], sparsity is imposed in either the covariance matrix Σ or its inverse Ω = Σ −1 .
The inverse covariance matrix Ω directly encodes information about partial correlations; therefore, when a zero pattern is present in Ω, it represents absent edges in the undirected graph of a Gaussian Markov network [3].Furthermore, letting Ω = WW t be its Cholesky decomposition, a zero pattern in the lower triangular matrix W yields the acyclic digraph associated with a Gaussian Bayesian network (GBN) [4] model, up to a permutation of the variables [5].As a result, much of the academic focus has been on sparsity in either the inverse covariance matrix or its Cholesky decomposition [6][7][8][9][10].
Conversely, a zero pattern in the covariance matrix Σ represents missing edges from the undirected graph of a covariance graph model [11][12][13][14].However, a structured zero pattern on the Cholesky decomposition Σ = TT t of the covariance matrix has only been addressed by few works: [15] and [16].In [15] the authors briefly analyse zeros in T mainly as a tool for better understanding of a higher-level graphical model called covariance chain, which is the main focus of their work.More directly related to our interests is the work of Rothman et al. [16], who directly explore a new regression interpretation of the Cholesky factor T. However, the authors do not address a structured zero pattern directly; instead, they focus on a banding structure for T, inspired by the popularity of this estimator for the covariance matrix.In fact, a significant amount of the paper is devoted to analysing the relationship between the covariance matrix, or its inverse, and banded Cholesky factorization.
To fill this gap, in this paper we focus on arbitrary zero patterns in the Cholesky factor T of the covariance matrix Σ.
This model is specially suited for data with naturally ordered variables, as also occurs with inverse covariance sparse Cholesky factorization [9].However, the regression interpretation in this case, as pointed out by [16], is over the error variables.Therefore it is specially suited for scenarios where our interest is in hidden signal sources that have been recorded with measurement errors.In this work we discuss how this model could be extended to unordered variables by defining a new Gaussian graphical model over the Cholesky factorization of the covariance matrix.This new graphical model can be thought of as the analogue of GBNs, which are defined via sparsity patterns in the Cholesky factorization of the inverse covariance matrix.For the ordered scenario, we propose a novel learning method by directly penalising a matrix loss, in contrast with existing regression-based approaches in the literature [16].This new estimator is feasible to compute thanks to a simplification of the loss gradient computation similar to the one proposed in [17].Finally, we empirically assess the model performance in a broad range of experiments, using both our proposed estimators and regression-based ones.We explore multiple simulation scenarios as well as real data where a natural spatial or temporal order arises between the variables.
The rest of the paper is organized as follows.In Section 2 we introduce the theoretical preliminaries necessary to follow the exposition.Then we detail the proposed sparse model for the Cholesky factor T of the covariance matrix Σ, as well as introduce its extension to a Gaussian graphical model, in Section 3. Afterwards, we discuss existing state-of-the-art estimation methods in Section 4, where we also detail our novel penalized matrix loss proposal.We assess the previously explained estimation methods in both simulated and real experiments, whose results are shown in Section 5. Finally, we close the paper in Section 6, discussing the conclusions that can be drawn from the presented work, and also outlining the planned lines of future research.Appendices A.1, A.2 and A.3 are referenced throughout the paper and contain additional material for the interested reader.

Theoretical preliminaries
We will denote as N (µ, Σ) the multivariate Gaussian distribution with mean vector µ and covariance matrix Σ.Since we focus on the zero pattern in Σ, in the following we will set µ = 0 for notational simplicity, without loosing generality.As in the previous section, we will denote the inverse covariance matrix as Ω = Σ −1 .

The inverse covariance matrix and systems of recursive regressions
As we stated previously, the inverse covariance matrix and its decompositions have been thoroughly researched.If X is a p-variate Gaussian random vector, that is, if X ∼ N (0, Σ), then the upper Cholesky factorization of Ω can be used to model ordered sequences of regressions [4,6], as follows.
For J ⊆ {1, . . ., p}, i ∈ J and j ∈ J, let β i|J = (Σ iJ Σ −1 JJ ) t denote the vector of regression coefficients of X i on X J , with its j-th entry being β ij|J , the coefficient corresponding to variable X j .We may equivalently write the statistical model for X as a system of recursive linear regression equations (also called a linear structural equation model [18]), where for each i ∈ {1, . . ., p}, with E i independent Gaussian random variables of zero mean and var(E i ) = var(X i |X 1 , . . ., X i−1 ).The above regression system can also be expressed as X = BX + E, where B is a strictly lower triangular matrix with entries b ij = β ij|1,...,i−1 .Rearranging the previous equation, we obtain where L = (I p − B) −1 with I p the p-dimensional identity matrix.Now if we take variances and inverses on Equation (2), we arrive at the upper Cholesky decomposition of the covariance matrix, where U = L −t = (I p − B) t and W = U √ D −1 are upper triangular matrices, and D is a diagonal matrix containing the variances of E.
Equations ( 1) and ( 3) are intimately related: all parameters of the recursive regression system are encoded in the upper Cholesky factorization.Indeed, we have that D = var(E) and the (j, i) entry of matrix U, u ji , is equal to −β ij|1,...,i−1 , therefore all the regression coefficients can also be recovered, and U is explicitly written as

The Cholesky decomposition of a covariance matrix
From Equation (2) we could have proceeded differently.Instead of taking variances and then inverses, most suited for analysing ordered sequences of variable regressions and subsequent GBNs (as explained previously), we could have just taken variances and would have arrived at the Cholesky decomposition of the covariance matrix [16] where now L = (I p − B) −1 and T = L √ D are lower triangular.Observe that Equation ( 5) is a direct analogue of Equation (3); although now there is not a natural interpretation of L entries in terms of regression coefficients.However, we will now show that the entries of L are in fact regression coefficients, just over a different conditioning set.Alternative derivations of this result can be found in [19, p. 158] and [15, p. 846].The one in [19] is computational, based on the sweep matrix operator [20], whereas [15] provides a sketch based on a recursive expression for regression coefficients.We use instead simple identities over partitioned matrices.Proposition 1.For i ∈ {1, . . ., p} and j < i, the (i, j) entry of matrix L, denoted as l ij , is equal to β ij|1,...,j .
Proof.For each i ∈ {1, . . ., p}, j < i and J = {1, . . ., j}, the following partitioned identities hold [19,Equation (4.2.18)] Therefore, JJ .Furthermore, observe that, since L JJ is lower triangular with ones along the diagonal, the last column of L −1 JJ is always a vector of zero entries except the last entry, which is 1.This means, in particular, that for each i ∈ {1, . . ., p}, j < i and J = {1, . . ., j}, the j-th element of row vector L iJ L −1 JJ is equal to l ij , which in turn is equal to the j-th entry of β i|J , β ij|J .
The explicit expression in terms of regression coefficients for L is therefore, From Equations ( 4) and ( 6) useful relationships can be determined; see, for example, [15, p.847] and Appendix A.1.

The Gaussian Bayesian network model and sparse Cholesky factorizations of Ω
The upper Cholesky factorization in Equation ( 3) is typically used as a parametrization for GBNs, see, for example, [10] or [5].In the following we will explain why, since this is closely related to the extension of our proposal to a graphical model.Let G = (V, E) be an acyclic digraph, where V = {1, . . ., p} is the vertex set and E ⊆ V × V is the edge set.Assume that 1 ≺ • • • ≺ p is a topological order of V , which means that for all i ∈ V , denoting as pa(i) the parent set in G of node i, it holds that pa(i) ⊆ {1, . . ., i − 1}.The ordered Markov property of Bayesian networks states that [4] X i |= X j |X pa(i) for all j < i and j ∈ pa(i), (7) where X i |= X j |X pa(i) stands for conditional independence [21], that is, X i and X j are independent given X pa(i) = x pa(i) , for any value of x pa(i) .In the multivariate Gaussian distribution, the above conditional independence is equivalent to the regression coefficient β ij|1,...,i−1 being zero (see for example [22]).Since β ij|1,...,i−1 = 0 if and only if u ji = 0, then the zero pattern containing absent edges in G can be read off from U in Equation ( 3).Now allow the ancestral order to be arbitrary, and denote as U(G) the set of matrices that have positive diagonal and zeros compatible with a given acyclic digraph G = (V, E); that is, such that if (j, i) ∈ E, j = i, then m ji = 0 for all M ∈ U(G).The GBN model can thus be expressed as Remark 1. Observe that if X = (X 1 , . . ., X p ) t ∼ N (0, Σ) follows a GBN model with graph G, the parameter matrix W of Equation ( 8) is not the Cholesky factor of Ω = Σ −1 .This only occurs when the ancestral order of the nodes in G is 1 ≺ • • • ≺ p; that is, when the variables follow a natural order (the direct analogue of our model).
However, if we denote as κ(M) the permutation of rows and columns in M following the ancestral order of G, then κ(W) is the upper Cholesky factor of κ(Ω).

Sparse Cholesky decomposition of the covariance matrix
We will hereby introduce the sparse Cholesky decomposition model Σ = TT t for the covariance matrix, which consists of allowing an arbitrary zero pattern in T. The entries of T are in correspondence with those in L and D (see Equation (5), , and therefore sometimes we will indistinctly refer to each of these matrices .

The regression interpretations for the covariance Cholesky factor
In terms of model estimation, in U (Equation ( 4)) each column corresponds to the parameters of a single recursive regression, whereas each entry of matrix L (Equation ( 6)) corresponds to a different regression model.
Fortunately, [16] gave an alternative regression interpretation for matrix L: recalling Equation (2), where the original variables X are written as a linear function of the error terms E, and unfolding this matrix equation, we obtain Thus obtaining an analogue of Equation ( 1), but now instead of regressing the original variables, the regression is performed over the error terms of the ordered recursive regressions in Equation (1).Remark 2. The (i, j) entry of matrix L for j < i, l ij , has therefore two interpretations as a regression coefficient: 2. It is the coefficient of variable X j in the regression of X i over X 1 , . . ., X j .Furthermore, from Equations (1) and (9) we have another dual interpretation for variable E i : it is the error of the regression of X i onto X 1 , . . ., X i−1 , but also of X i onto E 1 , . . ., E i−1 .

A hidden variable model interpretation
The sparse Cholesky parametrization of the covariance matrix naturally models a hidden variable structure over ordered Gaussian observables (Equation ( 2)).Interpreting the error terms E as latent signal sources, then the model is a sort of restricted GBN.This interpretation is naturally associated with the first regression coefficients outlined in Remark 2.
The constraints for this GBN that represents our model are: • All arcs are from hidden variables E to the observed ones X.
• There is always an arc from E i to X i , for all i ∈ {1, . . ., p}.
Figure 1 represents one such restricted GBN compatible with our model.The sparse Cholesky factor for Figure 1 would have the following lower triangular pattern where an asterisk means a non-zero entry.Observe that the zero value in entry (3, 1) corresponds to the missing edge from E 1 to X 3 in Figure 1.

A graphical model extension for unordered variables
In direct analogy with GBNs and Equation ( 8), we can define a graphical model which is parametrized by a Cholesky factorization, up to a permutation.Let G = (V, E) be an arbitrary given acyclic digraph, and denote with ≺ its ancestral order.Let pr ≺ (i) = {j ∈ V : j ≺ i} be the predecessor set of a node i ∈ V .Analogous to U(G) in GBNs, denote as L(G) the set of matrices which have a zero pattern compatible with G.By this we mean that if (j, i) ∈ E, j = i, then t ij = 0. We may define the Gaussian graphical model (comparable to Equation (8)) Remark 3. As in the case of GBNs (Remark 1), the parameter matrix T in Equation ( 11) is lower triangular, and thus coincides with the Cholesky factor of Σ, when the variables are already ancestrally ordered.Otherwise, the sparse Cholesky factorization model applies when the ancestral order ≺ is known, that is, denoting as κ(M) the permutation of rows and columns of M following ≺, then κ(T) is the Cholesky factor of κ(Σ).
This extension has a more natural correspondence with the second regression interpretation of Remark 2, which holds after reordering rows and columns of T to comply with ≺, as follows.First note that Equation ( 6) holds for an unordered version of L, and thus we have that t ij ∝ β ij| pr ≺ (j) for j ≺ i.Therefore, we retrieve a sort of ordered Markov property (comparable to Equation ( 7)) A simple example of an arbitrary graph would be that in Figure 2. In this model, factor T would be lower triangular after reordering its rows and columns following the ancestral ordering of the graph, 2 ≺ 1 ≺ 3, Figure 2: Graph with ancestral order 2 ≺ 1 ≺ 3. We have omitted the index set notation for vertices and directly used variables for clarity.
In the example of Figure 1, where variables already exhibit a natural order, the graph that would represent such interactions would be that in Figure 3, whose parameter matrix T is already lower triangular (see also Remark 3).A further study of this graphical model extension is out of the scope of this work, since we focus on naturally ordered variables from hidden signal sources, but we have discussed it here for completeness and interest for future work.

Model estimation
We will first review two regression-based existing estimators for this model that can be found in [16], and then will detail our proposed penalized matrix loss estimation method.

Existing work: Banding and lasso
Throughout this section denote as x i a sample of size N corresponding to variable X i , where X = (X 1 , . . ., X p ) is assumed to follow the regression model of Equation (9).
The banding estimate for T builds upon the respective for L. The idea is to estimate by standard least squares only the first k sub-diagonals of L and set the rest to zero.Specifically, if b(k) = max(1, i − k) denotes the starting index, with respect to the band parameter k, of the i-th row vector l i = (l ib(k) , . . ., l ii−1 ) t in matrix L, then, letting εb where • 2 is the Euclidean or l 2 -norm.In order to ensure positive definiteness of all matrices involved in the computations, k must be smaller than min(N − 1, p) [16].
Using the estimates obtained from Equation ( 13) we can then recover T = L D, which inherits the band structure from L. Rothman et al. [16] show that the respective estimated covariance Σ = T T t is not of maximum likelihood, although interestingly both estimators do coincide for the Cholesky factor of the inverse covariance.The main drawback of this banding estimator is the restrictive zero pattern that it imposes.Note also that this method requires previous selection of the parameter k.
An alternative to banding which gives more flexibility over the zero pattern, also discussed in [16], is to use lasso regression over Equation ( 9), li = arg min where this time ε1 = x 1 , l i = (l i1 , . . ., l ii−1 ) t , and • 1 is the l 1 -norm and λ > 0 is the penalisation parameter.Observe that such penalty could be replaced with any other sparsity inducing penalty over l i , for example, [16] discussed also the nested lasso [23] because the sparsity pattern is preserved in the covariance matrix.We have discarded such penalty because we are primarily interested in an arbitrary zero pattern in T and not worried about the induced one in Σ.

Penalized gradient-based learning of the covariance sparse Cholesky factor
The above approaches are based on the regression interpretation of the sparse Cholesky factor model, Equation (9).By contrast, we propose to directly estimate all of the parameters, that is, matrix T entries, by solving one optimization problem.This allows for example to recover maximum likelihood estimates, as well as have the potential to be easily extended to the graphical model interpretation (Equation ( 11)) following an approach similar to [24].Denote as Σ(T) the parametrization of a covariance matrix Σ with its Cholesky factor T (Equation ( 5)).We propose to learn a sparse model for T by solving the following optimization problem arg min where φ(•) is a differentiable loss function over covariance matrices, λ > 0 is the penalisation parameter, and • 1 is the l 1 -norm for matrices, which induces sparsity on T [25].Note that, as in the regression case, the l 1 penalty could be replaced with any other sparsity inducing matrix norm.
Solving Equation ( 15) can be done via proximal gradient algorithms, which have optimal convergence rates among first-order methods [25] and are tailored for a convex φ but also competitive in the non-convex case [17].In this work we have used two such smooth loss functions: the negative Gaussian log-likelihood and the triangular Frobenious norm.
The negative Gaussian log-likelihood for a sample x 1 , . . ., x N when µ is assumed to be zero is [22] proportional to where Σ = 1/N N n=1 x n x t n is the maximum likelihood estimator for Σ.On the other hand, the Frobenious norm loss that we also consider is Both φ N LL and φ F R are smooth, and in general φ N LL renders the optimization problem of Equation ( 15) non-convex [26], whereas φ F R is a convex function.In Appendix A.2 we provide the details on gradient computations for φ N LL and φ F R , as well as the proximal algorithm pseudocode.

Experiments
In all of the experiments we compare the four estimation methods outlined in the previous section: banding T (Equation ( 13)), lasso regressions (Equation ( 14)), and our two proposed penalized losses φ N LL (Equation ( 16)) and φ F R (Equation ( 17)).These four methods will be denoted in the remainder as band, lasso, grad lik and grad frob, respectively.All data was standardized, and therefore in our proposed losses the sample correlation matrix was used instead of Σ.The implementation of our loss optimization methods grad frob and grad lik can be found in the R [27] package covchol1 .

Simulation
We have conducted simulation experiments in two different scenarios.First, because as the work of Rothman et al. [16] is the most directly related to our model, we have replicated their simulation setting for completeness.Therein they select three fixed covariance matrices with either a fixed known banded sparsity pattern or no zeros at all.By contrast, in the second experiment we explore how the methods perform when the sparsity pattern is arbitrary.
In both experiments we have measured two statistics in order to assess both model selection (how precise the zero pattern is recovered) and estimation (how numerically close is the retrieved matrix to the original one).These metrics are evaluated over Σ and Σ in the second experiment instead of T and T, for better comparability with [16].Specifically, we use the F1 score for evaluating the zero pattern, F1(T, T) = 2 TPR(T, T) TDR(T, T) where

Fixed covariance matrices
The fixed covariance matrices used in the simulations by Rothman et al. [16] are: • The autoregressive model of order 1, where the true covariance matrix Σ 1 has entries σ ij = ρ |i−j| , with ρ = 0.7.
• The dense correlation matrix Σ 3 with 0.5 in all of its entries except for the diagonal.
Similarly to [16], we use for matrix dimension p values ranging from 30 to 500, and draw from the respective distribution N (0, Σ i ), i ∈ {1, 2, 3}, a sample of size 200 which allows to visualize both the p > N and p < N scenarios.With this experiment we measure how sparsity inducing methods for learning T behave in scenarios which are not specially suited for them, except for band and Σ 2 .
Figure 4 shows the results of the aforementioned simulation scenario.The norms are shown in logarithmic scale for a better comparison between the methods, since there were significant disparities.Σ 1 and Σ 3 are both dense matrices, with Σ 1 having entries that decay when moving q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Σ 1 Σ Method q q q q band grad_frob grad_lik lasso Figure 4: Results of the simulation experiment set out in [16].Metric NORM is in logarithmic scale.
away from the diagonal.We observe that the inexistent sparsity pattern is best approximated by grad lik and lasso, but interestingly grad frob and band achieve competitive norm results, sometimes even outperforming the rest.Matrix Σ 2 is banded, therefore, as expected, band achieves both the highest F1 measure and lowest norm difference.

Arbitrary sparsity pattern in the Cholesky factor T
In this experiment the sparse Cholesky factor T is simulated using essentially the method of [28, Algorithm 3] with a random acyclic digraph to represent zero pattern, that is, the latent structure (see Figure 1).Observe that in general this does not yield a uniformly sampled Cholesky factor, but it is more flexible that the standard diagonal dominance procedure, see [28] for further discussion on this issue.We generate three Cholesky factors T i with a proportion of i/p non-zero entries (density/average edge number of the corresponding acyclic digraph), where i ∈ {1, 2, 3}.
Then we draw a sample of size 200 from N (0, T i T t i ) for p ranging between 30 and 500, as in the previous experiment.
Figure 5 depicts the results.Note that as the density decreases, the F1 score and matrix q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Density = 1 Method q q q q band grad_frob grad_lik lasso norm results slightly worsen, but in general the methods behaviour is maintained.The band estimator exhibits a performance similar to the previous experiment: although achieving a small F1 score, it has a relatively small matrix norm difference.This behaviour is shared in this case with grad lik, which has in general poor performance.However, the worst performing method is lasso, which neither is able to recover the sparsity pattern, nor gets numerically close to the original Cholesky factor (it has a significantly high value for the norm difference).Conversely, method grad frob has the best performance, with a significantly high F1 score when compared with the rest and competitive or best norm difference results.

Real data
In this section we have selected two data sets from the UCI machine learning repository [29] where a natural order arises among the variables, and which are therefore suitable for our sparse covariance Cholesky factorization model.Both of them are labelled with a class variable, therefore after estimating the respective Cholesky factors with each method, we assess classification performance.
For classifying a sample x we use quadratic discriminant analysis [16], where x is classified in the class value c that maximizes where f (c) is the proportion of observations for class c and we have expressed f (x|c) in terms of with μc , Σ c and T c the respective estimates from training samples belonging to class c.Finally, for evaluating classification performance we have used: • The F1 score, already defined in Equation ( 18) in terms of TDR and TPR, but adapted to classification instead of matrix entries.
• The true negative rate, TNR, since it is not contained in the F1 score, which is the proportion of observations that have been correctly not classified as class c.
• The accuracy, ACC, which measures the proportion of observations that have been correctly assigned a class.Observe that this last metric, unlike the other two, is not class-dependent, but instead global.

Sonar: Mine vs. Rocks
The first real data set we explore is the Connectionist Bench (Sonar, Mines vs. Rocks) data set, which contains numeric observations from a sonar signal bounced at both a metal cylinder (mine) and rocks.It contains 60 variables and 208 observations.Each variable corresponds to the energy within a certain frequency band, integrated over a period of time, in increasing order.Each observation represents a different beam angle for the same object under detection.Over this data set the objective is to classify a sample as rock or mine.This data set was also analysed in [16], but without the expression in terms of T for Equation (19) and only using method band for T.
As a first exploratory step, we have applied each of the methods for learning the Cholesky factor T to all the instances labelled as M (mines), and R (rocks), which we show as a heatmap in Figure 6.The Cholesky factor for mines retrieved by grad lik and lasso look fairly similar, whereas the one for rocks that lasso estimates is nearly diagonal.Bands can be clearly observed from heatmaps by band, and all methods impose zero values for variables near to or higher than 50, which could be motivated by the problem characteristics and hint at high sonar frequencies being nearly noiseless.This latter behaviour is inherited by the covariance matrices, whose heatmaps are shown in Appendix A.3, Figure 9.The entries in the Cholesky factor estimated by grad frob are the most extreme, since most of them are zero, and the ones which are not have the highest and lowest values among all the estimates recovered.Despite the outlined differences among the Cholesky factors, the induced covariance matrices exhibit rather similar heatmaps and eigenvalues (Figures 9 and 10 in Appendix A.3).
For the quadratic discriminant analysis we have used leave-one-out cross-validation, since the sample size was sufficiently small to allow it.Table 1 contains the results thus obtained.We see that lasso is the method that performs poorest overall.Conversely, band is arguably the best for this problem, except for the TNR of rock samples, which is highest for grad frob.However, observing the rest of statistics for grad frob, it can be deduced that this method over-classifies samples as mines: it has the lowest TNR for them.On the other hand, grad lik performs competitively for this problem, but is in general outperformed or matched by band.Since the  sonar behaviour hints at a band structure for the covariance (frequency patterns being related to those close to them), and therefore for its Cholesky factor, the good performance of band could be expected.

Wall-Following Robot Navigation
The other real data set we use is the Wall-Following Robot Navigation one.Here a robot moves in a room following the wall clockwise.It contains 5456 observations and 24 variables.Each variable corresponds to the value of an ultrasound sensor, which are arranged circularly over the robot's body.Here the increasing order reflects the reference angle where the sensor is located.Since the robot is moving clockwise, here the classification task is between four possible class values: Move-Forward, Sharp-Right-Turn, Slight-Left-Turn or Slight-Right-Turn.
As in the previous problem, we obtain the Cholesky factors for each of the movements, depicted in Figure 7.We notice that grad frob outputs a similar matrix (except for the Slight-Left-Turn movement) than the other three methods, which means that the extreme behaviour we observed in the sonar experiment was problem related.By contrast, the Cholesky factor for Slight-Left-Turn is nearly diagonal.The other matrices are rather similar among the methods, with band notably choosing in general a high banding parameter k (few to no bands).Here we have a similar structure as in the sonar problem: we appreciate for all the movements except Slight-Left-Turn that most entries close to the diagonal are positive, whereas distant ones are frequently negative.Regarding Slight-Left-Turn, these matrices are the sparsest and have near zero values on the diagonal.Since the robot is moving clockwise, this movement is related to obstacles, therefore it could hint that sensor readings are correctly identifying them.
In this problem we have a larger sample size, and therefore we split the data into train and test, with half of the samples on each set.The classification results are shown in Table 2.We observe that all of the methods perform arguably good, in fact they achieve nearly identical accuracy.It is noticeable how competitive are lasso and grad lik, which performed much worse in the sonar problem.We also notice that arguably the best results are obtained for the Slight-Left-Turn movement, which confirms our previous intuition over the heatmaps about sensors correctly identifying obstacles.The worst performance over all methods is for the Slight-Right-Turn movement, but is not noteworthy when compared with the rest (except for Slight-Left-Turn).

Discussion
We can draw several conclusions from both the simulated and real experiments.First, we report in Figure 8 the execution time for each method, where it can be seen that grad lik is the slowest one and band is the fastest.Whenever there is a clear dependence between variables that are close in the ordering, such as in the sonar example, the band method could be preferred, because it is the one that more naturally approximates the structure induced in the Cholesky factor (as happened in the sonar example).
Our new proposed method grad frob has shown to be competitive both in execution time as well as recovery results: when interested in model selection, that is, how accurately zeros in the Cholesky factor are estimated, it yields the best results.Conversely, grad lik has shown to be the most robust: in simulations it achieved reasonable performance even when the true covariance matrix was dense, and it also performed competitively in the sonar example, which was mostly suited for band as we discussed.
Finally, lasso has achieved overall poor results, except for the wall-following robot navigation data set.Specially, in simulations it failed to correctly recover the zero pattern in the Cholesky factor and was the numerically farthest away from the true matrix.Despite this, it is the second fastest of the four methods, so when model selection or robustness are not a concern it is a good alternative to band, since it provides more flexibility over the zero pattern in the Cholesky factor.

Conclusions and future work
In this paper we have proposed a sparse model for the Cholesky factor of the covariance matrix of a multivariate Gaussian distribution.Few other works in the literature have previously addressed this problem, and we expand them in many ways.We have formalised the extension of an arbitrary zero pattern in the Cholesky factorization to a Gaussian graphical model.We have proposed a novel estimation method that directly optimizes a penalized matrix loss, and we have compared it with the other already existing regression-based approaches in different simulation and real experiments.We have finally discussed which estimation method is preferable depending on the problem characteristics, and argue how our estimation proposal is preferable in many scenarios than those already existing.
As future research, the most direct and interesting derivative work would be to further analyse, both theoretically and empirically, the Gaussian graphical model extension to unordered variables of the sparse covariance Cholesky factorization model.Also in this direction, its relationship with the already established covariance graph [11][12][13][14], an undirected graphical model over the covariance matrix, could yield interesting results.Regarding our novel estimation method, we could explore other alternatives to solve the optimization problems that arise from the two losses that we compute in Appendix A. A.2 Proximal gradient algorithm and gradient computations for φ N LL and φ F R We will show how to obtain a simplified expression for the gradient of φ(Σ(T)) with respect to T as a function of the one with respect to Σ.We will denote these gradients as ∇ T φ and ∇ Σ φ, respectively.
Proposition.For any differentiable loss function φ(Σ(T)), Proof.This proof follows some ideas from [17,Proposition 2.1].By matrix calculus [30], we have the following gradient expression for j < i, ∂φ(Σ(T)) where we have used that ∇ Σ φ is symmetric.Furthermore, note that [30] ∂Σ(T) where E ij (E ji ) has its (i, j) ((j, i)) entry equal to one and zero elsewhere.Then from Equation (21) we have Since a ji = tr(E ji A) for any matrix A, we have the desired result.

Figure 1 :
Figure 1: Hidden variable model interpretation.We have omitted the index set notation for vertices and directly used variables for clarity.

Figure 3 :
Figure 3: Graph corresponding to the model in Figure 1.We have omitted the index set notation for vertices and directly used variables for clarity.

Figure 5 :
Figure 5: Results of the simulation experiment for an arbitrary sparsity pattern in T. Metric NORM is in logarithmic scale.Density indicates the average proportion of lower triangular non-zero entries in the simulated T Cholesky factors.

Figure 6 :
Figure 6: Heatmaps of the Cholesky factors of rock and mine samples.M: Mines; R: Rocks.

Figure 7 :Figure 11 :Figure 12 :
Figure 7: Heatmaps of the Cholesky factor of the wall-following robot navigation samples.

Table 1 :
Statistics for the sonar problem.M: mines; R: rocks.

Table 2 :
Statistics for the robot problem.