Linear pooling of sample covariance matrices

We consider the problem of estimating high-dimensional covariance matrices of $K$-populations or classes in the setting where the sample sizes are comparable to the data dimension. We propose estimating each class covariance matrix as a distinct linear combination of all class sample covariance matrices. This approach is shown to reduce the estimation error when the sample sizes are limited, and the true class covariance matrices share a somewhat similar structure. We develop an effective method for estimating the coefficients in the linear combination that minimize the mean squared error under the general assumption that the samples are drawn from (unspecified) elliptically symmetric distributions possessing finite fourth-order moments. To this end, we utilize the spatial sign covariance matrix, which we show (under rather general conditions) to be an asymptotically unbiased estimator of the normalized covariance matrix as the dimension grows to infinity. We also show how the proposed method can be used in choosing the regularization parameters for multiple target matrices in a single class covariance matrix estimation problem. We assess the proposed method via numerical simulation studies including an application in global minimum variance portfolio optimization using real stock data.


I. INTRODUCTION
H IGH-DIMENSIONAL covariance matrix estimation is a challenging problem as the dimension p of the observations can be much larger than the sample size n.Such problems are increasingly common, for example, in finance [1], genomics [2], graphical models [3], chemometrics [4], wireless sensor networks [5], and adaptive filtering in array signal processing [6].This paper considers a high-dimensional problem, where there are K distinct classes (populations).Since the population variables are in general the same, but are measured under different population conditions, it is reasonable to presume the K distinct covariance matrices share some common features or structure.In the small sample size setting, it is then advantageous to leverage on this presumption in the estimation of the population covariance matrices.
Consider K mutually independent classes, where each class k ∈ {1, . . ., K} consists of independent and identically distributed (i.i.d.) p-dimensional observations X k = {x 1,k , . . ., x n k ,k } of size n k with mean µ k = E[x i,k ] and positive definite symmetric covariance matrix E. Raninen  The ordinary estimators for the covariance matrix and the mean are the sample covariance matrix (SCM) and the sample mean xk = 1 . When the sample size n k and the data dimensionality p are comparable in size, the SCM can be highly variable, resulting in an unstable estimate of the population covariance matrix.Also, the SCM is positive definite only if n k > p and X k spans R p .Due to these problems, a commonly used approach in high-dimensional covariance matrix estimation is to use regularization (shrinkage).
In the one population case (K = 1), linear regularization usually refers to estimating the covariance matrix by a linear or convex combination of the SCM (or some other primary estimator) and a (usually positive definite) target matrix.Multiple examples can be found in finance [1], [7]- [9], genomics [2], and signal processing [10]- [15].The target matrix is chosen based on prior knowledge or assumptions about the true population covariance matrix.Sometimes using more than one target matrix can further reduce the estimation error.In the double shrinkage approach of [16] and [17] there are two convex regularization steps: the SCM is first regularized toward a diagonal matrix consisting of the sample variances after which the resulting estimator is further regularized toward a scaled identity matrix.Recently, also multi-target shrinkage methods have been proposed that are able to incorporate a larger number of simultaneous target matrices [18]- [23].
In the multiple population setting, regularization via pooling the information in the different class samples is also possible.For example, [24] considered covariance matrix estimation from two independent data sets, whose covariance matrices are different but close to each other.This type of problems are encountered in radar processing as well as in hyperspectral imaging applications, where additional data sets may have been acquired with slightly different measurement configurations [24].In the context of wireless sensor networks, [25] considered linear parameter estimation from independent and non-identically distributed scalar sample statistics.In discriminant analysis classification, the pooled SCM, S pool = 1 n K k=1 n k S k , n = K k=1 n k , is often used as a shrinkage target and the class covariance matrices are estimated via a convex combination Σk = aS k + (1 − a)S pool , where a ∈ [0, 1].This was studied in a Bayesian framework in [26] and [27], and in the Regularized Discriminant Analysis (RDA) framework in [28].Furthermore, the optimal tuning parameter for this setting under elliptically distributed data was derived in our earlier work [29], [30].For applications using RDA see, e.g., [31], [32] As noted previously, at least some of the K population covariance matrices can be similar (close to each other in terms of suitable distance metric) and so it would be beneficial to use regularization to reduce the variance of the final estimates of the covariance matrices.Following this idea, we propose to estimate each class covariance matrix as a nonnegative linear combination of the SCMs of all classes.For a ≥ 0, define Restricting the coefficients to be nonnegative ensures that the estimator ( 3) is positive semidefinite.The aim is to find a with the estimate of Σ k then taken to be Σk = S(a k ).The equivalence of ( 4) and ( 5) is evident from the fact that the optimization problems for each class are separable.The solution to this problem is given in Section II.It is easy to see that the RDA based estimators form a subset of the more general form (3), which permits using individual weights for each SCM in the sum.
Below we summarize the main contributions of the paper.
• We propose covariance matrix estimators for multiclass problems, based on linearly pooling the class SCMs.Several aspects and properties of the estimator are discussed including possible modifications and an extension for complex-valued data.• We show how the optimal linear coefficients can be estimated by assuming that the data is elliptically distributed.To this end, we use the spatial sign covariance matrix (SSCM), which we show under rather general assumptions to be asymptotically unbiased with respect to growing dimension.• We show how the estimator can be used as a multi-target shrinkage estimator in a single class problem.• Numerical simulations are conducted including a portfolio optimization problem using real stock data.The simulations show promising performance of the proposed estimator compared to competing estimators.• Code is available at github.com/EliasRaninen, which works both for real and complex-valued data sets.The rest of the paper is organized as follows.In Section II, we derive the optimal coefficients for the linear pooling estimator and study some of its properties.In Section III, we propose methods for estimating the statistical parameters needed for the estimation of the optimal coefficients.This section also presents the results regarding the SSCM.In Section IV, we discuss possible extensions and modifications to the estimator.In Section V, we discuss the similarities and differences between the proposed method and closely related multi-target shrinkage covariance matrix estimation methods.Furthermore, we show how our proposed method can be used as a multi-target shrinkage covariance matrix estimator in a single class problem with arbitrary positive semidefinite target matrices.Section VI provides numerical simulation studies and Section VII provides an application to investment portfolio selection using historical stock data.Lastly, Section VIII concludes.
Notation: Matrices are denoted by upper case boldface letters (A or ∆), vectors are denoted by lower case boldface letters (a or δ), and scalars are denoted by lower case letters (a or δ).For a matrix A = (a ij ), the notation A ≥ 0 means that the matrix is nonnegative, that is, a ij ≥ 0, for all i and j.Similarly, for a vector a = (a i ) the notation a ≥ 0 means that a i ≥ 0, for all i.The notation A 0 (A 0) means that A is positive definite (positive semidefinite).The notation diag(a) denotes a diagonal matrix with the entries of a on the main diagonal.The identity matrix is denoted by I and the vector of all ones is denoted by 1.The notation e i denotes the ith Euclidean basis vector, i.e., a vector whose ith coordinate is 1 and all other coordinates are 0.For real sequences a

II. LINEAR POOLING OF SCMS
In this section, we address solving for the coefficients of the linear combination of SCMs in (3).First, define the scaled MSE of the SCM S k as and the scaled inner products of the covariance matrices as where p is the dimension of the data.Then define We can then state the following result.Theorem 1. (The MSE of the linearly pooled estimator) For class k, the MSE in (5) can be written as the strictly convex quadratic function where ∆ + C is a positive definite symmetric matrix.

Proof. See Appendix A.
As a consequence of Theorem 1, the optimal coefficients can be computed in the following way.Proposition 1. (Optimal nonnegative coefficients) The solution to (5), a k , is found by solving the strictly convex quadratic programming (QP) problem minimize 1  2 a (∆ + C)a − c k a subject to a ≥ 0. (6) Proof.Follows from Theorem 1.
Many efficient algorithms exist for solving constrained convex QPs [33]. 1 The optimization problem (6) requires knowledge of the matrices C and ∆, which depend on the unknown population parameters.We can nonetheless estimate the solution by using estimates Ĉ and ∆, which can be computed from the data as explained in Section III.
It is instructive to consider an unconstrained version of the optimization problem (6), where the weights are allowed to take negative values.For this case, we have the following closed form solution.
Proposition 2. (Optimal unconstrained coefficients) The unconstrained solution, which minimizes the MSE in ( 4) is Proof.Follows from Theorem 1.
Note that if the closed form solution a k ≥ 0 in (7), then it is also the solution to (6).
Consider the single class case, for which the problem reduces to finding an optimal scaling parameter a 1 such that E a 1 S 1 − Σ 1 2 F is minimized.Proposition 2 above then states that the optimal parameter that minimizes the MSE is where NMSE(S 1 ) = MSE(S 1 )/ Σ 1 2 F is referred to as the normalized MSE (NMSE).It can easily be shown that MSE(a 1 S 1 ) = a 1 MSE(S 1 ) < MSE(S 1 ) since a 1 < 1.Therefore, the (oracle) estimator Σ1 = a 1 S 1 is always more efficient than S 1 .For the univariate normal case, one obtains a 1 = (n 1 − 1)/(n 1 + 1).This result was first obtained in [35].A corresponding result for the general (non-normal) univariate case was obtained in [36], and it can be written as a 1 = ((n 1 + 1)/(n 1 − 1) + 3κ 1 /n 1 ) −1 , where κ 1 is the symmetric kurtosis of the population (see (11)).
Consider next the special case when all population covariance matrices are equal, i.e., , we obtain the solution (7) as Hence, all columns of A coincide, and the coefficients in each column are a jk = µ/δ j .These coefficients can be written in an equivalent form Thus, the weights are positive and proportional to the inverses of the NMSE of the SCMs.If S j has a large NMSE relative to others, which occurs for example when the sample size n j is small relative to others, then the weight a jk assigned for S j is small.This implies that the contribution of S j in the linear combination S(a k ) is small.If one further assumes that all populations have the same distribution and the sample sizes are equal, then That is, the pooled SCM is shrunk by the factor Ka < 1.
Due to the positivity of the coefficients, the conclusions of these special cases naturally also hold for the constrained case, where the coefficients are constrained to be nonnegative.

III. ESTIMATION
In this section, we address the estimation of ∆ and C. We review elliptically symmetric distributions, introduce the relevant statistical parameters as well as show how to estimate them.Regarding an estimate for the sphericity parameter, we use the SSCM for which we then prove an asymptotic unbiasedness result in Theorem 2.

A. Elliptically symmetric distributions
We will assume that the samples are generated from unspecified elliptically symmetric distributions with finite fourthorder moments.That is, an absolutely continuous random vector x ∈ R p from the kth population is assumed to have a density function up to a constant of the form where g k : R ≥0 → R >0 is called the density generator [37].
Here, we let g k to be defined so that Σ k represents the covariance matrix of x, which means that to denote this case.For example, the multivariate normal (MVN) distribution is a particular instance of the elliptical distribution for which g k (t) = exp(−t/2).We write x ∼ N p (µ k , Σ k ) to denote this case.An elliptically distributed random vector x ∼ E p (µ k , Σ k , g k ) can be expressed by the stochastic representation as where r k is a random variable called the modular variate, verifying E[r 2 k ] = p, and u is a random vector distributed uniformly on the unit sphere, i.e., u ∈ {z ∈ R p : z = 1}.Furthermore, u and r k are independent.More generally, we note that any random vector x which satisfies (8) is said to have an elliptical distribution, even if it is not absolutely continuous, i.e., does not have a density.The relationship between the modular variate r k and x is readily seen from ( 8) to be r Sometimes we are only interested in the covariance matrix up to a scaling constant.Hence, we define the shape matrix: which verifies tr(Λ k ) = p.Additionally, we define three statistical parameters that describe the elliptical distribution.First, we define the scale: which is equal to the mean of the eigenvalues of Σ k .Note that, Σ k = η k Λ k .Second, we define the sphericity: which equals the ratio of the mean of the squared eigenvalues relative to the mean of the eigenvalues squared.The sphericity parameter gets values in the range [1, p] and attains its minimum for the scaled identity matrix and its maximum for a rank one matrix.Third, we define the elliptical kurtosis: where For example, if the sample is from a MVN distribution, then κ k = 0.The kurtosis parameter also satisfies ) − 1, and hence, not only does κ k represent the kurtosis of each of the variables x i , but it also represents the kurtosis of any univariate linear combination b x, where b ∈ R p \ {0}.The lower bound for the kurtosis parameter is κ LB = −2/(p + 2) [38].
For elliptical populations, the scaled MSE δ k of the SCM obtains an explicit form [9, Lemma 1]: which depends on the known sample size n k as well as the unknown scale η k (9), the unknown sphericity γ k (10), and the unknown elliptical kurtosis κ k (11).

B. Estimates of the scale and the elliptical kurtosis
We estimate the scale using the SCM via The kurtosis κk is estimated via the (bias-corrected) average sample kurtosis of the marginal variables, , [39].In case ( 14) is less than κ LB , we set κk = 0.99κ LB .Note that, although κ k is invariant under affine transformations of x, the estimator κk in ( 14) is not.An alternative estimator of κ k that is affine equivariant is This estimator requires n k > p, and hence we use κk .
C. Estimate of the sphericity using the SSCM Regarding the sphericity, it would be natural to develop an estimator using the SCM as well.However, a simple and particularly well performing estimator of the sphericity is based on the robust spatial sign covariance matrix (SSCM) (16) and it has been used, e.g., in [40]- [42], and [9].Particularly, in [9], both a SCM and a SSCM based estimator of the sphericity was compared and, except for the case where the samples were MVN, the simulations suggested the superiority of the SSCM based sphericity estimator.Before introducing the estimator for the sphericity, we will discuss the properties of the SSCM.
It is known that Λ sgn and Λ have the same eigenvectors as well as the multiplicities and the orders of the eigenvalues, but the eigenvalues themselves are different [43].The sample SSCM of the kth population and its corresponding shape estimator are defined as where the mean µ k is replaced with the sample spatial median [44], μk = arg min m n k i=1 x i,k − m , when it is unknown.When the mean is known E[ Λk ] = Λ sgn,k , and Λk is distribution-free over the class of elliptical distributions.The latter statement can be proved by writing (16) in terms of the stochastic representation (8) and observing that the modular variate r k cancels out in each summand.
Since the eigenvalues of Λ sgn and Λ are different, Λk is biased [45].Surprisingly, as shown in Theorem 2 below, this bias becomes more negligible in higher dimensions provided the sequence of covariance matrix structures being considered with increasing p satisfies Theorem 2. Let x ∼ E p (µ, Σ, g).Then, Furthermore, if (17) holds, then The central condition (17) holds for many common sequences of covariance models, as shown in Proposition 3 below.We first present, in the following lemma, a simple general condition under which (17) holds.This lemma is seen to hold in particular for the case when the eigenvalues of Λ are bounded as p → ∞.Lemma 1.If Λ is a shape matrix for which λ max (Λ) = O(p τ /2 ), where τ < 1, as p → ∞, then (17) holds.
Proof.See Appendix C. Proposition 3. The following sequences, in p, of covariance models satisfy (17).
• First order autoregressive (AR(1)) covariance matrices: (Σ) ij = σ 2 |i−j| , where | | < 1 and σ 2 > 0 are both fixed, i.e., they do not depend on the dimension p.The restrictions on for the covariance models in Proposition 3 are needed to ensure that Σ is positive definite.
Let us illustrate this result in the case that Σ has an AR(1) covariance structure.In this case γ → (1 + 2 )/(1 − 2 ) (see Appendix C-A).For = 0.5, γ → 5/3 = O(1).From Theorem 2, we then have that the relative error We also have the following proposition about the normalized mean squared distance between Λ and a scaled SCM.Proposition 4. Let X = (x i ) ∼ N p (0, Σ) and assume that (17) holds.Then, where Proof.See Appendix D.
We may now focus on the sphericity estimator in (20).It can be shown by straightforward calculation (see Appendix D) that If ( 17) holds, then by Theorem 2, one has that By ( 18) and ( 19), a natural estimator for the sphericity of the kth class is then In a high-dimensional setting, using the spatial median μ in (16) results in nonnegligible error in the sphericity estimate.This was shown in [40], which considered SSCM based hypothesis testing of sphericity of the covariance matrix (i.e., H 0 : Σ ∝ I), and where they used a similar sphericity statistic as in (20).They also proposed a method for estimating and correcting for this error.Hence, we use the corrected estimator of the sphericity [40]: where , which is often a good approximation (see [40]).

IV. EXTENSIONS AND MODIFICATIONS
In this section, we first show how to incorporate regularization towards the identity matrix in the estimator.Then, we show how the optimal coefficients can alternatively be solved via a semidefinite optimization problem, which enables relaxing the nonnegativity constraint.Lastly, we extend the estimator to complex elliptically symmetric distributions.

A. Additional regularization towards the identity matrix
It is often beneficial to incorporate regularization towards the identity matrix.For example, if p > n = k n k , then all of the SCMs are singular.Regularization towards the identity can easily be added by using the estimator where a = (a 1 , . . ., a K , a I ) ∈ R K+1 .By constraining a i ≥ 0, 1 ≤ j ≤ K, and a I ≥ , where > 0 is a chosen lower bound for the identity regularization, the estimator (23) will be positive definite.Then can be solved via the strictly convex QP problem where and η = (η k ) ∈ R K is a vector consisting of scales (9).The unconstrained optimal solution for this case is A = ( ∆ + C) −1 C η .The positive definiteness of ∆ + C is shown in Appendix A. Algorithm 1 summarizes the procedure for computing the linearly pooled estimates of the class covariance matrices.
2 Compute C and ∆ of (25) (estimate as in Section III).
Remark 1.The QP formulation of the problem makes it easy to incorporate additional constraints if needed.For example, in order to find a convex combination of the SCMs the equality constraint 1 a = 1 should be added to the QP (24).

B. Semidefinite programming formulation
Constraining the coefficients in (6) to be nonnegative is sufficient to ensure positive semidefiniteness of the estimator (3).In some cases this approach can be suboptimal since it may be possible to obtain a positive semidefinite estimator with a lower MSE by allowing some of the coefficients to be negative.Using semidefinite programming (SDP), the nonnegativity constraint can be replaced with a positive semidefiniteness constraint as in [46] as follows.The objective function in (6) can be rewritten as , where B = ∆ + C. By introducing an auxiliary variable t and constraint (a the problem is converted into a minimization of t.Using the Schur complement, the problem is reformulated as a semidefinite program (SDP) in the variables a and t: This is a convex optimization problem, which can be solved in polynomial time with software such as CVX, which is a package for specifying and solving convex programs [47], [48].
There are cases where the SDP (26), nonnegativity constrained QP (6), and unconstrained problem (7) all give different solutions.In these cases only the SDP and QP solutions yield positive semidefinite estimators.In theory (when using the true C and ∆), the SDP solution will have a lower MSE.However, with estimation error in Ĉ and ∆, this is necessarily not the case.The computational complexity of the SDP problem is significantly greater than that of the QP problem.The QP can be computed with 2-3 orders of magnitude faster (depending on the problem dimension).Due to the above reasons, we recommend the QP approach.

C. Extension to complex-valued data
Extending the results to complex-valued data requires using the complex-valued definitions of the elliptical kurtosis and the scaled MSE of the SCM.For a review on complex elliptical symmetric (CES) distributions, see e.g., [49].A CES distributed (absolutely continuous) random vector x ∈ C p from the kth population has a density function up to a constant of the form is the mean vector and denotes the Hermitian positive definite covariance matrix of x.Above (•) H denotes the Hermitian (complex-conjugate) transpose.We denote this case by x ∼ CE p (µ k , Σ k , h k ).The definitions of the SCM (2) and the SSCM ( 16) stay unchanged except that in their definitions (•) is replaced with (•) H . Furthermore, the inner products between Hermitian matrices (A = A H ) remain unchanged since tr(AA H ) = tr(A 2 ).Hence, the definitions of the sphericity parameter γ k as well as the scale η k remain unchanged.The kurtosis of a complex marginal variable where The elliptical kurtosis κ k of class k is estimated using the average sample kurtosis of the marginal variables where The theoretical lower bound of the kurtosis in the complex-valued case is κ LB = −1/(p + 1) [49].In case (27) is less than κ LB , we set κk = 0.99κ LB .
The matrix ∆ is estimated via (28) given in the next lemma, which is an extension of [9, Lemma 1] to the complex case.Lemma 2. [50, Theorem 3] Let X k = {x 1 , . . ., x n } ⊂ C p be an i.i.d.random sample from CE p (µ k , Σ k , h k ) with finite fourth-order moments.Then, the scaled MSE of the SCM is

V. MULTI-TARGET SHRINKAGE ESTIMATORS
Multi-target shrinkage covariance matrix estimators are capable of simultaneously regularizing towards several target matrices.They are mainly designed for the single population setting.However, some of the multi-target shrinkage estimators can also be used in a multiclass setting for pooling SCMs.Therefore, we give a brief review of the existing methods.Lastly, we show how our proposed method can be applied to a single class multi-target shrinkage covariance matrix estimation problem.

A. Overview of multi-target shrinkage estimators
There exist multi-target shrinkage covariance matrix estimators, which can be used with user-defined target matrices and therefore also for pooling SCMs.In this section, we discuss these estimators and in Section VI we compare their performance to our proposed method by simulations.
Since the multi-target shrinkage estimators are developed for the single class covariance matrix estimation setting, we define the data set as X = {x 1 , . . ., x n }, where (x i ) j = x ij .Let Σ denote the covariance matrix and let S denote the SCM computed from X .Multi-target shrinkage covariance matrix estimators are often defined by where T k , k = 1, . . ., K, are linearly independent target matrices and a j , j = 0, . . ., K, are the regularization coefficients.In [18] and [19], convex multi-target shrinkage covariance matrix estimators were proposed, where a k ≥ 0 and As the elements in Q and b depend on the unknown covariance matrix Σ, they have to be estimated.In [19], the proposed estimates were qij = tr (T i − S)(T j − S) and bi where the latter was obtained by approximating E[tr (S − Σ)(T i − E[T i ]) ] ≈ 0. We compare our method with this method in the simulations of Section VI, where it is denoted by BARTZ.Regarding [18], a specific structural condition [18, (10) and ( 21)] was imposed on the target matrices, which prevents using it for pooling SCMs.
A leave-one-out-cross-validation (LOOCV) approach was considered in [20] for different scenarios.Their proposition for SCM based linear shrinkage estimation used the LOOCV loss function where x i is assumed to have zero mean, S −i = 1 n j,j =i x j x j is the SCM computed without the ith sample, and a = (a 0 , a 1 , . . ., a K ) ≥ 0. They showed how the elements in Q CV and b CV can be computed analytically.They also proposed a corresponding convex SCM based shrinkage estimator, which requires that the targets have the same trace as the estimated SCM.In the simulations of Section VI, we denote the linear estimator by LOOCV.
Multi-target shrinkage estimators have also been proposed, for example in [22] from a Bayesian perspective and in [23] by regularizing the Gaussian likelihood function.However, in both methods the target matrices are assumed to be positive definite, and hence, SCMs cannot be used as targets when n k < p.In [21] a multi-target shrinkage estimator of the form (29) was proposed for space-time adaptive processing (STAP) problems, where the reliable estimation of the loss function required additional prior information.

B. Multi-target shrinkage estimation via pooling SCMs
As discussed above, some of the multi-target shrinkage estimators can be used in order to linearly pool SCMs in a multi-class setting.Conversely, the proposed method of linearly pooling SCMs can be used in single class covariance matrix estimation as a multi-target shrinkage estimator as explained below.
Consider that there is only a single data set X , its corresponding SCM S and multiple positive definite symmetric target matrices {T m } M m=1 .Our approach for multi-target shrinkage estimation is detailed in Algorithm 2. The idea is to generate artificial data sets {Y m } M m=1 with covariance matrices {T m } M m=1 so that X and the generated data sets {Y m } M m=1 are approximately mutually independent.Specifically, Y m is conditionally independent of X given T m .Then, for the SCMs We use the zero mean MVN distribution to generate the data sets, i.e., Y m ∼ N p (0, T m ), m = 1, . . ., M , each consisting of L samples.We can then apply the proposed method for pooling S and {S Tm } M m=1 .We illustrate the usefulness of this method in a portfolio optimization problem in Section VII, where it compares well with the other multi-target methods.

VI. SIMULATION STUDY
This section provides several simulation studies in order to assess the MSE performance of the proposed estimator as well as the accuracy of the plugin estimates of ∆ and C and the estimated coefficients Â.We denote the proposed linear pooling estimator (23), which includes shrinkage towards identity, by LINPOOL.The LINPOOL estimator with an additional  convexity constraint on the coefficients (1 a = 1) is denoted by LINPOOL-C.For LINPOOL and LINPOOL-C, we set the lowerbound for the identity target in all of the simulations to k = 10 −8 .In addition to the proposed methods, the results are reported for the multi-target shrinkage method LOOCV from [20], which uses a nonnegative linear combination of the SCMs and the identity matrix, and BARTZ from [19], which uses a convex combination of the SCMs and the identity matrix (see Section V for details).

A. Three different setups
In the first simulation, we considered two different covariance matrix structures: the AR(1), where k for i = j.We generated four p = 100 dimensional random samples of sizes (n 1 = 20, n 2 = 100, n 3 = 20, n 4 = 100) from four independent multivariate t-distributions with ν = 8 degrees of freedom.The means of the classes were generated from the standard MVN distribution and held constant over the repetitions.We simulated three different setups.In the first setup, all covariance matrices had an AR(1) structure.In the second setup, all covariance matrices had an CS structure.In the last mixed setup, classes 1 and 2 had an AR(1) structure and classes 3 and 4 had a CS structure.For all setups, we used σ  Table I tabulates the normalized mean squared error, , and total NMSE (sum of NMSEs of the classes) of the covariance matrix estimates for the three different setups over 1000 repetitions.Table I shows that the proposed method LINPOOL performed best in the AR(1) setup and the mixed setup, whereas LINPOOL-C performed best (slightly better than LINPOOL) in the CS setup.
Figure 1 displays boxplots of the estimates δk and ĉij both for the AR(1) case as well as the CS case.The estimates of the optimal coefficients for LINPOOL for the fourth class âi4 are also shown.As can be seen from the boxplots, for the AR(1) case, the medians of the estimates were mostly correctly placed over the true values, which are denoted by the red triangles ( ).For the CS case, there was some significant bias in the estimation of c ij , which is due to the fact that the assumption (17) does not hold in this case as explained in Section III-C.Despite of this, the final coefficients âij were reasonable well estimated.
Here, it is good to note that, for the CS case, the coefficient corresponding to identity shrinkage âI4 is close to zero.When this happens, there is a possibility that (despite having a low MSE) the estimate is not well-conditioned resulting in high error when inverting the estimate.Therefore, in these cases (depending on the conditioning of the estimate) it can be useful to increase the lower bound 4 .Generally, for class k, one could use k = αη k , where α ∈ [0, 1].

B. Increasing the number of classes
Next, we examine the NMSE of the LINPOOL estimator of class 1 as the number of classes K increase from 2 to 16 classes.The setup is as follows.The first class has an AR(1) covariance matrix structure with a fixed parameter 1 = 0.5.The other class covariance matrices also have an AR(1) structure except for the classes k = 4, 8, 12, 16, which have a CS structure.The parameter k , for k ≥ 2, is chosen uniformly at random from the interval [0.1, 0.6] for each Monte Carlo trial.The means of the classes were generated from the standard MVN distribution for each Monte Carlo trial.The sample sizes are equal with n k = 40 for all classes k.The dimension is p = 100 and the data is multivariate t-distributed with ν = 8 degrees of freedom.
Figure 3 depicts the results averaged over 1000 Monte Carlo trials for each value of K.The red vertical lines mark the spots when the added class covariance matrix has an CS structure (i.e., doesn't share the same structure as class 1).One can observe that every time an AR(1) structured class is added, the NMSE decreases.When the added class has a different covariance structure (the CS structure), the NMSE does not decrease.An exception to this is when K = 4.A reason for this may be that the total number of observations is still relatively low and including the fourth class helps in reducing the variance of the estimate.

C. Complex-valued case
In the next simulation, the classes have an AR(1) covariance matrix structure,  (Σ k ) ji = (Σ * k ) ij , for i > j, where (•) * denotes complex conjugation.The used parameters were σ 2 k = k, 1 = 0.3e 2π0.3 , 2 = 0.4e 2π0.4 , 3 = 0.5e 2π0. 5, and 4 = 0.6e 2π0. 6.We simulated the NMSE as a function of the sample size from n k ∈ {10, 15, . . .40} for all k.The data was generated from the complex multivariate t-distribution with ν = 8 degrees of freedom and dimension p = 100.The results were averaged over 1000 Monte Carlo trials for each sample size and are shown in Figure 2. It can be observed that especially for small sample sizes (n k < 30) LINPOOL performed better than LOOCV.

VII. PORTFOLIO OPTIMIZATION
We studied the performance of the proposed method in a portfolio optimization problem using divident adjusted daily closing prices.Portfolio optimization is a central topic in investment theory, see, e.g., [51]- [54], and [55].A focus in portfolio optimization has been on the estimation of the covariance matrix of the stock returns, commonly using shrinkage regularization techniques or random matrix theory, see, e.g., [1], [7], [56]- [58], and [59].In a portfolio optimization problem, a particular investment portfolio is determined by a weight or allocation vector w ∈ R p (verifying the constraint 1 w = 1) whose elements describe the fraction of the total wealth invested in each of the p stocks.We considered two different portfolios.First, we considered the global minimum variance portfolio (GMVP) in which one seeks a portfolio w that minimizes the risk (variance).The optimization problem is thus minimize w∈R p w Σw subject to 1 w = 1. ( The well-known solution is w , where Σ is the covariance matrix of the stock returns.We also considered a constrained portfolio, where the coefficients are constrained to be within the range 0 ≤ w i ≤ 0.1, for all i, i.e., shorting (negative weights) is not allowed and the portfolio manager is not allowed to put more than 10% of the wealth in one stock.The optimization problem for this case is the same as in (30) but having the additional constraint 0 ≤ w ≤ 0.1 • 1, which results in a QP problem.
In the simulation, the covariance matrix Σ was estimated via a sliding window method so that at day t it was estimated using the daily net returns of the previous n days from t−n to t − 1.The portfolio weights were then computed via (30) with and without the additional weight constraints.These yielded the portfolio returns for the next 20 (trading) days.Then, the  sliding window shifted 20 days forward and the procedure was repeated.By denoting the total number of days in the data by T , we obtained T − n daily returns from which we computed the realized risk as the empirical standard deviation of the daily portfolio returns.We applied the proposed method (explained in Section V-B and Algorithm 2) for single class covariance matrix estimation using the same target matrices for regularization as in [1] and [8].The target matrices were the single factor market index model T F of [1] and the constant correlation model T C of [8].Their computation is explained in [1] and [8], respectively.At day t, we used the n previous days to compute the SCM.However, due to the nature of our method, we were able to freely choose the amount of data used for computing the regularization target matrices.Hence, we chose to use the previous 40 days (t − 40 to t − 1 corresponding to the previous two months) for computing T F and T C , regardless of the window size n (n > 40) used for estimating the SCM.This can be justified by the fact that the trend of the market is better captured by the most recent net returns.After computing T F and T C , we generated 1000 i.i.d.samples both from Y C ∼ N (0, T C ) and Y F ∼ N (0, T F ) and estimated the coefficients for the proposed LINPOOL estimator S(a) = a 1 S + a 2 S F + a 3 S C + a I I, where S F and S C denote the SCMs computed from Y F and Y C , respectively.We also report the performance of the LINPOOL estimator (23) using a convex combination as explained in Remark 1.For both methods, we used the lowerbound constraint a I ≥ = 10 −8 .We also report the performance of the multi-target shrinkage estimation methods LOOCV of [20] and BARTZ of [19] using the same target matrices (T F , T C , and I) as for our proposed method.Additionally, we report the performance of the following methods specifically tailored for portfolio optimization: LW-well of [7], LW-improved of [1], LW-honey of [8], and the random matrix theory based estimator LWanalytical of [56].
Figure 4 shows the annualized risk obtained by the different estimators as a function of the training window length n.Regarding the S&P 500 2016-2018 data, there were large differences in the performances of the methods.LW-improved performed best and LW-well the worst.The differences between the methods were smaller in the constrained case, where BARTZ performed best.Regarding the HSI 2010-2011 data set, the proposed method LINPOOL achieved the lowest risk for the unconstrained portfolio with all window sizes n.For the constrained portfolio, all of the methods performed nearly equally well with LW-improved having the lowest risk with the window length n = 100.Regarding the HSI 2016-2017 data, the proposed methods (LINPOOL and LINPOOL-C) achieved the lowest risk for both the constrained and unconstrained portfolios for all window sizes n > 120.

VIII. CONCLUSIONS
The paper proposed a regularized sample covariance matrix estimation method for high-dimensional multiclass problems.The proposed estimator was formed from a linear combination of the class SCMs.We derived the theoretically optimal coefficients that minimize the mean squared error.The optimal coefficients depend on unknown parameters, and their estimation was addressed under the assumption that the samples are generated from unknown elliptically symmetric populations with finite fourth-order moments.In constructing estimators for the unknown parameters, we utilized the sample spatial sign covariance matrix, which we showed in Theorem 2 to be an asymptotically unbiased estimator of the normalized covariance matrix in the case that the sphericity parameter of the distribution grows slower than the dimension.The effectiveness and usefulness of the proposed method was demonstrated via simulations and a portfolio optimization problem using real stock data.Codes are available at github.com/EliasRaninen.
Taking the expectation and scaling by 1/p gives and c k corresponds to the kth column of C.
Using that E[tr(S 2 i )] = MSE(S i ) + tr(Σ 2 i ), we get for i = j, and so B = ∆ + C. By definition ∆ is symmetric and positive definite.Also, C is symmetric and positive semidefinite.This follows since for any m Regarding the extension discussed in Section IV, we now show the positive definiteness of ∆ + C. Since we know that ∆+C 0, due to the properties of the Schur complement [60, A.5.5], it holds that We can then show the positive definiteness of ∆+ C by show- APPENDIX B PROOF OF THEOREM 2 Let x ∼ E p (µ, Σ, g) and assume Λ is computed with a known mean µ = 0 in (16).Then E . So, u is uniformly distributed on the unit sphere {u ∈ R p : u = 1}.Denote Λ = (Λ ij ).We need the following lemma.Lemma 3. Proof.
Note that, (32), (33), and (31) are valid more generally for any positive semidefinite symmetric matrix, not only for the shape matrix.
We are ready to prove Theorem 2. Since for any x > 0, x −1 ≥ 2 − x, we have By scaling both sides of (34) by = Λ −1 F = 1/ √ pγ, the first term will have unity norm ( Λ F = 1).Let us then consider the second term on the right-hand side.Its trace is 2 tr(Λ 2 − Λ) p + 2 = 2p p + 2 (γ − 1) and its norm is Here, we used the triangle inequality and submultiplicativity properties of the Frobenius norm.By moving all the terms of (34) to the left-hand side and scaling them by , we can use (35) and (36) as well as the property that for any A 0, tr In the following, we assume that is a fixed parameter that does not depend on the dimension p.We also use γ = tr(Λ 2 )/p = (1/p) i,j Λ 2 ij , where Λ = (Λ ij ).

A. Sphericity of the AR(1) covariance matrix
The shape matrix of the AR(1) covariance matrix with parameter (| | < 1) has p number of ones on the main diagonal, 2(p − 1) number of on the first diagonals above and below the main diagonal, and 2(p − 2) number of 2 on the second diagonals above and below the main diagonal, and so on.That gives, The first sum is the geometric series and the second sum is also well-known and its solution can be obtained by differentiating the geometric series.Hence, we get (37) The first term is Now, let S = 1 ( The last term of ( 37) is Now, substituting (38), (39), and ( 40) into (37) gives and E. Ollila are with the Department of Signal Processing and Acoustics, Aalto University, P.O.Box 15400, FI-00076 Aalto, Finland.D. E. Tyler is with the Department of Statistics, Rutgers, The State University of New Jersey, Piscataway NJ 08854, U.S.A.The research work by E. Raninen and E. Ollila was supported in part by the Academy of Finland under Grant 298118.Research work by D. E. Tyler was supported in part by the National Science Foundation under Grant DMS-1812198.
p and b p , as p → ∞, the notation a p = o(b p ) means that the sequence a p /b p → 0, and the notation a p = O(b p ) means that the sequence a p /b p is bounded.For a matrix-valued sequence A p , we write A p = o(b p ) and A p = O(b p ) if and only if A p F = o(b p ) and A p F = O(b p ), respectively.The Frobenius norm is defined as A F = tr(A A) while • denotes the Euclidean norm for vectors.The largest eigenvalue of A is λ max (A).The determinant of A is denoted by |A|.

Fig. 2 :
Fig. 2: NMSE as a function of sample size in the complexvalued AR(1) case with 4 classes.Top: NMSE of individual classes.Bottom: total combined NMSE. and

Fig. 3 :
Fig. 3: NMSE of Σ1 as the number of classes increase.

Fig. 4 :
Fig. 4: Annualized realized GMVP risk achieved out-ofsample for different covariance matrix estimators and different training window lengths n.Left: unconstrained portfolio.Right: constrained portfolio (nonnegative weights and maximum single asset weight 0.1).

2 F
Write f (a) = E[ S(a) − Σ k ] for the objective function.By expanding the expression for the squared error, we get

TABLE I :
The normalized mean squared error over 1000 repetitions and standard deviation in the parenthesis.
E[tr( ΛΛ SCM )] = E tr i E[x i x i ] +