Regularized tapered sample covariance matrix

Covariance matrix tapers have a long history in signal processing and related fields. Examples of applications include autoregressive models (promoting a banded structure) or beamforming (widening the spectral null width associated with an interferer). In this paper, the focus is on high-dimensional setting where the dimension $p$ is high, while the data aspect ratio $n/p$ is low. We propose an estimator called Tabasco (TApered or BAnded Shrinkage COvariance matrix) that shrinks the tapered sample covariance matrix towards a scaled identity matrix. We derive optimal and estimated (data adaptive) regularization parameters that are designed to minimize the mean squared error (MSE) between the proposed shrinkage estimator and the true covariance matrix. These parameters are derived under the general assumption that the data is sampled from an unspecified elliptically symmetric distribution with finite 4th order moments (both real- and complex-valued cases are addressed). Simulation studies show that the proposed Tabasco outperforms all competing tapering covariance matrix estimators in diverse setups. A space-time adaptive processing (STAP) application also illustrates the benefit of the proposed estimator in a practical signal processing setup.


I. INTRODUCTION
Consider a set of p-dimensional (real-valued) vectors {x i } n i=1 sampled from a distribution of a random vector x with unknown mean vector µ = E[x] and unknown positive definite symmetric p × p covariance matrix Σ ≡ cov(x) = E[(x − µ)(x − µ) ].In the high-dimensional case and when the sample size n is of the same order as p (p = O(n)) or p n, one is required to use regularization (shrinkage) in order to improve the estimation accuracy of the SCM and to obtain a positive definite matrix estimate.A popular estimate of Σ in such a setting is the regularized sample covariance matrix (RSCM), defined by where β ∈ [0, 1] is the regularization (or shrinkage) parameter, and where denotes the unbiased sample covariance matrix (SCM), i.e., E[S] = Σ.Note also that in (2), x = 1 n n i=1 x i denotes the sample mean vector.Automatic data-adaptive computation of optimal (oracle) parameter β for which S β in (1) attains the minimum mean squared error (MMSE) in Frobenius norm has been an active area of research.See for example [1], [2], [3], [4] to name only a few.
In many applications, the estimation accuracy (or another performance criterion) can alternatively be improved by using a so-called tapered SCM.Such estimate is defined as W • S, where • denotes the Hadamard (or Schur) element-wise product, and where W is a tapering matrix (also referred to as covariance matrix taper), i.e., a template that imposes some additional structure to the SCM.Note that above (W • S) ij = w ij s ij for (W) ij = w ij and (S) ij = s ij .
Covariance matrix tapers have been used in many applications in diverse fields.A first main example in statistics is related to covariance matrices with a diagonally dominant structure (e.g., in autoregressive models).This means that the variables have a natural order in the sense that |i − j| large implies that the correlation between the ith and the jth variables is close to zero.In this settings, popular estimation approaches are to use a banding-type tapering matrices such as thresholding [5], [6]: for some integer k ∈ [ [1, p]] (called the bandwidth parameter), or softer thresholding variants.Notably, the strong theoretical merits of a linear decay of the form were studied in [7].A second major example concerns the signal processing literature, in which tapering matrices have been developed in order to improve several spectral properties of adaptive beamformers, or to compensate subspace leakage and calibration issues [8].Most notably, the tapering matrices of the form (W) ij = sinc((i − j)∆/π)) where ∆ ∈ R + , attracted interest as a null broadening technique for fluctuating interference [9], [10], [11], [12], [13].
A first approach to combine regularization with tapering was proposed in [14] with the shrinkage to tapering (ST) estimator, defined as the convex combination of the SCM and the tapered SCM: where β ∈ [0, 1] is a shrinkage parameter.The authors then derived an optimal oracle parameter β o minimizing the MSE E[ S ST,β − Σ 2 F ] and proposed a shrinkage to tapering oracle approximating (STOA) estimator βo of β o under the assumption of Gaussian data.Authors in [15] also studied the ST estimator and derived an alternative oracle estimator of the shrinkage parameter both under Gaussian and non-Gaussian data.Data adaptive selection of the bandwidth k in (3) was also addressed with cross validation [14] or oracle estimation [15].A possible issue with the ST estimate is that it inherently destroys the tapering template structure (e.g., sparsity for banded matrices) since it can be expressed as the modified tapered SCM S ST,β = (β11 + (1 − β)W) • S.
Hence, shrinkage is applied to the tapering matrix itself rather than to the SCM.In the high dimensional case, it should also be noted that both W • S and S ST,β are not necessarily positive semidefinite matrices, i.e., they can have negative or null eigenvalues.A possible solution for this problem is to compute their EVD and then replacing the invalid eigenvalues by small positive constants.However, such a post-processing step further deteriorates the template pattern of the covariance matrix estimator, and is computationally restrictive when dealing with high-dimensional data.
In this paper we provide a solution to the aforementioned problems by jointly leveraging shrinkage to identity and tapering: Let W = {W(k)} K k=1 be a finite set of possible tapering matrices 1 and with R p×p Sym denoting the set of all symmetric p×p matrices and [[1, p]] = {1, . . ., p}.We propose an estimator, referred to as TABASCO (TApered or BAnded Shrinkage COvariance matrix), defined as which benefits both from shrinkage (as the classic estimator in (1)) and exploitation of structure via tapering.Note that it also preserves the original scale of the SCM since tr(W • S) = tr(S) ∀W ∈ W + .Obviously, the success of banding and/or tapering depends on one's ability to choose the parameters β and k correctly.In this scope, we derive a fully automatic data-adaptive evaluation of the optimal parameters that jointly minimize the mean squared error E[ Σβ,k − Σ 2 F ] under the general assumption that the data is sampled from an unspecified elliptically symmetric (ES) distribution with finite 4th order moments.A main interest to consider the general ES model is that it encompasses the standard Gaussian one while still accounting for possibly heavy-tailed distribution.Thus this assumption yields robustness to a large class of possible underlying data distributions.Our empirical experiments evidence that the proposed approach offers a near-tooptimal regularization parameter selection which outperform cross-validation schemes (especially at low sample support).Since both the RSCM in (1) (if W = 11 ∈ W) and 1 In this paper, we mostly focus on k implying a notion of bandwidth (or model order), for which W can be constructed from (3) or (4) with k ∈ [ [1, p]].However, the proposed methodology applies to the general setting where W corresponds to any finite collection of possibly envisioned templates.Notably, we will also consider an application where k indexes a set of possible {∆ k } K k=1 used for the template model in (5).
the tapered SCM (β = 1) appear as special cases of ( 8), TABASCO performs never worse than these two estimators in terms of MSE independent of the underlying structure of the true covariance matrix.The paper is structured as follows.In Section II expressions for the oracle regularization parameters β and k that minimize the MSE are derived in the general case of sampling from an unspecified distribution with finite 4th-order moments.In Section III we provide useful intermediate theoretical results about tapered SCM when the data is sampled from an unspecified ES distribution with finite 4th order moments.In Section IV a practical closed-form expression for the optimal regularization parameters are derived when sampling from an ES distribution, and an adaptive fully automatic procedure for their computation is proposed.As it is shown that the optimal parameters depend on the sphericity of the tapered covariance matrix W • Σ, we addressed the estimation of this quantity in Section V. Section VI extends our results to the special cases of known location (µ = 0) and/or complexvalued observations.Section VII provides simulation studies while in Section VIII the estimator is applied to STAP data.Finally, Section IX concludes.The Appendix contains more technical proofs.

II. ORACLE TABASCO PARAMETERS β AND k
First, recall that the TABASCO estimator Σβ,k is defined by (8) for a set W = {W(k)} K k=1 of envisioned tapering matrices (cf.footnote 1 for examples) and a regularization parameter β ∈ [0, 1].In this section, we derive the expression of the oracle parameters β and k that minimize MSE in the general case of sampling from an unspecified p-variate distribution with finite 4th-order moments.
Before doing so, let us introduce some notations and statistical parameters that are elemental in the proposed method.The scale and the sphericity of Σ [16], [17] are denoted by respectively.The scale corresponds to the mean of the eigenvalues of Σ, while the sphericity measures how close Σ is to a scaled identity matrix: γ ∈ [1, p], where γ = 1 if and only if Σ ∝ I and γ = p if and only if Σ has its rank equal to 1.
For any W ∈ W + as in (7), the matrix W • Σ, is called the tapered covariance matrix and we denote the sphericity parameter of the tapered covariance matrix.When W = 11 , we write γ 11 ≡ γ for brevity.

A. Oracle shrinkage parameter β for fixed k
We start by assuming that the index k is fixed.This allows us to simply denote the fixed tapering matrix W ≡ W(k) and TABASCO as Σβ ≡ Σβ,k .To find the oracle MMSE shrinkage parameter β ∈ [0, 1] of Σβ , the aim is thus to solve where • F denotes the Frobenius matrix norm, i.e., A 2 F = tr(A A) and tr(•) denotes the matrix trace, i.e., tr(A) = i a ii for all square matrices A = (a ij ).Notice that the MSE of the tapered SCM is where By normalized MSE (NMSE) we refer to NMSE(W • S) = MSE(W•S)/ Σ 2 F .We are now ready to state the main result of this section.Theorem 1.Let {x i } n i=1 be an i.i.d.random sample from any p-variate distribution with finite 4th order moments.For any fixed W ∈ W + , the oracle parameter β o in (11) is where and η = tr(S)/p.Furthermore, the value of the MSE at the optimum is The proof is postponed to Appendix A.
Notice that Theorem 1 also provides the optimal MMSE shrinkage parameter β o for the RSCM S β in (1) since Σβ = S β when W = 11 .For the RSCM the optimal parameter is where we used (16) and the facts that γ = γ V and W • S = S for W = 11 .The minimum MSE of the RSCM utilizing the optimal shrinkage parameter in (18) is where we used (17) and that V • Σ = Σ for V = 11 .

B. Oracle index k
Notice that MSE( Σβ0 ) in (17) implicitly depends on k through W ≡ W(k) and V defined in (13).We further have the relation where C is a constant that is not dependent on k.Equation (19) then implies that minimizing the MSE with respect to k is equivalent to set where β o (k) is given by any of the expressions in ( 14)-( 16) and γ V (k) is defined via (10).Note that we have made explicit the dependence of β 0 and γ V on k in (20) for clarity of exposition.
Of course, the oracles β 0 and k 0 depend here on the true underlying data distribution and covariance matrix through various unknown quantities.A practical implementation of TABASCO thus requires their adaptive evaluation.Rather than resorting to potentially inaccurate cross-validation, we will consider the general case where the data is sampled from an unspecified ES distribution [18], [19].In this setting, we show that the oracle parameters eventually depend on few parameters that can be accurately evaluated, even at low sample support.

III. TAPERED SCM UNDER ES DISTRIBUTIONS
In this section we recall some definitions and key results concerning ES distribution [18], [19].We then and derive useful results (expectations and consistent estimates) related to functions of the tapered SCM W • S, which will be needed in later developments of oracle TABASCO parameters.

A. ES distributions
The probability density function of an elliptically distributed random vector, denoted by x ∼ E p (µ, Σ, g), is given by where Σ denotes the positive definite symmetric covariance matrix parameter, µ is the mean vector, g : [0, ∞) → [0, ∞) is the density generator, which is a fixed function that is independent of x, µ and Σ, and C p,g is a normalizing constant ensuring that f (x) integrates to 1.Note that here we define g such that "scatter matrix" parameter Σ coincides with the covariance matrix.This can always be assumed (under assumption of finite 2nd order moments) without any loss of generality [18], [19].For example, the multivariate normal (MVN) distribution, denoted by N p (µ, Σ), is obtained when g(t) = exp(−t/2).The flexibility regarding the density generator g allows for modeling a large class of distributions, including heavy-tailed ones such as the multivariate tdistribution (MVT) with ν > 2 degrees of freedom (d.o.f.), denoted by x ∼ t ν (µ, Σ), where ν > 2 needs to be assumed for finite 2nd-order moments.The elliptical kurtosis [20] parameter κ is defined as where the expectation is over the distribution of the random variable r = Σ −1/2 (x−µ) and kurt(x i ) denotes the excess kurtosis of any (e.g., ith) marginal variable of x.Furthermore, observe that E[r 2 ] = p.The elliptical kurtosis parameter vanishes (so κ = 0) when x has a MVN distribution.
We also recall from [4, Lemma 2] that where the scalars are dependent on the elliptical distribution (and hence on the density generator g) only via its kurtosis parameter.

B. Useful intermediate results about tapered SCM
We now derive an extension of [4, Lemma 2] for tapered SCM W • S. Write diag(A) ≡ diag(a 11 , . . ., a pp ) for any matrix A = (a ij ) p×p , where diag(a) denotes a diagonal matrix with the entries of vector a on the main diagonal.Lemma 1.Let {x i } n i=1 be an i.i.d.random sample from E p (µ, Σ, g) with finite 4th order moments.Then for any W ∈ W + , it holds that and Proof.The proof is postponed to Appendix B.
Interestingly, the knowledge of E[ W • S 2 F ] from Lemma 1 allows for a direct computation of MSE of W • S via (12).Lemma 1 also states that the obvious plug-in estimate W • S 2 F /p for the parameter is biased.Next we derive a proper estimator θW of ϑ W which extends [4,Theorem 4] and provides an unbiased estimator of ϑ W provided that the elliptical kurtosis parameter κ is known.
Theorem 2. Let {x i } n i=1 be an i.i.d.random sample from a pvariate elliptical distribution E p (µ, Σ, g) with finite 4th order moments.Then, an unbiased estimator of ϑ W = W • Σ 2 F /p for any finite n and p and any where Proof.Note that a n in ( 27) can be written as where definitions of τ 1 and τ 2 are given by ( 25) while b n in (28) can be expressed as The expressions ( 27) and ( 28) are obtained when replacing the values of τ 1 and τ 2 given in ( 25) into a n ≡ a n (τ 1 , τ 2 ) and b n ≡ b n (τ 1 , τ 2 ) and simplifying the obtained expressions.
This result will notably be used later in subsection V-B to construct an estimator of the sphericity parameter γ W .

IV. ORACLE PARAMETERS ESTIMATION IN ES DISTRIBUTIONS
Using Lemma 1 we may now derive a simple closed form expression of the optimal shrinkage parameter β o given in Theorem 1 that depends only on few summary (scalar-valued) statistics which can be estimated from the data.Let us denote where d Σ = (σ 2 1 , . . ., σ 2 p ) contains the variances of the variables, i.e., the diagonal elements of Σ.The 2nd equality in (29) follows from [21,Lemma 7.5.2].The main result of this section is derived next.Theorem 3. Let {x i } n i=1 be an i.i.d.random sample from an ES distribution E p (µ, Σ, g) with finite 4th order moments.For any W ∈ W + , the oracle parameter β o in (11) is where t = n(γ V − 1), and Proof.Follows from Theorem 1 after substituting the values of E [ W • S 2 F given in Lemma 1 and of E tr(S) 2 given in (24) into the denominator of β o in (15) and simplifying the expression.
Following from Theorem 3, the proposed data-adaptive implementation of TABASCO consists in applying the oracle procedure of Section II by replacing each of the unknown parameters {η, θ W , κ, γ, γ W , γ V } in (30) by carefully chosen estimates (detailed below).This yields estimate βo (k) and one considers all templates in set W = {W(k)} K k=1 .Similarly, the index k is estimated based on (20) by replacing the unknown β 0 (k) and γ V (k) by their estimates and solving The pseudocode of the proposed estimation algorithm is summarized in Algorithm 1.
Estimators of the parameters {η, θ W , κ, γ, γ W , γ V } and additional remarks are detailed in the following: • For η and θ W , we use the empirical estimates: η = tr(S)/p and θW = tr((D S W) Compute θW from (32) Compute γW (k) and γV (k) (options in Section V) Compute βo (k) from (30) using plug-in estimates 10 Select optimal k 0 as in (31) with the marginal variables scaled by 1/3.Also note that if the data is assumed to follow the MVN distribution, we can set κ = 0, and the last term κ • A can be ignored in the denominator.
• The estimation of the three sphericity statistics: γ, γ W , and γ V is addressed in detail in Section V. Also notice that V = ( √ w ij ) p×p , so if W is a selection matrix (i.e., that has only 0-s or 1-s as its off-diagonal elements), as for example in (3), then W = V so only γ W requires to be estimated.

V. ESTIMATORS OF SPHERICITY
In this section, we detail two new alternative estimators of the sphericity of the tapered covariance matrix W • Σ, which are extensions of the sphericity estimators proposed in [4].First, define the shape matrix (or normalized covariance matrix) as Λ = p Σ tr(Σ) and note that tr(Λ) = p.The sphericity measures γ and γ W for any W ∈ W + can then be expressed simply in terms of Λ via the formulas:

A. Ell1-estimator of sphericity
The Ell1-estimator is based on the spatial sign covariance matrix (SSCM), which has been popular for constructing robust estimates of the sphericity [22], [23].This estimator was theoretically studied in [24] and we propose here its adaptation to the sphericity of the tapered covariance matrix W • Σ.
The (scaled) SSCM is defined by where μ = arg min µ n i=1 x i − µ is the sample spatial median [25].When µ is known (and without loss of generality assuming µ = 0), the SSCM is defined as Recently, it was shown in [24] that the following estimate of sphericity based on the SSCM (when µ is known), is asymptotically (as p → ∞) unbiased when sampling from elliptical distributions under the following assumption (A) The sequence of covariance matrix structures being considered with increasing p satisfies γ = o(p) as p → ∞.In other words, E[γ] → γ as p → ∞ when (A) holds.We note that Assumption (A) is sufficiently general and holds for many covariance matrix models as shown in [24,Prop. 3].The following Theorem presents a modification of the Ell1estimator [4] for the sphericity of W • Σ with equivalent asymptotic guarantees.Theorem 4. Let {x i } n i=1 be an i.i.d.random sample from an ES distribution E p (µ, Σ, g) with known µ = 0.Then, for any W ∈ W + and under Assumption (A), the following statistic where Proof.Proof is postponed to the Appendix C.

B. Ell2-estimator of sphericity
The Ell2-estimator of sphericity was proposed in [4] and we derive here its adaptation to the sphericity of the tapered covariance matrix W • Σ thanks to Theorem 2.
First, note that the sphericity of tapered covariance matrix can also be written as where ϑ W and η are defined in (26) and ( 9) respectively.Using this expression, we consider the estimate where θW is computed from Theorem 2, and η2 is obtained from (32).This yields the estimator where ân ≡ a n (κ) and bn ≡ b n (κ) are obtained by replacing the unknown κ in ( 27) and ( 28) by its estimate κ [4, Sect.IV].
We refer to (36) as Ell2-estimator of sphericity γ W . Also note that, if n is reasonably large, then bn ≈ 1 and n/(n + κ) ≈ 1, its expression can be simplified to In the non-tapered case (W = 11 ), the estimator in (36) reduces to the Ell2-estimator of sphericity in [4].Although Ell2-estimator of sphericity does not require knowledge of the underlying elliptically symmetric distribution of the data, it is not a robust estimator.Thus we overall favour Ell1-estimator due to robustness of SSCM, and recommend usage of Ell2-estimator when dealing with data that is not heavy-tailed, i.e., which can be approximated by a Gaussian distribution.In practice, we also always use the thresholding γ = min(p, max(1, γ)) for any option in order to guarantee that the final estimator remain in the valid interval, 1 ≤ γ ≤ p.

A. Known location µ
In some applications, the mean vector µ = E[x] is known and assumed to be µ = 0 without loss of generality.In this case, the covariance matrix Σ = E[xx ] is estimated by the SCM, defined by which is also unbiased estimator of Σ, i.e., E[S] = Σ.
The known location case implies only small changes in our estimation procedure since Theorem 1 holds for both known and unknown location cases.
When the location is known, the expectation E S 2 F and E tr(S) 2 are of the form ( 23) and ( 24) with τ 1 and τ 2 given by This result follows as a special case of [26, Lemma 1] for a Gaussian weight function.Similarly Lemma 1 holds when using τ 1 and τ 2 in (39).The change to the optimal β 0 parameter is also minimal: one may ignore the term (n/(n − 1)) that appears as the multiplier of the 2nd last term θ W /η 2 + γ W − 2γ/p in the denominator of β 0 .Theorem 2 also holds with .

B. Complex-valued data
Extending the results to complex-valued data also requires minor adaptations since Theorem 1 holds for complex-valued observations as well.First we recall some notations specific to complex-valued case.By x 2 = x H x we denote the usual Euclidean norm in complex vector spaces, while We now assume that the data {x i } n i=1 is a random sample from a circular complex elliptically symmetric (CES) distribution, denoted x ∼ CE p (µ, Σ, g) (cf.[19] for a detailed review).Similarly to the real-valued case, the probability density function of a CES distributed random vector x ∈ C p is given by where Σ denotes the positive definite Hermitian covariance matrix, µ = E[x] is the mean vector, g : R ≥0 → R >0 is the density generator, and C p,g is a normalizing constant.Again, we also normalize g so that The definitions of the scale and sphericity parameters in ( 9) and (10) remain unchanged.The elliptical kurtosis is however redefined as where the expectation is over r = Σ −1/2 (x − µ) and kurt(x i ) denotes the excess kurtosis of any (e.g., ith) marginal variable of x, defined by kurt where denote the mean and variance of x i .The theoretical lower bound of the kurtosis in the complex-valued case is κ LB = −1/(p + 1) [19].Again κ = 0 if x has a circular complex multivariate normal distribution (x ∼ CN p (µ, Σ)).The SCM (2) of complexvalued observations is defined by and the TABASCO estimator Σβ is still defined as in (8).
The next result provides the complex-valued extension of Lemma 1.
Lemma 2. Let {x i } n i=1 be an i.i.d.random sample from CE p (µ, Σ, g) with finite 4th order moments.Then for any W ∈ W + , and for the SCM as in (40), it holds that and , where D Σ = diag(Σ), D S = diag(S) and τ 1 and τ 2 are defined in (25).
Proof.The proof is postponed to Appendix D. This result allows us to derive the complex-valued counterpart of Theorem 3 for the optimal shrinkage parameter β o .Theorem 5. Let {x i } n i=1 be an i.i.d.random sample from a complex elliptical distribution CE p (µ, Σ, g) with finite 4th order moments.Then the oracle parameter β o in (11) is where t = n(γ V − 1), and With similar arguments as in the real-valued case, it follows that Theorem 2 holds with a n as in (27) and b n given by .
This means that Ell2-sphericity estimator can be defined as earlier with changes only in equations for a n and b n .Similarly, the only change for SSCM in (33) for complexvalued observations is that the transpose (•) is replaced with the Hermitian transpose.

VII. SIMULATION STUDIES
We generate samples from (real-valued) ES distributions with a scatter matrix Σ having a diagonally dominant structure (model 1 and model 2 detailed below).The mean µ is generated randomly as N p (10•1, I) and the number of Monte-Carlo trials is 5000.
The estimators included in the study are: i) The Ledoit-Wolf estimator (LWE) [2] defined by (1) where β is an estimate of an (oracle) MMSE parameter β o .ii) The shrinkage to tapering oracle approximate (STOA) estimator [14] defined by (6) where β is an estimate of the oracle parameter computed using an iterative procedure.The bandwidth k is selected using a cross-validation scheme with 60%-to-40% split for training and testing.iii) The shrinkage to tapering (ST-)estimators in [15] defined by (6) where both β and k are estimates of the oracle MMSE parameters.The estimator ST-gaus assumes Gaussian data, while ST-nong assumes non-Gaussian (ES) data.iv) TABASCO (computed via Algorithm 1) using the Ell1estimator of sphericity.

A. Model 1
In Model 1, Σ possesses an auto-regressive AR(1) structure: where | | ∈ [0, 1).When ↓ 0, then Σ is close to an identity matrix scaled by η, and when ↑ 1, Σ tends to a singular matrix of rank 1.As illustrated in Figure 1, banding matrices allow for a good approximation, so all tapering-type estimators are computed with W(k) as ( 3) in this subsection.The optimal bandwidth ko is chosen by consider the set of tapering matrices Figure 2 illustrates a validation for the theoretical results: it displays the theoretical normalized MSE (NMSE) curves, F as a function of shrinkage parameter β for TABASCO estimators using a fixed bandwidths k ∈ [ [1,5]] and k = p (i.e., W = 11 ).In this setup, the data is generated from MVN distribution N p (µ, Σ) with p = 100 and n = 50 (similar results were obtained for others ES distributions and dimension setups).The black bullet (•) displays the theoretical minimum NMSE in (17)   displayed using red triangle ( ), where the location on β axis correspond to empirical average β0 .As can be noted from Figure 2, TABASCO estimates the oracle shrinkage parameter β o very accurately since the black bullets and red triangles are mostly overlapping for each bandwidth.The dashed horizontal line shows the average NMSE obtained by TABASCO when using the estimated optimal bandwidth k0 .One can notice that the optimal bandwidth selection using ( 31) is also accurate.For example, in the case of = 0.4, the optimal bandwidth is k = 3 and TABASCO estimator attains an average NMSE that is very close to the theoretical minimum NMSE. Figure 3 compares the performance of TABASCO with the state of the art in various setups.The upper panel displays the NMSE curves as a function of the sample size n for four choices of correlation parameter when the data follows a MVN distribution.The lower panel displays the same results when the data follows a MVT distribution with ν = 5, which is heavy-tailed with marginal kurtosis kurt(x i ) = 6 and elliptical kurtosis κ = kurt(x i )/3 = 2.In the Gaussian case, all banding-type estimators outperform LWE thanks to the exploitation of the diagonally dominant structure of the covariance matrix.In the heavy-tailed case, this is no longer true for STOA and ST-gaus, while ST-nong and TABASCO remain robust.In all scenarios, TABASCO offers the lowest NMSE, and especially improves the performance when n p. Figure 4 displays the obtained (average) estimated shrinkage parameter βo of TABASCO and LWE as a function of n.The average shrinkage parameter of TABASCO is generally much larger than that of LWE.This means that it assign overall more weight on the banded SCM W • S compared to LWE, which uses W(p) = 11 .This behavior is expected since banding the SCM should naturally improve the MSE when the true covariance matrix has a diagonally dominant structure.
Figure 5 presents a comparison similar to Figure 3 when the variables are permuted at random for each Monte Carlo trial, thus destroying the diagonally dominant structure of the AR(1) covariance matrix 2 .The hypothesis is that any banding estimator with optimal bandwidth selection should be able to select the bandwidth k = p accordingly.Note that LWE is invariant to variable permutations, and hence its results stays the same for both of these scenarios.In this setup, TABASCO performs better that LWE for n p and equally well as LWE for n large enough.This result implies that bandwidth selection of TABASCO is consistent: it chooses k = p since the true covariance matrix does not have a diagonally dominant structure.The improvement brought at low sample support can be explained by the fact that an ES distribution is assumed by TABASCO, which allows for a better estimation of the oracle parameter (LWE only assumes finite 4th order moments).This example confirms that TABASCO always benefits from banding and bandwidth selection: it offers significantly improved NMSE compared to RSCM when banding structure is present in the covariance matrix, while it does not perform worse when such structure does not exist, thanks to its robust and efficient bandwidth selection.

B. Model 2
In Model 2 [7], Σ is defined by where α is a decay parameter and ρ is a correlation parameter.
As in the study of [7], we set ρ = 0.6, and Figure 1 illustrates the effect of decay parameter α in the case of p = 100.Figure 6 presents a comparison similar to Figure 3 where we also included the minimax risk tapering (MnMx-Taper) estimator W(k * )•S, where k * = n 1/(2(α+1)) is the optimal (oracle) bandwidth [7,Section 6].The dimension is p = 250.It should be noted that MnMx-Taper has advantage over the other estimators since it uses the true decay parameter α, which is unknown in practice.TABASCO also uses tapering matrices W(k) as in ( 4), but ST-gaus and ST-nong are restricted to tapering matrices whose off-diagonal elements are 0-s or 1-s.Hence, these are still computed with banding matrices W(k) as in (3).In either case, the optimal bandwidth ko is chosen by consider the set of tapering matrices W = {W(k) : k ∈ [ [1,30]] ∪ [[p − 30, p]]}.As can be noted, TABASCO again outperforms other estimators for all values of n and α and for both sampling distributions.In the MVN case (top panel), TABASCO outperforms MnMx-Taper with a clear margin when n is very small.This can be attributed to its ability to optimally shrink the tapered SCM towards a scaled identity matrix when 2 Prominent algorithms for recovering hidden ordering-structure in the variables are the Best Permutation Analysis (BPA) [27] or Isoband [28].The perspective of their joint use with TABASCO is left for further studies.n/p < 1. However for n ≥ p, TABASCO and MnMx-Taper estimator have similar performance, especially when α = 0.3.
In the MVT case (lower panel of Figure 6), the performance differences are more clear.TABASCO outperforms MnMxtaper by a large margin.ST-gaus estimator completely fails due to the impulsive nature of the underlying sampling distributions.The results also illustrate that the performance of tapered SCM estimator is dependent on the underlying sampling distribution more heavily than TABASCO.This is illustrated further in Figure 7 where we compare the true theoretical NMSE curves of tapered SCM W•S and TABASCO estimator Σβ0 as a function of bandwidth k in the case where n = 100 and when sampling from a MVN distribution (left panel) and MVT distribution (right panel) with ν = 5 d.o.f.following model 2 with α = 0.1.Figure 7 shows two important points.First, the performance differences between the tapered SCM and TABASCO are larger when the distribution is heavier tailed.This was evident already in Figure 6.Second, TABASCO with optimal bandwidth selection is able to estimate the optimal bandwidth rather accurately since the average (empirical) NMSE value seen in Figure 6 at n = 100 is close to the minimum true (theoretical) NMSE value.

PROCESSING
Space time adaptive processing (STAP) is a technique used in airborne phased array radar to detect moving target embedded in an interference background such as jamming or strong clutter [29].The radar receiver consists in an array of Q antenna elements processing P pulses in a coherent processing interval.Within the tested sample x 0 ∈ C p with p = P • Q, the received signal is composed of i) possible unknown targets responses; ii) unknown interferences (ground clutter) plus thermal noise.A detection problem for a given steering vector p is classically formalized as a binary hypothesis test: under H 0 , x 0 only contains the interference plus noise, or under H 1 , x 0 additionally contains a scaled observation of p, i.e.: where x i ∈ C p , i = 1, . . ., n is a secondary data set, assumed to contain i.i.d. and target-free realizations of the interference plus noise.Usually, this disturbance n i is modeled as centered complex Gaussian (or elliptically) distributed with covariance matrix Σ.In this context, efficient adaptive detection statistics can be built from the expression of the adaptive coherence estimator (ACE) detector [30]: where Σ is a plug-in estimate of Σ computed from {x i } n i=1 .More specifically in STAP, the target p follows the steering vector model of [29], which is function of the target angle of arrival (AoA) θ and velocity v.The statistic (43) can thus be computed for a dictionary of steering vectors covering a 2D-grid on θ and v, yielding an adaptive detection map.
Using the SCM as estimate in (43) yields a generalized likelihood ratio test (GLRT) [31], however, plug-in detectors can benefit from refined estimation processes in order to improve robustness, or to deal with limited sample support issues.For example shrinkage to identity (also referred to as diagonal loading or robust beamforming [32]) is a common procedure to improve several properties of the detector's output.In the context of interference cancellation, tapering templates have been considered as a spectrum notch-widening technique [11], or to deal with modulation effects [8].This section presents an experimental validation of TABASCO to illustrate the interest of both approach on real data.The STAP data is provided by the French agency DGA/MI: the clutter is real but the targets are synthetic.The  The tapering matrix is constructed as proposed in [11] 3 , i.e.
Note that index k is here a "null width" parameter in R + and not a bandwidth parameter in [ [1, p]] as in ( 3) or (4).Figure 8 presents the detection map of Λ( Σ) constructed with: i) the SCM; ii) the tapered SCM W(k) • S using bandwidth k = 0.05 (selected manually to obtain the best visual results); iii) TABASCO with the proposed adaptive selection of β for k = 0 (equivalent to RSCM, yielding β = 0.9324); iv) TABASCO with the proposed adaptive selection of β and k allowing k ∈ [10 −3 , 10 −1 ] (TABASCO, yielding k = 0.0143 and β = 0.9929).First we can notice that the SCM provides an unreliable detection map, which is due to insufficient sample support in this configuration.As observed in [11] on another dataset, the covariance matrix tapering can widen the clutter notch (anti-diagonal of the detection map), which permits to clearly distinguish several targets.However, this improvement is at the cost of canceling the response of slower targets (which are close to the canceled clutter ridge).The shrinkage to identity of RSCM also greatly improves the detection process, as it allows us to detect the 10 targets, but still presents some false alarms on the clutter ridge.Finally, TABASCO appears as an interesting trade-off by combining the two effects, and illustrates that the proposed NMSE-driven method still allows for a reasonable regularization parameters (both β and k) selection in this detection application.

IX. CONCLUSIONS AND PERSPECTIVES
We proposed TABASCO: a new covariance matrix estimator that jointly benefits from shrinkage to a scaled identity matrix and tapering of the SCM.By assuming the samples to be generated from an unspecified ES distribution, we also derived an efficient and robust estimation method for the oracle regularization parameters that minimize the MSE.Simulations studies illustrated that TABASCO outperforms existing regularized and tapered estimators in numerous setups.Interestingly, if W = 11 belongs to the set of tapering matrices W considered, the estimator can avoid applying tapering if this option does not provide reduction to the MSE.Thus TABASCO performs similarly to the regularized SCM proposed in [4] in this case, while significantly outperforming it when the tapering templates are valid.We also proposed two new novel estimators that measure the sphericity of the tapered covariance matrix.

APPENDIX
where and η = tr(Σ)/p.Note that L(β) is a convex quadratic function in β with a unique minimum given by Substituting the expressions for constants a 1 , a 2 and a 3 into β o yields the stated expressions in ( 14) and (15).In this regard, it is useful to notice that a 2 − a 16) can be deduced from (15) by using (12) and then simplifying the expression.
The expression for MSE of Σβo follows by substituting β o into expression for L(β) in (45) and using the relation, This gives the stated MSE expression after noting that a 2 − a

B. Proof of Lemma 1
Before proceeding with the proof we introduce some definitions and results that are used in the sequel.First, we let K p denote the p 2 × p 2 commutation matrix defined as a block matrix whose ijth block is equal to a p × p matrix that has a 1 at element ji and zeros elsewhere, i.e., K p = i,j e i e j ⊗ e j e i .It also has the following important properties [33]: K p vec(A) = vec(A ) and K p (A ⊗ B)K p = (B ⊗ A) for any p × p matrices A and B, where vec(A) vectorizes matrix A by stacking the columns of the matrix on top of each other.We then have the following identities.Lemma 3. The following holds: where τ 1 and τ 2 are constants defined in (25).Equations ( 48) and ( 49 Inserting (50) into (47) yields simply by invoking identities in Lemma 3.This proves the first identity.
Next we note that E tr((D S W) which proves the latter claim.

C. Proof of Theorem 4
Let us express the SSCM as Then since v i -s are i.i.d., and E[ Λ] = E[v i v i ] for all i, the expectation of the 2nd term is Using ( 53) and (54) we then obtain that where Next note that Λ sgn = E[ Λ] = Λ + o( Λ F ) when (A) holds by [24,Theorem 2] This fact together with (55) and (56) imply that as p → ∞ under assumption (A).Thus we have proven the claim.

D. Proof of Lemma 2: complex case
In our proof we will use the following identities.
We then recall that the (variance-)covariance matrix of S when sampling from a complex elliptically symmetric distribution where τ 1 and τ 2 are constants defined in (25).Equations ( 58) and ( 59 Inserting (60) into (57) yields simply by invoking identities in Lemma 4. This proves the first identity.The proof of latter part E tr((D S W) 2 ) is as earlier in the real-valued case in subsection B.

E W • v i v i 2 F pn 2 = E W • vv 2 F= 1 • E W • Λ 2 Ftr D i W 2 + 1 n 2 id 2 i
Λ sgn = E[ Λ].The expectation of the 1st terms is i pn E d (W • W)d pn where d = (v 2 1 , . . ., v 2 p ) contains the diagonal elements of vv , where v = d v i and = d reads "has the same distribution as".Furthermore, write D = diag(vv ).Thus we have that n n − D Λ = diag( Λ) can be written asD Λ = 1 n D 1 + . . .+ D n ,whereD i = diag(v i v i ).Furthermore, let d i = (v 2 i1 , . . ., v2ip ) denote a random vector containing the diagonal elements of v i v i .Then we gettr D ΛW 2 = 1 n 2 tr D 1 W + . . .+ D n W =j tr (D i W)D j W i (W • W)d i + 1 n =j tr (D i W)D j W . ] (W • W)E[d].

Lemma 4 .
The following holds:a) A • B 2 F = tr vec(A)vec(A) H • vec(B)vec(B) H for all A, B ∈ C m×n .b) d B (A • A)d B = tr vec(A)vec(A) • (B * ⊗ B) ∀A ∈ C m×m and B ∈ C m×m Sym .c) tr (D B A) 2 = d B (A • A)d B for all A ∈ R m×mSym and B ∈ C m×m .Proof.a,b) proofs of the identities are as proofs of Lemma 3a),b).c) follows directly from [21, Lemma 7.5.2].Write w = vec(W).Using Lemma 4a) we first notice that E W • S 2 F = tr ww • E[vec(S)vec(S) H ] .