New Risk Bounds for 2D Total Variation Denoising

2D Total Variation Denoising (TVD) is a widely used technique for image denoising. It is also an important nonparametric regression method for estimating functions with heterogenous smoothness. Recent results have shown the TVD estimator to be nearly minimax rate optimal for the class of functions with bounded variation. In this paper, we complement these worst case guarantees by investigating the adaptivity of the TVD estimator to functions which are piecewise constant on axis aligned rectangles. We rigorously show that, when the truth is piecewise constant with few pieces, the ideally tuned TVD estimator performs better than in the worst case. We also study the issue of choosing the tuning parameter. In particular, we propose a fully data driven version of the TVD estimator which enjoys similar worst case risk guarantees as the ideally tuned TVD estimator.

1. Introduction. Total variation denoising (TVD) is a standard technique to do noise removal in images. This technique was first proposed in Rudin et al. (1992) and has since then been heavily used in the image processing community. It is well known that TVD gets rid of unwanted noise and also preserves edges in the image (see Strong and Chan (2003)). For a survey of this technique from an image analysis point of view; see Chambolle et al. (2010) and references therein.
The success of the TVD technique as a denoising mechanism motivates us to revisit this problem from a statistical perspective. In this paper, we are interested in the following statistical estimation problem. Consider observing y = θ * + σZ where y ∈ R n×n is a noisy matrix/image, θ * is the true underlying matrix/image, Z is a noise matrix consisting of independent standard Gaussian entries and σ is an unknown standard deviation of the noise entries. Thus, in this setting, the image denoising problem is cast as a Gaussian mean estimation problem. Before defining the TVD estimator in this context, let us define total variation of an arbitrary matrix.
Let us denote the n × n two dimensional grid graph by L n and denote its edge set by E n . Then, thinking of θ ∈ R n×n as a function on L n we define where D is the usual edge vertex incidence matrix of size 2n(n − 1) × n. The 1/n factor is just a normalizing factor so that if θ ij = f (i/n, j/n) for some underlying differentiable function on the unit square then TV(θ) is precisely the discretized Reimann approximation for [0,1] 2 ∂f (x,y) ∂x ∂y . This notion of total variation extends the definition of variation from differentiable functions on the unit square to arbitrary matrices. We can now define the TVD estimator, which is our main object of study. where . throughout this paper will denote the usual Frobenius norm for matrices. The TVD estimator is actually a family of estimators indexed by the tuning parameter V > 0. As per tradition, we will measure the performance of our estimator in terms of its normalized mean squared error (MSE) defined as where throughout Sections 1 and 2 we denote N = n 2 to accord with the tradition of denoting the sample size by N and to help in interpretation of our risk bounds.
We defined the TVD estimator in its constrained form, however the penalized version is also popular in the literature, which is defined as follows: where λ > 0 is a tuning parameter. In this paper, we focus on analysis of the constrained version.
1.1. Background and Motivation. The 1D version of this problem is a well studied problem Tibshirani et al. (2005) in Non Parametric Regression. In this setting, we again have y = θ * + σz as before, where y, θ * , z are now vectors instead of matrices. The total variation of a vector v ∈ R n can now be defined as Again the above definition can be seen as a discrete Reimann approximation to [0,1] f (x) dx when v i = f (i/n) for some differentiable function f. The constrained and the penalized versions of the TVD estimator can now be defined analogously. The penalized form seems to be more popular in the existing literature; in this case the TVD estimator is often referred to as fused lasso (see Tibshirani et al. (2005), Rinaldo et al. (2009)). In this 1D setting, it is known Donoho and Johnstone (1998), Mammen and van de Geer (1997) that the TVD estimator is minimax rate optimal on the class of all bounded variation signals {θ : TV(θ) ≤ V} for V > 0. It is also shown in Donoho and Johnstone (1998) that no estimator, which is a linear function of y, can attain this minimax rate.
It is also worthwhile to mention here that TV denoising in the 1D setting has been studied as part of a general family of estimators which penalize discrete derivatives of different orders. These estimators have been studied in Steidl et al. (2006), Tibshirani (2014) and by Kim et al. (2009) who coined the name trend filtering. A continuous version of these estimators, where discrete derivatives are replaced by continuous derivatives, was proposed much earlier in the statistics literature by Mammen and van de Geer (1997) under the name locally adaptive regression splines.
Total variation of a signal can actually be defined over an arbitrary graph as the sum of absolute values of the differences of the signal across edges of the graph. Trend Filtering on graphs has been a popular research topic in the recent past; see Wang et al. (2016), Lin et al. (2016). A very recent paper, Ortelli and van de Geer (2018), studies TV Denoising on tree graphs. The 1D setting corresponds to the chain graph on n vertices whereas the 2D setting corresponds to the 2D lattice graph on n vertices.
The 2D TV denoising problem, while being much less studied than in its 1D counterpart, has enjoyed a recent surge of interest. Worst case performance of the TVD estimator has been studied in Hütter and Rigollet (2016), Sadhanala et al. (2016). These results show that like in the 1D setting, the 2D TVD estimator is nearly minimax rate optimal over the class {θ ∈ R n×n : TV(θ) ≤ V} of bounded variation signals. Infact, Sadhanala et al. (2016) also generalize the result of Donoho and Johnstone (1998) and prove that no linear function of y can attain the minimax rate in the 2D setting as well. State of the art risk bounds for the TVD estimator in 2D setting are due to Hütter and Rigollet (2016). They studied the penalized form of the TVD estimator and proved that there exists universal constants C, c > 0 such that by setting λ = cσ log n, one gets Theorem 1.1 (Hutter Rigollet).
We will use the usual O notation to compare sequences. We write a n = O(b n ) to mean that there exists a constant c > 0 such that a n ≤ c b n for all sufficiently large n. Additionally, we use the ‹ O notation to ignore some polylogarithmic factors.
In words, the bound in Theorem 1.1 is a minimum of two terms. The first term gives the 1 rate scaling like O(1/ √ N ) for bounded variation functions. The second one is the 0 rate which can be much faster than the O(1/ √ N ) rate if |Dθ * | 0 is small enough. In spite of the above works, there are still a couple of unexplored aspects regarding 2D TVD that are the focus of this paper. We discuss them now.
1.1.1. Adaptivity to Piecewise Constant Signals. Observe that the total variation semi norm is a convex relaxation for the number of times the true signal θ * changes values along the neighbouring vertices. This fact suggests that the TV estimator might perform very well if the true signal is indeed piecewise constant. This phenomenon is now fairly well understood in the 1D setting. In the 1D setting, suppose that the true vector θ * is piecewise constant with k + 1 contiguous blocks. Given data y ∼ N n (θ * , σ 2 I n ), an ideal (oracle) estimator, which knows the locations of the jumps, would just estimate by the mean of the data vector y within each block. It can be easily checked that the oracle estimator will have MSE bounded by σ 2 (k + 1)/n. Recent works (see Dalalyan et al. (2017), Lin et al. (2016)) studied the penalized TVD estimator and showed that if the minimum length of the blocks where θ * is constant is not too small (scales like O(n/k)) and if the tuning parameter λ is set to be equal to an appropriate function of the unknown σ and n, then an oracle risk O(k/n) could be achieved up to some additional logarithmic factors in k and n. In Guntuboyina et al. (2017), this adaptive behaviour was established for the ideally tuned constrained form of the estimator with slightly better log factors. Thus, we can say that in the 1D setting, the TVD estimator is optimally adaptive to piecewise constant signals.
This motivates us to wonder whether similar adaptivity holds in the 2D setting. In this paper, we investigate adaptivity to signals/matrices which are piecewise constant on k << N axis aligned rectangles. Such adaptivity of the 2D TVD estimator has not been explored at all in the literature. Estimation of functions which are piecewise constant on axis aligned rectangles are naturally motivated by methodologies such as CART Breiman (2017) which produce outputs of the same form. Recently, adaptation to piecewise constant structure on rectangles has been of interest in the Non-Parametric shape constrained function estimation literature also (see Theorem 2.3 in Chatterjee et al. (2018) and Theorems 2 and 5 in Han et al. (2017)). Here is the main question that we address in this paper. Q1: If the underlying θ * is piecewise constant on at most k << N axis aligned rectangles; can the ideally tuned TVD estimator attain a faster rate of convergence than the ‹ O(1/ √ N ) rate?
Basically we are asking the question whether the ideally tuned TVD estimator adapts to truths which are piecewise constant on a few axis aligned rectangles, which is a different notion of sparsity than the sparsity constraint of Dθ * 0 being small. As a simple instance of θ * being piecewise constant on rectangles, consider θ * to be of the following form: In this case, we have Dθ * 0 = O( √ N ) and TV(θ * ) = O(1). Note that the 0 bound of Hütter and Rigollet (2016) will give us an upper bound on the MSE scaling like ‹ O(1/ √ N ) which is already given by the 1 bound. Thus, the result of Hütter and Rigollet (2016) does not help in answering our question and suggests there is no adaptation. In Theorem 2.3 of this paper, we show that the ideally tuned TVD estimator indeed adapts to piecewise constant matrices on axis aligned rectangles and provably attains a rate of convergence scaling like ‹ O(1/N 3/4 ) which is strictly faster than the 1 rate ‹ O(1/ √ N ). However, we also show that this ‹ O(1/N 3/4 ) rate is tight and thus the TVD estimator is not able to attain the ‹ O(1/N ) parametric rate that would be achieved by an oracle estimator. This is the main contribution of this paper and is the first result of its type in the literature as far as we are aware.  Sadhanala et al. (2016), show that the ‹ O( V √ N ) rate attained by the penalized TVD estimator is near minimax rate optimal. Thus we can say that the penalized TVD estimator is near minimax rate optimal over the parameter space {θ ∈ R n×n : TV(θ) ≤ V}, simultaneously over V and N . However, this penalized TVD estimator needs to set a tuning parameter λ which depends on the unknown σ and can be potentially difficult to set in practice. This naturally raises a question which is unresolved in the literature so far as we are aware: Q2: Does there exist a completely data driven estimator which does not depend on any unknown parameters of the problem and yet achieves MSE scaling like ‹ O( V √ N ), thus being simultaneously minimax rate optimal over V and n?
In Theorem 2.2 of this paper we answer this question in the affirmative by constructing such a fully data driven estimator.
The rest of the paper is organised as follows. In Section 2, we state our main theorems. Then in Section 3, we discuss connections of our results with some recent works and also demonstrate simulation results which support and verify our main theorems. The next three sections describe and outline the proofs of our main theorems and state some intermediate results precisely. Many proofs including several key calculations and ancillary results are deferred until Section 7.
2.1. Tuned TVD. Our first result states a risk bound of θ V under the bounded variation constraint. Let . denote the usual inner product between two matrices which are vectorized. Also, for any matrix θ ∈ R n×n , we denote its mean by θ := 1 n 2 n i=1 n j=1 θ ij . In all the theorems in this section, V * generically stands for TV(θ * ) where θ * is the underlying true matrix.
Theorem 2.1. Let θ * be an arbitrary n × n matrix and N = n 2 . Suppose the tuning parameter is chosen such that V ≥ V * . Then the following risk bound is true for a universal constant C > 0 and all n ≥ 2 and σ > 0: Remark 2.1. The above result is similar to the 1 bound of Hütter and Rigollet (2016), the difference being the above risk bound holds for the constrained TVD estimator while the existing result of Hütter and Rigollet (2016) holds for the penalized estimator. For any sequence of V > 0 (possibly growing with n), the minimax lower bound results (mentioned earlier) of Sadhanala et al. (2016) now imply the minimax rate optimality (up to log factors) of the constrained TVD estimator θ V over the parameter space {θ ∈ R n×n : TV(θ) ≤ V}.
Remark 2.2. As is made clear in Section 4, our technique for proving Theorem 2.1 is completely different from the technique used to prove the result of Hütter and Rigollet (2016). While they analyze the properties of the pseudo-inverse of the edge incidence matrix D, our proof takes the Empirical Process approach and goes via computing Gaussian widths and metric entropies. Moreover, ingredients of this proof are used in the proof of our next result about a TVD estimator without any tuning.

2.2.
No Tuning TVD. We now state our next result which relates to the question we posed about removing the tuning parameter and still retaining a risk bound which is essentially the same as in Theorem 2.1. Choosing the tuning parameter is an important issue in applying the TVD methodology for denoising. The usual way out is to to do some form of cross validation. There are some proposals available in the literature; see Solo (1999), Osadebey et al. (2014), Langer (2017). However, to the best of our knowledge, we do not know of a tuning parameter free method which provably achieves the optimal worst case ‹ O(V/ √ N ) rate of convergence.
Our goal here is to construct a tuning parameter free estimator of θ * which adapts to the true value of TV(θ * ). The inspiration for this task comes from Chatterjee (2015) where the author gives a general recipe to construct tuning parameter free estimators in Gaussian mean estimation problems when the truth is known have small norm for some known norm. Even though the total variation functional is not a norm but a seminorm, the general idea in Chatterjee (2015) can be extended as we will show. Also the estimator of Chatterjee (2015) is a randomized estimator whereas in our case we construct a non randomized version. The following is a description of a completely data driven estimator which attains the desired risk bound.
The intuition behind the estimator defined above is as follows. The estimation of θ * is done by estimating the two orthogonal parts θ * 1 and θ * − θ * 1 separately. The first part is easy and is estimated by y 1. To estimate θ * − θ * 1, we use a Dantzig Selector type (see Candes et al. (2007)) version of the TVD estimator, which computes a zero mean matrix with the least total variation subject to being within a Euclidean ball of a suitable radius around the centered data matrix y −y 1. A good choice of this radius actually depends on the true σ and hence as an intermediate step, we have to estimate σ in the process which is denoted by σ.
The main idea behind our construction of σ here is the fact that TV(θ * ) is small compared to TV(Z) and hence TV(y) = TV(θ * + σZ) approximately equals σTV(Z). We can now use concentration properties of the TV(Z) statistic to show that TV(Z) ETV(Z) is approximately equal to 1. The following theorem supplies a risk bound for θ notuning .
Theorem 2.2. We have the following risk bound for our tuning free estimator: Remark 2.3. Note that the above bound is meaningful only when lim N →∞ V * √ N = 0. Therefore in this regime, (V * ) 2 N is a lower order term. Thus, Theorem 2.2 basically says that the MSE of θ notuning , up to multiplicative log factors and an additive factor σ √ N , scales like V * √ N . In light of Remark 2.1 we can say that θ notuning is minimax rate optimal over {θ ∈ R n×n : TV(θ) ≤ V}, simultaneously for any sequence of V (depending on n) which is bounded below by a constant and above by √ N . To the best of our knowledge, this is the first result demonstrating such an estimator which is completely tuning free.
2.3. Adaptive Risk Bound. Now we come to the main result of this paper which is about proving adaptive risk bounds for θ * which are piecewise constant on at most k axis aligned rectangles where k is a positive integer much smaller than n. We call a subset R ⊆ L n a (axis aligned) rectangle if it is a product of two intervals. For a generic rectangle R = ([a, b] ∩ N) × ([c, d] ∩ N), we define n row (R) and n col (R) to be the cardinalities of [c, d] ∩ N and [a, b] ∩ N respectively. In words, n row (R) and n col (R) are simply the numbers of rows and columns of R respectively if one views R as a two-dimensional array of points. Then we define its aspect ratio to be A(R) := max{ nrow(R) n col (R) , n col (R) nrow(R) }. For a given matrix θ ∈ R n×n we define k(θ) to be the cardinality of the minimal partition of L n into rectangles R 1 , . . . , R k(θ) such that θ is constant on each of the rectangles. Next we state our main result for the 2D TVD estimator.
Theorem 2.3. Let θ * ∈ R n×n be the underlying true matrix and denote k(θ * ) by k. Let R 1 , . . . , R k be the rectangular level sets of θ * which form a partition of the 2D grid L n . In addition, let us assume that the rectangles R i have bounded aspect ratio, that is there exists a constant c > 0 such that max i∈[k] A(R i ) ≤ c. Then we have the following risk bound for the ideally tuned TVD estimator: Here C is a constant that only depends on c.
Remark 2.4. One consequence of the above theorem is that when k(θ * ) = O(1) then the TVD estimator attains a O(N −3/4 ) rate, up to log factors. This rate is faster than the O(N −1/2 ) rate that is available in the literature. Our main focus here has been to attain the right exponent for N . The exponent of k and log n may not be optimal. Since the current proof of this theorem is fairly involved technically, obtaining the best possible exponents of k and log n is left for future research endeavors. See Section 3 for more discussions about the proof of the above theorem and comparisons with existing results.
Remark 2.5. The bounded aspect ratio condition is necessary for the O(N −3/4 ) rate to hold in the above theorem. This condition says that the rectangular level sets of θ * should not be too skinny or too long. The bounded aspect ratio is needed for similar reasons as a minimum length condition is needed for the length of the constant pieces in the 1D setting; see Guntuboyina et al. (2017), Dalalyan et al. (2017).
A natural question is whether our upper bound in Theorem 2.3 is tight. Our next theorem says that, in the low σ limit, the N −3/4 rate is not improvable even if k(θ * ) = 2.
Theorem 2.4. Let θ * ij = 1 if j > n/2 and 0 otherwise. Thus, θ * is of the following form: Clearly k(θ * ) = 2. In this case, we have a lower bound to the risk of the ideally constrained TVD estimator. lim Here c > 0 is a universal constant.

Comparison with existing results and simulation studies. To place our theor
To place our theorems in context, it is worthwhile to compare and relate our results with a couple of recent papers. Hütter and Rigollet (2016). Let us compare our risk bound in Theorem 2.3 to the adaptive risk bound (Theorem 1.1) of Hütter and Rigollet (2016) when the truth θ * is piecewise constant on a few axis aligned rectangles. Both of these theorems prove statements about tuned TVD estimators. Considering the very simple case when θ * is of the following form: θ * = î 0 n×n/2 1 n×n/2 ó we have already mentioned in Section 1 that ). Thus, in the small k regime when k < N 1/3 , Theorem 2.3 provides a much faster rate of convergence. This is one of the main contributions of this paper and to the best of our knowledge is the first of its kind in the literature. Guntuboyina et al. (2017). As mentioned in Section 1, one of our motivating factors behind investigating adaptivity of the 2D TVD estimator was its success in optimally estimating piecewise constant vectors in the 1D setting. Theorem 2.2 in Guntuboyina et al. (2017) gives a ‹ O(k/n) rate for the ideally tuned constrained 1D TVD estimator when the truth θ * is piecewise constant with k pieces and each piece satisfies a certain minimum length condition. In a sense, our Theorem 2.3 is a natural successor, giving the corresponding result in the 2D setting. Our bounded aspect ratio condition is the 2D version of the minimum length condition. A consequence of Theorem 2.3 and Theorem 2.4 is that, in contrast to the 1D setting, the ideally tuned constrained TVD estimator can no longer obtain the oracle rate of convergence ‹ O(k/n) in the 2D setting.

Comparison with
The proof of Theorem 2.2 in Guntuboyina et al. (2017) was done by bounding the Gaussian widths of certain tangent cones. Our proof of Theorem 2.3 also adopts the same strategy and precisely characterizes the tangent cone T K(V * )(θ * ) (defined in Section 6.1) for piecewise constant θ * and then bounds its Gaussian width. The main idea in Guntuboyina et al. (2017) was to observe that any unit norm element of the tangent cone is nearly made up of two monotonic pieces in each constant piece of θ * . Then the available metric entropy bounds for monotone vectors were used to bound the Gaussian width. A crucial ingredient in this proof is the well-known fact that any univariate function of bounded variation has a canonical representation as a difference of two monotonic functions. However, it is not clear at all how to adapt such a strategy to the 2D setting. In particular, it is not nearly as natural and convenient to express a matrix of bounded variation as a difference of two bi-monotone matrices. Our computation of metric entropy of the tangent cone is therefore essentially two dimensional and involves judicious recursive partitioning in both dimensions. We believe that our metric entropy computations, especially the proof of Proposition 6.10, consist of new techniques and are potentially useful for problems of similar flavor.
3.3. A Natural Question. In light of Theorem 2.3 and Theorem 2.4 we can say the following statement. When the truth θ * is piecewise constant on k axis aligned rectangles, the TVD estimator cannot attain the oracle rate of convergence scaling like O(k/N ). The question that now arises is whether there exists any estimator which attains the ‹ O(k/N ) rate of convergence for all piecewise constant truths. Furthermore, can this estimator be chosen so that it is computationally efficient? In a forthcoming manuscript we answer both of these questions in the affirmative.
3.4. Simulation studies. We consider three distinct sequences of matrices to facilitate comparison. We consider the simplest piecewise constant matrix θ two ∈ R n×n where θ two ij = I{j > n/2}. Hence θ two just takes two distinct values. The next matrix θ four is a block matrix with four constant pieces. θ four = ñ 1 n/2×n/2 2 n/2×n/2 0 n/2×n/2 1 n/2×n/2 ô Finally, we also consider a n × n matrix θ worst = I{i + j > n}. Clearly, θ worst does not have a block constant structure. We expect θ worst to achieve the worst case rate ‹ O(N −1/2 ); hence the name.
The dependence of the MSE with N = n 2 can be experimentally checked as follows. We can estimate the MSE for a fixed n by monte carlo repetitions and then iterate this for a grid of n values. We then plot log of the estimated MSE with log N and fit a least squares line to the plot. The slope of the least squares line then gives an indication of the correct exponent of N in the MSE. Figure 1 is such a plot for the ideally tuned constrained TVD estimator.
In Figure 1, the risk is seen to be minimum for θ two followed by θ four and then θ worst . The slope for θ two and θ four came out to be −0.73 and −0.68. This agrees well with Theorem 2.3 and Theorem 2.4 which says that the MSE decays at the rate n −0.75 upto log factors. However for the matrix θ worst the slope turned out be −0.52 which is in agreement with the worst case ‹ O(n −1/2 ) rate given in Theorem 2.1.
To assess the risk of our fully data driven estimator θ notuning , we again consider the three matrices θ two , θ four and θ worst respectively. Figure 2 is a plot of log MSE versus log n.
The simulations in Figure 2 strongly suggest that our estimator has MSE decaying at a O(1/ √ N ) rate for all three matrices. The slope of all three least squares lines are extremely close to −0.5. This matches the rate given in Theorem 2.2. Also note that the estimated risk of our estimator are consistently higher for θ four than both θ two and θ worst . We believe this is because TV(θ four ) = 2 is higher than TV(θ worst ) = 1 − 1/n and TV(θ two ) = 1. This indicates that our estimator performs better for matrices with smaller total variation which is an adaptive property we desire. However, our estimator does not seem to be adaptive to piecewise constant structure like the constrained TVD estimator with ideal tuning. This seems to be the price we have to pay in order to be completely tuning parameter free. The MSE of the ideally tuned TVD estimator is estimated with 50 Monte carlo repetitions for a grid of n = √ N ranging from 500 to 700 in increments of 50. The true matrices were taken to be θ two (blue), θ four (red) and θ worst (green). In each case, we have chosen the ideal tuning parameter to allow fair comparison. We plot log of estimated MSE versus log n where log is taken in base e. The circular points are the estimated log MSE and the dashed lines are the least squares line fitted to the points. The least squares slope for θ two is −0.73 and for θ four is −0.68 which is considerably lower than the slope for the matrix θ worst which is −0.52.   4. Proof of Theorem 2.1. We sketch the main ideas of the proof here. The proof involves several crucial intermediate results, most of which are proved in Section 7.2. We first set up some notations which would henceforth be used throughout the paper. 4.1. Some useful notations. For a subset A ⊆ R n define its Gaussian Width to be where Z ∈ R n is a random vector with independent standard normal entries and . denotes the standard inner product between two vectors. The r-covering number N (A, r) of A is the minimum number of Euclidean balls of radius r needed to cover A. For a positive integer n, let us denote the subset of positive integers {1, . . . , n} by [n]. Also in all the proofs of our results, we will use TV unnorm to denote the unnormalized version of (1.1). Specifically for a n × n matrix θ we will now denote We do this because we believe it is easier to read and interpret the proofs with the unnormalized definition while it is instructive to use the normalized version to state our theorems to facilitate interpretation of rates of convergence as a function of the sample size N = n 2 . Also we will generically use V to denote the unnormalized total variation whereas previously we used bold V to denote the normalized total variation. 4.2. Proof of Theorem 2.1. We first use the standard approach of using the basic inequality to reduce our problem to controlling Gaussian suprema. The following lemma serves this purpose.
Lemma 4.1. Under the same conditions as in the statement of Theorem 2.1 we have where the last inequality follows because θ V = y and 1 refers to the all ones matrix. Now taking expectation in both sides of the above display and noting that E y1 − θ * 1, Z = σ n 2 EZ 2 = σ finishes the proof. Let us define K 0 n (V ) := {θ ∈ R n×n : TV unnorm (θ) ≤ V, θ = 0} . We now need to evaluate the Gaussian width of the set K 0 n (2V ). Since Gaussian widths are connected to Metric entropies via the Dudley's entropy integral inequality (Dudley (1967)), our goal henceforth is to compute the metric entropy of the set K 0 n (2V ).
4.2.1. Metric Entropy of K 0 n (V ). Let us define the set of matrices . Since θ = 0 there must exist a negative element in θ which then implies TV unnorm (θ) > V thus producing a contradiction. Thus we have K 0 n (V ) ⊆ T n,n,V,V . Henceforth, we will bound the metric entropy of T n,n,V,V which is a compact set and therefore has finite metric entropy. To cover T n,n,V,V up to radius r > 0, we need to construct a finite net F n such that for any θ ∈ T n,n,V,V , there exists θ ∈ F n with θ − θ 2 ≤ r 2 . Our main covering strategy is to come up with a finite net consisting of matrices which are piecewise constant on axis aligned rectangles. Specifically, for a given θ ∈ T n,n,V,V and any given radius r > 0, we first construct a fine enough (but no finer than required) rectangular partition of L n and set θ to equal the mean of θ within each rectangular block in the partition so that in the end we ensure θ − θ 2 ≤ r 2 . Our covering strategy has two main ingredients.
• For any given θ ∈ T n,n,V,V and radius r > 0, we construct a rectangular partition of L n by a greedy partitioning scheme. As we will see, the cardinality of the net F n will essentially correspond to the number of distinct partitions of L n obtained as we vary θ ∈ T n,n,V,V . This counting is done in Lemma 4.2. • After creating a θ specific partition, we just set θ to equal the mean of θ within each rectangular block in the partition. Thus, to calculate θ − θ we need a bound on the We first describe the following very simple scheme for subdividing a matrix based on the value of its total variation. For any > 0, the (TV unnorm , ) scheme subdivides a matrix θ ∈ R n×n into several rectangular submatrices θ i such that TV unnorm (θ i ) ≤ for all i. This is achieved in several steps of dyadic division as follows. For convenience we will assume that n is a power of 2 without any loss of generality (see Lemma 7.5).
Suppose that θ ∈ R n×n . The final subdivision of θ produced by the (TV unnorm , ) scheme corresponds to a rectangular partition of L n into rectangular blocks, say, P θ, . Let |P θ, | denote the number of rectangular blocks of the partition P θ, . Now for V > 0, let P(V, n, ) denote the set of partitions {P θ, : θ ∈ R n×n , TV unnorm (θ) ≤ V }. A key ingredient in our argument is the following universal upper bound on the cardinality of P(V, n, ).
Lemma 4.2. Let θ ∈ R n×n . Then, for any > 0, for the (TV unnorm , ) division scheme there exists a universal constant C > 0 so that we have the following cardinality bounds: Remark 4.1. It is clear that both |P θ, | and log |P(V, n, )| should increase as decreases. The basic idea of proof is simple and uses superadditivity of the TV unnorm functional. The reader should read the above lemma as saying that both |P θ, | and log |P(V, n, )| scale like V , upto additive and multiplicative log factors. The log factors are not terribly important but the scaling V is.
Our next result concerns the approximation of a generic matrix θ with TV unnorm (θ) ≤ V by a constant matrix. This result is a crucial ingredient of the proof and is a discrete analogue of the Gagliardo-Nirenberg-Sobolev inequality for compactly supported smooth functions.
So in particular when m = n, we have imsart-aos ver. 2014/02/20 file: ann_paper.tex date: February 26, 2019 A special case of the so called Gagliardo-Nirenberg interpolation inequalities in analysis (see, e.g., Leoni (2017)) states that there exists a universal constant C > 0 such that for a differentiable function f : [0, 1] 2 → R, the following holds: Reimann approximation for the integral on the left side of (4.2) whereas TV unnorm (θ)/m is a Reimann approximation for the integral on the right side of (4.2). Thus one can see that Proposition 4.3 is the discrete analog of the classical inequality in (4.2). However, note that differentiability of f is required for this inequality whereas our inequality holds for arbitrary matrices with no assumptions. We are not aware of the discrete version of Gagliardo-Nirenberg-Sobolev inequality to have appeared in the literature which is why we state it here. The proof is again done in Section 7.
Remark 4.2. It must be mentioned here that metric entropy for the class of multivariate bounded variation functions exist in functional analysis. The classical Gagliardo-Nirenberg-Sobolev inequality plays a similar role in those proofs as well. Our proof can be seen as extending these classical results to the matrix setting where we do not use any differentiability assumptions. See Giné and Nickl (2016) for a general reference.
With Lemma 4.2 and Proposition 4.3 in hand, we are now in a position to upper bound the metric entropy of the set T n,n,V,V .
Proposition 4.4. For any positive integer n ≥ 2 and positive numbers V and L we have the following covering number upper bound for a universal constant C > 0, Since our definition of total variation is unnormalized, the canonical scaling of V is O(n). Thus, in almost all regimes of interest, we have V 2 log n 2 > 1. Thus, it is perhaps beneficial for the reader to read the above bound as scaling like V 2 2 , neglecting the log and the non dominant factors. The proof of the above covering number bound relies on our dyadic partitioning scheme and the use of Proposition 4.3. A brief sketch of the proof, neglecting log factors, is as follows. Given a θ ∈ T n,n,V,V , we first use the (TV unnorm , η) scheme to partition L n so that the total variation of θ is at most η within every rectangle of the partition. By Lemma 4.2, the number of rectangles in the partition is bounded above by O(V /η). Now θ is defined as the mean of θ within every rectangle. The 2 squared distance between θ and θ in each block is at most O(η 2 ) by an application of Proposition 4.3. Thus the overall squared distance between θ and θ is O(V η). As θ varies over T n,n,V,V the log cardinality of distinct θ directly depends on the log cardinality of the number of different partitions obtained and scales like O(V /η) by a further application of Lemma 4.2. Choosing V η = 2 now gives us the O(V 2 / 2 ) scaling of the metric entropy. Full details of the proof are given in Subsection 7.2. Now we can evaluate the Gaussian width of K 0 n (V ) using the above metric entropy result. This is the content of the next lemma. Its proof involves invoking Dudley's entropy integral inequality and some basic integral calculus.
Lemma 4.5. There exists a universal constant C > 0 such that we have the following bound on the Gaussian width of K 0 n (V ) for any n ≥ 2 and V > 0: Finally, we can use the above lemma in conjunction with Lemma 4.1 to complete the proof of Theorem 2.1.

Proof of Theorem 2.2.
We give the high level idea of the proof. The full proof is given in Subsection 7.3. To prove Theorem 2.2 we apply the general machinery developed in Chatterjee (2015) with suitable modifications. We now sketch the proof idea at a high level. The detailed proof is given in Section 7. Let us define w = y − y to be the centered data matrix, w * = θ * − θ * to be the centered ground truth matrix and let Also, for any V ≥ 0, let " w V denote the Euclidean projection of w onto the convex set K 0 n (V ).
To show that θ notuning is a good estimator of θ * it clearly suffices to show that " w is a good estimator of w * . If we knew TV unnorm (θ * ) = TV unnorm (w * ) = V * , a similar argument as in the proof of Theorem 2.1 would tell us that " w V * attains the ‹ O( V * √ N ) rate that we desire. Of course, the aim here is to get the same rate without knowing V * and σ. One part of our proof deals with showing that using σ in the definition of our estimator is not much worse than if we knew σ and used it in defining our estimator. This is shown by showing that σ ∼ σ using a concentration of measure argument where ∼ is a somewhat informal notation conveying the meaning of approximately equal to.
To analyze the risk of " w, a natural first step is to decompose the risk as follows: Here we used the elementary inequality a + b 2 ≤ 2 a 2 + 2 b 2 . The above decomposition has a natural interpretation as twice the sum of the ideal risk (achievable when V * is known) and an excess risk due to not knowing V * and σ. The main task therefore is to upper bound the excess risk term " w − " w V * 2 .
We now need to look at two different cases. The first case is when " w = 0. In this case we first show that the minimum of the optimization problem defined in (5.1) is attained on the boundary. This would mean we have " w − w 2 = (n 2 − 1) σ 2 ∼ (n 2 − 1)σ 2 . Letting " V = TV unnorm (" w), a simple geometric argument also shows that " w V = " w. Thus, both " w V * and " w are Euclidean projections onto K 0 n (V ) for two possibly different choices of V. Thus, we can now use standard characterizations of Euclidean projections onto convex sets (content of Lemma 7.7) for both " w V * and " w to obtain a bound on the excess risk as follows: Further, since " w V * is known to be a good estimator of w * we can write where the last approximation is again by a simple concentration of measure argument. The last three displays then suggest that " w is a good estimator of " w V * . Quantifying the last three displays gives us the desired upper bound on the excess risk.

The second case is when
The rest of the proof then follows similarly as in the previous case.
6. Proofs of Theorem 2.3 and Theorem 2.4. The constrained TVD estimator θ V is a least squares estimator constrained to lie in the convex set There are general results on the accuracy of least squares estimators under convex constraints, see, e.g, Chatterjee (2014), Hjort and Pollard (1993), Van der Vaart (2000), Van der Vaart and Wellner (1996). In particular, we plan to use available techniques connecting the risk of the constrained TVD estimator θ V to expected Gaussian suprema over tangent cones to the set K(V ). Of course, here we are interested in the setting of ideal tuning V = V * , i.e, we are interested in the risk of the estimator θ V * .
6.1. General Theory. We now describe existing results in the literature that we directly use for this proof. To describe these results, we need some notation and terminology. For any A ⊆ R n×n we denote the smallest cone containing A by Cone(A) and the closure of A by Closure(A). The tangent cone T K(V * ) (θ * ) ⊆ R n×n at θ * with respect to the convex set K(V * ) is defined as follows: represents all directions in which one can move infinitesimally from θ * and still remain in K(V * ). For any set A ⊆ R n×n let us recall that its Gaussian width, denoted as GW(A), is defined as where Z ij are independent standard normal random variables. Let us denote the Euclidean ball in R m×n of radius t by B m,n (t). The following result from Corollary 2 in Bellec et al. (2018) along with an application of Proposition 10.2 in Amelunxen et al. (2014) connects the risk of θ V * to the Gaussian width of the tangent cone T K(V * ) (θ * ).
Another result that is of use to us is the following result of Oymak and Hassibi (2013) (Theorem 3.1). It says that the upper bound provided in Theorem 6.1 is essentially tight.
Remark 6.1. To clarify, Theorem 3.1 in Oymak and Hassibi (2013) actually says that ) . Here Z, as usual, refers to a matrix of independent N (0, 1) entries, Polar(T K(V * ) (θ * )) refers to the Polar Cone of T K (V * )(θ * ) and Dist refers to the Euclidean Distance between two sets. Letting C denote a general cone and Π C denote the Euclidean projection operator onto C, the standard Pythagorus Theorem for cones implies Also, it holds that Π C (Z) = sup θ∈C: θ ≤1 Z, θ . A proof of the above fact is available in Lemma A.3 in Chatterjee et al. (2019). Theorem 6.2 now follows from applying the above facts to Theorem 3.1 in Oymak and Hassibi (2013) and then using the elementary inequality EX 2 ≥ (EX) 2 .
In light of the above two theorems our problem reduces to understanding the Gaussian width of T K(V * ) (θ * ) intersected with B n×n (1) when θ * is a piecewise constant matrix on rectangles. This in turn needs us to understand how these tangent cones look like in the first place. In the next subsection, we characterize the tangent cone of a piecewise constant matrix. 6.2. Tangent Cone Characterization. We fix a θ * ∈ R n×n and proceed to investigate the tangent cone T K(V * ) (θ * ). Let R * = (R 1 , R 2 , . . . , R k ) be a partition of [n] × [n] into k rectangles where k = k(θ * ). Recall that the vertices in the grid graph L n correspond to the pairs (i, j) ∈ [n] × [n] and its edge set E n consists of: For any edge e ∈ E n , we denote by e + and e − the vertices associated with e with respect to the natural partial order. For any θ ∈ R Ln , we will use ∆ e θ as a shorthand notation for the (discrete) edge gradient θ(e + ) − θ(e − ). Thus TV unnorm (θ) = e∈En |∆ e θ|. For a general we define its right boundary as follows: While defining the above set, we are using the matrix convention for indexing the vertices of L n . Thus, the top-left vertex in the two-dimensional array L n is indexed by (1, 1) and the bottom-right vertex by (n, n). Similarly we define the left, top and bottom boundaries of R and denote them by ∂ left (R), ∂ top (R) and ∂ bottom (R) respectively. The boundary of R, denoted by ∂R, is defined as Observe that |∆ e (θ * + θ)| − |∆ e (θ * )| = |∆ e θ| − 0 = |∆ e θ| for every edge e in A c . Thus in order for θ * + θ ∈ K(V * ), the increments in the absolute edge gradients of θ * + θ from the edges in A c must be compensated by an equal or greater amount of decrease in the absolute edge gradients for the edges in A. The precise statement is the content of Lemma 6.3. We have the following set equality: Here, sgn(x) := I{x > 0} − I{x < 0} is the usual sign function.
Proof. Let T be the set on the right side of (6.1). Let us first prove that An important feature of T is that it is a closed convex cone. Hence it suffices to show that θ ∈ T whenever θ * + θ ∈ K(V * ). To this end let θ be such that TV unnorm (θ * + θ) ≤ TV unnorm (θ * ). Since K(V * ) is a convex set, we have for any 0 ≤ c ≤ 1. Now observing that we can write whenever c is small enough satisfying sgn(∆ e θ * + c∆ e θ) = sgn(∆ e θ * ) for all e ∈ A. By definition, TV unnorm (θ * ) = e∈A sgn(∆ e θ * )∆ e θ * which together with (6.2) gives us θ ∈ T .
It remains to show that T ⊆ T K(V * ) (θ * ). It suffices to show that for any θ ∈ T there exists a small enough c > 0 such that TV unnorm (θ * + cθ) ≤ V * . This can be shown using the same reasoning given after (6.2).
With the above characterization of the tangent cone, we are now ready to prove our lower bound to the risk given in Theorem 2.4.
6.3. Proof of Theorem 2.4. Recall that here we consider θ * which is piecewise constant on two rectangles and is of the following form: In view of Theorem 6.2, it suffices to prove the following lemma.
Then, there exists a universal constant c > 0 such that we have the following lower bound to its Gaussian width: (1)) ≥ cn 1/4 .
Proof. Consider n to be even and a perfect square (i.e., √ n is an integer) for simplicity of exposition. Also for a generic n × n matrix θ we will denote θ (1) to be the submatrix formed by the first n/2 − 1 columns, v (θ) to be the n/2-th column and θ (2) to be the submatrix formed by the last n/2 columns. Also, for two matrices θ and θ with the same number of rows, we will denote [θ : θ ] to be the matrix obtained by concatenating the columns of θ and θ .
We can now use Lemma 6.3 to characterize the tangent cone T K(V * ) (θ * ).
In this proof, we will actually lower bound the Gaussian width of a convenient subset of T K(V * ) (θ * ). To this end, for constants c 1 , c 2 ∈ (0, 1) to be specified later, let us define In words, for θ ∈ S, the first n/2 − 1 columns are all equal to c 1 /n, the last n/2 columns of θ are 0 and the entries in the n/2-th column can take two values; either c 2 / √ n or c 1 /n.
Before going further, let us define the set of indices In words, we divide [n] into √ n many equal contiguous blocks and B j refers to the jth block. Now, for any realization of a random Gaussian matrix Z, let us define the matrix ν so that ν (1) = c 1 n 1 n×(n/2−1) and ν (2) = 0 n×n/2 . Moreover, we define v (ν) as follows: In words, the vector v (ν) is defined so that it is constant on each of the blocks B j . If i∈B j Z[i, n/2] > 0 the value on B j is c 2 √ n , otherwise the value is c 1 n . Now we claim that the following is true: a) ν ∈ S for any Z. b) ν ≤ 1.
Taking the above claims to be true we can write where we used the fact that ν (1) , ν (2) are constant matrices and Z has mean zero entries. Now let us denote Z j := i∈B j Z[i, n/2]. Note that (Z 1 , . . . , Z √ n ) are independent mean zero Gaussians with standard deviation n 1/4 . Therefore where for a standard Gaussian random variable z, we denote φ = E zI{z > 0}.
It remains to prove the two claims. To prove the first claim, first note that it suffices to show that (6.3) holds for ν, i.e the following is true Now entries of v ν can take two values, either c 2 √ n or c 1 n . In either case it can be checked that for each row index i ∈ [n] we have Along with the fact that In order to obtain a "matching" upper bound on the gaussian width, which would eventually lead to the proof of Theorem 2.3 in view of Theorem 6.1, we first need to split θ into submatrices each of which satisfies a constraint like (6.3). Our next two subsections are devoted to this goal.
6.4. Towards simplifying the tangent cone. Let us revisit Lemma 6.3. Since θ * is constant on each rectangle R i ∈ R * it follows that A ⊆ {e ∈ E n : e + ∈ R i and e − ∈ R j for some i = j ∈ [k]} .
As a consequence we get the following corollary: Corollary 6.5. Fix θ * ∈ R n×n . We have The first step towards obtaining a decomposition where each submatrix satisfies a constraint like (6.3) is to separate the constraints for R i 's. More precisely we would like for each i ∈ [k]. As we will see below that this is "almost" the truth when we consider matrices in the tangent cone which are of unit norm.
Let us make precise the notion of an "almost" version of (6.6). To this end we introduce for any δ, t > 0: ]. In plain words, M fourbdry (m , n , δ, t) consists of matrices of norm at most t whose total variation is bounded by the total 1 norm of its four boundaries plus an extra wiggle room δ > 0. In our next result we show that for any θ in T K(V * ) (θ * ) intersected with the unit Euclidean ball B n×n (1), the restriction θ R i of θ to R i lies in M fourbdry (m i , n i , δ i , t i ) for each i ∈ [k] with m i := n row (R i ), n i := n col (R i ) and t i 's and δ i 's satisfying some upper bounds on their 2 and 1 norms respectively.
Lemma 6.6. We have the set inclusion Remark 6.2. By virtue of Lemma 6.6, we achieve our objective of obtaining a characterization of T K(V * ) (θ * ) where we have separate constraints for each R i ∈ R * . The constraints are now coupled together by the wiggle room vector δ ∈ S k,∆(θ * ) and the 2 norm vector t.
With the help of Lemma 6.6 we can now deduce the following lemma.
Lemma 6.7. With the notation described in this section, we have the following upper bound: The proofs of Lemma 6.6 and Lemma 6.7 are given in Subsections 7.4 and 7.5. Operationally, the above lemma reduces the task of upper bounding GW(T K(V * )(θ * ) ∩ B n,n (1)) to upper bounding the Gaussian width of M fourbdry with appropriate parameters. However, it would be convenient for us to bound the Gaussian width when the number of boundaries involved in the constraint is at most one instead of four. The results in the next subsection makes this possible.
6.5. Further simplification: from four boundaries to one. We now proceed to the second step, i.e., reducing the number of boundaries involved in the constraints from four to one (or zero). Thus, we will keep on subdividing each θ R i until we obtain submatrices satisfying constraints similar to (6.7), albeit with the 1 -norm of at most one boundary vector appearing on the right hand side of the bound on total variation. This is the content of this subsection.
Taking the cue from the the previous subsection, let us define We can define M bottom (m , n , δ, t), M left (m , n , δ, t) and M right (m , n , δ, t) in a similar fashion. Notice that the constraint satisfied by the total variation of the members of M right (m , n , δ, t) is "almost" identical to (6.3). By abuse of notation we will refer to any of the four families of matrices described above by a generic notation which is M onebdry (m , n , δ, t). The reason behind this is that our ultimate concerns would be the Gaussian widths of these families which, for m and n close enough to each other, are expected to be of similar order by symmetry. Using a single notation for them would thus minimize the notational clutter. In a similar vein we define M nobdry (m , n , δ, t) := {θ ∈ R m ×n : TV unnorm (θ) ≤ δ, θ ≤ t} .
Having defined the relevant families of matrices, we can now state our main result for this subsection.
Lemma 6.8. Fix positive integers m, n and positive numbers δ, t. Define for each integer j ≥ 1, Then we have the following bound for a universal constant C, GW(M fourbdry (m, n, δ, t)) Here, to simplify notations, we use m/2 j , for m, j ∈ N, to denote any (but fixed in any given context) integer m between m2 −(j+1) and m2 −(j−1) . Similar thing is denoted by n/2 j . K equals the number of binary divisions of [m] × [n] on both axes that are possible and equals min{log 2 m, log 2 n} up to a universal constant.
The above lemma bounds the Gaussian width of M fourbdry in terms of Gaussian widths of simpler classes of matrices M onebdry and M nobdry . The proof proceeds by following a particular scheme of recursive binary partitioning in both dimensions and is given in Subsection 7.6. 6.6. Upper bounds on Gaussian Widths and the proof of Theorem 2.3. Now that we have reduced the problem of bounding the gaussian width of T K(V * )(θ * ) ∩ B n,n (1) to that of M nobdry (m, n, δ, t) and M onebdry (m, n, δ, t), we need to obtain upper bounds on these quantities in order to conclude the proof of Theorem 2.3. Our next lemma provides an upper bound on the gaussian width of M nobdry (m, n, δ, t) which we henceforth denote as GW nobdry (m, n, δ, t).
Lemma 6.9. Fix δ > 0 and t ∈ (0, 1]. For positive integers m, n such that n ≥ 2 and max{m/n, n/m} ≤ c for some universal constant c > 0, we have the following upper bound on the Gaussian width: Proof. Note that we have M nobdry (m, n, δ, t) ⊆ T m,n,δ,t . When m and n are of the same order, the proof follows from exactly the same arguments as in the proof of Lemma 4.5.
In our next proposition, we provide an upper bound on GW onebdry (m, n, δ, t), i.e., the gaussian width of M onebdry (m, n, δ, t). This is the main result in this subsection and one of the main technical contributions of this paper.
Here x 2↓ := x + x 2 and C > 0 is a universal constant.
We will prove the above proposition slightly later. Lemma 6.9 and Proposition 6.10 together with Lemma 6.8 now imply Lemma 6.11. Under the same condition as in the previous proposition, we have GW(M fourbdry (m, n, δ, t)) ≤ C(log n) 4.5 n 1/4 ( The detailed proof of Lemma 6.11 is given in Section 7.7. With the help of this lemma we can now conclude the proof of Theorem 2.3.
Throughout this proof we will use the notation C to denote some positive universal constant whose exact value may change from one line to the next. Also we will use "a ≤ b" to mean "a ≤ Cb". Recall that by Lemma 6.6, GW(T K(V * )(θ * ) ∩ B n,n (1)) equals GW(M fourbdry (m i , n i , ∆(θ * )δ i , t i )) + C k log n . (6.10) Now we plug in the bound from Lemma 6.11 to obtain a bound on the sum inside the two maximums in the above display: t i + kn −9 log n .
Since the aspect ratios of each of the rectangular level sets of θ * is bounded by a constant, we have k i=1 n i 2 ≤ n 2 . Therefore, we can repeatedly apply the Cauchy-Schwarz inequality to deduce for δ, t 2 ∈ S k,2 , Also because of constant aspect ratio, we have Combining the last two displays we notice that (log n) 4.5 k 5/8 n 1/4 emerges as the dominant term and hence i∈ [k] GW(M fourbdry (m i , n i , ∆(θ * )δ i , t i )) ≤ (log n) 4.5 k 5/8 n 1/4 .
Together with (6.10) this finishes the proof.
All that remains is the proof of Proposition 6.10. The proof of this proposition is fairly involved. The rest of this section is devoted to its proof. 6.7. Proof of Proposition 6.10. By symmetry, it is enough to bound GW(M right (m, n, δ, t)).

To this end, let us introduce a new class of matrices as follows:
A(m, n, u, v, t) := {θ ∈ R m×n : TV row (θ) ≤ u, TV col (θ) ≤ v, θ ≤ t} where the total variation TV row (θ) along rows is defined as and TV col (θ) := TV row (θ T ).
The following lemma gives an upper bound of GW(M right ) in terms of the Gaussian widths of A with appropriate parameters.
The proof of this lemma is done by dividing the n columns into blocks of geometrically increasing length and showing that for any θ ∈ M right (m, n, δ, t) the submatrices defined by the blocks live in A with appropriate parameters. The proof is given in Subsection 7.8.1.
It therefore suffices, in view of the previous lemma, to bound the gaussian width of each A(m, n j , 2t » m/n j + δ, t » m/n + δ, t) from above in order to bound GW right (m, n, δ, t).
Notice that we suppressed the dependence on m, n, δ and t which henceforth refer to the corresponding parameters in Lemma 6.12.
Our next result provides an upper bound on the metric entropy of A a for any radius τ between 1/m and 1.
Lemma 6.13. Let t ≤ 1 and c > 0 be a universal constant. Also let a ≥ c be such that m/a 2 is a positive integer between 1 and n. Then there exists a constant C > 0, depending solely on c, such that for any τ ∈ (1/m, 1], we have where L(x) := x log(e log(em) 2 x) and C m,n,δ,t := log(em) (recall that x 2↓ := x + x 2 ).
Remark 6.3. Notice that L(x) is linear in x ignoring the log factors. Thus it is helpful to read the above bound as scaling like √ m τ 2 up to log factors and the lower order terms. This √ m-scaling is crucial for us in order to derive the 1/4 exponent of n in Proposition 6.10 and subsequently the correct exponent of n in Theorem 2.3.
Remark 6.4. The reason for assuming a polynomial lower bound (in m) on τ is that we want log(1/τ ) to be at most O(log m). Hence the bounds of Lemma 6.13 remain valid, with appropriate changes in C, as long as τ ≥ 1/m c for some universal constant c > 0.
An important feature of the bound in Lemma 6.13 is that it does not depend on a. Hence an application of Dudley's entropy integral inequality (see Theorem 7.6) yields the same bound on each gaussian width appearing inside the summation in the statement of Lemma 6.12. From this we can deduce Proposition 6.10 in a straightforward manner. The details are given in Section 7.8.
The thing that remains to do is the proof of Lemma 6.13. An important ingredient is the following upper bound for the general case.
Lemma 6.14. Let t ≤ 1. Then there exists a universal constant C > 0 such that for any τ > 0 and 1 ≤ k < m, we have Here Remark 6.5. It is worthwhile to emphasize here that Lemma 6.14 is a stand alone result and the bounds hold for any m, n, u, v and t ≤ 1. In particular, the notations m, n in the statement of Lemma 6.14 are generic and should not be confused with the parameters in Lemmata 6.12 and 6.13.
Remark 6.6. Lemma 6.14, by itself, is not sufficient to prove Lemma 6.13. To see this, let us plug in n = m/a 2 and u = 2ta in the expression for J k . One can easily check that while this makes J k free from a, the principal term in the bound on the metric entropy does not attain the required √ m-scaling for any choice of k.
The key idea in the proof of Lemma 6.14 is to partition any θ ∈ A(m, n, u, v, t) into fewest possible submatrices such that the matrix θ obtained by replacing each submatrix with its mean satisfies θ − θ ≤ τ /2. The proof then follows by a producing a τ /2 covering set for the family of matrices θ obtained in this fashion in a rather straightforward way. Towards this end we will repeatedly use a subdivision scheme based on the value of either TV row or TV col . We will also use it in the proof of Lemma 6.13 and therefore describe it here in a general setting. Let us point out that a very similar scheme was described in Section 4.2.1 in the context of proving Theorem 2.1.
Consider a set S and a function T : ∪ n∈N S n → R ≥0 satisfying T (AB) ≥ T (A) + T (B) for all A, B ∈ ∪ n∈N S n where AB denotes the concatenation of A and B. Also suppose for any singleton s ∈ S, the function T satisfies T (s) = 0. To relate this to a concrete example, the reader may consider the case where S = R m so that S n ≡ R m×n and T is the function TV row . Now for any > 0, the (T, ) scheme subdivides an element U of ∪ n∈N S n as U 1 U 2 · · · U K such that T (U i ) ≤ for all i ∈ [K]. This is achieved in several steps of binary division as follows. In the first step, we check whether T (U ) ≤ . If so, then stop and output U. Else, divide U as U 1 U 2 into two almost equal parts. This means |U 1 | = |U |/2 and |U 2 | = |U | − |U 1 |. In each step, we have a representation of U of the form U 1 U 2 · · · U K . We consider each i ∈ [K ] such that T (U i ) > and subdivide U i into two almost equal parts. We repeat this procedure until each part U in the current representation satisfies T (U ) ≤ .
Suppose that |U | = n. The subdivision of U produced by the (T, ) scheme corresponds to a partition of [n] into contiguous blocks, say, P U ;T, . Let |P U ;T, | denote the number of blocks of the partition P U ;T, . Now for t > 0, let P(t, n, , T ) denote the set of partitions {P U ;T, : U ∈ S n , T (C) ≤ t}. A key ingredient in the proof of Lemma 6.14 (and subsequently Lemma 6.13) is the following universal upper bound on the cardinality of P U ;T, .
Lemma 6.15. Then for the (T, ) division scheme we have max P ∈P(t,n, ,T ) We defer the proof of Lemma 6.14 to Section 7.8.4, but right now let us finish the proof of Lemma 6.13. The main idea of the proof is, for any θ ∈ A a , to rearrange its rows into several blocks in an optimal and judicious way and then apply Lemma 6.14 to each submatrix formed by these blocks. As will be clear from the proof below, these submatrices do not necessarily contain consecutive rows.
Proof of Lemma 6.13. Take any θ ∈ A a and fix ∈ (0, 1) whose precise value based on τ would be chosen at the end and subdivide θ as where TV col (θ i ) ≤ for all i ∈ [K] and K ≤ log 2 (4m)(1 + TV col (θ) −1 ). We achieve this by the (TV col , ) division scheme applied to the rows of θ (see Lemma 6.15).
Having obtained the subdivision, we now replace each row of θ i with the mean of its rows for all i ∈ [K] and get a new matrix where the rows of θ i 's are all identical. By repeated application of Lemma 7.3, we can deduce that Also note that by the Cauchy-Schwarz inequality, θ 2 ≤ θ 2 ≤ t. We further claim that θ ∈ A a ≡ A(m, m a 2 , 2ta + δ, t » m n + δ, t). Hence to establish this claim we only need to show that TV row ( θ) ≤ TV row (θ) and TV col ( θ) ≤ TV col (θ). We can obtain the first inequality as follows: For the second inequality we just apply Lemma 7.4 to each column of θ.
Next we will regroup the rows of θ into several submatrices. For any positive integer such that 2 ≤ 2m, define the set S = ¶ i ∈ [K] : 2 −1 ≤ n row ( θ i ) < 2 © and let B be the vector which is the sorted version of S . Now consider the submatrix . . .
where K := |B |. In words, θ contains the submatrices θ i , in order, whose number of rows lies between 2 −1 and 2 . It is clear that θ 1 , θ 2 , . . . , θ L are disjoint submatrices of θ where L ≤ log 2 (2m). Notice that if the matrices θ 1 , θ 2 , . . . , θ L satisfy θ − θ ≤ √ m for all ∈ [L] and the matrix θ comprises θ 1 , θ 2 , . . . , θ L in the same order as θ comprises θ 1 , θ 2 , . . . , θ L , then we have (6.14) We now choose by requiring this approximation error to be τ , i.e., by setting = τ / » 2 m log 2 (4m) (notice that 1/4m 2 ≤ ≤ 1/ √ m when τ ∈ [1/m, 1]). Therefore we can bound the covering number of A a as : where P is the set of all possible horizontal subdivisions of θ as in (6.11) and A * ,P is the family of matrices θ corresponding to P ∈ P. Since the number of subdivisions is at most log 2 (4m)(1 + TV col (θ) −1 ), a naive upper bound on |P| can be obtained as This enables us to rewrite (6.15) as Now fix a P ∈ P and let Θ denote the matrix formed by the first (or any) rows of θ B (1) , θ B (2) , . . . , θ B (k ) in order, i.e., the rows of θ that are potentially distinct. We claim that The constraints on the number of rows and columns of Θ as well as Θ are clear. For the remaining constraints first observe that θ ∈ A(n row ( θ ), m a 2 , 2ta + δ, t » m n + δ, t) (the only non-obvious part is the bound on TV col ( θ ) which follows from the triangle inequality). From the definition of Θ it is immediate that TV col (Θ ) = TV col ( θ ) and TV row (Θ ) ≤ TV row ( θ ) Therefore the bounds on TV col (Θ ) and TV row (Θ ) follow from the similar bounds for θ and the fact that n row ( θ B (i) ) ≥ 2 −1 for each i ∈ [K ].
Further notice that since n row ( θ B (i) ) < 2 for each i ∈ [K ], we have θ − θ ≤ 2 /2 Θ − " Θ where θ comprises repetitions of the rows of " Θ in the same way as θ comprises repetitions of the rows of ‹ Θ . Therefore a 2 − /2 √ m covering set for A a, is also a √ m covering set for A * ,P . Our next claim is about a uniform upper bound on N (A a, , 2 − /2 √ m ).
Claim 6.1. There exists an absolute constant C > 0 such that for any ∈ N >0 and ∈ [1/m 2 , 1/ √ m], we have where we recall from the statement of Lemma 6.13 that L(x) = x log(e log(em) 2 x) and Claim 6.1 follows directly from Lemma 6.14 when we choose k in an appropriate manner. The complete proof is given in the appendix section.
Concluding the proof. In the remainder of the proof we will use C to denote any positive, absolute constant whose exact value may change from one line to the next. Using Claim 6.1 we can bound the second term on the right hand side of display (6.16) as follows: (recall the choice of after (6.14) and also the definition of C m,n,δ,t from the statement of Lemma 6.13). On the other hand, since ≤ 1/ √ m, we can bound the first term on the right hand side in (6.16) as Since L(x) ≥ x for all x ≥ 1, we deduce by combining the previous two displays and subsequently plugging them into (6.16):
are non negative real numbers satisfying the following inequality for each i ∈ [n], be some other non negative numbers. In addition, also suppose the following inequality holds for some δ > 0, Then the following is true: where a + = max{a, 0} for any a ∈ R. Proof. The first equation in the above proposition basically says ( where a − = (−a) + for any a ∈ R. Therefore we can write which finishes the proof of the lemma.
The following lemma appears as Lemma D.1 in Guntuboyina et al. (2017).
Lemma 7.4. Let α ∈ R n and let B 1 , B 2 , . . . , B k be a partition of [n] into contiguous blocks. Let α B j denote the restriction of α to the block B j . Also let α ∈ R n be defined so that In other words, α is the best Euclidean approximation to α within the subspace of all vectors which are constant on each block B j . We then have the following inequality: Proof. For any set of indices i 1 ∈ B 1 , . . . , i k ∈ B k , we have the following inequality: Now averaging over the indices i j ∈ B j and using Jensen's inequality gives us The last two displays finish the proof of the proposition. Proof. Consider a θ ∈ T n,n,V,V and partition it as . Now define f (θ) to be the matrix where − → M , for any matrix M , denotes the matrix obtained by reversing the order of its columns whereas M ↓ is obtained by reversing the order of its rows. It is to check that f (θ) ∈ T n ,n ,4V,V . Since θ is a submatrix of f (θ), we also have f (θ) − f (θ ) ≥ θ − θ .
We state below the standard chaining result known as the Dudley's entropy integral inequality adapted to our particular situation. The original reference is Dudley (1967) and one can find a version in van Handel (2014) as Corollary 5.25.
Theorem 7.6 (Dudley's Entropy Bound). Let A ⊆ R m×n . Then we have 7.2. Proof of Theorem 2.1.

Proof of Lemma 4.2.
Proof. Let P 0 = L n be the initial partition and P i be the partition after round i of our division scheme. Our division scheme can naturally be thought of as construction of a quaternary tree, one level at a time, with the nodes being the rectangular blocks partitioning L n . Since P 0 = L n is the initial partition, L n is the root of the tree. At round i, we take leaves of the tree at depth i − 1, these correspond to rectangular blocks B j ∈ P i−1 ; and check for which of these we have TV unnorm (θ B j ) > . For such a rectangular block B j , we perform a dyadic partition of B j to obtain four new equal sized blocks or leaves. Let n i be the number of blocks of the partition P i and s i equal the number of blocks B j in P i that are divided to obtain P i+1 . Then n 0 = s 0 = 1. Importantly, we have n i+1 = n i + 3s i . Note that, due to super-additivity of the TV unnorm functional, at every round i we must have s i ≤ TVunnorm(θ) . This implies in particular that n i ≤ 1 + 3i TVunnorm(θ) . Now the division scheme can go on for at most N = log 2 n rounds (recall that we assume n to be a power of 2), thereby giving us |P θ, | ≤ 1 + 3 log 2 n TV unnorm (θ) .
Next we upper bound the number of possible partitions P θ, when TV unnorm (θ) ≤ V . In the (i + 1)-th round the number of distinct ways of partitioning is at most n s i i provided that s i > 0. Therefore the number of possible partitions that can be obtained is bounded above by Thus we can conclude log |P(V, n, )| ≤ C V log n for some universal constant C > 0.

Proof of Proposition 4.4.
Proof. We will use our (TV, η) greedy division scheme to subdivide any θ ∈ T n,n,V,V into several (square and equal sized) submatrices such that the total variation of each submatrix is at most η where the precise value of η depends on n, V, and will be decided later. We will achieve this by progressively dyadically partitioning θ until each of the remaining submatrices has total variation at most η. These submatrices then corresponds to a partition P θ,η of [n] × [n] into adjacent, square blocks. By using Lemma 4.2, we get for a universal constant C, max θ∈T n,n,V,V |P θ,η | ≤ C log n V η −1 + 1 , and log |{P θ,η : θ ∈ T n,n,V,V }| ≤ C log n V η −1 . (7.2) Let us replace the submatrices of θ corresponding to the blocks in P θ,η by their respective means, and call the resulting matrix θ η . Since the blocks in P θ,η are square, Proposition 4.3 gives us Therefore if we choose we get ||θ − θ η || 2 2 ≤ 2 . Together with (7.2), this almost yields the lemma except that the means within each possible block in P θ,η need to come from a finite set.
Notice that θ ∈ T n,n,V,L implies θ ∞ ≤ L and consequently the mean of any collection of indices of θ need to lie within [−L, L]. Hence, we can take a grid within [−L, L] with spacing /n and set the value of θ η in a rectangular block to be the mean of θ within that block rounded off to the nearest value in the grid. Firstly, since θ − θ η ≤ , we now have ||θ − θ η || 2 2 ≤ 4 2 . Secondly, the cardinality of this grid is clearly at most 1 + 2 L n . Hence, for any fixed θ ∈ T n,n,V,L , the cardinality of the set of possible matrices θ η is at most Ä 1 + 2 L n ä |P θ,η | . As θ ∈ T n,n,V,L varies, the cardinality of the set of all possible matrices θ η is therefore bounded above by Ä 1 + 2 L n ä max θ∈T n,n,V,L |P θ,η | |{P θ,η : θ ∈ T n,n,V,L }|.
Recalling (7.2) and substituting the value of η from (7.3) we obtain log N (T n,n,V,V , ) ≤ C log n log(1 + 2 L n ) for some universal constant C, finishing the proof of the proposition.

Proof of Lemma 4.5.
Proof. By an application of Proposition 4.3 the 2 diameter of K 0 n (V ) is at most 2V. Also, since K 0 n (V ) ⊆ T (n, n, V, V ) we can use the metric entropy bound from Proposition 4.4 after setting L = V. Thus, after applying the Dudley's entropy integral inequality (Theorem 7.6) we get for some universal constant C > 0. For ≥ 1/n, Proposition 4.4 gives us

Now using the elementary inequality
and doing some integral calculus let us obtain This finishes the proof of the lemma.
7.3. Proof of Theorem 2.2. While proving Theorem 2.2 we will prove a few intermediate results.
Our first lemma is a basic fact about Euclidean projections onto K 0 n (V ) for two different choices of V. This also appears as Lemma 5.1 in Chatterjee (2015).
This finishes the proof of the lemma.
Our next lemma is the following pointwise inequality.
imsart-aos ver. 2014/02/20 file: ann_paper.tex date: February 26, 2019 Lemma 7.8. Let w = y − y1 be the centered version of y. For any V ≥ 0, let " w V denote the projection of w onto the convex set K 0 n (V ). Let Then we have the following pointwise inequality for any V ≥ 0; Proof. Let us first consider the case when " w = 0. Define " V = TV unnorm (" w). We claim that " w V = " w and further To prove the above claim, suppose " w V = " w. Then we have w − " w V 2 < w − " w 2 ≤ (n 2 − 1) σ 2 because of uniqueness of Euclidean projections onto convex sets. Therefore, we have w − " w V The first term is just 8σ 2 times the Gaussian width of K 0 n (2V * ) and we can use Lemma 4.5 to upper bound it. As for the second term, it is clear that Also we observe that w−w * 2 This is a standard fact about standard normal random variables. Therefore we can write where the first inequality follows from the Cauchy Schwartz inequality and the last equality follows because V ar(χ 2 k )) = 2 k for any positive integer k.
Next we bound E| σ 2 − σ 2 |. We can write Recalling the definition of σ we have .
Taking expectation on both sides of (7.11) we obtain Using (7.10), the last display and the Cauchy-Schwarz inequality to bound E| σ − σ|, we can deduce Collecting the bounds we have obtained in this proof for the four terms comprising the upper bound given in Proposition 7.9, we can conclude that This finishes the proof of Theorem 2.2.
It only remains to prove the following lemma.
Proof. Expanding Var(TV unnorm (Z)) we get Var(TV unnorm (Z)) = where in the second step we used the observation that Cov(|∆ e Z|, |∆ e Z|) = 0 for all nonadjacent e, e , i.e., e, e which do not share any vertex. Since each edge e is adjacent to finitely many edges (including e itself) we get from (7.12) that Var(TV unnorm (Z)) ≤ C|E n | for some universal constant C > 0. The lemma now follows by noting that |E n | = 2n(n − 1).
Proof. To prove the assertion in the lemma, we will first need an intermediate fact. (1). Then for each i ∈ [k] and any fixed choice of rows and columns r i , c i in R i , we have θ R i ∈ M fourbdry (m i , n i , δ i , t i ) where i∈[k] t 2 i ≤ 1 and (δ 1 , . . . , δ k ) =: δ ∈ R k + satisfies Let us first show how can we use the above fact to deduce the lemma. To this end notice that the fact is valid for any choice of rows and columns r i , c i in R and hence a natural step would be to optimize this choice by minimizing their 1 norms. Now consider a θ ∈ T K(V * ) (θ * ) such that θ 2 ≤ 1. Then for each i ∈ [k], we have (here we are treating θ c as the corresponding column vector). The first inequality is an application of the Cauchy-Schwarz inequality and the second inequality follows from the minimum is less than the average principle. Similarly, one can obtain the row version of these inequalities and together they give us min c: c is a column of R i , r: r is a row of R i imsart-aos ver. 2014/02/20 file: ann_paper.tex date: February 26, 2019 Summing the above inequality over all the rectangles R i ∈ R * we get where we have again used the Cauchy-Schwarz inequality and also the fact that θ 2 ≤ 1. Thus the second assertion of the lemma follows from the first one if we choose r i and c i that achieve the minimum in the summands above for each i ∈ [k]. Now let us prove the intermediate fact. Fix i ∈ [k] and consider a generic row r i of R i . Treating r i as a horizontal path in the graph L n , let us denote its two end-vertices by u and w with u ∈ ∂ left (R i ) and w ∈ ∂ right (R i ). Now denoting the vertex in r i ∩ c i by v, we see that v occurs between the vertices u and w in the row r . Therefore we can write Summing the above inequality for every row in the rectangle R i gives us By a similar argument applied to the columns of R we obtain Summing the previous two displays we get the following inequality: Now if θ ∈ T K(V * ) (θ * ) then as a consequence of Corollary 6.5 we also have An application of Lemma 7.1 (stated and proved in the appendix) together with the last two displays then directly yields the first assertion of the lemma.
Proof. Using Lemma 6.6 we can write E sup θ (7.13) where, by a slight abuse of notation, Z always refers to a matrix of independent standard normals with appropriate number of rows and columns.
At this point, we would like to convert the supremum over δ, t in the non negative simplex to a maximum over a finite net of δ, t. We can accomplish this by the following trick. Fix any δ ∈ S k,∆(θ * ) . Then we can define a vector q ∈ R k such that It is clear that q ∈ H k and q ∈ S k,2 . It is also clear that δ i ≤ q i ∆(θ * ). Due to similar reason, for any v ∈ S k,1 there exists w ∈ H k ∩ S k,2 such that v i ≤ w i for every i ∈ [k].
By the above logic, we can write Z, θ (7.14) Since Z is a matrix with i.i.d N (0, σ 2 ) entries, the first two maximums in the right hand side of the above display can actually be taken outside the expectation upto an additive term. This follows from the well known concentration properties of suprema of gaussian random variables. In particular, we now apply Lemma 7.2 (stated in Section 7), true for suprema of gaussians, to obtain for a universal constant C, To bound the log cardinality log |H k ∩ S k,2 |, note that for any positive integer k, the cardinality |H k ∩ S k,2 | is the same as the number of k tuples of positive integers summing upto at most 2. By standard combinatorics, we have for all s ∈ {k, . . . , 2k − 1}, it follows that for some positive absolute constant C.
Using (7.13), (7.14), (7.15) and the above cardinality bound, we can finally finish the proof by writing 7.6. Proof of Lemma 6.8. We need some intermediate lemmas. We start with the following lemma.
It is worthwhile to stress here that this rectangular partition R * of [m] × [n] is the same for all θ ∈ M fourbdry (m, n, δ, t).
Proof of Lemma 6.8. The proof of Lemma 6.8 follows directly from Lemma 7.11 and the sub-additivity of the Gaussian width functional.
The task now is to prove Lemma 7.11. The proof of Lemma 7.11 is divided into two steps where we state and prove two intermediate lemmas.
In the first step we reduce the number of "active" boundaries, i.e., the number of boundary vectors involved in the bound on total variation, from four to two and in the second step we reduce them from two to one or zero. The main idea of the proofs is essentially same as that of Lemma 6.6.
Remark 7.1. While lemma 7.11 is true for any integers m, n, the reader can safely read on as if m, n are powers of 2. The essential aspects of the proof of Lemma 7.11 all go through in this case. Writing the general case would make the notations messy. For the sake of clean exposition, we thus write the entire proof when m and n are powers of 2. At the end, we mention the modifications needed when m, n are not powers of 2.
Four to two boundaries.
Lemma 7.12. Take any θ ∈ M fourbdry (m, n, δ, t). Let us denote the four submatrices obtained by an equal dyadic partitioning of θ (i.e., each submatrix lies in R m/2×n/2 and is formed by adjacent rows and columns of θ as θ topleft , θ topright , θ bottomleft and θ bottomright in the obvious order. Then the submatrix θ ab , where a ∈ {top, bottom} and b ∈ {left, right}, itself satisfies In words, if a matrix θ ∈ M fourbdry (m, n, δ, t) is dyadically partitioned into four equal sized submatrices, each of these four submatrices lies in M twobdry (m/2, n/2, δ , t) where δ := δ + 16t( » m n + » n m ); furthermore the boundaries that are active for these submatrices are the ones that they share with θ.
Proof. Since θ 2 ≤ t, there exists 1 ≤ i ≤ m/2 < i ≤ m and 1 ≤ j ≤ n/2 < j ≤ n such that An application of Lemma 7.1 now finishes the proof of the lemma from the previous two displays.
Two to one or zero boundary.
Let us start by stating the following lemma which one can think of as a version of Lemma 7.12 applied to an element of M twobdry (m, n, δ, t). The proof is very similar and we leave it to the reader to verify.
Lemma 7.13. Let θ ∈ M ab (m, n, δ, t) for some a ∈ {top, bottom} and b ∈ {left, right}. We can partition θ into equal sized four submatrices θ topleft , θ topright , θ bottomleft and θ bottomright in the obvious manner such that the submatrix θ cd , where c ∈ {top, bottom} and d ∈ {left, right}, satisfies In words, if a matrix θ ∈ M twobdry (m, n, δ, t) is dyadically partitioned into four equal sized submatrices, then each of these four submatrices has at most two active boundaries and a wiggle room of at most δ + 4t( » m n + » n m ); furthermore the active boundaries are the ones that they share with one of the active boundaries of θ.
We are now ready to conclude the proof of Lemma 7.11.
Proof of Lemma 7.11. Recall that we are assuming m, n are powers of 2 for simplicity of exposition.
Step 0: Partition [m] × [n] dyadically into four equal rectangles so that, by Lemma 7.12, θ S ∈ M twobdry (m/2, n/2, δ (0) , θ S ) for any such rectangle S where Step 1: Let S (there are four of them) be a generic rectangle obtained from the previous step. Using Lemma 7.13, we now partition θ S into four equal parts (rectangles). We then get two matrices in M onebdry (m/4, n/4, δ (1) , t), one matrix in M nobdry (m/4, n/4, δ (1) , t) and the remaining one from M twobdry (m/4, n/4, δ (1) , t). Here, Steps j≥2: From the last step we get exactly one matrix in M twobdry (m/4, n/4, δ (1) , t), for each of the 4 rectangles S. For each S, we now recursively use Lemma 7.13 by partitioning this matrix again into four exactly equal parts in a dyadic fashion and continue the same procedure with the matrix obtained in each step with two active boundaries until we end up with matrices only with 0 or 1 active boundary.
For each j ≥ 1, define R * j,1 as the collection of rectangles R obtained in step j such that θ R has exactly 1 active boundary. From Lemma 7.13, we know that there are exactly two such rectangles for any given S (from step 0) and therefore |R * j,1 | ≤ 8. For any j ≥ 1, and any rectangle R ∈ R * j,1 , repeated application of Lemma 7.13 yields that θ R ∈ M onebdry (m/2 j , n/2 j , δ (j) , θ R ) where Now defining R * j,2 as the collection of rectangles R obtained in step j such that θ R has no active boundary, we can deduce in a similar way that |R * j,2 | ≤ 4. Also for such rectangles R and j ≥ 1 we have θ R ∈ M nobdry (m/2 j , n/2 j , δ (j) , θ R ). Finally, notice that Thus the collection of rectangles {R * j,k : j ≥ 1, k ∈ [2]} satisfies all the conditions of Lemma 7.11.
Remark 7.2. For the statement of Lemma 6.8 to hold, the important thing in the proof of Lemma 7.11 is that in every step 1 ≤ j ≤ K, the aspect ratio of the submatrices do not change significantly. The reader can check that at every step, both the number of rows and columns halve, thus keeping the aspect ratio constant. At every step, the dimensions of the submatrices halve and thus decrease geometrically, while the allowable wiggle room increases additively by the factor (does not change with j) 16 t( Remark 7.3. Let us discuss the case when m, n are not necessarily powers of 2 in the proof of Lemma 7.11. The first step of reducing the number of active boundaries from four to two, by applying Lemma 7.12, can be carried out in the same way by splitting at the point m/2 and n/2 . Next, we come to the stage when we are applying Lemma 7.13 to reduce the number of active boundaries from two to one, on the four submatrices obtained from the previous step. Let us denote the dimensions of these 4 submatrices generically by m , n . Recall, in the first step of subdivision, we get exactly one submatrix with 2 active boundaries. The others have 1 or 0 active boundaries. At this step, we can subdivide such that the submatrix with two active boundaries has dimensions which are exactly powers of 2. For instance, we can split at the unique power of 2 between m /4 and m /2 on one dimension and do the exact same thing for the other dimension. Once we have this submatrix with two active boundaries to have dimensions which are exactly powers of 2, we can carry out the rest of the steps as in the proof of Lemma 7.11. It can be checked that, in this case, all the inequalities we deduce while proving Lemma 6.8 goes through with the possible mutiplication of a universal constant. 7.7. Proof of Lemma 6.11.
Proof. In this proof, we abuse notation and write a ≤ b to mean a ≤ C b for some positive universal constant C whose exact value can change from line to line. Since δ ≤ n, Proposition 6.10 implies 7.8. Proof of Proposition 6.10.
Proof. The proof is done by the following piece of reasoning. Lemma 6.13 gives a upper bound on log N (A a , τ ) which is free of a. This in turn, in conjunction with the Dudley's entropy bound, gives a bound on GW(A a ) which is free of a. This upper bound times log n is thus an upper bound for GW onebdry (m, n, δ, t), in view of Lemma 6.12. We now do the calculations.
Proof. Let P 0 = [n] be the initial partition. At every step we take the blocks b i ∈ P i for which T (b i ) > and divide b i into two equal parts. Let n i be the number of blocks of the partition P i and s i equal the number of blocks B i in P i that are divided to obtain P i+1 . Define s 0 = 0. Therefore we have n i+1 = n i + s i . Note that, due to superadditivity of T, we must have s i ≤ t . This implies in particular that n i ≤ 1 + i t . Now the division scheme can go on for atmost N = log 2 n rounds. Therefore we have max P ∈P(t,n, ,T ) |P U ;T, | ≤ 1 + log 2 n t ≤ 1 + (1 + log 2 n)(1 + t ) .
Further noticing that ≤ 1/ √ m, so that 1 ≤ 1 √ m 2 , we obtain (7.24) J k, ≤ C(log(em)) 2 1 + 1 √ m 2 Ä t … m n + t + δ ä 2↓ (recall that x 2↓ = x + x 2 ). On the other hand we have log(eK m/a 2 2 − /2 √ m ) a≥c ≤ log(Ce K √ m/2 − /2 ) ≤ C log (em) where in the second step we used 2 ≤ m and K ≤ m (recall their definitions from the proof of Lemma 6.13) and also ≥ 1/4m 2 . Plugging these bounds into the right hand side of (7.21) and rewriting the expression in terms of L m (x) = x log(e log(em) 2 x) we obtain where we used the fact that log(Ce log(em) 2 x) ≤ C log(e log(em) 2 x) for all x ≥ 1 and large enough C.
Case 2: K ≥ 2 m 2 . Notice that in this case we can choose k = 2 m 2 and Lemma 6.14 gives us (7.26) log We will show below that the right hand side of (7.24) also serves as an upper bound for J * k, and consequently the upper bound in (7.25) holds in this case as well, thus proving the claim. To this end we will use the bounds (7.19) and (7.20). First observe that the bound on J k, is same as in the previous case since the only bounds we used there were k ≤ K and k ≤ 2 m 2 , both of which are valid in this case. On the other hand, C ,k, can be bounded by Since K ≥ 2 m 2 and ≤ 1/ Similarly we can bound Plugging these bounds into the (7.27) we get C ,k, ≤ C(log(em)) 2 1 + 1 √ m 2 Ä t … m n + t + δ ä 2↓ .
where used the simple fact that x 3/2 ≤ x 2↓ . Combined with (7.24) and the discussion preceding the display (7.27), this yields us a similar upper bound for J * k, .
Proof of Lemma 6.14. The proof is split into three parts. In the first part we try to construct, for any given θ ∈ A(m, n, u, v, t), another matrix θ satisfying θ − θ ≤ τ /2 such that θ is piecewise constant on rectangles with as few pieces as possible (recall from the brief discussion following the statement of Lemma 6.14). In the second part we compute an upper bound on the number of all possible partitions that one can obtain from any θ ∈ A(m, n, u, v, t) by the construction described in the first part. Finally, in the third part we construct a τ /2 covering set for the family of matrices θ, thus forming a τ covering set for A (m, n, u, v, t). We bound the total number of rectangular level sets of θ as well as the total number of possible partitions of [m] × [n] obtained from the first two parts. These two bounds lead to the desired upper bound on N (A(m, n, u, v, t), τ ).
Approximating θ by a piecewise constant matrix. This part consists of three steps. In the "zeroth" step, we divide θ equally into k submatrices by horizontal divisions. We do not choose, a priori, any specific value of k which is the reason why our final bound depends on k. Then in step 1, each of these submatrices is divided into submatrices by vertical divisions which are again subdivided in step 2 by horizontal divisions. The rectangles corresponding to these submatrices will be the final level sets of θ. We now elaborate the steps.
Step 0: Horizontal Divisions. Fix a positive integer 1 ≤ k ≤ m and divide θ into k submatrices as follows: where each θ i has either m/k or m/k many rows. We want to stress that we use the same partitioning for every θ in this step.
Step 1: Vertical Divisions. Next we want to subdivide each θ i (where i ∈ [k]) by making j i many vertical divisions: where TV row (θ i,j ) ≤ τ k for all j ∈ [j i ] and some τ k > 0 to be chosen shortly. We can do this by the (TV row , τ k ) scheme applied to the columns of θ i so that Lemma 6.15 gives us the bounds (7.28) j i ≤ log 2 (4n) Replacing each element in every row of θ i,j with the corresponding row mean, we then obtain a new matrix By construction, each θ i,j has identical columns. Finally, let us define From the Cauchy-Schwarz inequality, it is clear that θ ≤ θ . One important observation we need make at this point is that while this averaging procedure might increase the value of TV col ( θ), it does not increase the value of TV col ( θ i,j ) for any i and j. Indeed by a computation exactly similar to that performed in (6.13) we get TV col ( θ i,j ) ≤ TV col (θ i,j ) . TV unnorm (θ i,j [i , ]) 2 (7.30) where in the final step we used Lemma 7.3. Since TV row (θ i,j ) ≤ τ k , we can then deduce n col (θ i,j )τ 2 k = nkτ 2 k . (7.31) Setting τ k = τ /4 √ nk, we get θ − θ 2 ≤ τ /4.
Step 2: Horizontal Divisions. In this step, we are going to make horizontal divisions within each submatrix θ i,j obtained from step 1 so that the total variation of columns of each subdivision is smaller than some fixed, small number. To this end fix τ k > 0 whose exact Henceforth we will denote this number as N k .
Constructing a τ covering set for A (m, n, u, v, t). It suffices to construct a τ /2 covering set for the family of matrices θ obtained from the first part. Now, θ ≤ t implies θ ∞ ≤ t. Thus, we can construct the covering set C k in the same way as is done in the proof of Proposition 4.4. In words, we construct a grid with spacing τ √ mn in [−t, t] and then round off the values of θ on each rectangle to its nearest point in the grid. It immediately follows from the construction that |C k | ≤ max θ∈A (m,n,u,v,t) Ä max(t √ 2mn/τ, 1) ä n piece ( θ) N k .
Along with (7.35), (7.37) and (7.38), this leads to the first bound on the covering number of A(m, n, u, v, t) as stated in Lemma 6.14. For the second bound, that is when k = m, recall that the second summand in the right hand side of (7.34) comes from the horizontal division conducted in step 2. Since this step becomes void for k = m, the required bound follows in exactly similar fashion with J k replacing J