Conjugate Gradient Iterative Hard Thresholding for Structured Sparsity

Greedy sparse recovery algorithms are studied in the structured sparsity (sparsity in levels) framework. Recovery guarantees are provided for Normalized Iterative Hard Thresholding and Conjugate Gradient Iterative Hard Thresholding in the form of restricted isometry properties for sparsity in levels. Empirical results indicate that CGIHT is comparable to CoSaMP in recovery capability in the structured setting, while maintaining the computational complexity of NIHT. While exploiting structured sparsity improves recovery performance, pessimistic theoretical guarantees mask when practitioners should use these algorithms; the empirical results offer guidance for using the original greedy algorithms over CGIHT in Levels.


A. COMPRESSED SENSING AND SPARSE APPROXIMATION
In the simplest noise-free settings of compressed sensing [2], [3] and sparse approximation [4], [5], we start with only the measurements, y ∈ R m , and knowledge of the measurement operator, A ∈ R m×n . Given no other information, but assuming the measured vector is sparse, we seek the sparsest vector x ∈ R n which could produce the measurements y. In other words, the compressed sensing problem can be stated as the combinatorial optimization x = arg min z∈R n z 0 subject to y = Az (1) where z 0 counts the number of nonzero entries in z. The closely related sparse approximation problem assumes the measured vector is well approximated by a sparse vector, and therefore seeks the best approximation of the measurements when the sensed vector is given a fixed sparsity allowance, namely x = arg min z∈R n y − Az 2 subject to z 0 ≤ k.
Although both problems are NP-hard in general [6], much work has been done to find tractable alternatives to (1) and (2). Under the right conditions [7], [8], [9], [10], one can obtain the solution to (1) from its convex relaxation, x = arg min z∈R n z 1 subject to y = Az. ( Alternatively, many iterative algorithms have been designed to directly conquer (2) [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. One important tool in the analysis of compressed sensing and sparse approximation is the ubiquitous restricted isometry property [9]. A measurement matrix A has the asymmetric restricted isometry constants L ck , U ck when these are the smallest constants so that A acts like a near-isometry on all ck-sparse vectors in the sense that The large majority of recovery guarantees come in the form of some restricted isometry property (RIP) where the matrix A must have restricted isometry constants (RIC) satisfying a particular bound. For example, a necessary condition for A to be injective on the set of all k-sparse vectors, and therefore ensure a unique solution to (1) whenever x is k-sparse, is the restricted isometry property L 2 k < 1.

B. OUTLINE AND CONTRIBUTIONS
The remainder of this introduction briefly introduces the main contributions of the article and some notation used throughout. In Section II, we first recall the general advances in models of structure for compressed sensing (Section II-A).
In Section II-B we describe three specific models and detail how they are essentially equivalent. Greedy algorithms designed to exploit the sparsity in levels model are described in Section II-C; in particular we introduce Conjugate Gradient Iterative Hard Thresholding for sparsity in levels.
The main results of the paper are presented in Section III. In this section, Normalized Iterative Hard Thresholding (NIHT) and Conjugate Gradient Iterative Hard Thresholding (CGIHT) are analyzed using asymmetric restricted isometry constants (aRICs). Theorems 1 and 2 provide guarantees for uniform recovery of signals with the sparsity in levels structure for both NIHT and CGIHT when the measurement operator has appropriately bounded aRICs.
Section IV presents several empirical investigations to enhance the theoretical results from Section III. Experiments indicate that the structure exploiting algorithms are bolstered not just by signals having significant structure, but also by the quality of structure information given to the algorithms. Standard and sparsity in levels adapted versions of the algorithms are compared when given incorrect, minimal, partial, or exact structure information on signals with varying degrees of sparsity in levels structure. For signals with a highly structured, dyadic sparsity pattern, the algorithms are compared and recommendations made on which algorithms are likely to have the most preferable average case behavior.
The proofs of the main results from Section III are provided in Appendix A. These proofs are both an an extension of the proofs to the sparsity in levels setting and an improvement on the required bounds on the aRICs of the measurement operator in order to guarantee uniform recover.

C. NOTATION
For the remainder, index sets will be denoted by a capital letter, such as Q, S, T , while most Greek and script letters will indicate other types of sets. Standard lower case letters will indicate integers, for example c, k, m, n, r. Vectors will be written in bold, x, y, w, e, with some special bold face letters k = (k 1 , . . . , k r ), n = (n 1 , . . . , n q ) representing ordered tuples of integers. When a vector is partitioned into smaller portions of the vector, these portions are indicated with a subscript, e.g. x = [x 1 |x 2 | · · · |x r ]. Superscript indices distinguish vectors from one another so that x 1 and x 2 are two different vectors, often with the superscript identifying the vector associated with an iteration of an algorithm. Matrices will be given a capital letter, such as A, and will be clear from context that this is not a set. For a vector w ∈ R n and an index set T , the hard threshold of w onto T is defined by For a matrix A, the restriction of A to the columns indexed by T will be denoted A T . Also clear from context, |α| is the absolute value of the scalar α, |w| is the vector where the absolute value is taken componentwise, and |Q| is the number of elements in a discrete set Q.

II. STRUCTURED SPARSITY
In many applications, we know more than just the measurements y and the operator A; we also know some structure to the information. For example, we may know that x = s is the set of wavelet coefficients for the signal s and therefore understand a dyadic structure of the important information in x. This additional structure information could be exploited to improve the reconstruction process.

A. UNIONS OF SUBSPACES
A general model of sparsity was put forward in [4] where the notion of sparsity was presented as having a sparse representation in a union of subspaces. Within the compressed sensing framework, the union of subspaces model was formalized and advanced in a large body of work including [21], [22], [23], [24].

Definition 1 (Union of Subspaces Model):
In [21], Lu and Do provide several examples of such signal models including streams of Diracs in finite rate of innovation, piecewise polynomials, sparse representations in a dictionary, overlapping echoes, and signals with unknown spectral support. The authors study when a sensing operation is invertible and stable by analyzing a Gram matrix associated with the sampling and reconstruction schemes. The stability of the operators on a union of subspaces model is detailed in language equivalent to an asymmetric restricted isometry property.
In [22], Blumensath and Davies give an explicit definition for restricted isometry constants on a finite union of subspaces. They employ these RICs to describe both the injectivity of a linear mapping and a Lipschitz continuity condition (stability) on unions of k-dimensional subspaces. Importantly, [22] establishes that, with high probability, matrices with subguassian entries have bounded RICs for the union of subspaces models.
In [23], the iterative greedy algorithms Iterative Hard Thresholding (IHT) [13], [14] and Compressive Sampling Matching Pursuit (CoSaMP) [16] were analyzed by Baraniuk et al. for the union of subspaces model. In this very general setting, recovery guarantees are established for both algorithms in terms of a restricted isometry property. The generality of the setting raises some complications those authors handle with additional conditions on the measurement operator. For example, since the union of subspaces may not be closed under addition, the RICs must be extended to sums of vectors from the union of subspaces. In [24], Hegde et al. establish the need for an exact projection method onto the model and further analyze IHT and CoSaMP in this setting.
One obviously important example of a union of subspaces is the standard compressed sensing regime (1). Consider the k-sparse vectors where each subspace χ i is determined by a fixed index set A second example, relevant to the following discussion, is the well studied block sparsity. Suppose x ∈ R n is partitioned into r blocks x i each of length n i , so that x = [x 1 |x 2 | · · · |x r ] ∈ R n with n = r i=1 n i . The vector x is s-block sparse if at most s of the blocks are nonzero and the remaining r − s blocks are zero. Typically, in this setting, n i = n/r is constant. This model, too, can be expressed as a union of subspaces. Fix an s-block sparse vectorx, whose supportQ ⊂ {1, . . . , n} is no larger thank = i∈ n i where = { j :x j = 0}. When the blocks have equal sizes, we havek = s r n, and when the sizes vary there is clearly a maximumk for the model. In either case, the vectorx lies in thek-dimensional subspace χ j = {x ∈ R n : supp(x) ⊂Q j } and the set of all s block sparse vectors is the union of all possible suchk-dimensional subspaces.

B. SPARSITY IN LEVELS
Recently, much interest has been given to a few highly-related, specific classes of union of subspaces models. For example, iterative greedy algorithms IHT, CoSaMP, and Hard Threshold Pursuit (HTP) [18] have recently been studied in the following models. With some notational and organizational choices, we present these models as defined in the respective articles. [25], [26] Let z ∈ R n . Then z is (κ, M)-sparse in levels if r we define the sparsity levels M = (M 1 , . . . , M r ) where

1) SPARSITY IN LEVELS
The sparsity in levels model provides theoretical foundations for the use of practical compressed sensing processes such as using Fourier measurements of images sparse in the wavelet domain [25], [27]. By capturing the structure of these imaging processes, the sparsity in levels model permits theory that explains high quality sparse approximation performance even for imaging techniques that fail to satisfy assumptions from the standard sparsity model. There is now a robust set of theoretical advances along these lines [25], [26], [27], [28], [29].
When κ i = κ is fixed, the hierarchical sparsity model has also been studied as bilevel feature selection in [33]; iterative thresholding algorithms for bilevel feature selection were studied in [34], [35].
3) SPARSITY WITH STRUCTURED SUPPORT [36], [37] Let z ∈ R n . Let Q 1 , . . . , Q r ⊂ {1, . . . , n} be disjoint index sets. Then z is κ-sparse with structured support if r there exist local sparsities κ 1 , . . . , κ r with κ = r i=1 κ i , r and z has at most κ i nonzeros when restricted to the index While the different points of view can be helpful, these three sparsity models are essentially equivalent. Clearly the sparsity in levels model is a specific instance of a structured support. Also, the knowledge of the structured support allows us to consider a vector as being sparse in levels with the introduction of a known permutation matrix. Finally, there is a one-to-one mapping of hierarchically sparse vectors z = (z 1 , z 2 , . . . , z r ) ∈ R n 1 × · · · × R n r with z i 0 ≤ κ i to sparse in levels vectors x = [x 1 |x 2 | · · · |x r ] ∈ R n with n = r i=1 n i . To require that only s blocks x i of a sparse in levels vector x be nonzero, we simply set κ j = 0 for all but s values of j ∈ {1, . . . , r}.
A helpful and common property linking these models together is the ability to explicitly define the index sets for the support from the structured support formulation.
Definition 2 (Structured Support Model): We say a union of k-dimensional subspaces = ∪χ i ⊂ R n is a structured support model if for any vector x ∈ there exists an index set Q ⊂ {1, . . . , n} such that x = (x) Q and |Q| ≤ k.
One advantage of restricting our attention to structured support models is the simple way we can deal with sums of elements from such models. While the more general models of [21], [22], [23], [24] require careful tracking of direct sums of subspaces, direct sums are simply captured by a union of index sets in a structured support model. Lemma 1: Let be a structured support model and let z 1 , . . . , z c ∈ with z i = (z i ) Q i and |Q i | ≤ k for each i. Then z 1 + z 2 + · · · + z c = (z 1 + z 2 + · · · + z c ) Q for Q = ∪Q i and |Q| ≤ ck.
This observation allows us to generalize the definitions of restricted isometry properties and mimic the analysis of algorithms from the standard k sparse model k,n with essentially no major effort. Now, in order to avoid introducing yet another name for these equivalent models, we adopt the sparsity in levels language to support further understanding the variety of sensing and sparsity combinations that perform well in practice.
Definition 3 (Sparsity in Levels Model [25], [26]): Let x ∈ R n . For each i ∈ {1, . . . , r} define the index sets S i = {n i−1 + 1, . . . , n i } ⊂ {1, . . . , n} where 0 = n 0 < n 1 < n 2 < · · · < n r = n, and let k i ∈ N. Define the levels n = (n 1 , . . . , n r ) and the local sparsities k = (k 1 , . . . , k r ). Then We refer to k = r i=1 k i as the (total) sparsity of x. The ability to view each of these models as structured support models allows us to consider them as a union of subspaces. If we fix a set of local sparsities and sparsity levels (k, n), let k = k k i , and let For each vector x ∈ (k,n) , its support is denoted by a set Q ⊂ {1, . . . , n} with k i indices in each of the partitions {n i−1 + 1, . . . , n i }; we denote the set of all such support sets (k,n) . These models benefit from the structure of the support sets within the union of subspaces. Suppose we have fixed the dimension n, levels n, and local sparsities k for the union of subspaces, (k,n) . For any two support sets Q, S ∈ (k,n) , the union Q ∪ S also has structure; in each partition Notice that (k, n)-sparse vectors are k-sparse vectors, i.e.
(k,n) ⊂ k,n for k = k k i . As a result, the theory from standard compressed sensing on sparse vectors and from model based compressed sensing apply directly to the sparsity in levels setting. By exploiting the structure of sparse in levels vectors, algorithms are able to recover vectors with a higher total sparsity using the same number of measurements [26], [32], [37], while the model also provides the theoretical foundation for recovery success in Fourier imaging and other coherent measurement processes [25], [27], [28].

C. GREEDY ALGORITHMS FOR STRUCTURED SPARSITY
For A ∈ R m×n and x ∈ k,n , the measurements y = Ax have at least an (n − m)-dimensional preimage {z ∈ R n : y = Az}. Therefore, the sparsity cost function (1) or constraint (2) help limit the search for the vector x. In fact, once the support of x is known, the reconstruction problem is overdetermined and can be accomplished by projection. As such, the greedy approach to solving (1) and (2) is to find the supporting subspace of x.
For the standard k-sparse model, the best supporting subspace of a vector w ∈ R n is the space indexed by the largest magnitudes of the entries in w. Similarly, to approximate w with a vector with local sparsities k and sparsity levels n, we need the indices of the largest magnitudes of w on each level. We define DetectSupport k (w) to be the index set of k largest entries of |w|. Furthermore, to find the best supporting set from (k,n) , we let DetectSupport (k,n) (w) be the index set which, for each i = 1, . . . , r, contains the indices of the k i largest magnitudes of w on the level S i = {n i−1 + 1, . . . , n i }. In other words, we have The basic outline of an iterative greedy algorithm is to conduct a line search update, identify the best support set after the update, and then restrict the update to this supporting space either via hard threshold or projection. For example, if the current k-sparse approximation is x l−1 , the most basic iterative hard thresholding uses the residual, r l−1 = A * (y − Ax l−1 ) as the search direction so that A two-stage algorithm, such as CoSaMP, introduces an intermediate optimization problem over a (possibly larger) subspace: Both of these algorithms 1 have been extended to the sparsity in levels setting [26], [37] by simply utilizing DetectSupport (k,n) (x l−1 + r l−1 ) in (9), DetectSupport (2k,n) (r l−1 ) in (11), and DetectSupport (k,n) (w l ) in (13). In fact, in [26], the study includes empirical performance comparisons with normalized iterative thresholding [15] whereby the line search uses a step size selected in each iteration so that if T l−1 is the correct support set, then NIHT will take the optimal steepest descent step on that space. The authors pose as a future problem the need to analyze NIHT in the sparsity in levels paradigm.
Given that NIHT was seen to be competitive on k-sparse vectors [15], [38] and was then observed to also perform admirably in the sparsity in levels setting [26], we provide the extension of the theoretical recovery guarantees for NIHT.
Moreover, as the large scale empirical testing in [38] lead to the development of CGIHT and CGIHT Restarted by exploiting the advantages of multiple iterative greedy algorithms, we also provide the extension of CGIHT Restarted to the sparsity in levels model. Similar to the theory in the standard compressed sensing model, the theoretical recovery guarantee has Algorithm 1: NIHT.
only been proven for CGIHT Restarted wherein the conjugate gradient structure is restored when the support set changes. However, the empirical performances of both CGIHT and CGIHT Restarted are competitive with CoSaMP in which vectors are recovered, but both have a significantly reduced computational complexity comparable to NIHT. The algorithm CGIHT is the same as Algorithm 2, except the orthogonalization weight is never set to zero in Step 2.

III. RECOVERY GUARANTEES FOR SPARSITY IN LEVELS
As mentioned in Section I, most recovery guarantees in compressed sensing and sparse approximation come in the form of a restricted isometry property. The analysis and results typically look cleaner when a single, symmetric restricted isometry constant is used in the analysis, though these cleaner results often mask the important asymmetric scaling impacts of a matrix. In this paper, we will utilize the more general asymmetric restricted isometry constants.
Definition 4 (Asymmetric Restricted Isometry Constants [21], [29], [39]): Let A be an m × n matrix and fix levels n and local sparsities k. The asymmetric (k, n)restricted isometry constants L((k, n), A) and U ((k, n), A) for (k, n)-sparse vectors x ∈ (k,n) are defined as: for all x ∈ (k,n) ; In [26], Adcock et al. pose as a future problem the need to analyze NIHT for the sparsity in levels problem. The first step in doing so is to bound the step sizes from each iteration. Fortunately, any matrix with appropriate RICs will admit a stable behavior on the step sizes.
Lemma 2 ( [40]): Let k and n define local sparsities and sparsity levels and fix the sparse in levels model (k,n) . Suppose A ∈ R m×n has (k, n)-restricted isometry constants L k := L((k, n), A) and U k := U ((k, n), A). Let α l be the step size generated in iteration l + 1 of NIHT, Algorithm 1, Step 2. Then Proof: Fix the iteration l + 1 in Algorithm 1 and let r l = A * (y − Ax l ) where y is the input measurements and x l ∈ (k,n) is the approximation from the previous iteration. Let T l ∈ (k,n) be the support set of x l . Then, restricting to T l we see that (r l ) T l ∈ (k,n) . From Definition 4, we have which is equivalent to (17). The proof of Lemma 2 is nearly identical to that in [40] with the only change being the RICs on the structured support model (k,n) . In the appendix, we combine the proofs of [41,Thm. 6.18] and [40,Thm. 2] to obtain the guarantee for NIHT. This proof for the sparsity in levels model includes the standard sparsity result when considering only a single, k-sparse level.
Theorem 1 (NIHT [40]): Let k and n define local sparsities and sparsity levels and fix the sparse in levels model (k,n) . Suppose A ∈ R m×n has (ck, n)-restricted isometry constants L ck = L((ck, n), A) and U ck = U ((ck, n), A), c = 1, 2, 3. Let y = Ax + e for some x ∈ (k,n) and e ∈ R m an additive misfit. Define If then (20) ensures μ < 1 and where x l is the lth iterate of NIHT. CGIHT performs a conjugate gradient projection on a given support set. When the algorithm decides the support should change, CGIHT Restarted uses the residual as the search direction thereby restarting the conjugate gradient method. This assures the conjugate orthogonality of the search directions while on a fixed support. This orthogonality of the search directions also admits an analysis of the algorithm which calls upon several conjugate gradient properties. The first such situation relates the residual to the conjugate search direction and allows us to bound the step size and orthogonalization weights, as we did with NIHT.
Lemma 3 ( [19]): Let k and n define local sparsities and sparsity levels and fix the sparse in levels model (k,n) . Suppose A ∈ R m×n has (k, n)-restricted isometry constants L k := L ((k, n), A) and U k := U ((k, n), A). Let α l be the step size and β l be the orthogonalization weight generated in Step 4 and 2, respectively, of iteration l + 1 of CGIHT Restarted, Algorithm 2. Then Proof: Similar to the proof of bounding the step sizes in NIHT, this proof directly follows the existing argument in [19,Lemma 4.2], but with the support sets and RICs having the sparsity in levels structure. When the support changes from iteration to the next, i.e. T l = T l−1 , the step size α l is the same as in NIHT and is bounded directly as in the proof of Lemma 2.
When the support set is unchanged from one iteration to the next, T l = T l−1 , CGIHT Restarted uses step sizes and search directions identical to those in a conjugate gradient method restricted to T l . Therefore, the various properties of the conjugate gradient search directions yield two important inequalities, These two inequalities provide a straightforward path to (22), as and Now, the orthogonalization weight β l requires a little more finesse and the use of Lemma 5 stated and proved in the appendix. To begin with, we note that the definition of the orthogonalization weight β l has an equivalent form [42, p. 35], namely Since r l = A * (y − Ax l ) and From the definition of the search direction, we see that r l−1 = p l−1 − β l−1 p l−2 . Using this description of r l−1 we can use (27) to express the current residual (r l ) T l in terms of the two prior search directions, namely The entire point of the conjugate gradient search is that the search directions are conjugate orthogonal so that Therefore, we can use (28) and (29) to update the numerator of (26) as (30) Finally, applying the Cauchy-Schwarz inequality and Lemma 5 to the numerator in (30) while invoking the definition of the lower asymmetric RICs in the denominator, we have Having bounded the step sizes and orthogonalization weights using restricted isometry constants, we can provide a sufficient restricted isometry property for guaranteed recovery of sparse in levels vectors using CGIHT Restarted.
Theorem 2 (CGIHT [19]): Let k and n define local sparsities and sparsity levels and fix the sparse in levels model (k,n) . Suppose A ∈ R m×n has (ck, n)-restricted isometry constants L ck = L ((ck, n), A) and U ck = U ((ck, n), A), c = 1, 2, 3. Let y = Ax + e for some x ∈ (k,n) and e ∈ R m an additive misfit. Define where If then (34) ensures μ < 1 and where x l is the lth iterate of CGIHT Restarted. Theorem 2, proven in the appendix, provides a worst case recovery guarantee for CGIHT Restarted on the sparsity in levels model. Since many families of random matrices have well-behaved restricted isometry constants, this result and the analogous results for NIHT (Thm. 1) and CoSaMP ( [26]), give us confidence in using the algorithms with random measurement operators A. As pointed out in [19], when the support set changes, restarting the search direction with the residual (Algorithm 2, Step 2) permits us to exploit conjugate gradient properties in the proof. However, the algorithm CGIHT (Algorithm 2 without the restarting in Step 2) is highly effective in practice even though it lacks a worst case recovery guarantee.

IV. PERFORMANCE COMPARISONS FOR SPARSITY IN LEVELS
As is now well-known, the average case performance of greedy algorithms for compressed sensing and sparse approximation far exceed the performance guaranteed by the theory based on restricted isometry constants. In this section, we examine the performance of CGIHT, CGIHT Restarted, CSMPSP (an implementation 2 of CoSaMP [38], [43]), and NIHT; we examine both the standard algorithm and its sparsity in levels adaptation (indicated by appending "L" to the name such as CGIHTL or CGIHTL Restarted). With some reorganization, we perform experiments set up similarly to those conducted by Adcock et al. [26] but now including the CGIHT algorithms to compare with NIHT and CSMPSP, the two top performing algorithms from [26].
The following empirical performance comparisons show that CGIHTL and CGIHTL Restarted have an average case performance akin to CSMPSPL. While Theorems 1 and 2 with other results from the literature, e.g. [26], [32], [37], ensure that these sparsity in levels algorithms can recover vectors with structured support, the theory does not provide insight into how the degree of structure or the quality of structure information provided to the algorithms impact performance. The empirical average case performance presented in this section provides such insight to enhance our understanding of the algorithms.
We then use phase transitions and an algorithm selection map based on recovery times to identify which algorithm to use when facing a particular problem; with computational complexity similar to NIHT(L) but recovery performance similar to CSMPSP(L), a variant of CGIHT is almost always the algorithm of choice. Moreover, we observe that for problems where both the standard algorithm and sparsity in levels version of the algorithm will succeed, the additional cost of the levels aware algorithms makes them less desirable and the standard algorithm should be used. While the clear goal of incorporating structure into the algorithms is to improve recovery performance, the pessimism of uniform recovery guarantees masks when practitioners should switch to the more computationally expensive levels adapted algorithms. The algorithm selection map, timings, and timing ratios of Section IV-E provide example information for making such a choice.

A. EMPIRICAL TESTING ENVIRONMENT
These experiments were conducted on a Linux machine with Intel XEON E5-2643 CPUs @ 3.30 GHz running Matlab R2018a. In all experiments the measurement matrix A is an m × n Gaussian matrix with entries drawn from the normal distribution N (0, 1/ √ m). The observed vectors are generated by first fixing the levels structure n = (n 1 , . . . , n r ) and local sparsities k = (k 1 , . . . , k r ). For each i = 1, . . . , r, a random support of size k i is chosen from the index set S i = {n i−1 + 1, . . . , n i } where n 0 = 0. The random support is then filled with entries chosen with equal probability from {−1, 1}. Similar experiments with similar results are obtained when selecting the entries from other distributions, but the equal magnitude entries are known to provide the greatest challenge for iterative greedy algorithms [38], [44].
The algorithms are implemented using the non-parallelized, Matlab versions of the implementations from the software GAGA [43], but then adapted to utilize sparsity in levels structure information. The algorithms have several optimized stopping conditions as outlined in [43], though most stopping conditions are enacted only when the algorithm fails. When the algorithm succeeds, the stopping condition typically invoked is that the relative error per iteration is below 10 −4 . In any case, when the algorithm terminates, it returns an approximationx to the observed vector x. The vectorx is considered a successful recovery of x when x −x 2 < 10 −3 x 2 .
As indicated above, each problem instance of an experiment creates a vector with a sparsity in levels structure indicated by the levels n and the local sparsities k. The algorithms are provided a sparsity in levels structure with levels n and local sparsitiesk. In [26] the results of their experiments are organized by algorithm and include multiple experiments on a single plot. Here, the results will be presented by experiment with multiple algorithms included on the same plot. The data presented here is for n = 2048 and for Figs. 1-4 we  (512, 1024, 1536, 2048). The local sparsities are (a) k 1 = (192, 64, 192, 64), (b) k 2 = (256, 0, 256, 0), (c) k 2 = (256, 0, 256, 0). The levels algorithms receive (a) exact sparsity in levels structure (k 1 , n 1 ), (b) partial sparsity in levels informationñ = (512, 2048),k = (256, 256), and (c) exact sparsity in levels structure (k 2 , n 1 ).
have k = 512. Empirical testing with various values of n and k yield results painting the same overall picture. In Fig. 1(a) the vector has a moderate structure (k 1 , n 1 ) and the algorithms are given the exact structure information with k = k 1 ,ñ = n 1 . Even with only a moderate structure in the vectors, the algorithms recover the vectors with fewer measurements when given the exact information. In Fig. 1(b) and (c), the input vectors are created with highly structured local sparsities k 2 . In (b), the levels aware algorithms are given partial information: a two level structureñ = (512, 2048) with k = (256, 256). Even with missing information, the levels aware algorithms still outperform the standard algorithms, although modestly. In (c), when given the exact sparsity, k = k 2 andñ = n 1 , the levels aware algorithms significantly outperform the standard algorithms. These three experiments indicate that both the significance of the structure in the vectors and accuracy of the information provided to the algorithms impact the performance of the sparsity in levels versions of the algorithms. While this matches the observations in [26], here we also see that CGIHTL and CGIHTL Restarted are competitive with CSMPSPL and clearly outperform NIHTL.

C. QUALITY OF ALGORITHM KNOWLEDGE
In the following experiments, we demonstrate the impact of the quality of information provided to the algorithms. First of all, we observe that when a sparsity in levels adapted algorithm is provided incorrect information about the structure of the vector, the algorithm fails to recover the sparse in levels vector. The experiment depicted in Fig. 2 has vectors from (k 1 ,n 1 ) where n 1 = (512, 1024, 1536, 2048) has four equal sized levels and the local sparsities are k 1 = (192, 64, 192, 64) as before. However, in this experiment, the algorithm is given the incorrect sparsity in levels structure whereñ = (512, 2048) yetk = (256, 256). Since the vectors drawn from (k 1 ,n 1 ) can not have 256 nonzeros in the first 512 entries, the levels aware algorithms CGIHTL, CGIHTL Restarted, NIHTL, and CSMPSPL all fail to recover the vectors. Note that the standard algorithms are unencumbered by the incorrect information and recover the (k 1 , n 1 )-sparse in levels vectors.

(a) Minimal information is provided to the algorithms withñ
In the next experiment, we still provide the algorithms with only a four level structure, but now the structure more closely matches the actual structure of the input vectors. In Fig. 3(b), the algorithms are provided with the levelsñ = (256, 1024, 1280, 2048) and local sparsitiesk = (128, 128, 128, 128). With this limited information, the levels aware algorithms use the knowledge to modestly outperform their standard counterparts.
Finally, when the algorithms are provided the exact eight level sparsity in levels structureñ = n 3 andk = k 3 , we see in Fig. 3(c) that the algorithms are able to exploit this knowledge to considerably outperform the naive standard versions of the algorithm.
These four experiments show that the sparsity in levels adapted algorithms are highly dependent on the quality of the structure information provided to the algorithm. Moreover, we see that for all experiments, the CGIHT family of algorithms has performance comparable to CSMPSP.

D. DYADIC STRUCTURE
One important idea behind structured sparsity [23] and the motivation for work on the sparsity in levels model [25], [27] is the use of algorithms for sparse recovery of wavelet coefficients. In Fig. 4, the vectors are created with a dyadic levels structure n 4 = (128, 256, 512, 1024, 2048) where the five levels have proportions (1/16, 1/16, 1/8, 1/4, 1/2) of the total length. We conduct two experiments where the algorithms are provided the exact sparsity in levels structure. In Fig. 4(a) the vector has limited decay in coefficients per level with k 4 = (128, 64, 128, 128, 64). On the other hand, a more significant coefficient decay modeling a wavelet compression of a signal is shown in (b) where k 5 = (128, 128, 112, 96, 48). In both experiments, the levels aware algorithms are given the exact sparsity in levels information. Again, we see here that the standard algorithms have a recovery phase transition at roughly the same number of measurements m. However, the  (128, 0, 128, 0, 128, 0, 128, 0). Exact sparsity in levels information (k 3 , n 3 ) is provided to the levels algorithms; the levels algorithms show substantial advantage over the standard algorithms. In all cases, CGIHT and CGIHT Restarted are competitive with CSMPSP and outperforms NIHT. (a) 50% phase transition curves; (b) algorithm selection map for CGIHT, CGIHT Restarted, and CSMPSP and the levels aware adaptations. Note that in the region below the standard algorithms' phase transition from (a), the standard algorithm is faster than its levels version as indicated by the algorithm selection map (b).  (128, 0, 128, 0, 128, 0, 128, 0). Exact sparsity in levels information (k 3 , n 3 ) is provided to the levels algorithms. levels aware algorithms gain a significant advantage with the exact information and perform much better in (b) on the more highly structured sparsity in levels pattern; this observation matches the theoretical discussions and asymptotic sparsity in [25], [27]. Here again, the CGIHT family of algorithms significantly outperforms NIHT and matches the performance of CSMPSP.

E. PHASE TRANSITIONS AND ALGORITHM SELECTION
So far our empirical investigation has made a case that the standard and levels aware versions of CGIHT and CGIHT Restarted are comparable in performance to CSMPSP. In this section we will more clearly indicate which algorithm should be used for a given problem. First, we recall the notion of an empirical phase transition for sparse approximation algorithms. Given a problem class with an m × n matrix A and a vector x ∈ R n with x 0 = k, we define the undersampling and oversampling ratios Then since k ≤ m ≤ n, the unit square provides a (δ, ρ ) phase space. A phase transition curve for an algorithm is a function ρ(δ) that defines a curve where the algorithm successfully solves problems with sampling ratios below the curve, but fails to solve problems with sampling ratios above the curve. An empirical 50% phase transition curve for an algorithm is a function ρ (δ) where an algorithm is observed to solve roughly half of all problems with sampling ratios along the curve. Below the curve, the algorithm succeeds with high probability and fails with high probability above the curve. The region around the curve where the algorithm transitions from always succeeding to always failing has a width proportional to √ m. To investigate a problem where the sparsity in levels aware variants of the algorithms have the greatest advantage, our final two experiments return to the highly structured sparsity in levels experiment from Fig. 3(c), namely n 3 = (256, 512, 768, 1024, 1280, 1536, 1792, 2048) with local sparsities k 3 = (128, 0, 128, 0, 128, 0, 128, 0). For Fig. 5 we present results where the sparsity in levels variants of the algorithms are given the exact informationñ = n 3 ,k = k 3 . Fig. 5(a) presents the 50% phase transition curves for CGIHT, CGIHT Restarted, CSMPSP, and NIHT along with their sparsity in levels variants. The phase transition curves for the standard algorithms match those previously reported in [19], [45] and indicate that CGIHT and CGIHT Restarted can typically solve problems with the same sampling ratios as CSMPSP. The phase transition for NIHT is clearly below  (256, 512, 768, 1024, 1280, 1536, 1792,  2048) with local sparsities k 3 = (128, 0, 128, 0, 128, 0, 128, 0). Exact sparsity in levels information (k 3  those of the other algorithms. When given the sparsity in levels information and the ability to utilize that knowledge, the phase transitions increase significantly. Here again we observe that CGIHTL and CGIHTL Restarted are competitive with CSMPSPL; all three algorithms have nearly identical phase transition curves when δ ≤ 0.25. When attempting to solve a problem, it is usually desirable to employ the algorithm that will find an accurate solution with the least computational cost. In Fig. 5(b) we provide an algorithm selection map [38] which indicates which algorithm recovers the measured vector in the least amount of time. More specifically, for each ordered pair of sampling ratios (δ, ρ ), the algorithm with the smallest average recovery time is indicated on the map. There are two important observations in Fig. 5(b). First, for the overwhelming majority of the phase space, an algorithm from the CGIHT family 3 is the algorithm of choice with CSMPSP being selected either when it is the only algorithm that is likely to solve the problem (e.g. (δ, ρ ) = (0.5, 0.5)) or when the problem will only require one iteration (e.g. the smallest values of ρ = 0.02). Second, when the standard algorithm will succeed, i.e. when the sampling ratios fall below the phase transition curves of CGIHT Restarted, the standard algorithm will solve the problem in less time and should be selected.
The best run time (ms) among all algorithms is given in Fig. 6. The transition from using the standard algorithm to using a sparsity in levels aware algorithm is very clearly delineated. In Fig. 7, we see how each of CGIHT(L) Restarted, CGIHT(L), and CSMPSP(L) compare to the best run time. These plots show the ratio of the average recovery time for each algorithm to the best average recovery time. In order to compare plots while still distinguishing between small ratios near 1, the colorbar is fixed to a minimum ratio of 1 and a maximum of 3; all ratios greater than 3 appear as the bright yellow top of the colorbar. Based on the combination of the algorithm selection map (Fig. 5) and the ratio of recovery times to the best time (Fig. 7), CGIHT Restarted and CGIHTL Restarted appear to be the algorithms of choice.

V. CONCLUSION
When a priori knowledge of additional structure is available for a solving a compressed sensing or sparse approximation problem, adapting greedy algorithms to utilize this information can be advantageous. In Section III, we provided a theoretical recovery guarantee for the sparsity in levels model for both NIHT and CGIHT Restarted. Since the standard compressed sensing model is simply a trivial instance of the sparsity in levels models, Theorems 1 and 2 apply to all sparsity in levels models and slightly improve upon existing restricted isometry properties for guaranteed recovery for both algorithms. From the empirical investigation in Section IV, we see that the average case performance of CGIHTL and CGIHTL Restarted are highly competitive. With the worst case recovery guarantees and the observed average case performance, we make the following recommendations when solving a compressed sensing problem with structured sparsity.
Given the backing of Theorem 2, the empirical observations similar to those in Figs. 5-7, and its stability to noise [45], we recommend the use of CGIHT Restarted and CGIHT Restarted for Sparsity in Levels. In the problem regime where CGIHT Restarted is likely to succeed, use CGIHT Restarted for compressed sensing regardless of the known structure information. When given a problem whose sampling ratios are close to or above the 50% phase transition curve for CGIHT Restarted, use the sparsity in levels variant of CGIHT Restarted. For other problems in the smaller region of the phase space where Fig. 5(b) indicates using CSMPSPL, use a sparsity in levels version of CoSaMP.

APPENDIX A PROOFS
We prove Theorems 1-2 which extend similar results to the sparsity in levels setting. These proofs combine the existing proofs of [19], [40], [41] to obtain recovery guarantees for algorithms designed to utilize the structured support sets. Furthermore, since the sparsity in levels model includes the standard compressed model k , the results in this section are also improvements over those existing in the literature in that the relevant constants μ and ξ have been reduced.
While the proofs in this appendix appear on the surface to focus on an analysis for the compressed sensing problem (1), the proofs extend naturally to the sparse approximation setting (2). Suppose x ∈ R n and T ∈ (k,n) is the support of the largest entries in x with sparsity levels n and local sparsities k. Now assume y = Ax + e for some misfit vector e. Simply definex = (x) T andẽ = A(x) T c + e. Then we have y = Ax +ẽ wherex ∈ (k,n) is sparse in levels andẽ is an additive misfit vector. Further analysis will provide bounds on errors in other norms similar to those in [26] and [41, Section 6.3].

A. TECHNICAL LEMMAS
This first lemma is based on the structure of the algorithms, relies on the format for finding the support set of the update, and is not concerned with the matrix A nor RICs. This version is based on the beginning of [41,Thm. 6.18].
Lemma 4: Let x j+1 , x j , x ∈ (k,n) have support sets T j+1 , T j , T , respectively. Let α j ∈ R and d j ∈ R n be a step size and direction vector from one of the iterative algorithms. Then, Proof: In each of the algorithms, with w j = x j + α j d j the index set T j+1 has the form T j+1 = DetectSupport (k,n) (w j ) and contains the indices for the k i largest magnitude entries of w j on level n i−1 + 1 to n i . Since both T j+1 , T ∈ (k,n) , we have (w j ) T 2 2 ≤ (w j ) T j+1 2 2 and eliminating shared values gives Now since x is zero on the index set T j+1 \T , On the other hand, x j+1 has no contribution on the index set T \T j+1 so that Applying (39) to the right-hand side of (38), and (40) to the left, where T T j+1 = (T \T j+1 ) ∪ (T j+1 \T ) is the symmetric difference of the sets. Now notice that x j+1 = (w j ) T j+1 so that Thus, applying (41) to (42), As usual with proofs involving RICs, we will use several bounds on matrices derived from the measurement matrix A. These asymmetric restricted isometry constant (aRIC) bounds appear in various forms throughout the literature involving restricted isometry properties.
Lemma 5: Let S ⊂ ck,n and suppose A ∈ R m×n has asymmetric RICs L k , L ck , U ck . Suppose z ∈ R n , e ∈ R m , and α ∈ R satisfies 1 αA * e S 2 ≤ Proof: As defined in [39], the aRICs of A are determined by the extreme eigenvalues of the all Gram matrices from ck columns of A with an index set S ∈ ck,n , namely In other words, 1 − L ck ≤ λ(A * S A S ) ≤ 1 + U ck for every eigenvalue of A * S A S . Thus (44) follows from applying the bound on α and the standard operator norm inequality.
The eigenvalues of I − αA * S A S are bounded by . Applying the bound on the scalar α and rearranging a bit, we have Utilizing some basic inequalities on the aRIC, L k ≤ L ck , U k ≤ U ck , 1 − L k ≤ 1 + U k , we see that Thus, we obtain (45) since and (46) follows directly from (45) by recognizing that restricting the matrix A to the columns indexed by S is equivalent to restricting the result of the matrix-vector product to the index set S. To obtain (47), we recognize that A S A * S shares the same eigenvalues as A * S A S , so that A S A * S 2 ≤ 1 + U ck . Therefore αA * S e 2 2 = αA * S e, αA * S e = |α| 2 e, A S A * S e ≤ |α| 2 e 2 A S A * S 2 e 2

B. NIHT
We put the technical lemmas together to prove Theorem 1 for NIHT.

Proof of Thm. 1 (NIHT):
We assume that x ∈ R n has a structured support T ∈ (k,n) . The algorithm is given y and A where y = Ax + e.
Let r j = A * (y − Ax j ) be the residual in iteration j + 1 of NIHT, Algorithm 1, Step 1, and observe that the update in Step 3, uses r j as the search direction so that x j + α j r j = x j + α j A * (y − Ax j ) = x j + α j A * A(x − x j ) + α j A * e. (51) NIHT finds T j+1 ∈ (k,n) , and tracking the support restrictions in this iteration, we have since T ∪ T j contains the support of x j − x. Now, let S = T ∪ T j ∪ T j+1 ∈ (3k,n) and note T ∪ T j+1 ∈ (2k,n) . Combining Lemma 4 with (52), applying the triangle inequality, and observing (T ∪ T j ), (T ∪ T j+1 ) ⊂ S, we obtain Therefore, we can apply (46) and (47) from Lemma 5 to (53) to bound the error of a single iteration, where μ = √ 3 U 3 k + L 3 k 1 − L k and ξ = √ 3 When μ < 1, we arrive at the desired result, through an induction argument and by bounding a finite geometric sum of positive terms by its limit.

C. CGIHT
The proof of Theorem 2 closely follows the proof from [19]. By using Lemma 4, the constants in the proof are improved.
Of course the proof also assumes the sparsity in levels model. Lemma 6 provides the outline for the strategy where we attempt to establish a three-term recurrence relation on the errors from each iteration. By doing so, Lemma 6 ensures the approximations converge. The proof of this lemma is omitted. Lemma 6 [19,Lemma 4.1]: Suppose c 0 , η, τ 1 , τ 2 ≥ 0 and let μ = 1 2 (τ 1 + τ 2 1 + 4τ 2 ). Assume c 1 ≤ μc 0 + η and define c l = τ 1 c l−1 + τ 2 c l−2 + η for l ≥ 2. If τ 1 + τ 2 < 1, then μ < 1 and c l ≤ μ l c 0 + η l−1 j=0 μ j . (57) Next, in the spirit of the conjugate gradient method, we isolate a portion of the proof bounding the error from one iteration in terms of the previous iterations.
Lemma 7 [19]: For any j ∈ N, let x j ∈ (k,n) have support set T j , and let β j ∈ R be an orthogonalization weight from CGIHT. Now suppose x ∈ (k,n) with support set T , and let α l−1 ∈ R be a step size from CGIHT. Suppose A is an m × n matrix with m < n and let y = Ax + e for some additive misfit e ∈ R m . If w l−1 is the update vector from Algorithm 2, Step 5, then Proof: First, iteratively applying the definition of the search directions allows us to write them entirely in terms of the residuals, namely Considering the residuals in terms of the iterates, we have Therefore, we have Now, restricting (61) to T l ∪ T , taking the norm, and applying the triangle inequality yields (58). We now use these two lemmas along with the technical lemmas from above to prove Theorem 2.
Proof of Thm. 2 (CGIHT): From Lemma 4, we can immediately see that