Compressed Sensing Using Binary Matrices of Nearly Optimal Dimensions

In this paper, we study the problem of compressed sensing using binary measurement matrices and $\ell _1$-norm minimization (basis pursuit) as the recovery algorithm. We derive new upper and lower bounds on the number of measurements to achieve robust sparse recovery with binary matrices. We establish sufficient conditions for a column-regular binary matrix to satisfy the robust null space property (RNSP) and show that the associated sufficient conditions for robust sparse recovery obtained using the RNSP are better by a factor of $(3 \sqrt{3})/2 \approx 2.6$ compared to the sufficient conditions obtained using the restricted isometry property (RIP). Next we derive universal lower bounds on the number of measurements that any binary matrix needs to have in order to satisfy the weaker sufficient condition based on the RNSP and show that bipartite graphs of girth six are optimal. Then we display two classes of binary matrices, namely parity check matrices of array codes and Euler squares, which have girth six and are nearly optimal in the sense of almost satisfying the lower bound. In principle, randomly generated Gaussian measurement matrices are “order-optimal.” So we compare the phase transition behavior of the basis pursuit formulation using binary array codes and Gaussian matrices and show that (i) there is essentially no difference between the phase transition boundaries in the two cases and (ii) the CPU time of basis pursuit with binary matrices is hundreds of times faster than with Gaussian matrices and the storage requirements are less. Therefore it is suggested that binary matrices are a viable alternative to Gaussian matrices for compressed sensing using basis pursuit.


I. INTRODUCTION
C OMPRESSED sensing refers to the recovery of highdimensional but low-complexity entities from a limited number of measurements. The specific problem studied in this paper is to recover a vector x ∈ R n , where only k n components are significant and the rest are either zero or small, based on a set of linear measurements y = Ax, where A ∈ R m×n .
A variant is when y = Ax + η, where η denotes measurement noise and a prior bound of the form η ≤ is available. By far the most popular solution methodology for this problem is basis pursuit in which an approximationx to the unknown vector x is constructed viâ The basis pursuit approach (with η = 0 so that the constraint in (1) becomes y = Az) was proposed in [1], [2], but without guarantees on its performance. Much of the subsequent research in compressed sensing has been focused on the case where A consists of mn independent samples of a zero-mean, unit-variance Gaussian or sub-Gaussian random variable, normalized by 1/ √ m. With this choice, it is shown in [3] that, with high probability with respect to the process of generating A, m = O(k ln(n/k)) measurements suffice to ensure thatx defined in (1) equals x, provided x is sufficiently sparse. It is also known that any compressed sensing algorithm requires m = Ω(k ln(n/k)) samples; see [4] for an early result and [5] for a simpler and more explicit version of this bound. Thus random Gaussian matrices are "order optimal" in the sense that the number of measurements is within a fixed universal constant of the minimum required.
In recent times, there has been a lot of interest in the use of sparse binary measurement matrices for compressed sensing. One of the main advantages of this approach is that it allows one to connect compressed sensing to fields such as graph theory and algebraic coding theory. There are also some computational advantages. At present, a popular alternative is to choose the measurement matrix A to consist of mn independent samples of a Gaussian random variable. A Gaussian random variable is nonzero with probability one; therefore every element of A will be nonzero with probability one. Moreover, in solving the minimization problem in (1), each element of A needs to be stored to high precision. In contrast, sparse binary matrices require less storage both because they are sparse and also because every nonzero element equals one. For this reason, binary matrices are also said to be "multiplication-free." As a result, popular compressed sensing approaches such as (1) can be applied effectively for far larger values of m and n and with greatly reduced CPU time, when A is a sparse binary matrix instead of a random Gaussian matrix. Of course, the previous discussion assumes that the unknown vector is sparse in the canonical basis. There are situations, where the unknown vector is sparse with respect to some other basis, such as the Fourier basis. Our remarks would not apply in such a situation. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ At present, the best available bounds for the number of measurements required by a binary matrix are m = O(max{k 2 , √ n}). This contrasts with m = O(k ln(n/k)) for random Gaussian matrices. However, in the latter case, the O symbol hides a very large constant. It is shown in this paper that for values of n 10 5 , the known bounds with binary matrices are in fact smaller than with random Gaussian matrices. The preceding discussion refers to the case where a particular matrix A is guaranteed to recover all sufficiently sparse vectors. A parallel approach is to study conditions under which "most" sparse vectors are recovered. Specifically, in this approach, n, m are fixed and k is varied from 1 to m. For each choice of k, a large number of vectors with exactly k nonzero components are generated at random and the fraction that is recovered accurately is computed. Clearly, as k is increased, this fraction decreases. One might expect that the fraction of recovered randomly generated vectors equals 1 when k is sufficiently small and decreases gradually to 0 as k approaches m. In reality there is a sharp boundary below which almost all k-sparse vectors are recovered and above which almost no k-sparse vectors are recovered. This phenomenon is known as phase transition and has been established theoretically for the case, where A consists of random samples from a Gaussian distribution in [6]- [8]. A very general theory is derived in [9], where the measurement matrix still consists of random Gaussians, but the objective function is changed from the 1 -norm to an arbitrary convex function. In a recent paper [10], phase transitions are studied empirically for several classes of deterministic measurement matrices and it is verified that there is essentially no difference between the phase transitions of of deterministic measurement matrices and the phase transitions of random Gaussian measurement matrices.
Here we describe the organization of the paper, as well as its contributions. Section II contains background material, but also includes some improvements over known results. In particular, we review the current literature on the construction of binary matrices for compressed sensing. The original contributions of the paper begin with Section III. In this section we derive a sufficient condition for a binary matrix to satisfy the robust null space property (RNSP). In turn this leads to a new upper bound on the sparsity count k for which robust sparse recovery can be guaranteed using a column-regular binary matrix. 1 In Section IV we derive a lower bound on the number of measurements m as a function of the girth of the bipartite graph associated with the measurement matrix; it is shown that graphs of girth six are optimal in terms of minimizing the number of measurements. In Section V, we construct binary matrices of girth six, where the number of measurements is nearly equal to the lower bound derived in Section IV; this explains the title of the paper. In Section VI, we attempt to reconcile two seemingly conflicting observations, namely: For compressed sensing, graphs of girth six are optimal, whereas in coding, graphs of high girth are preferred. In Section VII, we carry out some numerical experiments and establish that the basis pursuit approach together with our binary matrices exhibits a phase transition. The paper is concluded with some discussion in Section VIII. 1 This term is defined in Section III.

A. Definition of Compressed Sensing
Let Σ k ⊆ R n denote the set of k-sparse vectors in R n ; i.e., where, as is customary, · 0 denotes the number of nonzero components of x. Given a norm · on R n , the k-sparsity index of x with respect to that norm is defined by Now we are in a position to define the compressed sensing problem precisely. Note that A ∈ R m×n is called the measurement matrix and Δ : R m → R n is called the "decoder map." Definition 1: The pair (A, Δ) is said to achieve stable sparse recovery of order k and indices p, q if there exists a constant C such that The pair (A, Δ) is said to achieve robust sparse recovery of order k and indices p, q (and norm · ) if there exist constants C and D such that, for all η ∈ R m with η ≤ , it is the case that The above definitions apply to general norms. In this paper and indeed in much of the compressed sensing literature, the emphasis is on the case, where q = 1 and p ∈ [1,2]. However, the norm on η is still arbitrary.

B. Approaches to Compressed Sensing -I: RIP
Next we present some sufficient conditions for basis pursuit as defined in (1) to achieve robust or stable sparse recovery. There are two widely used sufficient conditions, namely the restricted isometry property (RIP) and the stable (or robust) null space property (SNSP or RNSP). We begin by discussing the RIP.
Definition 2: A matrix A ∈ R m×n is said to satisfy the restricted isometry property (RIP) of order k with constant δ if The RIP is formulated in [3]. It is shown in a series of papers [3], [11], [12] that the RIP of A is sufficient for (A, Δ BP ) to achieve robust sparse recovery. The best known and indeed the "best possible," result relating RIP and robust recovery is given below: Theorem 1: If A satisfies the RIP of order tk with constant δ tk < (t − 1)/t for t ≥ 4/3, or δ tk < t/(4 − t) for t ∈ (0, 4/3), then (A, Δ BP ) achieves robust sparse recovery of order k. Moreover, both bounds are tight.
The first bound is proved in [13] while the second bound is proved in [14]. Note that both bounds are equal when t = 4/3. Hence the theorem provides a continuous tight bound on δ tk for all t > 0.
This theorem raises the question as to how one may go about designing measurement matrices that satisfy the RIP. There are two popular approaches, one probabilistic and one deterministic. In the probabilistic method, the measurement matrix A equals (1/ √ m)Φ, where Φ consists of mn independent samples of a Gaussian random variable, or more generally, a sub-Gaussian random variable. In this paper we restrict our attention to the case, where A consists of random samples from a Gaussian distribution and refer the reader to [15] for the more general case of sub-Gaussian samples. The relevant bound on m to ensure that A satisfies the RIP with high probability is given next; it is a fairly straight-forward modification of [15,Theorem 9.27].
Theorem 2: Suppose an integer k and real numbers δ, ξ ∈ (0, 1) are specified and that A = (1/ √ m)Φ, where Φ ∈ R m×n consists of independent samples of a normal Gaussian random variable X. Define Then A satisfies the RIP of order k with constant δ with probability at least 1 − ξ provided Proof: We start with [15,Theorem 9.27]. In that theorem, it is shown that, if the measurement matrix A ∈ R m×n consists of independent samples of Gaussian random variables and if where η satisfies then A satisfies the RIP of order k with constant δ, with probability at least 1 − ξ. The above equation can be rewritten as Rearranging this equation leads to (5). Equation (6) leads to an upper bound of the form m = O(k ln(n/k)) for the number of measurements that suffice for the random matrix to satisfy the RIP with high probability. It is shown in [5, Theorem 3.1] that any algorithm that achieves stable sparse recovery requires m = O(k ln(n/k)) measurements. See [4, Theorem 5.1] for an earlier version. For the convenience of the reader, we restate the theorem from [5]. Note that it is assumed in [5] that p = q = 1, but the proof requires only that p = q. In order to state the theorem, we introduce the entropy with respect to an arbitrary integer θ. Suppose θ ≥ 2 is an integer. Then the θ-ary entropy H θ : (0, 1) → (0, 1] is defined by Theorem 3: Suppose A ∈ R m×n and that, for some map Δ : R m → R n , the pair (A, Δ) achieves stable k-sparse recovery with constant C. Define θ = n/k . Then Because robust k-sparse recovery implies stable k-sparse recovery, the bound in (8) applies also to robust k-sparse recovery.
Comparing Theorems 2 and 3 shows that m = O(k ln(n/k)) measurements are both necessary and sufficient for robust ksparse recovery. For this reason, the probabilistically generated measurement matrices are considered to be "order-optimal." However, this statement is misleading because the O symbol in the upper bound hides a very large constant, as shown next.
Example 1: Suppose n = 22, 201 = 149 2 and k = 69, which is a problem instance studied later in Section VII. Then the upper and lower bounds from Theorems 2 and 3 imply that Thus the spread between the upper and lower bounds is more than three orders of magnitude. Also, the upper bound for the number of measurements is more than the dimension n.
There is another factor as well. As can be seen from Theorem 2, probabilistic methods lead to measurement matrices that satisfy the RIP only with high probability, that can be made close to one but never exactly equal to one. Moreover, as shown in [16], once a matrix has been generated, it is NP-hard to test whether that particular matrix satisfies the RIP.
These observations have led the research community to explore deterministic methods to construct matrices that satisfy the RIP. A popular approach is based on the coherence of a matrix.
Definition 3: Suppose A ∈ R m×n is column-normalized, so that a j 2 = 1 for all j ∈ [n], where a j denotes the j-column of A. Then the coherence of A is denoted by μ(A) and is defined as The following result is an easy consequence of the Gerschgorin circle theorem.

C. Approaches to Compressed Sensing -II: RNSP
An alternative to the RIP approach to compressed sensing is provided by the stable (and robust) null space property. The SNSP is formulated in [17], while, to the best of the authors' knowledge, the RNSP is formulated for the first time in [18]; see also [15,Definition 4.17].
Definition 4: Suppose A ∈ R m×n and let N (A) denote the null space of A. Then A is said to satisfy the stable null space property (SNSP) of order k with constant ρ < 1 if, for every The matrix A is said to satisfy the robust null space property (RNSP) of order k for the norm · with constants ρ < 1 and τ > 0 if, for every set S ⊆ [n] with |S| ≤ k, we have that It is obvious that RNSP implies the SNSP. The utility of these definitions is brought out in the following theorems.
Theorem 4: (See [15,Theorem 4.12].) Suppose A satisfies the stable null space property of order k with constant ρ. Then the pair (A, Δ BP ) achieves stable k-sparse recovery with Theorem 5: (See [15,Theorem 4.22].) Suppose A satisfies the robust null space property of order k for the norm · with constants ρ and τ . Then the pair (A, Δ BP ) achieves robust ksparse recovery with D. Best Bounds on the Sparsity Count Using the RIP Until recently, the twin approaches of RIP and RNSP had proceeded along parallel tracks. However, it is shown in [19,Theorem 9] that if A satisfies the RIP of order tk with constant δ tk < (t − 1)/t for some t > 1, then it satisfies the RSNP of order k. Note that if A has coherence μ, then by Lemma 1, we have that δ tk ≤ (tk − 1)μ for all t. Next by [19,Theorem 9], basis pursuit achieves robust k-sparse recovery whenever for any t > 1. So let us ask: What is an "optimal" choice of t?
To answer this question, we neglect the 1 in comparison to tk and rewrite the above inequality as Thus we get the best bound by maximizing the right side with respect to t. It is an easy exercise in calculus to show that the maximum is achieved with t = 3/2 and the corresponding bound (t − 1)/t = 1/ √ 3. Hence by combining with Lemma 1 we can derive the following bound.
Theorem 6: Suppose A ∈ R m×n has coherence μ. Then (A, Δ BP ) achieves robust k-sparse recovery whenever or equivalently Moreover, the bound is nearly optimal when applying [19,Theorem 9]. If we retain the term tk − 1 instead of replacing it by tk, we would get a more complicated expression for the optimal value of t. However, it can be verified that if (16) is satisfied, then so is (15).

E. Binary Matrices for Compressed Sensing: A Review
In this section we present a brief review of the use of binary matrices as measurement matrices in compressed sensing. The first construction of a binary matrix that satisfies the RIP is due to DeVore and is given in [20]. The DeVore matrix has dimensions q 2 × q r+1 , where q is a power of a prime number and r ≥ 2 is an integer, has exactly q elements of 1 in each column and has coherence μ ≤ r/q. This construction is generalized to algebraic curves in [21], but does not seem to offer much of an advantage over that in [20]. A construction that leads to matrices of order 2 m × 2 m(m+1)/2 based on Reed-Muller codes is proposed in [22]. Because the number of measurements is restricted to be a power of 2, this is not a very practical method. A construction in [23] is based on a method to generate Euler squares from nearly a century ago [24]. The resulting binary matrix has dimensions lq × q 2 , where q is an arbitrary integer, making this perhaps the most versatile construction. The integer l is bounded as follows: Let q = 2 r 0 p r 1 1 . . . p r s s be the prime number decomposition of q. Then l + 1 ≤ min{2 r 0 , p r 1 1 , . . . , p r s s }. In particular if q is itself a power of a prime, we can have l = q − 1. Each column of the resulting binary matrix has exactly l ones and the matrix has coherence 1/l. All of these matrices can be used to achieve robust k-sparse recovery via the basis pursuit formulation, by combining Lemma 1 with Theorem 1. Another method found in [25] constructs binary matrices using the Chinese remainder theorem and achieves probabilistic recovery.
There is another property that is sometimes referred to as the 1 -RIP, introduced in [26], which makes a connection between expander graphs and compressed sensing. However, while this approach readily leads to stable k-sparse recovery, it does not lend itself readily to robust k-sparse recovery. One of the main contributions of [27] is to show that the construction of [20] can also be viewed as a special case of an expander graph construction proposed in [28]. Yet another direction is initiated in [29], in which a general approach is presented for generating binary matrices for compressed sensing using algebraic coding theory. In particular, it is shown that binary matrices which, when viewed as elements over the binary field F 2 , have good properties in decoding, will also be good measurement matrices when viewed as matrices of real numbers. In particular, several notions of "pseudo-weights" are introduced and it is shown that these pseudo-weights can be related to the satisfaction of the stable (but not robust) null space property of binary matrices. These bounds are improved in [30] to prove the stable null space property under weaker conditions than in [29].

III. ROBUST NULL SPACE PROPERTY OF BINARY MATRICES
In this section we commence presenting the new results of this paper on identifying a class of binary matrices for compressed sensing that have a nearly optimal number of measurements.
Suppose A ∈ {0, 1} m×n with m < n. Then A can be viewed as the bi-adjacency matrix of a bipartite graph with n input (or "left") nodes and m output (or "right") nodes. Such a graph is said to be left-regular if each input node has the same degree, say d L . This is equivalent to saying that each column of A contains exactly d L ones. Given a bipartite graph with E edges, n input nodes and m output nodes, define the "average left degree" and "average right degree" of the graph asd L = E/n andd R = E/m. Note that these average degrees need not be integers. Then it is clear that nd L = md R . The girth of a graph is defined as the length of the shortest cycle. Note that the girth of a bipartite graph is always an even number and in so-called simple graphs (not more than one edge between any pair of vertices), the girth is at least four.
Hereafter, we will not make a distinction between a binary matrix and the bipartite graph associated with the matrix. Specifically, the columns correspond to the "left" nodes while the rows correspond to the "right" nodes. So an expression such as "A is a left-regular binary matrix of degree d L " means that the associated bipartite graph is left-regular with degree d L . This usage will permit us to avoid some tortuous sentences.
Theorems 7 and 8 are the starting point for the contents of this section.
1} m×n is left-regular with left degree d L and suppose that the maximum inner product between any two columns of A is λ. Then for every v ∈ N (A), we have that where [n] denotes {1, . . . , n}.
If the matrix A has girth six or more, then the maximum inner product between any two columns of A is at most equal to one. Therefore (18) gives the bound However, if the girth is equal to 10 or more, then it is possible to improve the bound (18).
, 1} m×n and that A has girth g ≥ 6. Then for every v ∈ N (A), we have that where, if g = 4t + 2, then and if g = 4t, then Note that if the girth of the graph equals 6, then C as defined in (20) becomes C = 2 and the bound in (19) becomes the same as that in (18) after noting that λ = 1. Similarly, if g = 8, then C in (21) also becomes just C = 2. Therefore Theorem 8 is an improvement over Theorem 7 only when the girth of the graph is at least equal to 10.
In [30], the bounds (18) and (19) are used to derive sufficient conditions for the matrix A to satisfy the stable null space property. However, it is now shown that the same two bounds can be used to infer the robust null space property of A. This is a substantial improvement, because with such an A matrix, basis pursuit would lead to robustness against measurement noise, which is not guaranteed with the SNSP. We derive our results through a series of preliminary results.
Lemma 2: Suppose A ∈ R m×n and let · be any norm on R m . Suppose there exist constants α > 2, β > 0 such that Then, for all k < α/2, the matrix A satisfies the RNSP of order k. Specifically, whenever S ⊆ [n] with |S| ≤ k, Equation (12) holds with Proof: Let S ⊆ [n] with |S| ≤ k be arbitrary. Then which is the desired conclusion. Next, let A ∈ R m×n be arbitrary and let · be any norm on R n . Recall that N (A) ⊆ R n denote the null space of A and let N ⊥ := [N (A)] ⊥ denote the orthogonal complement of N (A) in R n . Then for all u ∈ N ⊥ , it is easy to see that where σ min is the smallest nonzero singular value of A. Because all norms on a finite-dimensional space are equivalent, there exists a constant c that depends only on the norm · on R m such that (In particular, y 2 ≤ y 1 , so we can take c = 1 in this case.) Therefore, by Schwarz' inequality, we get At this point, we can state the main result of this section. Theorem 9: Suppose A ∈ {0, 1} m×n is left-regular with left degree d L and let λ denote the maximum inner product between any two columns of A (and observe that λ ≤ d L ). Next, let σ min denote the smallest nonzero singular value of A and for an arbitrary norm · on R m , choose the constant c such that (24) holds. Then A satisfies (22) Consequently, for all k < α/2 = d L /λ, A satisfies the RNSP of order k with Proof: Let h ∈ R n be arbitrary and express h as h = v + u, where v ∈ N (A) and u ∈ N ⊥ . Then clearly We will bound each term separately.
As shown in Theorem 7, we have that where the last step follows from the fact that Ah = Au because Av = 0. Next .
Combining these two inequalities shows that Ah .
This establishes (26). The (27) follows from Lemma 2, specifically (23). Remarks: r In the above proof, we make use of the inequality |u i | ≤ u 1 . At a certain level, this estimate is conservative. However, if we wish to have a bound on |u i | in terms of u 1 that is applicable to all vectors u, then the bound is tight.
r Note that the bound |u i | ≤ u 1 is used only to derive a bound on the constant β. In turn the bound on β leads to a bound on the constant τ in the definition of the robust null space property. It can be seen from Theorem 5 and (14) that robust k-sparse recovery occurs whenever ρ < 1 and the only appearance of τ is in the constant D in (14), which gives the amplification factor of the noise. Theorem 10: Suppose A ∈ {0, 1} m×n is left-regular with left-degree d L and has girth at least six. Define the constant C as in (20) or (21) as appropriate. Then for all k < C /2, the matrix A satisfies the RNSP of order k, with constants The proof of Theorem 10 is entirely analogous to that of Theorem 9, with the bound in Theorem 8 replacing that in Theorem 7. Therefore the proof is omitted.
The results in Theorem 9 lead to sharper bounds for the sparsity count compared to using RIP and coherence bounds. This is illustrated next.
Example 2: Suppose A ∈ {0, 1} m×n is left-regular with degree d L and with the inner product between any two columns bounded above by λ. Then it is easy to see that the coherence μ of A is bounded by λ/d L . Therefore, if we use Theorem 6, then it follows that (A, Δ BP ) achieves robust k-sparse recovery whenever In contrast, if we use Theorem 9, it follows that (A, Δ BP ) achieves robust sparse recovery whenever k < d L /λ, which is an improvement by a factor of roughly 3 √ 3/2 ≈ 2.6.

IV. LOWER BOUNDS ON THE NUMBER OF MEASUREMENTS
Theorem 8 shows that, for a fixed left degree d L , as the girth of the graph corresponding to A becomes larger, so does the constant C . Therefore, as the girth of A increases, so does the upper bound on k as obtained from Theorem 10. This suggests that, for a given left degree d L and number of input nodes n, it is better to choose graphs of large girth. However, as shown next, as the girth of a graph is increased, the number of measurements m also increases. As shown below, the "optimal" choice for the girth is actually 6.
Observe from Theorem 10 and specifically (28), that the pair (A, Δ BP ) achieves robust k-sparse recovery whenever ρ < 1, or equivalently k < C /2. From the definition of C , this bound on the sparsity count for which robust k-sparse recovery is guaranteed can be written as if g = 4t + 2 and and if g = 4t. Let us definē It is recognized thatk is just the last term in the summations in (29) and (30). Moreover, unless d L is quite small, the difference betweenk and the summations in (29) and (30) will be rather small. Thus we use k <k as an easily analyzable and quite reasonable, approximation to the actual upper bounds on the sparsity count k given in (29) and (30). It is clear that if we choose the matrix A to have higher and higher girth, the boundk also becomes higher. So the question therefore becomes: What happens to m, the number of measurements, as the girth is increased? The answer is given next.
Theorem 11: Suppose A ∈ {0, 1} m×n is d L -left regular graph with m ≤ n and that every row and every column of A contains at least two ones. If the girth g of A equals 4t + 2, then whereas if g = 4t for t ≥ 2, then The proof of Theorem 11 is based on the following result [31,Equations (1) and (2)]: Theorem 12: Suppose A ∈ {0, 1} m×n with m < n. Suppose further that in the bipartite graph associated with A, every node has degree ≥ 2. 2 Let E denote the total number of edges of the graph and defined L = E/n,d R = E/m to be the average left-node degree and average right-node degree, respectively. Suppose finally that the graph has girth g = 2r. Then It is important to note that the above theorem does not require any assumptions about the underlying graph (e.g., regularity). The only assumption is that every node has degree two or more, so as to rule out trivial cases. Usually such theorems are used to find upper bounds on the girth of a bipartite graph in terms of the numbers of its nodes and edges (as in Theorem 13 below). However, we turn it around here and use the theorem to find a lower bound on m, given the integers n and g. Note that if g = 4, then r = 2 and the bound (34) becomes m ≥d L , which is trivial. In fact m has to exceed the maximum degree of any left node. However, for g ≥ 6, the bound in (34) is meaningful.
Proof: (Of Theorem 11:) The bound (34) implies that m is no smaller than the last term in the summation; that is m ≥d Because A is assumed to be left-regular, actuallyd L = d L , but we do not make use of this and will carry the symbol d L throughout. By definition, we have thatd R = (nd L )/m. Therefore, if n ≥ m, then it follows that Therefore (35) implies that where α = (r − 1)/2 + (r − 1)/2 .

Therefore (36) becomes
This can be rearranged as 2 This is equivalent to the requirement that every row and every column of A contains at least two ones. or m ≥k 2/(t+1) n t/(t+1) , which is (32). In case g = 4t, the proof proceeds along entirely parallel lines and is omitted.
It is obvious from (32) that the lower bound is minimized (for a fixed choice of n andk) with t = 1, or g = 6. Similarly, the lower bound in (33) is minimized when t = 2, or g = 8. Higher values of g would lead to more measurements being required. We can also compare g = 6 with g = 8 and show that g = 6 is better. Let us substitute t = 1 in (32) and t = 2 in (33). This gives If we wish to have fewer measurements than the dimension of the unknown vector, we can set m < n. Substituting this requirement into (37) leads tō k < n 1/2 if g = 6,k < n 1/3 if g = 8.
Hence graphs of girth 6 are preferable to graphs of girth 8, because the upper limit on the recoverable sparsity countk is higher with a graph of girth 6 than with a graph of girth 8.

V. CONSTRUCTION OF NEARLY OPTIMAL GRAPHS OF GIRTH SIX
The discussion of the preceding section suggests that we must look for bipartite graphs of girth six, where the integer m satisfies the bound (34) with the ≥ replaced by an equality, or at least, close to it. In this section it is shown that a certain class of binary matrices has girth six. Then we give two specific constructions. The first of these is based on array codes, which are a class of low density parity check (LDPC) codes and the second is based on Euler squares.The first construction is easier to explain, but the second one gives far more flexibility in terms of the number of measurements. Here is the general theorem.
2) The maximum inner product between any two columns of A is one. 3) Every row and every column of A have at least two ones. Then the girth of A is six. Remark: Before proving the theorem, let us see how closely such a matrix satisfies the inequality (34). In the constructions below we have thatd L = d L = l, g = 6 and r = 3. Therefore the bound in (34) becomes m ≥ 1 + (l − 1) + (l − 1)(q − 1) = q(l − 1) + 1.
Since m = lq, we see that the actual value of m exceeds the lower bound for m by a factor of l/(l − 1) (after neglecting the last term of 1 on the right side). Note that there is no guarantee that the lower bound in Theorem 9 is actually achievable. So the class of matrices proposed above (if they could actually be constructed), can be said to be "near optimal." In applying this theorem, we would choose q such that n ≤ q 2 and choose any desired l ≤ q − 1. With such a measurement matrix, basis pursuit will achieve robust k-sparse recovery up to k < √ n p , where x denotes the smallest prime number larger than or equal to x.
Then the inequality (34) implies that This can be rewritten as Note that g ≥ 6, so that r ≥ 3, due to Condition (2). We study two cases separately.
Case (1): g = 4t for some t ≥ 2. In this case Therefore (38) becomes or Combining these inequalities gives It is shown that (40) or q(l − 1) ≤ 8. However, q ≥ 5 and l − 1 ≥ 3, so this inequality cannot hold. At this point, let us consider the possibility that g = 8, i.e., that t = 2. In this case (39) becomes This inequality can hold only for l = 1, 2, 3 and not if l ≥ 4. Hence A cannot have girth 4t for any t ≥ 2.

So (38) becomes
As before, this can be rewritten as This inequality can hold if t = 1 because the left side equals 1. However, if t > 1, then (42) implies that or q(l − 1) ≤ 8, which is impossible. Hence (42) implies that t = 1, or that g = 6.
In what follows, we present two explicit constructions of binary matrices that satisfy the conditions of Theorem 13. The first construction is taken from the theory of low density parity check (LDPC) codes and is a generalization of [32]. This type of construction for Low Density Parity Check codes (LDPC) was first introduced in [33]. Let q be a prime number and let P ∈ {0, 1} q×q be any cyclic permutation of [q]. In [32] P is taken as the shift permutation matrix defined by P i,i−1 = 1 and the rest zeros, where i − 1 is interpreted modulo q. Then P q = I, the identity matrix. Let l < q be any integer and define the matrix H(q, l) ∈ {0, 1} lq×q 2 as the block-partitioned matrix The matrix H(q, l) is bi-regular, with left (column) degree l and right (row) degree q. It is rank-deficient, having rank (q − 1)l + 1. In principle we could drop the redundant rows, but that would destroy the left-regularity of the matrix, thus rendering the theory in this paper inapplicable. (However, the resulting matrix would still be right-regular.) Moreover, due to the cyclic nature of P , it follows that the inner product between any two columns of H(q, l) is at most equal to one. It is shown in [32,Proposition 1] that H(q, l) has girth six, but here that statement follows from Theorem 13.
The second construction is based on Euler squares. In [24], a general recipe is given for constructing generalized Euler squares. This is used in [23] to construct an associated binary matrix of order lq × q 2 , where q is any arbitrary integer (in contrast with the construction of [32], which requires q to be a prime number), such that the maximum inner product between any two columns is at most equal to one. Again, by Theorem 13, such matrices have girth six and are thus nearly optimal for compressed sensing. The upper bound on l is defined as follows: Let q = 2 r 0 p r 1 1 . . . p r s s be the prime number decomposition of q. Then l < min{2 r 0 , p r 1 1 , . . . , p r s s }. In particular if q is a prime or a power of a prime, then we can have l < q − 1. It is easy to verify that, if q is a prime, then the construction in [23] is the same as the array code construction of [32] with permuted columns. For the case, where q is a prime power, the construction is more elaborate and is not pursued further here.
Example 3: In this example we compare the number of samples required when using the DeVore construction of [20] and a matrix that satisfies the hypotheses of Theorem 13, such as the array code matrix or the Euler square matrix. The conclusions are that: (i) When k < √ n/4, the Devore construction requires fewer measurements than the array code, whereas when √ n/4 < k < √ n, the array code type of matrix requires fewer measurements. (ii) When k > √ n/2, the DeVore construction requires more measurements than n, the dimension of the unknown vector, whereas the array code construction has m < n whenever k < √ n. To see this, recall that the DeVore construction produces a matrix of dimensions q 2 × q r+1 with the maximum inner product between columns equal to r and each column contains q ones. So if we choose r = 2, then λ in Theorem 10 equals 2, while d L = q. Consequently the DeVore matrix satisfies the RNSP of order k whenever k < q/2 and the number of measurements m D equals q 2 = 4k 2 , Thus m D < n requires that 4k 2 < n, or k < √ n/2. In contrast, a matrix of the type discussed in Theorem 13 has dimensions lq × q 2 , where n = q 2 and l = k + 1. For this class of matrices, we have λ = 1 and d L = q. This matrix satisfies the RNSP whenever k = l − 1 < q and the number of measurements equals lq = (k + 1)q. Now 4k 2 < kq if and only if k < q/4 = √ n/4. Also m A = (k + 1)q < n = q 2 whenever k + 1 < q = √ n. Here, in the interests of simplicity, we ignore the fact that q has to be a prime number in both cases and various rounding up operations.

VI. LOW GIRTH IN COMPRESSED SENSING VS. HIGH GIRTH IN CODING THEORY
As shown in the previous section, in compressed sensing left-regular bipartite graphs of girth six are preferable to graphs with higher girths. It is easy to understand why graphs of girth four are undesirable. For left-regular graphs of column degree d L and girth four, recovery is guaranteed only for k < (d L − 1)/2, whereas for left-regular graphs of column degree d L and girth six, recovery is guaranteed for k < d L , or twice as large a bound. However, it is counter-intuitive that graphs of still higher girth are also inferior to graphs of girth six when it comes to compressed sensing, because in LDPC coding, the higher the girth, the better the decoding performance.
In order to explain this disparity, we quote verbatim a comment by one of the reviewers, who said: Although it is correct that in the area of LDPC codes large girth helps in the limit n → ∞, in practice people use mostly parity-check matrices with girth six. The reason for this is that most of the gain is by going from girth four to girth six. Going to larger girth is mostly not worthwhile because of the loss of flexibility in designing parity-check matrices for the typical values of n of interest.
Intuitively, it is clear that for a given code length, given variable node degree distribution and given check node distribution, the larger the required girth, the fewer Tanner graphs there will be. (Clearly, if the girth requirement is beyond some bound, there will be no Tanner graph.) Writing down the relevant constraints is particularly convenient for the popular class of quasi-cyclic LDPC codes. See, for example [34], [35].
Many papers have empirically observed that going from girth four to girth six brings the most benefit, with limited payoff beyond that. A mathematical approach to understand this can be found in [36,Section 8.3], which is the extended version of [37].
The fact is that, while both coding and compressed sensing use binary matrices, there are some significant differences between them. In coding, the number of bit-flipping errors k (which is analogous the sparsity count in compressed sensing) is a linear multiple of n, say k = αn for some α ∈ (0, 1). In this case the universal lower bound from Theorem 3 becomes m = O(nα ln(1/α)) and the challenge is to design codes, where the number m of parity check bits grows linearly with n. In contrast, in compressed sensing, the emphasis is on the case, where k grows sub-linearly with respect to n and the objective is to ensure that the number of measurements m also grows more slowly than n, though faster than k. In this setting, the rate of the code defined as 1 − m/n approaches 1 as n grows. For this setting, as shown here, the optimal girth of the bipartite graph is six.

VII. NUMERICAL EXPERIMENTS
In this section we carry out various numerical experiments to illustrate the use of the array code binary matrices proposed in this paper. The experiments include a comparison of the array code binary matrix and the DeVore construction of binary matrices from [20], with random Gaussian matrcies. In Section VII-A, we compute the number of measurements that are sufficient to guarantee the recovery of k-sparse n-dimensional vectors, for each of these classes of measurement matrices. We also compute the CPU time for 1 -norm minimization to be performed using each class of matrices. While the absolute CPU time is not meaningful, the relative values are indeed meaningful. In Section VII-B we study the phenomenon of "phase transition" in 1 -norm minimization, whereby for fixed n and m and increasing values of k, the probability of success on randomly generated k-sparse n-vectors suddenly goes from 100% to 0%. We compare numerical results for Array code binary matrices and DeVore binary matrices with randomly generated Gaussian matrices, for which a formal theory is available.

A. Guaranteed Recovery
In this subsection, we compare the number of measurements m and the CPU time for 1 -norm minimization, when n = 149 2 = 22, 201, for two different values of k, namely k = 14 and k = 69. Note that both values of k are smaller than √ n. For each of the array code matrix, the DeVore matrix and a random Gaussian matrix, the number of measurements m is chosen so as to guarantee robust k-sparse recovery using basis pursuit. In the case of the random Gaussian matrix, the failure probability ξ is chosen as 10 −9 and the number of samples m is chosen in accordance with Theorem 2, specifically (6). When n = 149 2 and k = 14, with the array code matrix we choose q = √ n = 149 and d L = k + 1 = 15, which leads to m = d L √ n = 2, 235 measurements. With DeVore's construction, we choose q to be the next largest prime after 2k, namely q = 29 and m = 29 2 = 841. Because k < √ n/4, the DeVore construction requires fewer measurements than the array code matrix, as shown in Example 3. When k = 69, with the array code matrix we choose d L = k + 1 = 70 and m = d L √ n = 10, 430 measurements. In contrast, with the DeVore construction, we choose q to be the next largest prime after 2k, namely 139, which leads to m = q 2 = 19, 321. Because k > √ n/4, the DeVore construction requires more measurements than the array code matrix, as shown in Example 3. For the random Gaussian matrix, when k = 14, Equation (6) gives m = 11, 683. When k = 69, Equation (6) gives m = 44, 345, that is, more than n. Therefore using random Gaussian matrices is not meaningful in this case.
The results are shown in Table I. From this Table it can be seen that both classes of binary matrices (DeVore and array code) require significantly less CPU time compared to random Gaussian matrices. As shown in Example 3, the DeVore matrix is to be preferred when k < √ n/4 while the array code matrix is to be preferred when k > √ n/4. But in either case, both classes of matrices are preferable to random Gaussian matrices.

B. Phase Transition Study
In this subsection we compare the phase transition behavior of the basis pursuit formulation with both classes of binary matrices (DeVore and array code) and random Gaussian matrices.
Suppose we choose integers n, m < n, together with a matrix A and use basis pursuit as the decoder. If a k-sparse vector is chosen at random, we can ask: What is the probability that (A, Δ BP ) recovers the vector and how does it change as k is increased? We would naturally expect that the probability of success would be 100% for k sufficiently small (because various sufficient conditions for guaranteed recovery would be satisfied) and 0% for k sufficiently large. Further, we would expect a gradual drop-off for in-between values of k. The reality however is quite different. There is a sharp transition between success and failure, which is known as a phase transition.
To make the discussion precise, let us define two quantities: θ := m/n, which is known as the under-sampling ratio and φ := k/m, which is known as the oversampling ratio. 3 For fixed m, n, let us vary k and make a plot of θ versus φ. We can compute three quantities: φ 95 , which is the value at which the probability of recovering a random k-sparse vector is 95%, φ 50 and φ 5 , with obvious definitions. The difference φ 5 − φ 95 is called the transition width and is denoted by w.
The phase transition phenomenon is analyzed theoretically in a series of papers, for the case, where the measurement matrix A consists of mn independent samples of Gaussian random variables, using convex polytope theory [6], [38]. A formula is derived for φ 50 as a function θ, which might be referred to as the "transition boundary." However, this is not a closed-form formula. It is further shown that the transition width is roughly equal to C/ √ n, where C is a constant that does not depend on n. In addition, it is shown through numerical simulations in [10], [38], [39] that a large class of random and deterministic measurement matrices display the same phase transition behavior as Gaussian matrices, even though there is as yet no theoretical analysis for anything other than random Gaussian matrices.
Against this background, it is of interest to study whether the two classes of binary matrices studied here, namely the array code matrix and the DeVore construction, also display the same phase transition behavior as Gaussian matrices. Specifically, we study the following questions through numerical simulations: 1) For a given θ, is the 50% recovery value of φ 50 more or less the same for all three types of matrices? 2) Is the phase transition width w more or less the same for all three types of matrices? 3) As n is varied, does the phase transition width vary as C/ √ n for some constant C that is independent of the method used to generate the measurement matrix? 4) What is the CPU time with each type of binary matrix? Here we give details of the study. For Gaussian measurement matrices and the DeVore measurement matrices, the dimension of the vector n is chosen to be 1024, to match the previous literature on the topic. The phase transition boundary for the Gaussian case is computed using the software provided by Prof. David Donoho. For the array code class, we chose n = 961 = 31 2 , which is the nearest square of a prime number to 1024. Once n is chosen, for the Gaussian matrices, every value of m (the number of measurements) is permissible. However, for each class of binary matrix, there are only certain values of m that are permissible. For the DeVore class, m equals the square of a prime power q such that m = q 2 < n. Thus the permissible choices for q are {11, 13, 16, 17, 19, 23, 25, 29, 31}.
Note that we omitted the possibility of q = 8 as being too small. In the case of array matrices n = 31 2 = q 2 and the permissible values of m are lq as l ranges from 2 to q − 1 = 30, that is, {62, 93, . . . , 930}. For each permissible choice of m, an appropriate measurement matrix A is generated. Once this is  done, 100 random k-sparse vectors are generated and 1 -norm minimization (basis pursuit) is applied to each random k-sparse vector with the measurement matrix of each class. The optimization is carried out using the CVX package of Matlab.
Since there is a great deal of information to be presented, we first show the results for the DeVore construction of [20] in Fig. 1 and then the results for the array code construction in Fig. 2. These figures show φ 5 , φ 50 and φ 95 for the two methods.
Then in Fig. 3, we plot the numerically determined median values φ 50 for the two classes of binary matrices (DeVore and array code), together with the theoretically determined values from [40, Fig. 1]. 4 Note that there are two theoretical curves here, corresponding to the case, where the unknown vector x is k-sparse with each nonzero value equal to ±1 (blue curve) and where each nonzero value is uniformly distributed over [−1, 1] (magenta curve). The first case is known as "random signed vector" and the second case is known as "random bounded vector." From Fig. 3, it can be seen that the observed transition boundary in each of the two binary matrices closely matches      V  COMPARISON OF THE NUMBER OF MEASUREMENTS FOR THE DeVore BINARY MATRIX, THE ARRAY CODE BINARY MATRIX, AND THE RANDOM GAUSSIAN MATRIX. NOTE THAT m D = q 2 D AND m A = (k + 1)q A . THE QUANTITY m G IS COMPUTED ACCORDING TO (6) the theoretical transition boundary with Gaussian matrices and random signed vectors. In contrast, the transition boundary value of φ (at which the success ratio is 50%) with random vectors taking arbitrary values in [−1, 1] is much lower with Gaussian matrices than with either of the two binary matrices. Next we analyze the results shown in these figures. To make the comparisons between methods readable, we dispay the results in two separate tables. Table II gives a comparison between the DeVore binary matrices and random Gaussian matrices. Table III gives a comparison between the array code binary matrices and random Gaussian matrices.
Next, we compute the transition width (φ 5 −φ 95 ) for various values of θ, for three different values of n namely 256, 512 and 1,024, using the DeVore binary matrix. The objective is to determine whether the transition width varies as C 1 / √ n for some constant C 1 that is independent of n. For a fixed choice of n, for each (permissible) value of θ, we compute the transition width w and see how constant it is with respect to θ. It can be seen from the table that indeed w is relatively constant even as θ varies. Then we averaged the various values of w over θ for each fixed n, to arrive at an average transition width, shown as w in the table. Then we computed the ratiow/ √ n for the three values of n and called it C 1 . The expectation is that this constant C 1 should be independent of n. In reality, the values of C 1 for n = 256 and 512 are quite close, while that for n = 1, 024 is noticeably higher.

VIII. DISCUSSION
In this paper we have built upon previously proven sufficient conditions for stable k-sparse recovery and showed that they actually guarantee robust k-sparse recovery, that is, enable basis pursuit to achieve k-sparse recovery in the presence of measurement noise. We then derived a universal lower bound on the number of measurements in order for binary matrix to satisfy this sufficient condition. Ideally, we would like to prove a universal necessary condition along the following lines: If a left-regular binary measurement matrix A achieves robust k-sparse recovery of order k, then d L ≥ φ(k), where φ(·) is some function that is waiting to be discovered. In such a case, the bounds in Theorem 10 would truly be universal. At present, there are no known universal necessary conditions for binary measurement matrices, other than Theorem 3, which is applicable to all matrices, not just binary matrices.
Note that, as shown in [15,Problem 13.6], a binary matrix does not satisfy the RIP of order k with constant δ unless This negative result has often been used to suggest that binary matrices are not suitable for compressed sensing. However, RIP is only a sufficient condition for robust sparse recovery and as shown here, it is possible to provide far weaker sufficient conditions for robust sparse recovery in terms of the RNSP, when the measurement matrix is binary. This is consistent with the results of [19], which show that RIP implies RNSP. Hence any sufficient condition that is derived using the RIP can also be derived using the RNSP. The present paper goes farther by deriving a sufficient condition based on the RNSP that is strictly weaker than the best available condition based on the RIP. Moreover, it is possible to compare the sample complexities implied by (6) for random Gaussian matrices with those corresponding to the DeVore class and the array code class, to see that when n < 10 5 and k < √ n, in fact binary matrices require fewer measurements, as shown in Table V.
One might argue that the bound in (6) is only a sufficient condition for the number of measurements and that in actual examples, far fewer measurements suffice. This is precisely the motivation behind studying the phase transition of basis pursuit with binary matrices. As shown in Section VII-B, in fact there is no difference between the phase transition behavior of random Gaussian matrices and binary matrices. This observation reinforces earlier observations in [10]. In other words, the fraction of randomly generated k-sparse vectors that can be recovered using m measurements is the same whether one uses Gaussian matrices or binary matrices. Given that basis pursuit can be implemented much more efficiently with binary measurement matrices than with random Gaussian matrices and both classes of matrices exhibit similar phase transition properties, there appears to be a very strong case for preferring binary measurement matrices over random Gaussian matrices, notwithstanding the "order-optimality" of the latter class. In this connection, it would be worthwhile to explore whether other classes of measurements also exhibit phase transition behavior that is quantitatively similar to that of Gaussian and binary matrices.
There is one final point that we wish to make. Theorem 11 suggests that, in order to use binary matrices for compressed sensing, it is better to use graphs with small girth, in fact, of girth six. This runs counter to the intuition in LDPC decoding, where one wishes to design binary matrices with large girth. Indeed, in [41], the authors build on an earlier paper [42] and develop a message-passing type of decoder that achieves order-optimality using a binary matrix. The binary matrices that are used in [41] all have large girth Ω(ln n), which is the theoretical upper bound. One possible explanation for this discrepancy is that the model for compressed sensing using in [41] is different from the one used here and in most of the compressed sensing literature. Specifically (to paraphrase a little bit), in [41] in the unknown vector, each component is binary and the probability that the component equals one is k/n. Thus, the expected value of nonzero bits is k, but it could be larger or smaller. Accordingly, the actual sparsity count is a random number that could exceed k. The recovery results proved in [41] are also probabilistic in nature. It is worth further study to determine whether this difference is sufficient to explain why, in compressed sensing, graphs of low girth are to be preferred. Mahsa Lotfi received both B.Sc. and M.Sc. degrees in electrical engineering from Isfahan University of Technology, Isfahan, Iran, in 2012 and 2015, respectively. She received Ph.D. degree in electrical engineering at the University of Texas at Dallas, TX, USA in 2018 and her doctoral thesis was focused on developing recovery algorithms using deterministic measurement matrices in compressive sensing with near-optimal compression rate. Besides compressive sensing, she has done several research projects on classification and regression problems in machine learning and biomedical image processing as well. Currently, she is a Postdoctoral Scholar with Professor David Donoho at Stanford University, Statistics Department, where she works on different data science projects including developing super-resolution algorithms for biomedical imaging.
Mathukumalli Vidyasagar was born in Guntur, India on September 29, 1947. He received the B.S., M.S. and Ph.D. degrees in electrical engineering from the University of Wisconsin in Madison, in 1965Madison, in , 1967Madison, in and 1969 respectively. Between 1969 and 1989, he was a Professor of electrical engineering at Marquette University, Concordia University, and the University of Waterloo. In 1989, he returned to India as the Director of the newly created Centre for Artificial Intelligence and Robotics (CAIR) in Bangalore. In 2000, he moved to the Indian private sector, and joined India's largest software company, Tata Consultancy Services, as an Executive Vice President, and was posted in Hyderabad. In 2009, he retired from TCS and joined the Erik Jonsson School of Engineering & Computer Science at the University of Texas at Dallas, as a Cecil & Ida Green Chair in Systems Biology Science. He retired from UT Dallas at the end of 2017, and joined the Indian Institute of Technology Hyderabad, where he is a National Science Chair and a Distinguished Professor. His current research interests are in the areas of machine learning, and compressed sensing. On the applications front, he is interested in applying ideas from machine learning to problems in computational biology with emphasis on cancer. Vidyasagar has received a number of awards in recognition of his research contributions, including Fellowship in The Royal Society, the world's oldest scientific academy in continuous existence, the IEEE Control Systems (Technical Field) Award, the Rufus Oldenburger Medal of ASME, the John R. Ragazzini Education Award from AACC, and others. He is the author of 13 books and about 150 papers in peer-reviewed journals.