An Armijo-Type Hard Thresholding Algorithm for Joint Sparse Recovery

Joint sparse recovery (JSR) in compressed sensing simultaneously recovers sparse signals with a common sparsity structure from their multiple measurement vectors obtained through a common sensing matrix. In this paper, we present an Armijo-type hard thresholding (AHT) algorithm for joint sparse recovery. Under the restricted isometry property (RIP), we show that the AHT can converge to a local minimizer of the optimization problem for JSR. Furthermore, we compute the AHT convergence rate with the above conditions. Numerical experiments show the good performance of the new algorithm for JSR.


I. INTRODUCTION
Sparse recovery in signal processing [1], [2] addresses the problem of recovering sparse signals from underdetermined linear measurements. That is, one aims to recover a sparse vector x ∈ R n from the measurement y = Ax ∈ R m with m < n. A vector x is sparse if most of its entries are zeros. This kind of problem is referred to as the single measurement vector (SMV) problem, which is a significant problem in the era of big data. Mathematically, it can be solved by the following optimization problem: where x 0 counts the number of nonzero entries of x, s < n and · is the 2 norm of a vector.
In some applications, such as magnetoencephalography (MEG), there are multiple measurement vectors, each corresponding to a different unknown vector, with the property that all unknown vectors share a common support. This kind of sparse recovery problem is called the multiple measurement vector (MMV) problem or joint sparse recovery (JSR) problem. It recovers the source matrix X ∈ R n×p with joint sparsity from the matrix Y ∈ R m×p with multiple measurement vectors given by Y = AX . Here, the joint sparsity of matrix X is called row sparsity, defined as the number of nonzero rows in X . Usually, the JSR problem can be written as The associate editor coordinating the review of this manuscript and approving it for publication was Chih-Yu Hsu .
where · F is the Frobenius norm of the matrix, X r,0 counts the number of rows of X containing nonzero entries and s < n. Denote X (i) as the ith row of X and r (X ) = {i = 1, . . . , n : X (i) = 0} as the row support of X . Then, X r,0 = | r (X )|, where |J | is the cardinality of index set J . In recent years, a substantial body of work has been performed for the JSR problem by relaxation methods [3]- [8]. In addition, many greedy methods for a single measurement vector were extended to the JSR problem. Tropp et al. [9], [10] introduced simultaneous recovery algorithms based on orthogonal matching pursuit (OMP), called simultaneous orthogonal matching pursuit (SOMP) [11]. In [12], Foucart extended the hard thresholding pursuit (HTP) [13] to the joint sparse recovery problem. The authors in [14] provided a comprehensive investigation of the extension to the joint sparse recovery problem of five known greedy algorithms designed for the single measurement vector problem: iterative hard thresholding (IHT) [15], normalized IHT (NIHT) [16], HTP [13], normalized HTP (NHTP) [13], and compressive sampling matching pursuit (CoSaMP) [17]. By considering the rank of the unknown signal matrix X or the subspace spanned by measurement vectors Y , various joint sparse recovery algorithms have been proposed, such as rank-aware-order recursive matching pursuit (RA-ORMP) [22], subspace-augmented multiple signal classification (SA-MUSIC) [23], and signal space matching pursuit (SSMP) [24].
In this paper, we pay particular attention to the IHT algorithm [15] for problem (1), which was proposed by Blumensath and Davies in [15], since IHT has good recovery properties, and the frame is simple and easy to implement. The convergence of IHT to stationary points was established in [18]. In [16], the authors showed that the numerical studies of IHT are not very promising and the algorithm often fails to converge when the conditions fail. Then, they gave NIHT with an adaptive step size and line search and proved that it converges to a local minimizer if the matrix A is row full-rank and s-regular, where A is s-regular, which means that any s columns of A are linearly independent [18]. Cartis and Thompson [19] showed that NIHT converges to a local minimizer if A is 2s-regular.
To guarantee the convergence of the algorithm, some line search rules were proposed. Among these rules, the Armijo-type line search rule [25] is the most popular and easier to implement on computers than other line searches [26]. Inspired by these works, we propose an Armijo-type hard thresholding (AHT) algorithm to solve problem (2).
Since problem (2) is, in general, nonconvex programming, we can only expect to find the local minimum rather than the global minimum. Then, we show that the accumulation point of the AHT algorithm is a stationary point. With the restricted isometry property (RIP) of matrix A, we prove that the stationary point is a local minimum. Moreover, we give the full convergence of AHT and its linear rate of convergence under RIP. Finally, we simulate several experiments to demonstrate the effectiveness of the algorithm. This paper is organized as follows. Section II presents the AHT algorithm for problem (2) and proves its convergence properties. Numerical results are given in Section III. The last section makes concluding remarks.

II. ALGORITHM AND CONVERGENCE
In this section, we present the Armijo-type hard thresholding (AHT) algorithm for problem (2) and then analyze its convergence properties. Denote f (X ) := 1 2 AX − Y 2 F . Then, the gradient of f is denoted as ∇f (X ) := A (AX −Y ). Denote the row sparse set For matrix A ∈ R m×n , the restricted isometry constants (RIC) of order s, denoted L s and U s , are defined as follows; for all x ∈ R n with x 0 ≤ s, it holds that It is easy to verify that if A has RIP constants L s and U s , then for any X ∈ S, we have For X ∈ R n×p and set ⊆ R n×p , projecting X onto is Note that the projection onto sparse set S, written S (·) sets all but the s largest 2 norms of the rows in X to zeros.
The definition of α-stationarity was proposed in [18] based on the notion of a fixed-point equation. Here, we extend it to problem (2).
Definition 1: The point X * ∈ S is called an α-stationary point of problem (2) if it holds that Note that X * ∈ S is an α-stationary point of problem (2) if and only if [18] where M i (X ) denotes the ith largest 2 norms of the rows in X and * r := r (X * ).

A. ALGORITHM
In NIHT for CS [16], to guarantee the objective function has a sufficient decrease per iteration, the authors added a stepsize strategy based on the RIP of the matrix A [1]. In AHT for (2), to achieve a sufficient reduction in the objective function, we use an Armijo-type stepsize to sufficiently decrease the objective function directly with the RIP of matrix A. Here (·) is the subvector (submatrix) obtained by discarding all but the elements (columns) in . The AHT framework is described as follows.
Step 2. Let k r = r (X k ) and compute Step 3. Compute where α k = α 0 k β m k and m k is the smallest positive integer m such that and Step 4. If the stopping criterion is met, then stop.

Remark 1:
We now briefly illustrate the algorithm.
• In Step 2, we utilize the optimal stepsize to compute an initial stepsize of X k+1 or a step size that maximally reduces the error in each iteration, which can be obtained by computing the following problem: This step size is in accordance with the optimal step size in NIHT for CS in [16]. It should be noted that the maximal reduction in the objective function can only be guaranteed when the row supports of X k and X k+1 coincide. When the row support of X k+1 differs from that of X k , we use an Armijo-type stepsize to sufficiently decrease the objective function in Step 3. The strategy of Armijo-type stepsize is as follows: we first calculate X k+1 with a larger stepsize and determine whether the objective function has a sufficient decrease, that is, whether (7) holds. If (7) holds, we keep this stepsize. Otherwise, we decrease the stepsize. Lemma 1 shows the stepsize exists. Combining (4) and (8), we can see that the optimal stepsize α 0 k ≤ 1 L s . From the following Lemma 1, the total number k of decreasing step sizes to make 0 < α = α 0 k β k ≤ 1 U 2s +σ is at most ln L s U 2s +σ / ln β + 1. • We adopt the same stopping criterion as that in [27], that is or the number of iterations exceeds the max iterations. The following result shows that the Armijo stepsize is well defined under RIP.
Lemma 1: Assume matrix A satisfies RIP with order 2s and X k is the iterative points generated by AHT. Then, for 0 < α ≤ 1 U 2s +σ , it holds Therefore, α k is well defined.
Proof: According to the computation in Step 3, we have It implies that Then we obtain where the second inequality comes from (9). By letting 1 α − U 2s ≥ σ , we can obtain the desired result by the definition of α.

B. ANALYSIS OF CONVERGENCE
In this subsection, we establish the convergence of AHT using RIP of A and Lemma 1.
Theorem 1: Let the RIP of matrix A hold and {X k } be generated by AHT. Then, } is an α-stationary point of (2).
Proof: (i) It follows from Lemma 1 that then α 0 k ≥ 1 U 2s . By Lemma 1 and the definition of α k in the algorithm, we have Therefore, α k is bounded from below by a positive constant. We conclude that (ii) Assume that X * is an accumulation point of sequence {X k }; then, there exists a subsequence {X k j } converging to X * and lim j→∞ X k j +1 = X * by (i). For the simplification of notation, denote α k as α by its boundedness and * r = r (X * ). Based on in Step 2, we consider two cases. Case 1: i ∈ * r . The convergence of {X k j } and {X k j +1 } guarantees that for some n 1 > 0, X k j (i) > 0, X k j +1 (i) > 0 for all j > n 1 . The definition of the projection on S shows that Case 2: i / ∈ * r . If there exists an n 2 > 0 such that for all j > n 2 , X k j +1 (i) = 0, the projection implies that Letting j → ∞ and using the continuity of the function M s , we obtain that If there exists an infinite number of indices of k j for X k j +1 (i) > 0, as the same proof in Case 1, it follows that (A (AX * − Y )) (i) = 0. Since α k is bounded from below by a positive constant, we have which means X * is an α-stationary point of (2). Theorem 2: Let the RIP of matrix A hold and {X k } be generated by AHT. Then, the accumulation point of {X k } is a local minimizer of (2).
Proof: From Theorem 1, we know that the accumulation point of {X k } is an α-stationary point of (2). Next, we only need to show that the α-stationary point X * of (2) is a local minimizer of (2). We first prove the fact that A (AX * − Y ), X − X * = 0 by separating it into two cases. Case 1: X * r,0 < s. Then, M s (X * ) = 0, which indicates that A (AX * − Y ) = 0. It holds that A (AX * − Y ), X − X * = 0. Case 2: X * r,0 = s. For any X sufficiently close to X * , we have r (X ) ⊇ * r . Then, X r,0 ≥ X * r,0 = s. Together with X r,0 ≤ s, we obtain r (X ) = * r and then X r,0 = s. Therefore, where the last equality comes from (6), r (X * ) = * r and * r is the complementary set of * r . Since the objective function of problem (2) is convex, we have for any X ∈ S and is sufficiently close to X * . This completes the proof.

C. RATE OF CONVERGENCE
To show the rate of convergence, we need the next lemma. Lemma 2: Let the RIP of order s hold. We have for some α > 0, | | ≤ s, r (X ) ⊆ where Proof: Notice that the RIP (4) is equivalent to for all | | ≤ s and X with r (X ) ⊆ . Then, it follows that This means that Then, we have max | |≤s This completes the proof. The next theorem shows the linear convergence rate of AHT. Generally, for the first-order optimization method, we can only expect to obtain the linear convergence rate. The sufficient condition is in accordance with those for the methods based on IHT. For completeness, we give the proof.
Theorem 3: Assume that the RIP (4) of matrix A with order 3s and 3s (α) < 1 2 (defined in (12)). If X * is an s-row sparse matrix, then the sequence {X k } generated by AHT converges to the matrix X * with Y = AX * , i.e., where 0 < ρ < 1. Proof: We use the fact that the s-row sparse matrix X k+1 is a better s-term approximation to X k − αA (AX k − Y ) than X * . This yields that Inserting the term of X * , it follows that Expanding the left-hand side and eliminating Simplifying by X k+1 − X F , we derive Therefore, the sequence {X k } converges to X whenever ρ < 1, i.e., 3s (α) < 1 2 .

III. NUMERICAL EXPERIMENTS
In this section, we conduct numerous experiments to illustrate the performance of AHT. The codes were written in MATLAB, and the experiments were performed in MATLAB R2019a on a desktop computer with 3.40 GHz and 16 GB RAM. For all the simulations, the data are generated as follows. More specifically, two types of sensing matrices of A are generated, namely, random Gaussian matrix and random partial discrete cosine transform (pDCT) matrix: ∼ N (0, I /m), j = 1, 2, . . . , n, pDCT: A ij = m −1/2 cos(2π(j − 1)ψ i ), i = 1, 2, . . . , m, j = 1, 2, . . . , n, where A j denotes the jth column of A and I is an m-dimensional identity matrix. After generating the matrices, we orthogonalize them to satisfy AA = I . We generate the 'true' original signal X true as follows: first produce an index set T with s indices randomly selected from {1, . . . , n}; then, for each row of X true with index in T , uniformly choose from [0,10]. The data are generated as follows (in MATLAB format): X true = zeros(n, p); T = randperm(n); X true (T (1 : s), :) = 10 * rand(s, p); Y = A * X true + σ 0 * randn (m, p).
The case σ 0 = 0 is the exact recovery, and σ 0 = 0.01 is the stable recovery.
To see the accuracy of the solutions and the speed of these seven methods, we now run 100 trials for each kind of matrix with n = 512, m = 128 and p = 4. Specific results produced by these seven methods are recorded, where the exact reconstruction ratio (ERR) defined as [20] ERR := number of exact reconstructions number of total trials , and the relative error (RE) is computed by The CPU time (in seconds) reported here does not include the time for data initialization.
For Guassian matrix with σ 0 = 0,when sparsity is no more than 25 (or s /n ≤ 4.9%), all the seven methods can make exact joint sparse recovery. With the increase of the sparsity, the quality of recovery of NIHT, ORMP, MUSIC and SP   decreases. And our AHT, CoSaMP and SSMP can recover signal accurately and their relative errors are less than those of other three methods. For σ 0 = 0.01, our AHT and SSMP   can make exact recovery and also behave better than other five methods in regards of relative error. While for the CPU time, our AHT runs faster than other six methods.   For pDCT matrix with σ 0 = 0, when sparsity is no more than 35 (or s /n ≤ 6.8%), all the seven methods can make exact joint sparse recovery. For σ 0 = 0.01, our AHT, SSMP and ORMP can make exact joint sparse recovery and their errors are relative small. Our AHT also has advantage on the aspect of the CPU time.
In conclusion, for exact recovery (σ = 0) with the two kinds of matrices, almost all the methods can obtain exact reconstruction. For stable recovery (σ = 0.01), our AHT and SSMP behave better than other five methods. Significantly, our AHT possesses an apparent advantage in terms of CPU time.

IV. CONCLUSION
In this paper, we proposed an Armijo-type hard thresholding (AHT) algorithm to solve the joint sparsity recovery problem. We showed the sequence generated by AHT converges to a local minimizer of the optimization for JSR and we also gave the AHT convergence rate. Numerous experiments illustrated the efficiency of the AHT algorithm on JSR.