A Second Order Algorithm for MCP Regularized Optimization

In this article, we provide two optimal property of MCP regularization optimization. One shows that the support set of a local minimizer corresponds to linearly independent columns of $A$ , the other provides two sufficient conditions for a stationary point to be a local minimizer point. An active set subspace second-order algorithm for MCP regularized optimization is proposed. The active sets are estimated by an identification technique that can accurately identify the zero components in a neighbourhood of a stationary point. The search direction consists of two parts: some of the components are simply defined; the other components are determined by a second-order algorithm. A nonmonotone line search strategy that guarantees global convergence is used. The numerical comparisons with several state-of-art methods demonstrate the efficiency of the proposed method.


I. INTRODUCTION
In this article, we focus on the MCP-regularized optimization problem where f is continuously differentiable and where λ > 0 and θ > 1. The MCP penalty is a fast, continuous, nearly unbiased and sparse penalty function [43]. An important application of (1), found in the signal processing literature, is the problem which approximates the known 2 -0 The associate editor coordinating the review of this manuscript and approving it for publication was Shanying Zhu .
where A ∈ R m×n and b ∈ R m and the 0 norm counts the nonzero entries of x. The problems (1)-(3) are used in many applications, such as in compressive sensing, high-dimensional variable selection, data fitting and image reconstruction ( [5], [28], [40]).

A. PREVIOUS WORKS
Problem (3) is known to be NP-hard ( [12], [30]) since it involves the 0 norm and hence is a nonconvex and discontinuous optimization problem. Several different classes of approaches have been proposed to solve problem (3). Convex relaxation is the first class of methods in which the 0 -norm is replaced by the convex 1 -norm. One of the most popular methods in convex relaxation is the class of iterative shrinkage-thresholding algorithms (ISTA) and various variants ( [2], [13], [25], [26], [40], [41]), where each iteration requires a matrix-vector multiplication involving A and A T followed by a shrinkage/soft-threshold step. Other algorithms for the l 1 minimization include alternating direction method of multipliers SALSA [1]; coordinate-wise descent method [39]; projected gradient method [17]; reduced-space algorithm [10]; active set methods ( [11], [26], [33]). We refer to papers ( [10], [26]) for recent advances in this area.
Another class of methods to deal with 0 optimization problem are greedy methods such as matching pursuit [29], orthogonal mathching pursuit (OMP) [32], CoSaMp [31] and single best replacement [35]. These greedy methods update the support of x by exploiting the residual and then solve a least-squares problem on the support at each iteration. These type of algorithms are widely used in practice due to their relatively fast computational speed. Under a sufficient incoherence assumption, Tropp [38] proved that OMP recovers the sparsest representation of the input signal. We refer to a survey of greedy algorithms [37] for recent advances in this area.
The convexity of the 1 penalty allows designing fast and global convergent algorithms. However, it has a drawback which requires more restrictive conditions on the matrix A to establish the equivalent and unique global solution to model (3). In addition, it tends to produce biased estimated for larger coefficients and lacks oracle property [13,14]. To overcome the drawback produced by the 1 penalty, many continuous nonconvex approximations have been proposed to approach the 0 -norm. Among the variety of 0 -like continuous penalties, one can find the nonnegative garrote [6], log-sum penalty [7] and capped-1 [42]. Fan and Li [15] defined necessary conditions to obtain a 'good' penalty function (unbiasedness, continuity in data and sparsity) and proposed the smoothly clipped absolute deviation (SCAD) penalty. Fan and Li [16] discussed a variable splitting and penalty decomposition minimization technique for (2), along with other approximations of the 0 -norm. Lv and Fan [27] used concave function on R + to approximate the 0 norm and proved a nonasymptotic nearly oracle property of the resultant estimator. Zhang [43] introduced the notion of sparse convexity to compare penalties and proposed the minimax concave penalty (MCP). Fourcat and Lai [19] investigated the use of p -norm and an interesting work on the equivalence between minimal 0 and p (0 < p < 1) solutions of linear systems of equations (equality and inequality) was given in [20]. A general iterative shrinkage thresholding algorithm in [21] was proposed and illustrated on the SCAD, MCP and capped 1 penalties. They also proved that the algorithm converged to a critical point. Fornasier and Ward [18] proposed an iterative thresholding algorithm for minimizing (2) where the 0 norm is replaced by a reasonable sparsity-promoting relaxation and convergence to a local minimizer is established. In [44], zhang et al. proved that the transformed 1 penalty can recover the exact solution of constrained 0 norm with the null space property (NSP) and the restricted isometry property (RIP). They also presented two difference of convex algorithms for the transformed 1 optimization which guarantee the convergence to a stationary point. Recently, Jiao et al. [24] developed a unified primal dual active set algorithm for a class of nonconvex sparsity penalties which includes 0 , p , SCAD, capped 1 and MCP. Under the restricted isometry property, they proved the proposed algorithms converges to the underlying regression target. We refer to papers ( [3], [4], [9], [24], [28], [34], [36], [45]) for recent advances in this area.

B. OUR WORK
In this article, we propose a second-order algorithm to solve problem (1). The main contributions of our paper can be summarized as follows.
[1] We provide two optimal property of problem (1). One shows that the support set of a local minimizer corresponds to linearly independent columns of A, the other provides two sufficient conditions for a stationary point to be a local minimizer point. [2] We introduce an active set identification technique. An attractive advantage of the identification technique is that it can accurately identify the zero components in a neighbourhood of a stationary point of problem (1). [3] By means of the identification technique, we propose a second-order algorithm for solving (1.1). To accelerate the convergence of the algorithm, the nonmonotone line search [22] is used. We prove that every accumulation point of the sequence generated by the new algorithm is a stationary point of problem (1). [4] We do comprehensive numerical experiments to show the efficiency of the proposed algorithm. Numerical results demonstrate that the proposed algorithm is competitive with several known methods. The remainder of the paper is organized as follows. Two optimal properties related to (1) are given in Section II. The active set estimate is presented in Section III. In Section IV, we propose the algorithm and establish the global convergence of the algorithm. We report some numerical results in Section V and make some conclusions in section VI.

II. OPTIMAL PROPERTIES
In this section, we provide two optimal property of problem (1). We begin with some notations. A pointx ∈ R n is a stationary point of problem (1) if it satisfies the following system where g i (x) is the ith component of the gradient vector of f (x) and sgn(t) : R → R is the sign function defined by −1, if t < 0. We use the set T (x) to denote the support ofx and use the set G(x) to be the set of indices corresponding to the zero components ofx. Namely, G(x) = {i :x i = 0} and T (x) = {i :x i = 0}. VOLUME 10, 2022 The set G(x) can be subdivided into the following two sets The support T (x) can be subdivided into the following three sets The following property shows that ifx is a local minimizer of problem (2), then the columns of A T (x) are linearly independent. Namely, the sparsity of the local minimizer is at most rank(A).
Theorem 1: Suppose thatx is a local minimizer of problem (2), then A T (x) is of full column rank, i.e., columns of A T (x) are linearly independent.
Proof: Sincex is a local minimizer of problem (2), there exists a neighbourhood N (x, r) ofx such that Assume that columns of A T (x) are not linearly independent. Then there exists a vector 0 = p = (p T (x) , 0) such that

Consequently, we have
Further, we can scale p such that Let ξ 1 =x + p and ξ 2 =x − p. By (5) and the definition of p, we have sgn(x i ) = sgn(ξ 1 i ) = sgn(ξ 2 i ), for any index i which means that Since r i (t) is strictly concave for t > 0, we have Thus, we have which contradicts the assumption thatx is a local minimizer of problem (2). The proof is completed. Chen et al. [8] gave an affine-scaled second-order sufficient condition for local minimizers of nonconvex penalties which include the MCP penalty. However, it is difficult to verify the condition. Jiao et al. [24] gave a second-order sufficient condition for a stationary point to be a local minimizer point of (2). Below we provide two sufficient conditions for a stationary point to be a local minimizer point of (1). Condition (6) is an extension of the result [24].
Theorem 2: Suppose that f is twice continuously differentiable andx is a stationary point of problem (1). If or λ min > 1 θ whereλ min denotes the minimizer eigenvalue of the matrix ∇ 2 f (x) T (x),T (x) and λ min denotes the minimizer eigenvalue of the matrix ∇ 2 f (x), thenx is a local minimizer point of problem (1).
Proof: Suppose condition (6) holds. If T (x) = ∅, the conclusion is clearly. Suppose that T (x) = ∅. We shall prove that ϕ(x + d) ≥ ϕ(x) for any sufficiently small 0 = d ∈ R n . We shall consider two different cases. Note that where the fifth equality uses (4) and the fourth equality uses (7). If λ min > 1 θ , by (7) and (8) we have for any sufficiently small d = 0 The proof is completed.

III. ACTIVE SET ESTIMATE
In this section, we propose the active set estimate and give some properties of the active set estimate. We use the following sets Z(x) and S(x) to approximate G(x) and T (x), respectively, Furthermore, we partition Z(x) into two parts. That is, Below the theorem shows that when the point x is sufficiently close to a stationary pointx, the set is actually equal to the set G(x). Theorem 3: Assume thatx is a stationary point of problem (1) and g(x) is continuous on R n . Then we have the following statement.
(1) There exists a neighbourhood N (x) ofx such that In particular, if G − (x) = ∅, then we have (2) There exists a neighbourhood N (x) ofx such that (3) There exists a neighbourhood N (x) ofx such that Proof: If G − (x) = ∅ holds atx, then the second statement in (1) follows from the first statement (1). Thus, it suffices to prove the statement (1). For each i ∈ G + (x), we havex By the continuity of g(x), we have We claim that it must holdx i = 0. In fact, by (4), we get which contradicts (9). Then we get the statement (1).

VOLUME 10, 2022
For each i ∈ T 2 (x), by the continuity of g(x), we have On the other hand, for each i ∈ S(x), by (4) we have . Then we get the statement (2).
By the continuity of g(x), we have i ∈ Z 2 (x) as x →x. On the other hand, for each i ∈ Z 2 (x), by the continuity of g(x) we have Then we get the statement (3). The proof is completed.
To get a better descent direction of the new algorithm, we partition Z 2 (x) into Z 21 (x) and Z 22 (x), respectively, Furthermore, we partition Z 21 (x) and Z 22 (x) into two parts, respectively, The following theorem shows that Z 21f ∪ Z 22f is a subset of the set G − (x) when x is sufficiently close tox. Theorem 4: Assume thatx is a stationary point of problem (1) and g(x) is continuous on R n . Then there exists a neighbourhood N (x) ofx such that, ∀x ∈ N (x), the following statements hold, Proof: It is easy to see that (10)- (11) give (12). In the following, we will only prove (10) since by a similar analysis, we can show (11). For each i ∈ Z 21f (x), by the continuity of g(x), we have Then we must havex i = 0 and g i (x) = λ. In fact, which contradicts (13). The proof is completed.
In a similar way as we proved (12), we can prove the following conclusion.
Theorem 5: Assume thatx is a stationary point of problem (1). Then we have

IV. THE NEW ALGORITHM
In this section, by use of the active set identification technique in Section 3, we shall propose a second-order algorithm for solving problem (1). The following assumptions were made on the objective function. Assumption 1: In some neighbourhood N of , f is continuously differentiable and its gradient is Lipschitz continuous, i.e., there exists a constant L > 0 such that A. DEFINITION AND SOME PROPERTIES OF THE SEARCH DIRECTION Now, we describe how to compute the search direction in the proposed algorithm. Let x k ∈ be the k-th iteration. For simplicity, we let and We obtain d k I k by solving the following linear system We use the CG algorithm [23] to solve this linear system. An attractive property of the CG algorithm [23] is that it tries to determine a good approximation of the full Newton's direction even if ∇ 2 f I k ,I k (x k ) is not positive definite. It is easy to see that Theorem 2.2 (c) of [23] ensures that there exist positive constants c 1 and c 2 such that the direction d k I k produced by the CG algorithm satfaisfies the following condition and Then it follows that For each i ∈ I k 2 , we have Then we have From Theorem 3, we know that Z k 1 approximates the set G(x) when x k →x. Thus, it is a better choice to fix those variables with indices in Z k 1 to 0. Thus, we set d k i = −x k i , ∀i ∈ Z k 1 . We will get x k+1 i = 0 for all i ∈ Z k 1 if the unit steplength is accepted (see Algorithm 1). Below the theorem establishes that d k = 0 if and only if x k is a stationary point of problem (1).
Theorem 6: Assume that d k is determined by (16)- (19). Then d k = 0 if and only if x k is a stationary point of problem (1). (20), we get |x k i | > λθ and g i (x k ) = 0, i ∈ S(x k ). By (21) and the assumption d k = 0, we note that Z k 2 = ∅. For each i ∈ Z k 1 , by the definition of d k i and the assumption d k = 0, we get On the contrary, suppose that x k is a stationary point of problem (1). By Theorem 5, we get Z k 2 = ∅. For each i ∈ S(x k ), by (4) we get g i (x k ) = 0. By (20), we get d k which contradicts the definition of Z 1 (x k ). For each i ∈ Z k 1 , by (4) we get The proof is completed.
The following theorem together with Theorem 6 demonstrates that d k is a descent direction of ϕ at x k if x k is not a stationary point of problem (1). Consequently, it guarantees that Algorithm 1 is well-defined.
Step 1. If x k satisfies the stopping condition, then stop. Otherwise, go to Step 2.

Theorem 7:
Suppose that x k is not a stationary point of problem (1) and d k determined by (16)- (19). Then we have Proof: By the convexity of x 1 , we get for any β ∈ [0, 1] For each i ∈ I k 2 , by (21), we get For each i ∈ I k 1 , without loss of generality, we assume 0 ≤ x k i ≤ λθ . By the definition of Z k 1 , we have

VOLUME 10, 2022
By the definition of d k , we can get where the first inequality uses (23), the second inequality uses (20) and the last equality uses (24) and (25). By Theorem 6, we get the statement. The proof is completed.

B. THE ALGORITHM AND ITS GLOBAL CONVERGENCE
Now, we state the steps of the new algorithm for solving (1) as follows. The nonmonotone line search condition (22) was introduced by Grippo, Lampariello, and Lucidi (see [23]). It has been successfully incorporated into a variety of optimization algorithms. Now, we need the following two lemmas to establish the convergence of Algorithm 1. Using a similar proof as the one in [22], we get Lemma 1.
Lemma 1: Assume that f satisfies Assumption 1 and {x k } is generated by Algorithm 1. Then The following theorem shows that if x k → x * and d k → 0, then x * is a stationary point of problem (1).

Lemma 2:
Assume that x k → x * and d k → 0, where d k is determined by (16)- (19). Then x * is a stationary point of problem (1).
Proof: Taking into account that the number of distinct sets Z k 1 , I k 1 , I k 2 , I k is finite, there exists a subsequence (without loss of generality, we label again {x k }) such that index sets Z k , I k 1 , I k 2 , I k are constant and hence we can write Z k 1 = Z 1 , I k 1 = I 1 , I k 2 = I 2 and I k = I.
For each i ∈ Z 1 , by the definition of Z 1 , the continuity of g(x) and d k i = −x k i → 0, we get that x * i = 0 and |g i (x * )| ≤ λ. For each i ∈ I 1 , by the definition of I 1 and d k

Thus we have
x * i = 0 and |g i (x * )| = λ. Thus, for each i ∈ I, by (20) we get |x * i | ≥ λθ and g i (x * ) = 0. Thus, x * is a stationary point of problem (1). The proof is completed.
By Lemma 2, we will prove that every accumulation point of {x k } is a stationary point of problem (1). Taking into account that the number of distinct sets Z k 1 , I k , I k 1 and I k 2 is finite, there exists a subsequence of {x k } k∈K (without loss of generality, we label again {x k } k∈K ) such that the index sets Z k , I k , I k 1 and I k 2 are constant and hence we can write Z k 1 = Z 1 , I k = I 1 , I k 2 = I 2 and I k = I for any k ∈ K. We consider two different cases. Case (1) lim k∈K β k = 0. By (26), we obtain lim k∈K d k = 0.
By Lemma 2, we get the conclusion. Case (2) lim k∈K β k = 0. By the line search condition (22), we get for all k ≥ 1. By the mean-value theorem, we get for sufficiently large k ∈ K that where θ k ∈ (0, 1), the first inequality uses (15), the convexity of x 1 and the last equality use (24) and (25). Substituting (27) into the last inequality, we have −c 1 d k By the boundedness of {d k } k∈K , the last inequality and (26), we get lim k∈K d k = 0.
By Lemma 2, we get the conclusion. The proof is completed.

V. NUMERICAL EXPERIMENTS
In this section, computational experiments for testing the performance of the proposed algorithm on the compressed sensing problems (2) are presented. We compare the proposed algorithm with the following three existing solvers: GIST [21], PDASC [24] and MSCR [42]. The MATLAB code (package Unified-PDASC) is available at http://www0.cs.ucl.ac.uk/staff/b.jin/software/updasc.zip. We note that package Unified-PDASC contains the codes GIST [21] and MSCR [42]. ''MCP'' form of the three codes are implemented with default parameters. All codes are written in MATLAB R2013a and all tests described in this section were performed on a PC with Intel I5-3230 2.6GHZ CPU processor and 16GB RAM memory with a Windows operating system. In Subsection VI-A, parameter settings and data generated are presented. In Subsection VI-B, the effect of different concave parameters θ on the proposed algorithm is presented. We compare our method GIST [21], PDASC [24] and MSCR [42] by solving middle-scale problem (2) in Subsection VI-C and by solving large-scale problem (2) in Subsection VI-D. The results in Subsecitons VI-C and VI-D show that our method outperforms GIST [21], PDASC [24] and MSCR [42] both in solution quality and in running time for all most tested problems.

A. PARAMETERS SETTING AND DATA GENERATION
We now briefly describe the implementation details of Algorithm 1. For convenience, we abbreviate Algorithm 1 as AMCP. We implemented AMCP with the following parameters M = 10, δ = 10 −2 and η = 0.6. For all the tests, the computation is initialized with zero vector. To obtain a good initial guess, we embed AMCP in a continuation procedure. Specifically, instead of solving problem (1) directly from scratch, we solve a sequence of problems (x * and use the solution (or an approximate solution) x * λ k−1 as the initial estimate of the solution to the next problem. We set λ 0 = 0.5 A T b ∞ and λ N = λ = 10 −10 λ 0 . The interval [λ N , λ 0 ] is divided into N equally distributed subintervals in logarithmic scale. The sequence {λ k } are chosen to be the endpoints of the subintervals. We set N = 30 in all tested algorithms. In our implementation, at the end of iteration k, the next parameter λ k is set to a value smaller than λ k−1 if the point x k satisfies the condition where f = 0.01. We implemented the CG algorithm in [23] to solve (17) with parameters c = σ = 10 −8 . In many applications, it is not necessary and wasteful to obtain an exact solution at each iteration of (17) when |I k | is large. Thus we use only an approximate solution of system (17) and also terminate the CG algorithm if one of the following is satisfied: (1) r j ≤ max{10 −1 r 0 , 10 −12 }; (2) d j ≥:= max{0.05, min{10 3 , 10 x k − x k−1 }}; (3) the maximum number of CG iterations exceeds 10. In the above, d j denotes the direction of the jth CG iteration and r j denotes the residual of the jth CG iteration. It is easy to see that Theorem 2.2 (c) of [23] ensures that there exist positive constants c 1 and c 2 such that the direction d k I k produced by the CG algorithm satfaisfies conditions (18)- (20).
In the following subsections, we demonstrate the viability of our approach and focus on the problems from compressed sensing. Namely, we consider the problem (2), where the goal is to reconstruct a lenghth-n sparse signal from m observations, where m < n. Given the dimension of the signal n and the number of observations m, we generated a random matrix A as follows. The type of matrix A was chosen from the following types: (1) DCT matrix.
(3) Sparse normally distributed random matrix. Types (1) and (2) of matrix A were generated by the code [A, b, xs, xsn, picks] = getData(m, n, k, Ameth, xmeth, varargin) [40]. For each test problem, we stop all tested methods if the condition Ax * λ k − b ≤ where = e and e is the noise of the data b.

B. EFFECT OF THE CONCAVE PARAMETER θ
In the experiment, we examines the sensitivity of the algorithm with respect to the concavity parameter θ. The type of measure matrix A is DCT matrix. The observation b is generated by b = Ax * + e, where e is drawn according to the Gaussian distribution with zero mean and variance 10 −2 . To generate the true signal x * , we first generated the support A * = {i| x * i = 0} by randomly selecting T indices between 1 and n and then assigned the sign of a normally distributed random variable to x * i for each i in A * . The performance is evaluated in terms of CPU time in seconds (time) and the error (error = x k − x * ∞ ), which are computed as the average of 10 independent realizations of the problem. Table 1 lists the average CPU time and the average error for each problem. From Table 1, we see that the CPU time is fairly robust with respect to the concavity of parameter θ. Further, the average error varies little with the parameter θ, indicating the robustness of AMCP. In all the following tests, we set the parameter θ = 2.7.

C. RECONSTRUCTION FROM NOISE OBSERVATION FOR MIDDLE-SCALE PROBLEMS
In the experiment, the type of the measure matrix A is orthogonalized Gaussian matrix whose rows are orthogonalized using a QR. The true signal x * is a random sparse vector with T -sparse signal with a support A * . The dynamical range R of the true signal x * is defined by where e is drawn according to the Gaussian distribution with zero mean and variance 10 −2 . In the experiment, we consider a range of degrees of sparseness: the number T of nonzero spikes in x * ranges from 1 to 100 in the support. We set n = 10 4 and m = round(0.2 * n). In the following figures, we use ''CPU time'' to denote the CPU time in seconds and ''error'' to denote the error (error = x * − x k ∞ )). ''CPU time'' and ''error'' are computed on the average of   10 independent realizations of the problem. To have more insights into the numerical results, we utilize the performance profiles by Dolan and Moré [14] to evaluate ''CPU time'' and ''error'' in Figures 1-4. From Figures 1-4, we observe that AMCP is faster than UPDASC, and UPDASC is faster than GIST and MSCR. Moreover, for most problems, we also observe that AMCP obtains the similar error as the one of UPDASC for most problems and the error obtained by AMCP and UPDASC is better than the error obtained by GIST and MSCR.

D. RECONSTRUCTION FROM NOISE OBSERVATION FOR LARGE-SCALE PROBLEMS
In the experiment, we use the MATLAB command sprandn (m, n, density) to generate the random matrices A. We also adjusted each column of A such that the norm of each column of A is one. The true signal x * and the observation b were generated by the same way as the second experiment. We set the parameter ''density'' = 0.01 for n = 2 16 , 2 17 , 2 18 and ''density'' = 0.001 for 2 19 . Table 2 lists the CPU time (time) and the error (error = x k − x * ∞ ). From Table 2, we can see that AMCP, PDASC and GIST obtain the similar error when n ≤ 2 17 and both AMCP and PDASC obtain the similar error for all tested problems. In summary, we can observe clearly that AMCP obtains the best CPU time and the best error among all tested algorithms for most problems.

VI. CONCLUSION
In this article, an active set second-order algorithm for MCP regularized optimization is proposed. The active sets are estimated by an identification technique that can accurately identify the zero components in a neighbourhood of a stationary point. The search direction consists of two parts: some of the components are simply defined; the other components are determined by a second-order algorithm. A nonmonotone line search strategy that guarantees global convergence is used. The numerical comparisons with several state-of-art methods demonstrate the efficiency of the proposed method.