Bregman Proximal Gradient Algorithm with Extrapolation for a class of Nonconvex Nonsmooth Minimization Problems

In this paper, we consider an accelerated method for solving nonconvex and nonsmooth minimization problems. We propose a Bregman Proximal Gradient algorithm with extrapolation(BPGe). This algorithm extends and accelerates the Bregman Proximal Gradient algorithm (BPG), which circumvents the restrictive global Lipschitz gradient continuity assumption needed in Proximal Gradient algorithms (PG). The BPGe algorithm has higher generality than the recently introduced Proximal Gradient algorithm with extrapolation(PGe), and besides, due to the extrapolation step, BPGe converges faster than BPG algorithm. Analyzing the convergence, we prove that any limit point of the sequence generated by BPGe is a stationary point of the problem by choosing parameters properly. Besides, assuming Kurdyka-{\'L}ojasiewicz property, we prove the whole sequences generated by BPGe converges to a stationary point. Finally, to illustrate the potential of the new method BPGe, we apply it to two important practical problems that arise in many fundamental applications (and that not satisfy global Lipschitz gradient continuity assumption): Poisson linear inverse problems and quadratic inverse problems. In the tests the accelerated BPGe algorithm shows faster convergence results, giving an interesting new algorithm.


Introduction
In the last few years different numerical methods have been devised to solve large-scale minimization problems, but still the Cauchy's gradient method is at the kernel of most of the schemes (for instance, see the recent books [1,2] and it is assumed that the gradient of the objective function is globally Lipschitz continuous. This assumption is quite restrictive in some real applications, and therefore recently new families of methods have been designed in order to solve more generic cases. On this line, the remarkable paper of Bauschke, Bolte and Teboulle [3] introduced a new method based on the Bregman distance paradigm (BPG algorithm) able to deal with non-globally Lipschitz continuous gradient problems in the convex case, and Bolte, Sabach, Reboulle and Vaisbourd [4] extend it to the nonconvex case.
On the other hand, a lot of effort has been paid to accelerate the proximal gradient algorithm in order to reduce the number of iterations. Several techniques have been introduced, like the fast iterative shrinkage-thresholding algorithm (FISTA) proposed in [5], the use of Nesterov's extrapolation techniques [6,7], and quite recently it has been introduced in [8] a version of the proximal gradient algorithm with extrapolation for some nonconvex nonsmooth minimization problems (but assuming that the gradient of the objective function is globally Lipschitz continuous).
The main goal of this paper is to focus on introducing a scheme, and analyzing the convergence, that combines the power of the method developed in [4] able to solve non-globally Lipschitz continuous gradient problems in the convex and nonconvex case, and that includes extrapolation techniques [8] in order to accelerate the method.
In this paper, we consider the following minimization problem: (P) where f is a nonconvex continuously differentiable function and g is a proper lower-semi-continuous (l.s.c.) convex function. We assume that the optimal value of (P) is finite, that is, Ψ * := inf{Ψ(u) : u ∈ R d } > −∞. Problem (P) arises in many applications including compressed sensing [9], signal recovery [10], phase retrieve problem [11]. One classical algorithm for solving this problem is the Proximal Gradient (PG) method [12]: x k+1 = arg min where λ k is the stepsize on each iteration. Proximal gradient method and its variants [13,14,15,16,17,18] have been one hot topic in optimization field for a long time due to their simple forms and lower computation complexity.
One branch of developing new PG methods was devoted to accelerations. Accelerated proximal algorithms [5,19] on convex problems have shown to be quite efficient. They were also useful for solving nonconvex problems [8,20,21,22]. For solving nonconvex problems (P), one simple and efficient strategy is to perform extrapolation for each k ∈ N, with the following form(where where λ k is the stepsize on each iteration, and β k (x k −x k−1 ) is an extrapolation term. The previous iteration is called the Proximal Gradient algorithm with Extrapolation (PGe), which have been shown in [8] that converges and performs quite well by setting parameters β k properly. However, PGe has one restriction on solving problem (P): it requires the continuously differentiable part f to be globally Lipschitz gradient continuous on R d . In fact, this requirement cannot often be satisfied for many practical problems, such as quadratic inverse problem in phase retrieve [11] and Poisson linear inverse problems [23], that arise in many real world applications.
In this paper, we propose a new algorithm -Bregman Proximal Gradient algorithm with Extrapolation (BPGe)-to solve problem (P) without requiring globally Lipschitz gradient continuity of f for each k ∈ N, from x −1 = x 0 : where D h is a Bregman distance defined in Section 2. On the basis of Bregman distance theory, we utilize a smooth adaptive condition introduced in [4], which generalizes Lipschitz gradient continuous condition. This smooth adaptive condition was originally proposed to analyze Bregman Proximal Gradient (BPG) algorithm in [4]. It can also be used to analyze the convergence of BPGe, since BPGe algorithm extends BPG one by performing extrapolation. In particular, we have that: (i) When D h (x, y) = 1 2 x − y 2 and β k = 0, BPGe reduces to PG. (ii) When D h (x, y) = 1 2 x − y 2 , BPGe reduces to PGe; (iii) When β k = 0 for any k ≥ 0, BPGe reduces to BPG (no extrapolation). Therefore, PG, PGe and BPG are particular cases of BPGe algorithm.
Recently, other acceleration algorithms for BPG have been proposed in literature, like using it combined with inertial methods [24] (which is a different methodology from ours), or combining it with Nesterov's acceleration method [25] but requiring the Bregman distance function satisfying some extra crucial triangle scaling property.
From the convergence analysis (Section 4), the BPGe algorithm has to satisfy the condition D h (x k , y k ) ≤ ρ C k D h (x k−1 , x k ) (where C k ∈ (0, 1] and ρ ∈ (0, 1) are two parameters) to guarantee the convergence. In the Lipschitz gradient continuous condition D h (x, y) = 1 2 x − y 2 , this condition is easily satisfied just by choosing inf k∈N {β k } ≤ √ ρ C. But when D h is general, computing a threshold of inf k∈N {β k } directly may be hard and expensive. Therefore, we modify this idea to achieve this condition through a line search method (Algorithm 2 introduced in Section 3).
In the convergence analysis, we prove that any limit point of the sequence generated by BPGe is a stationary point under very general conditions. Moreover, by adding some slightly stronger assumptions and Kurdyka-Lojasiewicz property, we could guarantee the sequence generated by BPGe converges to a stationary point.
The paper is organized as follows. We first introduce in Section 2 some basic definitions in optimization, smooth adaptive condition, relative weak convexity, and Kurdyka-Lojasiewicz property. In Section 3 we introduce the new BPGe algorithm. The convergence analysis is done in Section 4, where under some assumptions of the smooth adaptive condition and relative weak convexity of problem (P), we first show a descent-type lemma, from which the fact that any limit point of the sequence generated by BPGe is a critical point follows. Later, we prove that the whole sequence generated by BPGe converges to a critical point under Kurdyka-Lojasiewicz property and a stronger assumption. Several numerical experiments are shown in Section 5 to show the performance of the BPGe method compared with the BPG one.

Preliminaries
Throughout the paper we will use the following basic notations. Let N := {0, 1, 2, . . . } be the set of nonnegative integers. We will always work in the Euclidean space R d , and the standard Euclidean inner product and the induced norm on R d are denoted by ·, · and · , respectively. We denote B ρ (x) := {x ∈ R d : x −x ≤ ρ} as the ball of radius ρ > 0 aroundx ∈ R d , dist(x, S) := inf y∈S x − y as the distance from a point x ∈ R d to a nonempty set S ⊂ R d . The domain of the function f : For other generalized notions and definitions we refer to [4,26,27].

Smooth Adaptable Function and Relative Weakly Convexity
In this subsection, we define the notion of smooth adaptable condition for nonconvex f proposed in [4]. This property was extended from the recent work [3] in which the differentiable functions need to be convex. This condition is similar to the relative smoothness condition introduced in [28], but the relative smoothness is based on the fact that f is convex. As we want also to deal with nonconvex functions, in our paper we use the smooth adaptable condition to generalize Lipschitz gradient continuity and to derive the related convergence results of BPGe.
We first introduce the concept of Bregman distance needed in the definition of smooth adaptable condition.
We denote the class of kernel generating distances by G(S). Given h ∈ G(S), the Bregman distance [29] is defined by Note that the Bregman distance is, obviously, a proximity measure that measures the proximity of x and y. Next, we list some basic properties of the Bregman distance [30,31]: (ii) The three point identity: For any y, z ∈ int dom h and x ∈ dom h, (iii) Linear Additivity: For any α, β ∈ R, and any functions h 1 and h 2 we have: for all couple (x, y) ∈ (dom h 1 ∩ dom h 2 ) 2 such that both h 1 and h 2 are differentiable at y.
When h(x) = 1 2 x 2 and consequently D h (x, y) = 1 2 x−y 2 , the L-smad condition of f would be reduced to Lipschitz gradient continuity: x − y 2 for any (x, y) ∈ dom h. Next we introduce the definition of a µ-relative weakly convex function, given in [32]. This definition extends the definition of weakly convexity [33], which was employed in the analysis of nonconvex optimization methods.
Definition 3. f is called µ-relative weakly convex to h on S if there exists µ > 0 such that f + µh is convex on S.
When f is convex, µ = 0. When (f, h) is L-smad on S, obviously f is L-relative weakly convex to h. So by default, µ ≤ L.

Kurdyka-Lojasiewicz Property
Finally, we introduce the definition of the Kurdyka-Lojasiewicz property proposed in [34]. We need this property to prove the global convergence of the whole sequences generated by BPGe for solving (P).
, the following inequality holds (ii) If f satisfies the KL property at each point of dom ∂f then f is called a KL function.

Bregman Proximal Gradient Algorithm with Extrapolation (BPGe)
Throughout this paper, we focus on the nonconvex problem (P) in Section 1 with the following assumptions on f and on the kernel generating distance function h: h ∈ G(R d ), (f, h) is L-smad and f is µ-weakly convex relative to h (see Definition 2 and 3). And we also make the following general Assumptions 1 and 2 as default.
Assumption 1 is a quite standard condition [4] to guarantee the existence of the solution to each step of the optimal subproblem of Proximal Gradient (PG) algorithms.
Assumptions 2 is a general assumption used in the analysis of Bregman Proximal-type algorithms [3,30].
We are now ready to introduce our BPGe algorithm, divided in two parts, Algorithm 1 and Algorithm 2. Algorithm 1 is the whole framework for solving Problem (P). And Algorithm 2 is a line search step, which is used to search a proper parameter β k at every iteration in Algorithm 1. Throughout the whole paper, we make the following notations By default 0 < λ ≤ λ < ∞.
Algorithm 1: BPGe-Bregman Proximal Gradient algorithm with Extrapolation. Data: A function h defined in Definition 1 such that (f, h) is L-smad holds and f is µ-weakly convex relative to h on R d . Error tolerance: TOL. Initialization: x 0 = x −1 ∈ int dom h and 0 < λ k ≤ 1/L.

General step:
For k = 0, 1, 2, . . . , k max repeat Take where β k is searched according to Line Search in Algorithm 2.
Then compute until EXIT(TOL) received.
Algorithm 2: Line Search for Algorithm 1 at the k-th iteration.
We remark that an important point on any iterative process is to define suitable error control techniques. In this paper we consider a quite simple strategy in order to determine the EXIT conditions. On one hand we fix a maximum number of iterations k max (in most of our tests 5000 iterations) and EXIT(TOL)=true if x k − x k−1 / max{1, x k } ≤ TOL (in our tests TOL = 10 −6 as in [8]). Other option is to check the convergence using the objective function, instead of the solution itself, that is Ψ( We first verify that (2) is well-defined using the following Proposition 1. For all y ∈ int dom h and stepsize 0 < λ ≤ 1/L, we define the Bregman proximal gradient mapping as: In Proposition 1 we prove that T λ (y) is well posed. Thus by Proposition 1, x k+1 ∈ T λ k (x k ), and fixing inf{λ k } > 0, then Step (2) in BPGe algorithm is well-defined.
Proposition 1. Suppose that Assumption 1 holds, let y ∈ int dom h and 0 < λ ≤ 1/L. Then, the set T λ (y) is a nonempty and compact set.
Proof. Fix any y ∈ int dom h and 0 < λ ≤ 1/L. For any u ∈ R d , we define so that T λ (y) = arg min u∈R d Ψ h (u), It can also be represented as where the second inequality is obtained by taking into account λ −1 ≥ L and in the last inequality Since Ψ h is also proper and lower-semi-continuous, invoking the modern form of Weierstrass' theorem (see, e.g., [26, Theorem 1.9, page 11]), it follows that the value inf R d Ψ h is finite, and the set arg min u∈R d Ψ h (u) ≡ T λ (y) is nonempty and compact.
Secondly, we add an extrapolation step to the BPGe algorithm to choose a suitable β k at each iteration step through the line search Algorithm 2. On this step it is hard to guarantee directly the decrease of function value Ψ(x k ). Therefore, we focus on guaranteeing sufficient decrease of the Lyapunov sequences defined in Section 4 in the convergence analysis. However, it still requires an extra condition BPGe is reduced to the PGe algorithm [8] and this condition is easily satisfied by setting 0 ≤ β k ≤ ρ L L+µ . But when h is more general and complex, computing the threshold of β k directly may be hard and expensive. So, we try to reach this condition by a line search method introduced in Algorithm 2. Thus, our next step is to verify that Algorithm 2 is well-defined, as the following proposition 2 shows.
Proposition 2. (Finite termination of Algorithm 2). Consider Algorithm 1 and fix k ∈ N. Let Proof. This result is proved by contradiction. Suppose that holds for any j ∈ N.
Since x k = x k−1 , and due to the strictly convexity of h in Assumption 2(i), for j ≥ J, which is a contradiction.

Convergence Analysis of BPGe
In this section we provide the main convergence results of the BPGe algorithm. First of all, following the analysis of Remark 4.1(ii) in [4], we obtain the following Lemma 1. We find that after adding an extrapolation term, it is hard to justify monotonicity of the objective function Ψ directly. But for a special auxiliary sequence, defined by the monotone property will be presented in our settings.

Lemma 1.
For any x ∈ int dom h, and let be a sequence {x k } k∈N produced by BPGe, then Moreover, assuming there exists some M such that ρ λ −1 ≤ M ≤ λ −1 , then the sequence {H k,M } is nonincreasing and convergent for the fixed M .
Proof. (i) According to the first order condition of (2), we get Combining with the convexity of g, there is Together with the three point identity of Bregman distance If we take the µ-relative weakly convex property and L-smad property of (f, h) (see Definitions 2 and 3), Thus Combining these two cases, we obtain From the definition of H k,M , we see that Furthermore, assuming there exists some M such that and fixing one of such values of M , we find that The next corollary is an obvious result based on Lemma 1. We analyze the boundness of the sequences produced by BPGe algorithm. Since H k,M is nonincreasing according to Lemma 1(ii), it is easy to verify that the sequence {x k } k∈N generated by BPGe is bounded according to Assumption 1. The boundness would act as a tool in the following analysis, so we present this result as the auxiliary Corollary 1.
If the stepsize λ k and parameter ρ in Algorithm 2 satisfy ρ < λ −1 /λ −1 = λ/λ, then we could get sufficient decrease of the auxiliary sequence {H k,M } k∈N for the fixed M given in Lemma 1. As a consequence, we can bound the sum of Bregman distance between two iteration points generated by BPGe. Moreover, adding stronger assumptions than Assumption 2 on the kernel generating distance h, such as strong convexity, we could get that lim k→∞ x k − x k−1 = 0 for the sequence {x k } k∈N in R d by BPGe. In this paper, we just consider the set of weaker blanket Assumptions 1 and 2, that permit us to prove that any limit point of the sequence {x k } k∈N generated by BPGe, if exists, is a stationary point of the objective function Ψ.
Assume that {x k } k∈N is generated from a starting point x 0 . The set of all limit points of {x k } k∈N is denoted by The next technical lemma shows, among other results, that for any Lemma 2. Suppose ρ < λ/λ and let {x k } k∈N be a sequence generated from x 0 by BPGe. Then the following statements hold: which implies, ∀K ∈ N, that by summing both sides of (10) from 0 to K. Since {H k,M } is convergent by Lemma 1(ii), letting K → ∞, we conclude that the infinite sum exists and is finite, i.e., , we obtain On one hand, we prove that there exists v k i ∈ ∂Ψ(x k i ) such that v k i → 0. By using the first-order optimality condition of the minimization problem (2), we obtain Therefore, we observe that Taking limits on the left hand in (13) we have that where the limit can be got according to (12) and the continuity of ∇f and ∇h. Thus, we get that On the other hand, we derive that Ψ( According to the iteration step (2) of BPGe, for k i ≥ 1, we have Adding λ k i −1 f (x k i ) to both sides, we have After rearranging terms, for all k i ∈ N, it follows L-smad property and µ-relative weakly convexity of (f, h) imply that for all Plugging (18) in (17), passing to the limit, together with the relationship λ ≤ λ k i ≤ λ, we have where the second inequality is based on L ≤ λ −1 ≤ λ −1 in BPGe. From (12), together with the continuity of ∇h, we obtain Combining (15) and (20) Thus, according to these results, and the closedness of ∂Ψ (see, Exercise 8 in Page 80 [2]), we have 0 ∈ ∂Ψ(x).
(iii) In view of Lemma 1 and (i), the sequence {H k,M } is convergent and D h (x k−1 , x k ) → 0, these together with the definition of H k,M imply lim k→∞ Ψ(x k ) exists, denoted as ζ. According to the last part of the proof in (ii), and taking x ∈ ω(x 0 ) with a convergent subsequence Thus the conclusion is completed since x is arbitrary.
Next, we prove a global O( 1 K ) sublinear convergence rate for the sequence min k∈N D h (x k−1 , x k ) of the algorithm. In fact, the linear convergence rate can also be got if we add more assumptions, like KL property and concrete KL exponent (we refer to [35]), based on similar deductions as in [4, Theorem 6.3].
Corollary 2. Suppose ρ < λ/λ and {x k } k∈N be a sequence generated from x 0 by BPGe. Then for all K ≥ 1, min 1≤k≤K D h (x k−1 , x k ) converges with a sublinear rate as O( 1 K ).
Proof. Set M = λ −1 , recall (11), now for K ≥ 1, Hence we obtain Next, we focus on performing a global convergence analysis. We aim to prove that the sequence {x k } k∈N generated by BPGe converges to a critical point of the objective function Ψ defined in (P). In order to prove global convergence, we use the proof methodology introduced in reference [  (H1) For each k ∈ N, there exists a positive 'a' such that (H2) For each k ∈ N, there exists a positive 'b' such that for some v k+1 ∈ ∂F (z k+1 ) we have (H3) There exists a subsequence (z k j ) j∈N such that z k j →z and F (z k j ) → F (z).
Moreover, if F have the Kurdyka-Lojasiewicz property at the limit pointz = (x,x) specified in (H3). Then, the sequence {x k } k∈N has finite length, i.e., ∞ k=1 x k − x k−1 < ∞, and converges tox =x as k → ∞, where (x,x) is a critical point of F .
In our paper, what we need is to verify that the conditions given in Theorem 1 are satisfied for F (x, y) = Ψ(x) + M D h (y, x) and the sequence (x k , x k−1 ) k∈N ∈ R 2d generated by the BPGe algorithm.
In order to guarantee the three conditions hold, we need another assumption. The first two requirements of the assumption were also required in [4, see Assumption D(ii)], and the third condition is easily verified.
(ii) ∇h, ∇f are Lipschitz continuous on any bounded subset of R d .
(iii) There exists a bounded u such that u ∈ ∂(∇h) on any bounded subset of R d .

In fact, Assumption 3(i-ii) can guarantee that Assumption 1(ii-iii) hold for the bounded sequence
The next task is to verify the three conditions one by one. Then, together with Theorem 1, we obtain the result that, under proper parameter selection, the whole sequence generated by BPGe converges to a critical point of the objective function.
) satisfies the Kurdyka-Lojasiewicz property at some limit pointz = (x,x) ∈ R 2d and Assumption 3 holds, then (ii) x k →x as k → ∞, andx is a critical point of Ψ.
Proof. We first verify the three conditions of the Theorem 1 for function H and BPGe algorithm.
(H1) According to Assumption 3 , since h is strongly convex, assume that h is σ-strongly convex, that is D h (x, y) ≥ σ 2 x − y 2 for any x, y ∈ R d . We denote a = σ 2 (M − ρ λ −1 ). For any k ∈ N, where the first inequality is based on the strongly convexity of h , the second inequality is based on λ ≤ λ k , the third and the last equality is from the definitions of H k,M and F , the fourth inequality is from Lemma 1(ii), and the fifth inequality is according to the nonnegativeness of (M − λ −1 k ) D h (x k , x k+1 ). Thus (H1) is verified.
According to Theorem 1, combining the three conditions given in Theorem 1 and KL property at z could guarantee that conclusion (i) holds. Conclusion (ii) is followed by Theorem 2(i). Thus {x k } k∈N is a Cauchy sequence of R d and converges to its limit pointx. From Theorem 1x is the critical point.

Numerical Results
In this section we perform several numerical tests in order to show the behaviour and the convergence speed up obtained when using the BPGe algorithm. We consider two important optimization problems in which the differentiable part of the objective does not admit a global Lipschitz continuous gradient: a convex Poisson linear inverse problem and a nonconvex quadratic inverse problem (and so the PG and PGe algorithms cannot be applied to these problems). It is important to remark that for cases where the differentiable part of the objective admits a global Lipschitz continuous gradient the BPG and BPGe algorithms become the PG and PGe algorithms, respectively. That is, the BPG and BPGe methods can be applied but the performance in these cases it was already shown in [8].
The main parameters in BPGe algorithm are the stepsizes λ k in Algorithm 1, and the parameter ρ that gives the extrapolation coefficients β k in the line search method of Algorithm 2. In our tests we consider fixed stepsizes λ k = λ. The influence of both parameters {λ, ρ} in order to fix suitable values is studied below in the tests.
All the numerical experiments have been performed in Matlab 2013a on a PC Intel(R) Xeon(R) CPU E5-2697 (2.6 GHz).

Application to Poisson Linear Inverse Problems (PLIP)
Poisson Linear Inverse Problems (that is, linear inverse problems in presence of Poisson noise) emerged in many fields, like astronomy, nuclear medicine (e.g., Positron Emission Tomography), inverse problems in fluorescence microscopy [3,23,39]. Therefore, the design of methods and estimators for such problems has been studied intensively over the last two decades (for a review, see [23,39]). Often these problems can be represented as a minimization problem like where θ > 0 is used to weigh matching the data fidelity criteria and its regularizer g, and d(·, ·) denotes a convex proximity measure between two vectors.
A very well-known measure of proximity of two nonnegative vectors Ax and b is based on the Kullback-Liebler divergence: which corresponds to noise of the negative Poisson log-likelihood function. It is easy to find that f := d(b, Ax) has no globally Lipschitz continuous gradient [3], but satisfies L-smad condition with and so now the Bregman distance is given by Therefore, we have that Lemma 7 in [3]), and f is 0-relative weakly convex to h since f is convex; (ii) Assumptions 1 and 2 hold, but Assumption 3 does not hold.
So, from the convergence Section 4, we can solve this problem using the BPGe algorithm and it is guaranteed that any limit point of the sequence generated by BPGe is a stationary point of the objective function Ψ.
An important point in any iterative method is to define suitable error control techniques. As discussed in Section 3, EXIT conditions of the experiments are set when iterations exceed 5000 times or x k − x k−1 / max{1, x k } ≤ 10 −6 (as in [8]).
In the tests, the entries of A ∈ R m×d + and x ∈ R d + are generated following independent uniform distribution over the interval [0, 1]. We consider the case g(x) ≡ 0, i.e., we solve the inverse problem without regularization, so now the minimization problem is the standard Poisson type maximum likelihood estimation problem (modulo change of sign to pass to a minimization problem).
As these methods (BPG and BPGe) can be applied to both, overdetermined (m > d) and underdetermined (m < d) problems, we have performed numerical tests on both cases. First, we present the results obtained in the overdetermined case. As commented before, the main parameters in   BPGe algorithm are the stepsize λ and the parameter ρ. In order to study briefly the most suitable set of parameters, we analyze the influence of both parameters {λ, ρ} in Figure 1. In all the pictures we show the evolution of Ψ(x k ) − Ψ(x * ) (being x * the approximate solution obtained at termination of each respective algorithm) with respect to the iteration number k. With this figure we can study the influence of the parameters with respect to the size of the problem (measurements m) with fixed dimension d = 100. Globally, we observe that the value ρ = 0.99 has the best results, even if for some cases, the set of initial conditions gives rise to a very fast convergence (as in the cases of using λ = 1/(2L) for m = 5000 and ρ = 0.95, where we have a fast linear convergence instead of sublinear). Note that this kind of differences can be observed on other situations, but the average behaviour tells us that the best performance is when we take ρ = 0.99. On the other hand, similar comments can be said with respect to the stepsize parameter λ. The general situation also recommends us to take the highest value λ = 1/L (also for both algorithms BPGe and BPG).
In Figure 2, now with the fixed parameter values {λ = 1/L, ρ = 0.99} and for the overdetermined (m > d) case, we show the evolution of the objective function Ψ(x k ) vs. iteration number and for several problem sizes (measurements m) with fixed vector dimension d = 100. We observe that always the BPGe algorithm is much faster than the BPG one. In order to observe more clearly the faster convergence, we present in Figure 3 much more simulations but now showing the evolution m=500, d=5000 m=700, d=5000 m=1000, d=5000 m=1000, d=10000 m=2000, d=10000 m=3000, d=10000 of Ψ(x k ) − Ψ(x * ) . We note that the differences of both methods are bigger for low dimension d problems, in fact for the most overdetermined problems m d.
In the underdetermined case we also analyze the influence of both parameters {λ, ρ} in Figure 4 with respect to the size of the problem (measurements m) with fixed dimension d = 5000. Now, we observe that the value of the parameter ρ seems to not affect too much on the global performance of the method, so we will take the value ρ = 0.99 when we fix the parameter. On the other hand, similar comments as in the overdetermined case can be said with respect to the stepsize parameter λ. Now the behaviour is quite regular, and no cases of very fast convergence have been observed, and the fastest convergence is obtained for the highest value λ = 1/L (also for both algorithms BPGe and BPG). Therefore, in the rest of tests on this paper we fix the parameter values {λ = 1/L, ρ = 0.99}.
In Figure 5, now with the fixed parameter values {λ = 1/L, ρ = 0.99} and for the underdetermined (m < d) case, we observe that always the BPGe algorithm is much faster than the BPG one. But, similarly as in the overdetermined case, the differences are bigger when we use the methods for larger ratios d/m, that is, for the most underdetermined problems m d.
Finally, in Table 1 we give the CPU-time and number of iterations for different sizes of problems (number of data m and dimension d) for two values of the λ parameter (λ = 1/L and 1/(3L)) for overdetermined (top) and underdetermined (bottom) cases. From the simulations we observe that when the problem has not a very big size (probably because in these other cases longer simulations are needed) the ratios among both methods provide an interesting speed-up, and in most cases the EXIT strategy stops the BPGe algorithm before the maximum number of iterations is reached.
On the other hand, we observe that the CPU-time and iteration number ratios are quite similar, and so there are little differences between them. Note that the BPGe algorithm has an extra step, the line search method of Algorithm 2, but it increments quite a few the final CPU-time.
On the table we have remarked three discordant cases (superscript -a-) related with a fast linear convergence, instead of sublinear. This is illustrated, for example, on the left bottom plot of Figure 1 (ρ = 0.99,m = 1000) where the green curve, corresponding to λ = 1/(3L) converges faster than the other colours (as it also occurs in other plots of the same figure). Note that for an overdetermined problem with random data some initial conditions and data may be led to a faster convergence. For the underdetermined problem there is a regular behaviour in all the simulations.
Therefore, in the Poisson Linear Inverse Problems tests the BPGe algorithm presents a faster performance compared with the BPG algorithm, giving an interesting option for real problems.

Application to Quadratic Inverse Problems
In the second test (taken from [4]) we show that BPGe algorithm can deal with a nonconvex Quadratic Inverse Problem (QIP) in which the differentiable term has no globally gradient Lipschitz continuous property. This problem is a natural extension of the linear inverse problem, but now using quadratic measurements. It appears in many popular applications, such as signal recovery [10] and phase retrieve [11] from the knowledge of the amplitude of complex signals.
A general description of the Quadratic Inverse Problem is to find the vector x ∈ R d that solves the system being {A i ∈ R d×d | i = 1, . . . , m} a set of symmetric matrices that describes the model, and b = (b 1 , . . . , b m ) ∈ R m a vector of usually noisy measurements.
Following the formalism given in [4, section 5.1], this problem can be formulated as a nonconvex minimization problem as: where θ > 0 is used to weigh matching the data fidelity criteria and its regularizer g. In our experiments, we take a convex l 1 -norm regularization function g(x) = x 1 . Note that the first function f (x) is a nonconvex differentiable function but that does not admit a global Lipschitz continuous gradient.
The main quality of the BPG and BPGe algorithms (as noted to the BPG in [4]) is that these methods can solve the broad class of problems (QIP). To apply BPG and BPGe on the QIP model properly, we first need to identify a suitable function h (Definition 1). In [4], a proper choice has been given as: and so now the Bregman distance is given by When L is chosen such that L ≥ m i=1 3 A i 2 + A i |b i | then by [4, Lemma 5.1], L-smad condition (Definition 2) holds for the selected functions f (x), g(x) and h(x). Besides, according to the same analysis in [4, Lemma 5.1], we could derive the relative weakly convex parameter as In conclusion, we have that: (ii) Assumptions 1 and 2 are easily verified.
(iii) f, g, D h are all semi-algebraic, (see for example [34]). One can show inductively that H M (x, y) = Ψ(x)+M D h (x, y) is semi-algebraic, thus it has KL property (Definition 4) at any point (x, x). Besides, we could verify that Assumption 3 holds.
It means, from the convergence Section 4, that the sequences generated by BPGe algorithm converge to a critical point of the objective function Ψ.
Here, we perform several numerical tests to compare the behaviour of the BPGe and BPG algorithms. As we did with the previous problem (PLIP), we have designed two main families of experiments, considering overdetermined (m > d) and underdetermined (m < d) cases. To that goal we set different values of m and d, and we generate m random rank-1 matrices A i = a i a T i in R d×d , where the entries of the vectors a i are generated following independent Gaussian distributions with zero mean and unit variance. The accurate x * := arg min{Ψ(x) : x ∈ R d } is chosen as a sparse vector (the sparsity is 5%) and b i = x T A i x * , i = 1, . . . , m. We set the weight parameter θ = 1 as default.
As a first performance comparison, in Table 2 we give the CPU-time and number of iterations for different sizes of problems (number of data m and dimension d) for two values of the λ parameter (λ = 1/L and 1/(3L)) for overdetermined case. The values T BP Ge and T BP G denote the CPU-time of BPGe and BPG algorithms, and N BP Ge and N BP G the number of iterations to reach the EXIT criteria, respectively. From the simulations we observe that the ratios among both methods provide an interesting speed-up, and the EXIT strategy stops the BPGe algorithm before the maximum number of iterations (k max = 5000 in this case) is reached. On the other hand, we observe that the CPU-time and iteration number ratios are quite similar, and so there are little differences between them. Therefore, we note again that although the BPGe algorithm has an extra step (the line search method of Algorithm 2), it increments quite a few the final CPU-time. Also, from the data we observe that although the ratio for the BPGe and BPG algorithms for λ = 1/(3L) is quite good, the option BPGe with λ = 1/L performs many fewer iterations, and so it is the recommended option.   In Figure 6, with the fixed parameter values {λ = 1/L, ρ = 0.99} and for the overdetermined (m > d) and underdetermined (m < d) cases, we show the evolution of Ψ(x k ) − Ψ(x * ) . In this problem we observe that the performance of the accelerated BPGe algorithm for the overdetermined case is quite good, giving a linear convergence. In the case of underdetermined the behaviour seems to be sublinear, and it needs more iterations to reach the desired value (in this simulations k max = 20000). In both cases the BPGe algorithms performs much better than the BPG one. For the underdetermined case we also show the evolution of the objective function Ψ(x k ) vs. iteration number to see that in this case the objective function takes large values, and therefore, when applying the EXIT strategy the required precision is obtained (a relative error < 10 −6 ) giving not too small absolute values.
Therefore, again in the Quadratic Inverse Problems tests the BPGe algorithm presents a faster performance compared with the BPG algorithm, giving an interesting option for real problems.

Conclusions
This work have joined two powerful methods to solve large-scale minimization problems and we proposed a new accelerated Bregman proximal gradient algorithm (BPGe) useful for nonconvex and nonsmooth minimization problems. On one hand, we have taken the BPG algorithm [3] able to deal with non-globally Lipschitz continuous gradient problems. Firstly defined for the convex case [3] and later extended to the nonconvex case by [4]. And on the other hand, the accelerated extrapolation algorithm (used for instance in the PG algorithm [8]). The use of the Bregman distance paradigm permits to enlarge the number of problems to work with, because we do not need the assumption of global Lipschitz gradient continuity. And with the extrapolation technique the convergence of the method is accelerated.
In this paper we have studied the convergence of the new method, and we have proven that any limit point of the sequence generated by BPGe algorithm is a stationary point of the problem by choosing parameters properly. Besides, assuming Kurdyka-Lojasiewicz property, we have proven the whole sequences generated by BPGe converges to a stationary point.
Finally, we have applied it to two important practical problems that arise in many fundamental applications (and that not satisfy global Lipschitz gradient continuity assumption): Poisson linear inverse problems and quadratic inverse problems, for both, overdetermined and underdetermined cases. In these tests the BPGe algorithm have shown faster convergence results than the BPG algorithm, and so the new BPGe algorithm seems to be an interesting methodology.