An Efficient Single-Parameter Scaling Memoryless Broyden-Fletcher-Goldfarb-Shanno Algorithm for Solving Large Scale Unconstrained Optimization Problems

In this paper, a new spectral scaling memoryless Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is developed for solving large scale unconstrained optimization problems, where the scaling parameter is chosen so as to minimize all the eigenvalues of search direction matrices. The search directions in this algorithm are proved to satisfy the approximate Dai-Liao conjugate condition. With this advantage of the search directions, a scaling memoryless BFGS update formula is constructed and an algorithm is developed by incorporating acceleration strategy of line search and restart criterion. Under mild assumptions, global convergence of the algorithm is proved. Numerical tests demonstrate that the developed algorithm is more robust and efﬁcient in solving large scale benchmark test problems than the similar ones in the literature.


I. INTRODUCTION
Optimization models have found wider applications in the fields of engineering and management sciences [1], [2]. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is one of the most efficient quasi-Newton algorithms for solving medium scale unconstrained optimization models [3]. Owing to its good local and global convergence and self-correcting quality [4]- [6], it is particularly suitable for solving the small-sized and medium-sized unconstrained optimization problems [7], [8]. Specifically, a mathematical model of unconstrained optimization problems can be written as where f : R n → R is continuously differentiable. The BFGS method produces an iterate format for solving (1) by generating a sequence {x k }, specified by The associate editor coordinating the review of this manuscript and approving it for publication was Tachun Lin . where and α k is a step size computed by a line search. It has been shown that B k ∈ R n×n in formula (3) is an approximation of the Hessian matrix ∇ 2 f (x k ) for k ≥ 1. Most importantly, if y T k s k > 0, then for all k, B k ∈ R n×n in formula (3) is symmetric and positive definite. Therefore, d k in (3) is a descent direction of the objective function f , as well as being close to the Newton direction. In general, the condition y T k s k > 0 is guaranteed by the Wolfe-Powell line search. However, some numerical experiences [7]- [9] have showed that the BFGS algorithm may be not efficient enough due to a poor approximation to the Hessian matrix at the initial point, or due to the ill-conditioning of the approximate Hessian matrices in the iterate process. To overcome the drawbacks of the BFGS algorithm, some modified versions have been proposed in the literature to improve its numerical efficiency and robustness [10]- [14]. In order to obtain a more effective algorithm, two aspects are generally studied. The first focuses on innovation of line search rules. In addition to the classical Wolfe line search and Armijo search rules, many effective line search rules have been proposed to find suitable step lengths in recent years [15], [16]. Another is focused on determination of efficient search directions. Exactly due to meticulous choices of step lengths and search directions, this type of algorithms are often more efficient than the heuristic algorithms [17]- [20]. Particularly, as shown in the existing results [7], [9], [13], [14], [21], the scaling factor in the update formula of BFGS plays an important role in approximating the Hessian matrix and improving efficiency of algorithms. In this paper, we mainly investigate how to determine search directions by this scaling strategy.
Note that the scaling strategy in the BFGS update formula has been regarded as one of the main approaches to avoid an ill-conditional B k [9] in (3). It includes two ways: one is to multiply the approximate Hessian matrix by an appropriate scalar before it is updated in the BFGS method. Another is by properly scaling one or two terms in the BFGS update formula. Specifically, if the third term in the second equality in (3) is multiplied by a scaling parameter γ k , then it is called one-parameter scaling BFGS update formula: In this case, the inverse of B k+1 , H k+1 , reads Clearly, if γ k = 1, then (4) is exactly the standard BFGS update formula. One of our goals in this paper is to find a better scaling parameter such that the corresponding algorithm is more efficient and robust. For many large-scale practical optimization models [22], [23], the (scaling) BFGS algorithms like (3) and (4) are often powerless because they are associated with solution of a large-scale system of linear equations B k d k = −g k , as well as computation and storage of matrices B k with large sizes. Therefore, another goal in this paper is to modify the scaling BFGS algorithms such that only gradient information in the optimization problem (1) is employed to compute the search directions and the step sizes without needs of computing and storing matrices [15]. In summary, this paper intends to study a single-parameter scaling memoryless BFGS algorithm for solving large scale unconstrained optimization problems.
The rest of this paper is organized as follows. In next section, we review the literature related to this study. Section III is devoted to development of a scaling memoryless BGFS algorithm based on acceleration scheme and restart criterion. In Section IV, global convergence of the algorithm is established. Numerical tests and discussion are conducted in Section V to show advantages of our algorithm. Along with suggestions for future research, some conclusions are drawn in the last section.

II. LITERATURE REVIEW
The BFGS algorithm was first proposed in 1970 by Broyden, Fletcher, Goldfarb and Shanno [24]. Owing to its fast convergence speed, a great number of its variants have been studied. Biggs [13] proposed a modified BFGS algorithm by introducing a scaling parameter such that an improved estimate of the second directional derivative is obtained. Yuan [14] presented another modified BFGS algorithm such that the updated approximate Hessian matrix satisfies the most recent quasi-Newton condition: the gradient value of the local quadratic model matches that of the objective function at the previous iterate. Cheng and Li [7] proposed a spectral scaling BFGS method by scaling the quasi-Newton equation, which has a self correcting property such that its numerical behavior is improved. However, global convergence was proved in [7] only for uniformly convex optimization problems.
To make the BFGS methods applicable to large scale optimization models, many memoryless BFGS updating formulas have been proposed. For more details, one can see [3], [25]- [32] and the references therein. For example, Livieris et al. [25] presented a new hybrid conjugate gradient method based on convex hybridization of the conjugate parameters of DY and HS+ by adapting the quasi-Newton philosophy. The computation of the hybrization parameter is obtained by minimizing the distance between the hybrid conjugate gradient direction and the self-scaling memoryless BFGS direction. Babaie-Kafaki and Ghanbari [26] proposed a nonlinear conjugate gradient method by minimizing the distance between the search direction matrix of the Dai-Liao method and the scaled memoryless BFGS update matrix in the Frobenius norm. Andrei [27] presented an accelerated scaled memoryless BFGS preconditioned conjugate gradient algorithm for solving unconstrained optimization problems by using the Powell's nonnegative restriction of the conjugate gradient. The basic idea was combination of the scaled memoryless BFGS method with the preconditioning technique in the frame of the conjugate gradient method. Andrei [28] and Yao and Ning [29] proposed two conjugate gradient methods, where the search direction was given by a symmetrical Perry matrix and was associated with a positive parameter determined by minimizing the distance of this matrix and the self-scaling memoryless BFGS matrix in the Frobenius norm. Apostolopoulou et al. [30] presented a curvilinear algorithmic model for training neural networks which was based on modifications of the memoryless BFGS method by incorporating a curvilinear search. In summary, the above-mentioned algorithms have global convergence and satisfactory numerical efficiency, but stability of the algorithms cannot be guaranteed. In this paper, we will develop a more stable algorithm by clustering all the eigenvalues of the approximate Hessian matrices to obtain a better scaling parameter in the memoryless BFGS update formula.
It is noted that Babaie-Kafaki [3], [31] started with controlling the condition number of the BFGS update matrix to develop more efficient algorithms. Actually, it was shown that the algorithms in [3], [31] are more stable compared with the similar ones. Very recently, Babaie-Kafaki and Ghanbari [32] further suggested a linear combination of the search direction by the memoryless BFGS technique with that by the Hestenes-Stiefel method. As a result, a one-parameter extension of the Hestenes-Stiefel method was proposed in [32]. However, unlike the analysis of the condition number of matrices in [3], [31], [32], we will use a measure function in this paper to conduct clustering analysis of all the eigenvalues of the approximate Hessian matrix, not only the maximum and minimum eigenvalues.
In summary, as a new scaling and memoryless BFGS algorithm, our algorithm has advantages of lower storage, lower computational cost and more stable numerical performance. Especially, we will prove that the used scaling parameter in our method can minimize all the eigenvalues of search direction matrices. Then, such an advantage of this parameter will be incorporated in developing an efficient algorithm for solving large scale optimization problems. We will also prove that our algorithm is globally convergent and show by numerical tests that it is more efficient and stable than the similar ones available in the literature.

III. DEVELOPMENT OF ALGORITHM
In this section, our aim is to develop a single-parameter scaling memoryless BFGS algorithm to solve large scale optimization problems.
In order to analyze properties of the BFGS methods, Byrd and Nocedal [33] introduced a measure function: where A is a symmetric positive definite search direction matrix in the quasi-Newton method, tr denotes the trace of this matrix, and det represents its determinant. From the definition of ϕ, we know that for a given matrix A, ϕ(A) is involved with all the eigenvalues of A, not only the smallest or the largest ones. In this paper, we are concerned with how to choose the parameter γ k in (4) such that ϕ(B k+1 ) is minimized for any k. We first answer whether B k+1 in (4) is positive definite or not in the case that B k is positive definite. Proposition 1: Suppose that the stepsize α k is computed by the Wolfe line search (16). If B k is symmetric positive definite and γ k > 0, then B k+1 in (4) is symmetric positive definite.
Proof: Similar to the proof of Proposition 2.1 in [8]. By Proposition 1, in order to minimize all the eigenvalues of B k+1 , we can choose a parameter γ k > 0 such that We can prove the following result. Proposition 2: Denote Then, γ k in (8) solves Problem (7). Proof: Notice that and Consequently, The first-order optimality condition of Problem (7) yields Clearly, γ k > 0. The proof has been completed. Remark 1: Cheng and Li [7] also obtained the same scaling parameter as in (8) by minimizing s k − γ k y k 2 . Proposition 2 shows that such a γ k also minimizes ϕ(B k+1 ). Thus, its condition number is also minimized.
Clearly, a smaller condition number of search direction matrices can theoretically ensure stability of algorithms.
With the advantages of γ k stated in Remark 1, we modify the scaling BFGS update formula (4) as: Compared with (4), (13), (4) is more applicable to solve large scale optimization problems since it no longer needs to solve a large scale system of linear equations to compute a search direction.
Specifically, let γ k be defined by (8). We set H k = I in (5) and rewrite the inverse matrix of B k+1 as Consequently, at the iterate point x k+1 , we obtain a search direction: Since the Wolfe line search can ensure that H k+1 is positive definite, it is used to develop an algorithm together with the search direction being computed by (15). Specifically, at the k-th iteration, we choose a step size α k such that it satisfies the following conditions: Furthermore, if the second inequality in (16) is replaced by then the step size α k satisfies the strong Wolfe conditions [34].
With the above preparation, we are in a position to state the overall framework of our algorithm.
Step 1 (Termination). Test a criterion for stopping the iterations. If g k < ε, then the algorithm stops; Otherwise, go to Step 2.
Remark 2: Among the existing memoryless scaling methods, the scaling parameters, are obtained by the secant equation or by directly minimizing the condition number of the inverse Hessian matrix [3], [31], [35], [36]. In Algorithm 1, the scaling parameter is computed by (8), whose intrinsic features are given by Byrd and Nocedal's measure function. Since such a measure can achieve better eigenvalue clustering than the other methods, Algorithm 1 can be more stable than the similar ones.
Remark 3: Steps 3 and 4 are the same accelerating scheme as in [37].
In the end of this section, we further prove that the search direction sequence { d k+1 } generated by Algorithm 1 has the following properties.
Proposition 3: Suppose that the line search satisfies the Wolfe line search conditions (16). Then, d k+1 given by (15) is descent for any k.
Proof: Since the step length α k satisfies the Wolfe line search conditions, it follows that y T k s k > 0. From (15), we have Therefore, We have proved that d k+1 is a sufficiently descent direction.
Proof: Actually, where t k = y k 2 y T k s k > 0. We have proved the result.
The properties of Algorithm 1 in Propositions 3 and 4 are useful to its convergence analysis in next section.

IV. CONVERGENCE ANALYSIS
In this section, we will prove global convergence of Algorithm 1.
Since global convergence of the spectral scaling BFGS method was only proved for uniformly convex optimization problems in [7], we first simply prove that the scaling BFGS algorithm corresponding to the scaling parameter γ k in (8) is globally convergent for any general smooth non-convex objective function, rather than a uniformly convex one. For readability, we state this spectral scaling BFGS algorithm as follows.
Step 1 (Termination). Test a criterion for stopping the iterations. If g k < ε, then the algorithm stops. Otherwise, go to Step 2.
Step 2 (Line search). Compute a step size α k > 0 such that the Wolfe line search conditions (16) is satisfied.
As done in [7], we also need the following assumption. Under Assumption 1, it follows from the first Wolfe condition (16) that the sequences {f (x k )} is not increasing. Thus, exists. Before statement of convergence result, we first prove the following results.

Lemma 1: Suppose that the inverse approximate Hessian matrix is computed by (5). Then, the search direction d k+1 = −H k g k+1 in Step 5 of Algorithm 2 is descent.
Proof: Left-multiplying g T k+1 on both sides of d k+1 = −H k g k+1 yields g T k+1 d k+1 = −g T k+1 H k g k+1 . By Proposition 1, B k is positive definite. Since H k is the inverse of B k , H k is positive definite. It is seen that d k is a descent direction.
Lemma 2: Suppose that the scaled B k+1 is determined by (4), where γ k is computed by (8). Then, Proof: From (9), we have Since B k+1 is positive definite and tr(B k+1 ) > 0, we have Remark 5: Note that the inequality (19) reveals that the largest eigenvalue of B k+1 is strictly less than tr(B 0 )+(k +1). Therefore, the spectral scaling BFGS method with γ k in (8) Proof: Similar to the proof of Lemma 3.3 in [8].
85668 VOLUME 8, 2020 With the results in Lemmas 1, 2 and 3, we can prove global convergence of Algorithm 2.
Theorem 1: Let {x k } be any sequence generated by Algorithm 2. Under Assumption 1, it holds that lim inf k→∞ g k = 0.
Proof: Assume that for all k, g k > γ > 0. Note that f is bounded from below and B k s k = α k B k d k = −α k g k . Thus, α k = B k s k g k . From the first inequality in the Wolfe conditions (16), it follows that The following proof is similar to that of Theorem 3.1 in [38]. For completeness, we present this proof in detail.
By geometric arithmetic mean inequality, for any ζ > 0, there exists an integer k 0 > 0 such that for any positive integer q, it holds that Thus, where the last inequality follows from Lemma 2. As q → ∞, the last part of (23) converges to zero. Therefore, which contradicts the result in Lemma 3. The proof of global convergence for a general non-convex problem is completed. We now come back to establish the global convergence of Algorithm 1, the core algorithm developed in this paper.
We first make the following mild assumption. Assumption 2: In some neighborhood N of , f is continuously differentiable and its gradient is Lipschitz continuous, namely, there exists a constant L > 0 such that Under Assumptions 1 and 2, there exist constants B > 0 and ≥ 0 such that (16), then for all k > 0, the following inequality holds:

Lemma 4: Under Assumptions 1 and 2, if the line search satisfies the Wolfe conditions
Proof: From the Wolfe conditions (16), it follows that Since d k is a descent direction and σ < 1, the inequality (25) has been proved.

Lemma 5: Let {d k } be any sequence generated by Algorithm 1. Under Assumptions 1 and 2, it holds that
Proof: From the Wolfe condition (16) and Lemma 4, we get Therefore, from Assumption 2, we get the Zoutendijk condition (26) [39].
Proof: Similar to the proof of Lemma 3.1 in [40], we can prove Lemma 6. VOLUME 8, 2020 Proof: By Assumptions 1 and 2, we have Suppose that g k = 0 for all k ≥ 1. Otherwise, a stationary point is obtained. From Assumptions 1 and 2, it follows that Consequently, the condition (27) is true. By Lemma 6, we know that (29) is true and the global convergence is proved.

V. NUMERICAL TESTS AND DISCUSSION
In this section, we report the numerical performance of Algorithm 1 (SM-BFGS), in comparison with similar algorithms available in the literature. We test all these algorithms by using them to solve 750 large-scale test problems from [41], where the dimension of each benchmark problem changes from 1000 to 10000 with a step length 1000. To further validate global convergence of the algorithms, the same initial point, as given in the literature, is used. To show the advantages of our search directions in Algorithm 1, the step length in all the tested algorithms are determined by the Wolfe line search for a fair comparison, where we take ρ = 0.0001 and σ = 0.8. Each algorithm stops if the condition g k ≤ 10 −6 is satisfied or if the number of iterations exceeds 10 4 . It is noted that in Algorithm 1, a strategy of the line search acceleration is used to improve its numerical performance.
All the numerical results are presented by a frequently-used approach to comparison of algorithm's performance [42]. Specifically, let f A i and f B i be the optimal value found by Algorithms A and B for the i-th test problem (i = 1, 2, . . . , p), respectively. For the i-th test problem, we say the performance of Algorithm A is better than that of Algorithm B if: , and the number of iterations, or the number of functiongradient evaluations, or the CPU time of Algorithm A is less than that of Algorithm B. To intuitively display numerical performance of all the tested algorithms, we use the Dolan and Moré performance profile graphs to analyze their numerical results, including the number of iterations, the total number of evaluating the objective function and its gradient, and the consumed CPU time when each algorithm stops. Specifically, in these performance profile graphs (see, for example, Figures 1 and 2), the horizontal axis represents the performance analysis factor τ , which can reflect efficiency of each algorithm in solving all the 750 test problems. The vertical axis is the probability P of each algorithm (measured by the proportion of the test problems) that a certain performance ratio is within a factor τ ∈ R of the best possible ratio.
All codes of the computer procedures are written in Fortran 90, and are implemented in Windows system with 2.4 GHz CPU processor, 4 GB RAM memory. In the first round of tests, since Algorithm 1 can be regarded as an extended BFGS algorithm, we compare our algorithm with the standard BFGS, B-BFGS in [13] and Y-BFGS in [14] with the memoryless technique. For readability, we present the scaling parameters in [13], [14] as follows.
In all the four algorithms, the initial matrix H 0 = I for solving any test problem. In Figure 1, we show the efficiency comparison among SM-BFGS and the other three types of BFGS algorithms. From the numerical results in Figure 1, it is clear that: (1) In terms of the number of iteration shown in Figure 1(a), our algorithm SM-BFGS can solve about 71% of the test problems with the least number of iterations. In contrast, BFGS, B-BFGS and Y-BFGS only solve about 56%, 60% and 55% of these problems, respectively.
(2) In terms of the number of function-gradient evaluations shown in Figure 1(b), SM-BFGS performs the best in solving about 63% of the test problems, and for the other three algorithms, the proportions of the test problems with the least number of evaluating functions and gradients are 53%, 52% and 58%, respectively.
(3) With respect to the consumed CPU time shown in Figure 1(c), our algorithm spends the shortest CPU time in solving about 75% of the problems, while the other three algorithms have a shorter running time on about 64% of these problems.
In one word, our algorithm (SM-BFGS) can solve the test problems as many as possible within less number of iteration, less number of function-gradient evaluations and less CPU time. Therefore, it is concluded that SM-BFGS outperforms the similar three types of BFGS algorithms.
In the second round of tests, since Algorithm 1 can be regarded as a modified conjugate gradient algorithm, we compare Algorithm 1 (SM-BFGS) with the similar conjugate gradient algorithms available in the literature. Particularly, recall that in Proposition 4, we have proved that the directions in our algorithm also satisfy the approximate Dai-Liao conjugate condition (18). In the following, SM-BFGS is compared with other three Dai-Liao types of conjugate gradient algorithms, which were recently published in [26], [43], [44], respectively. For simplicity, we denote them DL1+ [43], DL3+ [44] and DL4+ [26], respectively. Efficiency comparison of the four algorithms is presented in Figure 2.
From the numerical results in Figure 2, it is easy to see that: (1) With regard to the number of iterations shown in Figure 2  of the test problems with the least number of iterations, while DL1+, DL3+ and DL4+ solve only about 18%, 20% and 10% of these problems, respectively.
(2) With regard to the number of function-gradient evaluation shown in Figure 2(b), the algorithms SM-BFGS, DL1+, DL3+, and DL4+ respectively solve about 37%, 47%, 21%, and 47% of the test problems with the least number of evaluating the objective function and its gradient.
(3) In terms of performance profile of the consumed CPU time shown in Figures 2(c), SM-BFGS solves about 83% of the test problems with the least CPU time, while DL1+, DL3+ and DL4+ can solve about 60%, 62% and 40% of these problems with the least CPU time, respectively.
In summary, Figures 2(a) and 2(c) indicate that Algorithm 1 (SM-BFGS) can solves the test problems as many as possible with less number of iterations and less CPU time than the other three algorithm.
In order to further justify the advantages of Algorithm 1, we make pairwise comparison between Algorithm 1 and any one of the other six algorithms. The pairwise comparison results are presented in Table 1, where we denote NI, NE and CT the number of iteration, the number of evaluating the objective function and its gradient and the consumed CPU time after termination, respectively. More intuitively, Figure 3 displays the differences of numerical performance in Table 1 between Algorithm 1 and another algorithm. In Figure 3, each figure contains three groups of bars, which represent the three types of numerical performances: the number of iterations, the number of function-gradient evaluations and the consumed CPU time. Each group consists of the three bars with different colors: the blue bar represents the number of the test problems solved by Algorithm 1 with better numerical performance, the red bar represents the number of these problems solved by another algorithm with better numerical performance, and the orange bar represents the number of the test problems when the numerical performance of the compared algorithms is the same. Figure 3 clearly demonstrates that in the pairwise comparison, Algorithm 1 (SM-BFGS) is better than any one of the other six algorithms. For example, between SM-BFGS and DL4+, it can be seen that: (1) In terms of the number of iterations, SM-BFGS achieves less number of iterations for the 617 problems, DL4+ is better than SM-BFGS for the 14 problems, and for the rest of 48 problems, SM-BFGS and DL4+ have the same number of iterations. (2) In terms of the number of evaluating the objective function and its gradient, SM-BFGS wins for the 456 problems, DL4+ performs better for the 195 problems, and for the rest of 28 problems, SM-BFGS and DL4+ have the same numerical performance.
(3) With regard to the consumed CPU time, SM-BFGS runs faster than DL4+ for the 436 problems, DL4+ is faster only for the 61 problems, and for the rest of 182 problems, they perform the same.
On the whole, the second round of tests also shows Algorithm 1 is more stable and more efficient in solving the large scale benchmark test problems.
In the last round of tests, we report the numerical results in Table 2 as all the seven algorithms are used to solve the benchmark test problems with dimension of over 10000 (DIM).
The underlined results in Table 2 are the best ones among the seven algorithms. From the viewpoint of less number of iterations, less number of evaluating the objective function and its gradient, or less consumed CPU time after termination, our algorithm (SM-BFGS) also outperforms all the other six ones as they are used to solve large scale optimization problems with dimension of over 10000.

VI. CONCLUSION AND FUTURE RESEARCH
In this paper, we have proposed a new scaling memoryless BFGS algorithm for solving large scale unconstrained optimization problems. We have proved that the used scaling parameter can minimize all the eigenvalues of search direction matrices and the corresponding search directions satisfy the approximate Dai-Liao conjugate condition. In addition, a strategy of the line search acceleration is employed to improve numerical performance of this algorithm. Under mild assumptions, we have proved that the developed algorithm is globally convergent.
By numerical tests, we have demonstrated that our algorithm outperforms the similar ones available in the literature for solving large scale optimization problems, either as an extension of the BFGS-type algorithms or as a modified conjugate gradient algorithm.
In future research, it is valuable to study new twoparameter scaling memoryless BFGS algorithms such that the used scaling parameters can minimize all the eigenvalues of search direction matrices.
It is also interesting to further validate the proposed method in this paper by applying it in solving more practical optimization models from the fields of medical, engineering and management. For example, as done in [23], the developed algorithm may be useful to deeply mine the transcriptomic profile of the sub-genomes in hybrid fish lineage. Since a nonlinear system of equations is closely related with an optimization model, it is significant to extend the developed algorithm into solving large scale nonlinear system of equations from engineering fields. Actually, it has been shown [20], [22] that recovering sparse signals and restoring blurred images can be formulated as a system of equations, and efficient optimization algorithms can be modified to solve these practical engineering problems.