A Batch Variable Learning Rate Gradient Descent Algorithm With the Smoothing L1/2 Regularization for Takagi-Sugeno Models

A batch variable learning rate gradient descent algorithm is proposed to efficiently train a neuro-fuzzy network of zero-order Takagi-Sugeno inference systems. By using the advantages of regularization, the smoothing $L_{1/2}$ regularization is utilized to find more appropriate sparse network. Combining the second-order information of the smoothing error function, a variable learning rate is chosen along the steep descent direction, which avoids line search procedure and may reduce the cost of computation. In order to appropriately adjust the Lipschitz constant of the smoothing error function in the learning rate, a new scheme is proposed by introducing a hyper-parameter. Also the article applies the modified secant equation for estimating the Lipschitz constant, which makes the algorithm greatly reduce the oscillating phenomenon and improve the robustness. Under appropriate assumptions, a convergent result of the proposed algorithm is also given. Simulation results for two identification and classification problems show that the proposed algorithm has better numerical performance and promotes the sparsity capability of the network, compared with the common batch gradient descent algorithm and a variable learning rate gradient-based algorithm.


I. INTRODUCTION
Regularization is one of the important approaches for the overfitting problem. A regularization term is generally introduced on the basis of minimizing the empirical risk to limit the model capabilities so that they do not excessively minimize empirical risk. Regularization has been applied successfully in neural networks for the sparsity of weights and hidden layer units [1]- [5]. The L p regularization is commonly used, leading to the following minimizing problem whereĒ(w) is an error function depending on the weights w of networks, w p = ( n i=1 |w i | p ) 1 p , and λ is the regularization parameter, which is used to control the strength of The associate editor coordinating the review of this manuscript and approving it for publication was Shaoyong Zheng .
regularization. We know that better network sparsity often obtains better generalization performance. The L 1/2 regularization [6] is a typical regularization, which is capable of obtaining even sparser solution and achieves better generalization performance than the L 2 and L 1 regularizations. However, the L 1/2 regularization is nonconvex, nonsmooth, and non-Lipschitz. Thus, the gradient-based learning algorithms can not be used directly to train neural networks. By introducing smoothing technique, Wu et al. [1], [2] have proposed the standard gradient descent algorithms with the smoothing L 1/2 regularization for training BP feedforward neural network. It is shown that a better pruning of the network is obtained than the L 1 and L 2 regularizations. In particular, a good review about the study of the smoothing L 1/2 regularization in different type of networks, such as BP feedforward neural network, high order neural network, parallel feedforward neural network and Takagi-Sugeno (T-S) fuzzy systems, can be found in literature [4]. Neuro-fuzzy systems are important intelligent systems, which combine artificial neural networks and fuzzy systems, and have been applied successfully in various fields such as pattern recognition, image processing and feature extraction, soft sensors, control system and system identification [7]- [10]. Many common batch gradient descent algorithms are proposed to train neurofuzzy networks for neuro-fuzzy systems [11]- [16]. For training the network, it needs to minimize a high nonlinear and ill-condition problem, the common batch gradient descent algorithm is simple, however, it usually converges slowly and exhibits oscillating phenomenon. These drawbacks have been discussed in literature [17]- [19]. To address the drawbacks, some gradient-based algorithms have been developed. One class is to use smoothing techniques to overcome serious numerical oscillations. Recently, by using the smoothing L 1/2 regularization, a constant learning rate gradient descent smoothing algorithm [3] is proposed to reduce the adverse effects of high non-linearity of the error function for T-S fuzzy models, in which the learning rate is usually chosen by a smaller value. Another class of typical algorithms is to use the second-order information of the error function to overcome the adverse effects of high non-linearity and ill-condition. By combining a modified secant equation to generate the searching direction, it approximates the second order curvature of the error function with a high-order accuracy, the efficient conjugate gradient algorithms have been proposed to train BP networks [5], [20]. In optimization community, by estimating the part information of the secondorder Hessian matrix of the error function to compute the learning rate without line search, the variants of the common gradient descent algorithm [21], [22] have been proposed, but figuring out how to ingeniously adjust the estimation of the Lipschitz constant or choose suitable positive definite matrices is not easy. Therefore, for neuro-fuzzy systems, in order to reduce the adverse effects of high non-linearity and ill-condition of the error function and improve the numerical performance of the algorithm and generalization capacity of the network, we shall use the second-order information from the iterations to define a learning rate without line search, and further propose a batch gradient-based algorithm with the smoothing L 1/2 regularization. The advantages of the proposed algorithm and the main contribution of the paper are as follows: (1) We have proposed a new scheme for adjusting the Lipschitz constant to update the learning rate by introducing a hyper-parameter. And the modified secant equation [25], which can lead to a good property that the hereditary positive definiteness of the approximation of the Hessian matrix is always guaranteed, is also used for estimating the Lipschitz constant. Then the algorithm with a modified secant equation makes better use of second-order information than the algorithms with a common secant equation, which makes the proposed algorithm greatly reduce the oscillating phenomenon, and enhance the robustness.
(2) Also compared with the common batch gradient descent neuro-fuzzy learning algorithm [3], the proposed algorithm can obtain lower error, effectively reduce the oscillation phenomenon and find the appropriate sparse network. Thus, by appropriately applying the second-order information and the smoothing L 1/2 regularization, the proposed algorithm shows better sparsity and generalization capacity. The numerical simulations have demonstrated the performance of the algorithm. Moreover, under the assumptions, the global convergent result is given.
The rest of the paper is organized as follows. In Section II, we describe the zero-order T-S fuzzy system. In Section III, we propose a variable learning rate gradient algorithm with the smoothing L 1/2 regularization. In Section IV, a convergent result is given. In Section V, simulation results are presented. Some conclusions are summarized in Section VI.

II. ZERO-ORDER TAKAGI-SUGENO INFERENCE SYSTEM
In this section, we use zero-order T-S inference system based on fuzzy neural network. Its topological structure of the zeroorder T-S neuro-fuzzy network is shown in Figure 1. We consider a four-layer network, which includes the input layer, linguistic variable layer, fuzzy rule layer and output layer with an output node. The case with more output nodes can also be proposed easily by a similar way, which we do not mention here.
The zero-order T-S inference system [23], [24] is a common fuzzy system. Its fuzzy rule base is composed of a series of IF-THEN fuzzy rules. The i-th rule, denoted as R i , i = 1, 2, · · · , n, is represented in the following form: where x 1 , x 2 , · · · , x m are the inputs to the system, A li denotes a fuzzy set of x l , l = 1, 2, · · · , m, z is the system output variable, and the value y i is a real number. Here we select the bell function with Gaussian distribution as the membership function to compute A li (x l ), i = 1, 2, · · · , n; l = 1, 2, . . . , m, which defined as follows: where a li and σ li represent the center and width of A li (x l ). As considered in [15], the reciprocals of the widths, denoted as b li = 1/σ li , are also used here in order to avoid the numerical difficulty, the corresponding formula from above is Next, combining IF-THEN fuzzy rules (2) and the Gauss functions (3), we clarify the meaning of each layer in Figure 1, as follows: Layer 1: Each node in this layer represents one input variable of an observation data Layer 2: In this layer, each node represents the membership function of a linguistic variable, and the Gauss membership function (3) is to calculate the value of the membership function of each input component belonging to the fuzzy set of linguistic variables. The connecting weights from Layer 1 and Layer 2 are regarded as the centers and the widths of the Gauss membership function, respectively.
Layer 3: In this layer of the network, each node represents a fuzzy rule, and its main function is to match the antecedent part of the fuzzy rule, and the agreement of the i-th antecedent part is usually computed by using the product T-norm, as follows: (4) where i = 1, 2, · · · , n. The connecting weights between Layer 2 and Layer 3 are all set to 1.
Layer 4: The output layer has an output node, which is a linear combination of the results from layer 3, and the output z is computed as follows: where y i , i = 1, 2, · · · , n, are the connecting weights between Layer 3 and Layer 4.

III. BATCH VARIABLE LEARNING RATE GRADIENT NEURO-FUZZY LEARNING ALGORITHM WITH SMOOTHING L 1/2 REGULARIZATION
In order to train the neuro-fuzzy network, it usually needs to solve an unconstrained problem on the error function. The common batch gradient descent method with the smaller constant learning rate is usually used to solve the optimization problem. However, the larger constant learning rate does not work well. Therefore, we will propose an adaptive learning rate related the second-order information of the error function without line search to improve the learning efficiency.
In addition, the smoothing L 1/2 regularization term is added to the minimize empirical risk to promote the generalization VOLUME 8, 2020  performance. Next, we will propose a variable learning rate gradient algorithm. For the sake of convenience, the Hadamard product is used to describe the following contents, which is the componentwise product of matrices, denoted by ''·'' here.
Suppose that the number of input nodes is m. The set of where O j is the desired output for the j-th training pattern x j and J is the number of training patterns. The fuzzy rule base is provided by (2). Let 1+2m) be all weight vectors relevant to the network except the constant weights, where the vector a 0 = (y 1 , y 2 , · · · , y n ) T ∈ R n is the weight vector connecting Layer 3 and Layer 4, and vectors are the centers and the reciprocals of the widths of the corresponding Gaussian membership functions, respectively.
For simplicity, the following vector-valued function is introduced by (1) is given bȳ where z j is the fuzzy reasoning result for the j-th training pattern x j and The gradient of the error functionĒ(w) with respect to a 0 , a i , b i , i = 1, 2, · · · , n, are given respectively bȳ Therefore, the gradient of the error functionĒ(w) with respect to w is given by ∇Ē(w) = (Ē a 0 (w),Ē a 1 (w), · · · ,Ē a n (w), E b 1 (w), · · · ,Ē b n (w)) T .
Consider a polynomial smoothing function of |x|, the approximating function is given by where ν is a small positive constant. The smoothing function has the following properties: ]. (10) Using the above smoothing approximation (9), we can obtain the smoothing approximation of the L 1/2 regularization w 1 2 1 2 , so that the original error function in (1) is approximated by the following function where λ is a regularization parameter, and Correspondingly, the gradient of the smoothing approximating function E(w) on w is denoted by where 2f (a 11 ) 2f (a 1n ) 1 2 , · · · , f (a mn ) 2f (a mn ) From [22], we know that figuring out how to adjust the estimation of the Lipschitz constant of ∇E(w) is not easy. Next, a hyper-parameter c is introduced to appropriately adjust the estimate value of the Lipschitz constant in proposed algorithm, and we will firstly propose the variable learning rate gradient algorithm with smoothing L 1/2 regularization, denotes as VLRGSL 1/2 . For convenience, we denote E k E(w k ) and ∇E k ∇E(w k ). Step 1. Randomly choose the initial weights w 0 ∈ R n(1+2m) , and the terminal criterion Tol > 0, the maximum number of epochs K max , the constants η, L 0 , c ∈ (0, 1), the regularization parameter λ. Set k = 0. Step 2. Compute the error function value E k in (11) and its gradient ∇E k in (12).
In order to adjust L k , we introduce the hyper-parameter c in (14). It can be observed that L k is not c times of L o k , where L o k denotes the estimation of the Lipschitz constant in [22]. For example, when k = 2, we have It is easy to see that the above L 2 is not equal to cL o 2 when c ∈ (0, 1) except for when c = 1. The new adjustment scheme has been shown to be effective in numerical experiments.
Using the second-order information, based on the modified secant equations, the conjugate gradient algorithms [5], [20] have been confirmed to improve numerical performance for training the neural networks. Like the modified Newton method's idea, a modified BFGS method [25] is proposed with a modified secant equation, and the equation is given as follows: where the matrix B k is an approximation of the Hessian of the smoothing error function E(w) at w k , with r > 0 and µ ≥ 0. It is easy to know that which means that the hereditary positive definiteness of B k is always guaranteed. Thus, in order to make better use of the second-order information, combing the modified secant equations (MSE) (16), L k is modified as follows: the corresponding algorithm is denoted as MSE+VLRGSL 1/2 .

IV. CONVERGENT RESULTS
In the section, we present a convergent result of Algorithm 1. The result is easily concluded by Theorem 3.1 in [22] and we omit its proof here. Some assumptions are needed for the convergent result.  (18) for any w 1 , w 2 ∈ N .

V. SIMULATION RESULTS
In this section, in order to evaluate the performance of Algorithm 1, which includes VLRGSL 1/2 and MSE+VLRGSL 1/2 , it will be compared with the gradient-based algorithm with   the smoothing L 1/2 regularization [3], denoted as GSL 1/2 , and Algorithm A3 with the best numerical performance in [22] on the same examples: identification of the function y = sin(πx), and a Sonar classification problem acquired from UCI repository [26]. All simulations are performed with Matlab 2018b on a computer (Intel (R) Core (TM) i5-8500 CPU @ 3.00 GHz 3.00 GHz 8.00 GB RAM). For algorithms, the initial fuzzy parameters are chosen stochastically in [0, 1], the maximum epochs K max is 10,000, multiple values of the learning rate η and the penalty parameter λ are tested, 10 trials are carried out, and 10 fuzzy rules are used. In order to estimate L k , we set L 0 = 1 and c = 0.1 in (14), and r = 50 and µ = 0.5 in (17).
In the following tables, ACAN (defined in [6]) denotes the average number of identified zero coefficients or below a certain threshold and the threshold is set to 0.0001, Std means the standard deviation of errors over multiple runs, and the best performance of the algorithms are highlighted by boldface except for Average of time.

A. IDENTIFICATION PROBLEM
Identification of the nonlinear function y = sin(πx), x ∈ [−1, 1]. In this example, training patterns are evenly selected from the interval x ∈ [−1, 1] with 0.02 as the discretization size. Test patterns are selected with 1 15 as the discretization size similarly.
The numerical results are presented in Table 1. It shows that the whole performance of VLRGSL 1/2 and MSE+VLRGSL 1/2 are superior to that of GSL 1/2 and Algorithm A3 from the 4th-6th columns, and MSE+VLRGSL 1/2 can usually obtain the lowest error and have the most robust performance. GSL 1/2 usually costs the least time due to the constant learning rate. For Algorithm A3, it seems that the algorithm with the smaller constant (η = 0.001) has the poor performance, and also is shown in Figure 3.
In Figure 2, when λ is a bit large, the curve of the norm of the gradient on GSL 1/2 shows serious oscillation, and for VLRGSL 1/2 , the curve also occurs oscillation. However, by using the modified secant equation (15) in (17), VOLUME 8, 2020 MSE+VLRGSL 1/2 greatly reduces oscillation compared with VLRGSL 1/2 .

B. THE SONAR CLASSIFICATION PROBLEM
The Sonar benchmark problem is a famous binary classification problem. It needs to classify reflected sonar signals in two categories (metal cylinders and rocks). The data set is composed by 208 input vectors, and each vector has 60 components. For input vectors, we stochastically select 128 input vectors from the data set as training patterns and the initial connecting weights in [0, 1]. The correct identification on training patterns is done according Fahlman's ''40-20-40'' criterion [27].
In Table 2, VLRGSL 1/2 and MSE+VLRGSL 1/2 get the lower error rate of training and test patterns and have the robust performance. For the values of ACAN, though GSL 1/2 can obtain the most sparse architecture of the network in several cases, other aspects have the worst performance except for time from the 4th-8th columns. In Figures 4 and 6, VLRGSL 1/2 and MSE+VLRGSL 1/2 are superior to VLRGSL 1/2 and Algorithm A3, and MSE+VLRGSL 1/2 can also greatly reduce oscillation. VLRGSL 1/2 and MSE+VLRGSL 1/2 have the better generalization capacity in Figures 5 and 7, but MSE+VLRGSL 1/2 is a bit better.

VI. CONCLUSION
A variant of the common batch gradient descent algorithm with the smoothing L 1/2 regularization is proposed to train the zero-order T-S inference network. Utilizing a new adjust scheme of the estimation of Lipschitz constant and the modified secant equation, the proposed algorithm makes better use of the second-order information to reduce the adverse effects of high non-linearity and ill-condition of the error function. Simulation results show that the proposed algorithm has better numerical performance, sparsity and generalization capacity. Moreover, under the assumptions, the convergent result is proved. We will also investigate the corresponding stochastic algorithm for large scale fuzzy systems in the future research.