Adaptive Learning-Rate Backpropagation Neural Network Algorithm Based on the Minimization of Mean-Square Deviation for Impulsive Noises

This paper presents a novel adaptive learning-rate backpropagation neural network (ALR-BPNN) algorithm based on the minimization of mean-square deviation (MSD) to implement a fast convergence rate and robustness to impulsive noises. The learning rates of the weights in each hidden layer are derived to minimize the upper bound of the MSD obtained by the analysis, which guarantees a fast convergence rate in a stable range. Moreover, by adopting the variance of the kind of the measurement noises in each layer through the variance of the error signals, the proposed scheme provides robustness to the impulsive noises. The performance of the proposed algorithm is evaluated on various sequential signals and industrial data including the impulsive noise and compared with conventional ALR-BPNN algorithms. Simulation results indicate that the proposed algorithm outperforms the existing algorithms.


I. INTRODUCTION
Neural networks are widely used to train various models. Training method of neural network is inundated with algorithms that focus on the application and real-time implementation of various problems [1], [2]. This paper is also concerned with the training algorithm of a multilayered feedforward neural network. A low steady-state error and fast convergence rate are important issues in the training algorithm [3], [4]. Backpropagation (BP) algorithm is a representative for training algorithm [5]- [7]. However, the main drawback of BP algorithm has a slow convergence rate. To be a fast convergence rate, some heuristic techniques such as novel BP algorithms with a momentum term and numerical optimization algorithms with quasi-Newton method were introduced [8], [9]. The limitation of the quasi-Newton methods is that the memory must be secured up to the square The associate editor coordinating the review of this manuscript and approving it for publication was Jingen Ni .
of the network size. The conjugate-gradient and Newton method have been proposed to be fast convergence rate, but the algorithms require too much computational complexity [10]- [12]. Other algorithms such as recursive leastsquare [13], Levenberg-Marquardt [14], [15], and extended Kalman filtering [16] have been also improved to be a fast convergence, but these improvements deteriorate the simplicity and convenience of implementation of BP algorithm.
Another issue is its robustness to the impulsive noises [17], [18]. Specifically, the impulsive noises generated in real world, such as the sound of the door suddenly shutting, is a major challenge to be solved in training algorithms. To overcome the challenge, some robust adaptive algorithms have been developed, such as adaptive learning-rate saturation and sign algorithms [19], [20], but these algorithms cannot effectively train the nonlinear model. To suppress the impulsive noises in nonlinear model, various detection techniques of the impulsive noises using the neural network and other nonlinear filters have been introduced [21]- [23]. These algorithms well detected the impulsive noises in nonlinear model, but applying these detection techniques to the training algorithms is another complex process. To mitigate this drawback, functional link neural network (FLNN) architectures have been proposed [24]- [26]. Specifically, a neural network based on sparse representations of functional links has been proposed to be robust in the impulsive noise environments [27]. This algorithm not only solved the problems of basic FLNN architectures, such as slow convergence rate and computational complexity, but also improved the robustness to the impulsive noises. However, as the algorithm contains the trigonometric functions and exponential calculations, implementing this algorithm on a DSP board or memory chip is not easy in real environment.
In order to address the aforementioned issues like the fast convergence rate and robustness to the impulsive noises, this paper proposes a novel adaptive learning-rate backpropagation neural network (ALR-BPNN) algorithm based on the minimization of mean-square deviation (MSD). Although advantages of the training algorithm based on the minimization of the MSD are well known in adaptive filtering problems [28], [29], such as fast convergence rate and low steadystate error, this algorithm has not been used to train neural networks as it is not feasible to know exact value of the MSD of the weights in each hidden layer. To take these advantages of this algorithm into neural network, the upper bound of the MSD of the weights in each hidden layer is analyzed and the learning rates of the weights is set to minimize the upper bound of the MSD to ensure a fast convergence rate in a stable range. In addition, the proposed algorithm provides robustness to the impulsive noises by adopting the variance of the kind of the measurement noises in each hidden layer through the variance of the error signals.
The rest of this paper are organized as follows. In Sections II, basic BP algorithm and notations are presented. Details of the proposed algorithm are described in Section III. Simulations results are discussed in Section IV and conclusion is provided in Section V.

II. PRELIMINARY
Consider feedforward neural networks (FNNs) with L n neurons in the nth layer, for n = 1, 2, . . . , N . The neural networks represent based on the following equations: where w N −2,N −1 i,j (k) represents the weight from the ith neuron at the (N − 2)th layer to the jth neuron at the (N − 1)th layer. u N −1 j (k) represents the output of the jth neuron that belongs to the (N − 1)th layer. f (·) is a rectified linear unit (ReLU) activation function and defined as The output of the neural network is expressed as To derive backpropagation method, cost function J (k) is defined as where d(k) and e(k) are target signals and error signals. Using gradient descent method, the update equation of the weight w n−1,n i,j (k) is expressed as The backpropagation method is summarized in the following equations [6] w n−1,n i,j where µ represents the learning rate. For n = N − 1, . . . , 2, The update equation of the weight vector w n−1,n j (k) from all neurons at the (n − 1)th layer to the jth neuron at the nth layer is derived as where w n−1,n

III. PROPOSED ALR-BPNN ALGORITHM A. ADAPTIVE LEARNING-RATE EQUATION OF WEIGHT VECTOR AT HIDDEN LAYER
The update equations of the weights are modified into normalized form to analyze the MSD of each weight. This normalized update equation has advantages in defining the stability range of the learning rate over the original update equation (11).
where δ n j (k) represents the backpropagation error of the jth neuron at the nth hidden layer [6].
To derive the MSD of the weights vector from all neurons at the (n − 1)th layer to the jth neuron at the nth hidden layer, in this paper, δ n j (k) is defined in terms of deviation of the weights vector as follows, where η n (k) is a kind of a measurement noise that is independent of u n−1 (k) and assumed to be stationary and zero-mean. called perturbations means a kind of the measurement noise at the hidden layer. Fig. 2 graphically shows that the backpropagation error δ n j (k) is defined in terms ofw n−1,n j (k).
Using the Eq. (12), the deviation of w n−1,n j (k) can be rewritten as where F n−1,n where E(·) and Tr(·) represent expectation and trace, respectively. P n−1,n Using the Eq. (19) and the assumption thatw n−1,n j (k) and r n (k) are uncorrelated, the recursive equation of P n−1,n j (k) is derived as where α n−1 ≥ 1 and ||u n−1 (k)|| 2 . By setting the partial differential of (µ n j (k)), the proposed learning-rate of the weights vector from all neurons at the (n − 1)th layer to the jth neuron at the nth layer, µ n j (k) is obtained as where is set to small value to prevent the denominator from becoming zero. A specific algorithm is shown in Algorithm (1).
As δ n j,post represents the backpropagation error obtained through the new updated weights and it should be less than δ n j (k), µ n j (k) should be satisfied as following, VOLUME 8, 2020 Using the Eq. (27) and (31), the stability condition of the proposed ALR-BPNN can be derived as From above equation, it can be seen that the proposed ALR-BPNN algorithm always satisfies the stability condition if Tr{P n−1,n j (k)} > 0, which can be easily confirmed by the Eq. (28).
As the variance of the perturbation σ 2 r n (k) is not measurable value in the Eqs. (27) and (28), it is adopted by the variance of the error signals which are also obtained using a moving average method as follows where λ and β n were set to 0.99 and [0.01 0.5]. By choosing the variance of the perturbation in each hidden layer through the variance of the error signals, the proposed algorithm can be robustly updated even when the impulsive noises are suddenly generated. Specifically, the variance of the perturbation is rapidly increased by the error signals with the impulsive noises, which makes the learning rate small and prevents erroneous updates of the weights.

2) SET OF α N
To choose an appropriate α n in the proposed algorithm, the normalized mean-square error (NMSE) curves according to α 1 and α 2 were compared for two types of input. As can be seen Figs. 4 and 5, when using multi-tonal sinusoidal signals as input signals, α 1 closer to the input layer had a greater effect on performance than α 2 . In that, it should be always adjusted larger than 1. However, this tendency depends on  the characteristics of the inputs. For example, when using a highly correlated data features such as exchange rates by country, SHANGHAI index, and etc. as input data, α 2 closer to the output layer had a greater effect on performance than α 1 as can be seen Figs. 6 and 7. In this paper, simulations were performed by setting α 1 and α 2 based on this analysis.

IV. SIMULATION
In order to evaluate the performance of the proposed algorithm, two simulation scenarios were performed in this paper.
In the first case, multi-tonal sinusoidal signals dependent on various frequencies were sequentially used as input signals.  Target signals were set by passing the input signals through a specific nonlinear model. This simulation was performed to confirm how quickly and robustly the output signals generated by the proposed algorithm can converge the target signals in impulsive noise environments. The second simulation was conducted to predict the NASDAQ index. Target and input data were set to NASDAQ index data and 18 data features for about 10 years. This simulation also evaluated how robustly the proposed algorithm trains the neural network in the environment where bad training target data is randomly generated.

A. CASE 1
Sinusoidal wave was used as the original signals s(k) and defined as where f and F s are center frequency and sampling frequency of the original signal, respectively. A multi-tonal signal was generated through the sum of the sinusoidal waves s(k) having the center frequencies of 200, 400, . . . , 1200Hz and sampling frequency of 2000Hz. Correlated input signals x(k)  were obtained from passing the multi-tonal signals through the succeeding filters as Nonlinear target signals to be estimated were set as  The impulsive noises ψ(k) were generated as ψ(k) = ω(k)G(k), where ω(k) is Bernoulli process with Pr(ω(k) = 1) = p and p is set to 0.005 in this paper. G(k) is zero-mean Gaussian with power σ 2 G = 1000σ 2 y . Basic BPNN algorithms based on the sigmoid and ReLU activation functions, Adagrad-BPNN [32], Adam-BPNN [33], and proposed ALR-BPNN algorithms are simulated to compare the performance. The parameters, β 1 and β 2 used in the Adam-BPNN algorithm were set to 0.99. Number of layers, including to input and output layer, was set to 3 and number of neurons at the input and 2th layers in all algorithms were set to 20, respectively. All simulation results were presented by averaging 50 independent simulations.
In this simulation, the proposed ALR-BPNN algorithm performed very well for not only the uncorrelated inputs but also the correlated inputs. As the proposed algorithm has larger learning rate by the MSD analysis compared to other algorithms in the beginning, the initial normalized error was larger than other algorithms. However, it is intended to ensure that the proposed algorithm has a fast convergence rate in a stable range, so it does not negatively affect the overall performance. As can be seen in Figs 10 and 11, the proposed ALR-BPNN algorithm had a fast convergence rate over the compared algorithms even in environments where the impulse noises are not generated. The value of the proposed algorithm is much more exerted in environments where the impulse noises are generated. As can be seen in Figs 12 and 13, regardless of whether the inputs are the uncorrelated or   correlated signals, other comparison algorithms have failed to maintain a low steady-state errors in the impulse noise environments. On the other hand, the proposed algorithm showed a good performance even for all inputs mixed with the impulsive noises as the learning rate automatically decreases when the error signals rapidly increases due to the impulsive noises.

B. CASE 2
2000 training data and 500 test data were used to train the FNNs predicting the NASDAQ index. Number of the input features, including exchange rates by country, SHANGHAI index, Goldman index, etc., were 18. 5 simulations were performed according to the ratio of bad data among the training target data and the ratio of bad data was divided from 98024 VOLUME 8, 2020   0 to 40%. As can be seen Fig. 14, bad data was randomly set to 2 to 6 times larger or smaller than the original training target data. The prediction accuracy of the algorithm was calculated by root-mean-square error (RMSE) using test data. Number of layers was also set to 3, and number of neurons at the input and 2th layers was set to 18 and 20. Comparison algorithms were set to Adagrad-BPNN and Adam-BPNN algorithms. The parameters, β 1 and β 2 used in the Adam-BPNN algorithm were also set to 0.99. All simulation results were presented by averaging 30 independent simulations.
In this simulation case, the proposed algorithm showed good prediction accuracy regardless of the ratio of bad data among the training data. As can be seen Fig. 15, 16, and 17, the proposed ALR-BPNN algorithm not only provided good prediction accuracy even in the absence of bad data, but also maintained good accuracy in the presence of bad data.  This prediction accuracy is specifically shown in TABLE 2. While the comparison algorithms had less accurate as the ratio of bad data increases, the proposed algorithm maintained good accuracy regardless of the ratio of bad data. Moreover , TABLE 1 showing number of multiplication of Adagrad-BPNN, Adam-BPNN, and proposed ALR-BPNN algorithms indicates that the proposed algorithm does not increase the computational complexity compared to the Adagrad and Adam algorithms commonly used in adaptive learning-rate method.

V. CONCLUSION
This paper proposed a novel ALR-BPNN algorithm updating the learning rate in the direction of minimizing MSD at each hidden layer and showed for the first time that minimizing the MSD can effectively reduce the overall error of the neural network. The problem that exact value of the MSD at each hidden layer is not feasible was solved by setting the upper bound of the MSD. In addition, to be robustness to the impulsive noises, the proposed algorithm adopted the variance of the perturbation at each hidden layer through the variance of the error signals. The results of the two simulations, estimating nonlinear model target signals through the sequential input signals and estimating the actual NAS-DAQ data, showed how robust and excellent the proposed ALR-BPNN algorithm was even in the case of the impulsive noise. The proposed BPNN algorithm based on the MSD analysis will be utilized in various BPNN algorithms in the future.