Extreme Learning Machine Under Minimum Information Divergence Criterion

In recent years, extreme learning machine (ELM) and its improved algorithms have been successfully applied to various classification and regression tasks. In these algorithms, MSE criterion is commonly used to control training error. However, MSE criterion is not suitable to deal with outliers, which can exist in general regression or classification tasks. In this paper, a novel extreme learning machine under minimum information divergence criterion (ELM-MinID) is proposed to deal with the training set with noises. In minimum information divergence criterion, the Gaussian kernel function and Euclidean information divergence are utilized to substitute the mean square error (MSE) criterion to enhance the anti-noise ability of ELM. Experimental results on two synthetic datasets and eleven benchmark datasets show that this method is superior to traditional ELMs.


I. INTRODUCTION
Extreme learning machine (ELM) is a single hidden layer feedforward neural network (SLFN) with universal approximation capability [1], [2]. In ELM, the weights linking the input layers to the hidden layers and the hidden bias terms can be randomly initialize. Then, the corresponding weights linking the hidden layers to the output layers can be directly determined by the least square method based on the Moore-Penrose generalized inverse [3]. Different from full parameter determination algorithms such as back propagation (BP) algorithm, the hidden nodes' parameter random initialization process with an analytical weight solution can reduce computational complexity [1]. Therefore, the important advantage of ELM is the fast training speed. ELM has been widely used in many actual engineering applications, such as stock market forecasting [4], [5], image processing [6], [7], face recognition [8], and nonlinear model identification [9].
In recent years, some new improved versions of ELM have been proposed. In general, the performance of ELM is improved from two aspects. One is to optimize the network The associate editor coordinating the review of this manuscript and approving it for publication was Guangdeng Zong . structure (like evolutionary ELM (E-ELM) [10], ELM-kernel (KELM) [11]), and the other is to improve the error statistics method (like regularized ELM (RELM) [12], outlier robust ELM (OR-ELM) [13]). In KELM, I /λ parameter is added in the hidden layer matrix to address the randomness problem of learning machine. RELM was proposed by Deng et al. [12], which achieves optimal trade-off between empirical risk ε 2 and structural risk β 2 by introducing regularization parameters, and makes the model obtain the best generalization performance. In OR-ELM [13], the 1 norm of the prediction error is used as its objective function, which can obtain better results when there are outliers in the regression task.
However, in essence, these ELMs use the mean square error (MSE) criterion to measure the error. MSE only limits the second-order statistics and shows a poor optimization ability for nonlinear and non-Gaussian (e.g. finite range or heavy-tail distributions) situations. MSE mainly focuses on the scatter aspects of the error distribution and cannot draw all the probabilistic information of the error, such as the shape (kurtosis, tails, peaks, etc.) of probability density function. To address this issue, Chen et al. [14]- [16] proposed a novel minimum information divergence (MinID) criterion, in which the Kullback-Leibler divergence between the actual error and the desired error is selected as the objective function for adaptation algorithm. This criterion has been successfully used in adaptive filtering.
In order to overcome the defects of above ELMs and improve the anti-noise ability of ELM, a novel ELM-MinID algorithm is developed in this paper. In this algorithm, the MinID criterion based on Euclidean information divergence is applied to extreme learning machine (ELM). The main contributions of this paper are as follows: 1) we proposed a new method of error control: minimum information divergence criterion based on Euclidean information divergence. 2) we proposed a new ELM-MinID algorithm. Compared to the traditional ELMs, this algorithm utilizes the MinID criterion to substitute the MSE criterion, which makes ELM-MinID more resistant to noise. 3) we simulated the function fitting with synthetic data sets and the regression with benchmark data sets to verify our method.
The structure of this paper is as follows. In part A of section II, we provide a brief review of ELM. After that, the MinID criterion based on Euclidean information divergence is given in part B of section II. In section III, ELM under minimum information divergence criterion is developed. Subsequently, the performance of the algorithm is tested on synthetic data sets and benchmark data sets in section IV. Finally, conclusion is given in section V.

A. EXTREME LEARNING MACHINE (ELM)
For ELM, the input weights (connecting the input layer and the hidden layer) and hidden bias terms are randomly initialized, and the output weights (connecting the hidden layer and the output layer) are obtained by using Moore-Penrose generalized inverse.
We are training a single hidden-layer feedforward neural network withÑ hidden neurons and activation functions g(x) to learn N arbitrary distinct sample sequences ∈ R n is the kth input vector and t k = [t k1 , t k2 , . . . , t km ] T ∈ R m is the associated desired value. In ELM, the activation function g(x) is mathematically modeled as where y k is the output weight vector of the SLFN for the kth input weight vector x k , w i is the weight vector linking the ith hidden unit to all the input units, b i is the hidden bias for the ith hidden unit, and β i denotes the output weight vector linking the ith hidden unit to all the output units. In this way, the nonlinear system can be transformed to a linear system: and β is the vector of the weights linking the hidden layer to output layers, H denotes the output weight matrix of the hidden layer, Y is the output vector of the output layer, and T is the matrix of desired output. The output weight vector β can be determined by minimizing the mean square error (MSE) (7) where E denotes the expectation operator and e j = t j −Ñ p=1 g(w p · x j + b p )β p is the estimation error. Usually, the solution of (7) can be determined by where H † denotes the Moore-Penrose generalized inverse. When H T H is nonsingular, the orthogonal projection method can be used to calculate H † [1]: However, there are still some shortages in the above ELM, such as the solution of the MSE function (7) is sensitive to non-Gaussian noises. The reason is that the MSE criterion captures only the second-order statistics of the residual and may perform poorly in nonlinear and non-Gaussian cases. In order to improve the robust performance in realistic situations, an alternative optimality criterion beyond the secondorder statistics has been adopted in this study.

B. INFORMATION DIVERGENCE
The information divergence is a kind of distance measurement method between two distributions. Based on the Euclidean distance, a symmetric information divergence is given, called Euclidean information divergence. For two probability density functions p(x) and q(x), the Euclidean information divergence is given by which is always non-negative and equal to zero only if p(x) = q(x). Obviously, the Euclidean information divergence is symmetric, we have D(p q) = D(q p). In this work, the symmetric divergence (10) is used to measure the distance between two distributions.
In practice, the probability density functions p(x) and q(x) of samples are unknown. In the present paper, we adopt the kernel method [17] to estimate them. By kernel approach, the estimated PDF could be differentiable. This is the premise of the gradient calculation. The one-dimensional probability density estimator with kernel K (.) is given by where σ is the kernel width, S p denotes the sample sequence drawn independently from the probability density function p(x), an S p is the total number of samples in S p . Usually K (.) will be a radially symmetric unimodal probability density function. The kernel function K (x) satisfies R K (x)dx = 1.
In this work, we choose the standard Gaussian kernel function The minimum of information divergence function (10) is called the minimum information divergence (MinID) criterion. Since divergence is insensitive to noises, it is better than the MSE especially when there is impulse noise in the samples [14].

III. ELM UNDER MINIMUM INFORMATION DIVERGENCE CRITERION
According to ELM learning theory, multiple types of feature maps can be used in ELM, so that ELM can approximate any continuous objective function. (refer to [2] for details). That is, given any continuous target function y(x), there is a series of β i to make the error equal to zero.
Equation (13) is the cost function of ELM training. The purpose of ELM training is to make error between the training output and the desired output close to zero. The traditional ELM training utilizes the MSE criterion, like (7). However, the MSE criterion is sensitive to the non-Gaussian noises. In this section, the MinID criterion based on Euclidean information divergence is used as a cost function for ELM training.
Based on the MinID criterion, the ELM-MinID is proposed to minimize the information divergence between the actual error e and the desired error e (d) by adjusting the parameter β. In other word, the output weight matrix β will be adjusted to make the PDF of error e k close to the desired density function p e (d) . By setting the desired density function p e (d) to a Dirac delta function at zero, the actual error e of ELM also converges around zero.
We can get a new objective function of ELM-MinID which minimizes the divergence between the actual error e and the desired error e (d) , as follows: we use kernel method (11) and Gaussian kernel function (12) to estimate the PDF of actual error e, that is where e i (i = 1, 2, . . . , N ) is the error sequence of ELM and σ is the kernel width. From (2), one can get the error of the kth output: in which the error sample e i (i = 1, 2, . . . , N ) will be expressed as where is the row vector of H (the output matrix of the hidden layer). Here, W = [w 1 , w 2 , . . . , wÑ ], Theoretically, we try to make the error values as concentrated around zero as possible. For the desired error e (d) , we can choose the δ function as the probability density function p e (d) , i.e., However, in practice, the above situation is difficult to operate. In real application, the estimated information divergence is used as an alternative cost function, in which the desired error distribution is also estimated by kernel method. We have the desired density function The information divergence between e and e (d) can be written as A detailed mathematical deduction of (20) is given in Appendixes. At last, we have the function of information divergence (21). Substituting (17) into (21), we get function (22), as shown at the bottom of the page.
One can update the parameter β by the following gradient algorithm: where η > 0 is the step-size and β(k) denotes the parameter vector at iteration k.
Based on the above model optimization strategy, a robust learning algorithm for SLFNs under MinID can be obtained, which is referred to as the ELM-MinID and is described in Algorithm 1.

1) Randomly initialize the weight vectors w j
Ñ j=1 together with their corresponding bias terms b j Ñ j=1 .
2) Calculate the hidden layer output matrix H.

3) Update the weight vectors β.
For k = 1, 2 . . . K do Compute the actual errors based on β(k − 1): Calculate the gradient of the information divergence: ∇D(p e p e (d) ) Update the bias term vector and the weight: Until ∇D(p e p e (d) ) < ξ EndFor

IV. EXPERIMENTAL RESULTS
In this part, we present experimental results to illustrate the performance of ELM-MinID proposed in the previous section. Parameters of all algorithms are chosen by gridsearch method and cross validation method. In each independent trial, the training datasets and testing datasets are fixed. Average RMSE of 50 trials of simulations for each algorithm are obtained and then finally the performance obtained is ∇D(p e p e (d) ) = ∂D(p e p e (d) ) ∂β     reported. All the experiments are carried out in the MATLAB R2018a environment running inInter(R) Xeon(R) E-2124G processor with the speed of 3.40GHz.

A. FUNCTION FITTING WITH SYNTHETIC DATASETS
In this subsection, two synthetic datasets are utilized to validate the proposed algorithm. The description of them is as follows.
Sinc: The synthetic data set is produced by y i = sinc(x i ) + n, where n denotes a noise and the sinc function is given as we generate 1000 data points with x i drawn randomly from [−10, 10].
Func: This artificial data set is generated by (y i , y j ) = func(x i , x j ) + n, where n is also a noise and the func function is given as 1000 data points are constructed by randomly chosen from the evenly spaced 50 × 50 on [−2, 2].  For Sinc date set, we consider three long-tailed distributions of n: 1) symmetric α−stable(SαS) distribution [18] with characteristic function φ(t) = exp(−τ |t| α ), with shape parameter α = 1.5 and scale parameter τ = 0.5; 2) SαS distribution with shape parameter α = 1.3 and scale parameter τ = 1; 3) Laplace distribution with zero mean and variance 0.5. Similar Laplace noise is also added to the Func data set. In our simulations, 500 noisy data are used for training and another 500 clean data are used for testing. The activation function in this paper is the sigmoid function We contrast the performance of the proposed ELM-MinID with three existing ELMs including ELM, RELM and ELM-RCC [19]. In order to make a fair comparison, these algorithms are compared at their best fitting accuracy based on optimal parameter combination. Therefore, we need to predetermine these parameters: the number of hidden nodesÑ , the regularization parameters λ, the kernel width σ , and the step size η. In ELM optimization, the parameters are usually chosen by grid-search method and cross validation method, such as k-fold, as done by Inaba et al. [20], Kai and Luo [13], Huang et al. [21], Da Silva et al. [22] and others. Similarly, in this part, we obtain the best parameter combination by the grid search on each parameter and the five-fold cross-validation on every training set. We calculate the validation accuracy by using different parameter combinations of the hidden nodes numberÑ ∈ {10, 20, . . . , 400}, the regularization parameters λ ∈ 10 −10 , 10 −9 , . . . , 10 5 , the kernel width σ ∈ {0.02, 0.04, . . . , 1}, and the step size η ∈ {0.01, 0.02, . . . , 0.1}. The maximum number of hidden nodes for ELMs is set to 400 because there are only 400 training data available (since we use a 5-fold cross validation on 500 training data) [23]. Additionally, in ELM-MinID, the termination tolerance ξ is 0.001 and the maximum iteration VOLUME 8, 2020 number K is 300. The best parameters for each algorithm are chosen according to the validation accuracy and summarized in Table 1.
The experiments were run 50 times, using the parameters in Table 1. Fig.1 demonstrates the fitting results of the four algorithms upon Sinc with three different noises. Further, the average (and standard deviation) values of testing RMSEs are shown in Table 2, where the best result for each noise distribution are highlighted in bold. Table 3 is a statistical significance report between the best performance and runnerup using the paired T-test. From Table 3, P < 0.05, that is, there is a significant difference in the testing RMSEs between the two algorithms. This shows that ELM-MinID has a better fitting ability. Fig.2 is the fitting results of four algorithms  upon func with Laplace noise (0,0.5). Clearly, the ELM-MinID is more robust than other algorithms under the same noises.

B. REGRESSION WITH BENCHMARK DATASETS
In the second experiment, eleven benchmark datasets from UCI machine learning repository [24] are utilized to confirm the better regression performance of the ELM-MinID compared with the KELM, RELM and ELM-RCC. The descriptions of the data sets are presented in Table 4. In order to illustrate the robustness of these algorithms, training samples with different contamination rates are generated. This is made by assigning the random values from [0, 1] to the target values of some training samples (all target values are normalized into [0, 1]).
The parameters of these algorithms are selected through grid search and five-fold cross-validation with the same parameter interval as those in section 4.1. In addition, in the KELM algorithm, the grid-search range of kernel parameter γ is {2 −10 , 2 −9 , . . . , 2 10 }. The optimal parameters are summarized in Table 5, except that the iteration number K is preset to 300 and the termination tolerance ξ is fixed to 0.001.
The 50-run training and testing RMSEs are shown in Tables 6, 7, and 8, which are for uncontaminated data sets and contamination rates of 20% and 40%, respectively. The best simulation results were highlighted in bold. We can notice that when there is no contamination in the training data, all training methods can obtain similar results. When considering that 20%, 40% of each training sample is contaminated with outliers, KELM, RELM and ELM-RCC show worse regression performance than ELM-MinID. This is to be expected, because they use the 2 norm, which are not suitable to deal with the data sets with outliers. Unlike 2 norm, MinID criterion can capture the more characteristics of the error and reduce errors from many ways. Table 9 and 10 are statistical significance report between the best performance and runnerup for contamination rates of 20% and 40%, respectively. In those reports, P < 0.05, that is, there is a significant difference in the testing RMSEs between the two algorithms. According to the analysis above, we can draw conclusion that the proposed ELM-MinID has good robustness performance in benchmark datasets with outlier.

V. CONCLUSION
In this paper, we proposed a robust learning algorithm for single-hidden layer feedforward neural networks (SLFNs) called ELM under minimum information divergence criterion (ELM-MinID), which provides a new error control method VOLUME 8, 2020 for ELM. The simulation results on function fitting with synthetic data and regression with benchmark data sets showed the superior noise tolerant capability and stable regression performance of the proposed method.

APPENDIXES
The information divergence between e and e (d) can be written as