Neural Networks Regularization With Graph-Based Local Resampling

This paper presents the concept of Graph-based Local Resampling of perceptron-like neural networks with random projections (RN-ELM) which aims at regularization of the yielded model. The addition of synthetic noise to the learning set finds some similarity with data augmentation approaches that are currently adopted in many deep learning strategies. With the graph-based approach, however, it is possible to direct resample in the margin region instead of exhaustively cover the whole input space. The goal is to train neural networks with added noise in the margin region, located by structural information extracted from a planar graph. The so-called structural vectors, which are the training set vertices near the class boundary, are obtained from the structural information using Gabriel Graph. Synthetic samples are added to the learning set around the geometric vectors, improving generalization performance. A mathematical formulation that shows that the addition of synthetic samples has the same effect as the Tikhonov regularization is presented. Friedman and pos-hoc Nemenyi tests indicate that outcomes from the proposed method are statistically equivalent to the ones obtained by objective-function regularization, implying that both methods yield smoother solutions, reducing the effects of overfitting.


I. INTRODUCTION
Many efforts have been made in the last decades in order to represent the learning problem of Single Hidden Layer Feed-forward Networks (SLFN) [1]- [3] with convex formulations and to avoid the burden of iterative gradient descent learning on complex objective functions. Higher dimensional random projections to the hidden layer is an approach to convexification [4]. Random projection methods are based on Cover's Theorem principles [5] to assure that the projected data results on a linear problem that can be treated under a convex optimization perspective. Such an approach became popular in recent years under the framework of Extreme Learning Machine (ELM) [6], [7], [8], [9] a two-layer perceptron with large random expansions in the hidden layer.
The associate editor coordinating the review of this manuscript and approving it for publication was Kok-Lim Alvin Yau .
A large number of hidden neurons may, however, lead to neural networks with a far higher capacity than the required to solve the problem [10]. The network then becomes overspecialized on the training samples, which may result on overfitting, and lose its generalization ability. Regularization methods aim at solving the overfitting problem by smoothing the separation surface, thus leading to an improved performance on unknown data. A regularized surface can be achieved by combining the objective function with a penalization term, as shown in (1) [11].
where E is the training set error, λ is a regularization parameter and is a model complexity penalty function. Functions E and have conflicting behavior [12], so trading them off by selecting a proper value of λ is essential to achieve a more general model. Regularization methods have been proposed and applied to control the smoothness of the approximating function in many application problems. A common choice for the smoothness function is the norm of the weight vectors [13] as presented in (2), known as Tikhonov Regularization, L2 penalty and Ridge Regression [14].
where N is the number of samples of the training set and L is the number of hidden layer neurons of the network.
Since ELMs tend to be oversized, works in the literature apply regularization to control the smoothness of the resulting approximation function. Deng [15] proposed the Regularized Extreme Learning Machine (ELM-REG), which implements (2). The Optimally Pruned ELM proposed by Miche [16] achieves regularization by employing multiresponse sparse regression and leave-one-out cross validation to remove the least relevant neurons. The Tikhonov Regularized OP-ELM [17], combines L1 and L2 penalties in order to generate a regularized separation surface. The L 2,1norm based Online Sequential Extreme Learning Machine proposed by Preeti [18] creates an iterative bi-objective optimization algorithm to solve L 2,1 norm-based minimization problem, dealing with real time sequential data. These approaches to regularization require that λ is provided prior to training.
Silvestre et al. [19] proposed a method for parameter-free regularization of extreme learning machines, that uses only an affinity matrix obtained from the training samples, which leads to the same Tikhonov Regularization effect. Araujo [20] proposed a method for automated parameter selection, based on the linear separability of projected data, that requires neither user defined parameter nor cross validation. Both approaches rely on the quality and representability of training samples in order to be effective.
Another regularization approach that focuses neither on the dataset structure nor on pre-setting hyperparameters was given by Bishop [21], which consists of adding noise to the training set and is shown to be equivalent to Tikhonov Regularization. Training with noise has been applied to the training of deep neural networks, under the framework of data augmentation and dropout regularization [14], [22] and also as noise injection in hidden units to yield stochastic behavior that exploits a probabilistic formulation for optimization [23]. Furthermore, this approach has also been used for generative adversarial networks (GAN) applied to machinery fault diagnosis, so as to add synthetic samples based on the distribution of the original samples, avoiding training with imbalanced data [24]. The most widely used data augmentation strategy for deep neural networks consists of randomly applying operations such as rotations, cropping and mirroring to images. While leading to improvements in performance, this strategy has been further explored with more specific approaches: dataset augmentation in feature space is explored by DeVries and Taylor [25], a multiphase blending method is proposed by Quan et al. [26] and Lemley et al. [27] use a GAN to generate samples with features that improve performance. Adding local noise to border regions can also be viewed as analogous to boosting, a robust machine learning approach that combines multiple weak classifiers and assigns higher relevance to those patterns located near the border regions [28]. Boosting has already been applied to ELMs learning, especially in problems involving imbalanced data [29].
Resampling over an unconstrained input space can be prohibitive, particularly in higher dimensions which is mostly the case of current applications. So, local resampling in the margin region may yield regularization effects without the need to exhaustly cover the whole input space. The margin region can be identified by considering the geometry of the dataset given by proximity graphs such as the Gabriel Graph (GG) [30]. Torres et al. [31] proposed a geometric approach that uses GG to build a large margin classifier based on the edges between points of different classes, defining a boundary region. These edges can be used in order to define the border region for local resampling.
The proposed method explores both the geometric information of the dataset and the regularization effect obtained when training with noise. Structural information is extracted from the GG in the form of structural vectors (SVs), which are vectors that share edges with vertices from the opposite class [31], [32]. The SVs are then used to generate synthetic noise samples in the separation region in order to smooth the decision surface. It is shown, that the generated noise samples lead to a Tikhonov regularization effect. Those noise samples are added to the training set and used to train an ELM.
The proposed method, Regularization with Noise of Extreme Learning Machines (RN-ELM), is compared to the standard ELM algorithm and to ELM-REG in 18 real-world datasets. The datasets differ in size, dimension, class overlap and imbalance. The values of the norm of the weights and accuracy are used to evaluate the models. Statistical tests have shown a significant difference between RN-ELM and the standard ELM, which indicates that local resampling in the border region yields the expected regularization effect. Furthermore, no significant statistical difference was also observed between RN-ELM and ELM-REG, which reinforces the expected regularization behavior of the model training with the resampling approach. It is also formally shown in this paper that local resampling in convex networks is equivalent to Tikhonov regularization. As a combination of different concepts, such as training with noise, data augmentation and graph-based margin resampling, this paper adds up to the formal proofs of regularization a new perspective to neural networks training.
This paper is organized as follows: section II presents a review of relevant literature, in section III the proposed method is explained, section IV details the experimental setup, section V presents the results, and in section VI is the conclusion of the work.

II. THEORETICAL BACKGROUND A. GRAPH-BASED STRUCTURAL INFORMATION
Computational Geometry methods allow dataset patterns to be represented by a planar structure. One example is the Gabriel Graph (GG) [30]; a planar connected graph, built from geometric information of a dataset x ∈ R m , defined by a set of vertices where δ(.) is the euclidean distance between vectors.
Edge (x i , x j ) defined by inequality (3) is represented in Fig. 1, whereas Fig. 2 shows two edges that do not satisfy the inequality.  Given GG edges of a particular dataset, those connecting samples from different classes can be considered for generating synthetic samples in the border region. Fig. 3 depicts the GG of a two moons problem and synthetic samples added to the border region.
The expected value E[y|x] tends to the separation region where the effects of labels from opposite classes in the error function are balanced. According to Geman et. al [33], the function f (x, w) that approximates E[y|x] minimizes the general approximation error. Local resampling with equally probable labels from opposite classes in the region tends to approximate f (x, w) to E[y|x], thus minimizing the approximation error. Once synthetic samples are added to the border region, models with a large number of neurons are less likely to present overfitting, since these new samples smooth the separation surface.

B. EXTREME LEARNING MACHINE AND REGULARIZATION
ELM is a learning algorithm for SLFN [6] that is based on the random projections approach. ELM is easily implemented: bias and weights of the hidden layers are randomly assigned and the weights of the output layer are determined by a generalized inverse matrix.
. . , N }, the network output is given by (4) where L is the number of hidden layer neurons, g(.) is the activation function, v j = [v j1 , v j2 , . . . , v jm ] T is a weight that connects the input and the hidden layer and w j is a weight that connects the hidden layer and the output. Finally, b j is the bias term of the j-th hidden neuron. For a SLFN with L hidden layer neurons, which is capable of approximating a function from N samples, there exists w j , v j and b j such that (5) is satisfied. where The output weights matrix w is calculated using the Moore-Penrose [34] pseudoinverse: Regularization can be employed to smooth the effects of overfitting in oversized networks, as in ELM-REG [35]. Equation (8) shows the expressions for obtaining the weight VOLUME 9, 2021 matrix for N smaller than L and vice-versa.
Regularization effects can also be achieved by resampling as shown by Bishop [21]. In his original work he demonstrates that the addition of small amplitude synthetic samples leads to a penalty term which is equivalent to the regularized sum-of-squares error. Hence, the method proposed in this paper is based on the same principle, however with a GG-based local resampling in the border region.

III. PROPOSED METHOD
Structural information is extracted from a planar graph (GG), as defined in section II-A [36]. The addition of synthetic patterns to the training set can be seen on Fig. 4. In this example, two classes sampled from Gaussian distributions are represented as empty and filled circles. Synthetic samples are triangles and upside down triangles. The geometric vectors and mean points are indicated by thick black circles and a ''x'' mark, respectively. The general expression for the sum of squared errors (SSE) is given in (9).
where N 1 is the number of samples in the training set. Once the synthetic samples are added, the error function can be rewritten as in (10).
(v k − f (r k + k , w)) 2 (10) where N 2 is the number of samples added to the training set, (r k + k ) is the k-th synthetic sample, composed of r k selected from the geometric vectors, k is a random noise and v k is the label of the sampled pattern, which is the same as r k . The additional term in (10) shifts the solution to the direction of the synthetic samples added (r k + k ), so the larger N 2 is, the more influential the synthetic samples become.
Considering that conditions of linear separability are met by ELM, function f (h i , w) = w T h i , and consequently f (r k + k , w) = w T (r k + k ) is considered, leading to (11). (11) can be rewritten as (12).
Output weights that describe the separation hyperplane are obtained by deriving (11) and equating it to zero.
Training set and target values are represented by matrix H and vector y.
Data noise samples are represented by matrix P, which is the sum of matrices R (composed of the reference vectors) and E (random noise). Target values for these samples are given in matrix v. 11 12 . . .  . . .
Matrix corresponding to the regularization term is expressed in (24).
Finally, (24) is taken to (18), leading to the new weight update equation shown in (25).
Thus, (25) shows the new least squares weight update equation when local resampling is applied. As can be observed in (24), the regularization term is composed by resampling terms E and R, leading to smoothing effects of the separation surface. In the absence of noise, E and R are null and (25) is reduced to the standard least squares (7).
Suppose N = N p + N n , N p the number of positive labeled samples and N n the number of negative labeled samples. (r pk + pk )(+1)+ N n k=1 (r nk + nk )(−1) (26) r pk = r nk = r is constant. (27) If the sampled set is balanced, i.e., N p = N r = M : If the synthetic samples are generated according to gaussian distributions, p ∼ N (µ = 0, σ 2 ) and n ∼ N (µ = 0, σ 2 ), (29) Finally, according to the Law of Large Numbers [37], since variables d are indepedent identically distributed, when The result shown in (30) proves that, for a sufficiently large amount of generated samples, the term (R T + E T )v in (25) is equal to zero and (25) can be rewritten as (31).
Analyzing (31), it can be seen that the addition of sufficient synthetic samples is equivalent to Tikhonov's Regularization. In order to avoid the need for asymptotically large number of samples, symmetry can be achieved as long as each sample generates a mirrored one, with the same label, in relation to r, as defined by (32) and (33).

IV. EXPERIMENTS
The experiments were performed on binary classification problems. The first one was carried out on the two moons synthetic problem for visualization purposes. The second one was accomplished on real world datasets and compared to the standard regularized ELM [35].

A. TWO-DIMENSIONAL PROBLEM
In order to compare the standard ELM with RN-ELM, both methods were applied to the two-moons dataset. ELM separation surface with 500 hidden layer neurons results on overfitting, as shown in Fig. 5, whereas the RN-ELM separation surface with the same number of hidden neurons after regularization with local resampling is shown in Fig. 6. The addition of noise samples to the training set leads to regularization and to smaller norm values of the weights.    (qsr), Sonar (snr), Statlog Heart (sth) ), six datasets were obtained on the KEEL Repository [39] (Appendicitis (apd), Bupa (bpa), Ecoli1 (ec1), Haberman (hbm), Monk2 (mk2), Breast Cancer Wisconsin Original (wcs) ), and, finally, Golub (glb) [40] and Hess (hes) [41]. All datasets are binary classification problems. Instances of the datasets containing missing values were discarded. Table 1 summarizes dataset sizes and number of variables. The performance of the proposed method (RN-ELM) was compared to two other ELM learning approaches. Input data were standardized to zero mean and unity standard deviation and the output was assigned to +1 and −1 labels. The hyperbolic tangent function was used on hidden neurons and initial weights sampled from an uniform distribution within the interval [−0.5, 0.5]. The numbers of neurons on the hidden layer were 10, 30, 100, 500, and 1000 [19]. The dataset was randomly split into training and test sets with 70% to 30% ratio, respectively. The number of samples for resampling was obtained by 10-fold cross-validation within the range n = {1 . . . 10}. Local resampling was obtained from a normal distribution with standard deviation defined by (35), which guarantees that all data is within three standard deviations from the mean.
where D is the distance between border vertices of opposite classes. For ELM-REG the regularization parameter is selected within the range C = {2 −24 , . . . , 2 15 } [19] with 10-fold cross-validation [35]. For each dataset, three different ELM training methods were used (ELM, ELM-REG, and RN-ELM). Overall performance was assessed by comparing mean accuracy (Table 3) and weight norm ||w|| (Table 4). For each network configuration, average values were obtained in 30 trials. Finally, Friedman and pos-hoc Nemenyi tests were adopted for comparing multiple models and multiple domains [19], [42].

V. RESULTS
For 18 datasets of real world classifications problems, RN-ELM was compared with the standard ELM and ELM-REG. As expected, regularization on networks with a small number of hidden neurons (10 or 30) did not lead to better results, so that standard ELM accuracy was within one standard deviation of the other two methods and even performed better for apd, bcr, bpa, ec1, ion, hes, mk2, pks, qsr, snr, sth, and wcs datasets. However, when the number of hidden neurons is greater than 100, as expected, regularization plays VOLUME 9, 2021 a major role, as for most datasets ELM-REG and RN-ELM had better results than ELM. Furthermore, it can be seen that the proposed method led to similar results when compared to ELM-REG, and both were within one standard deviation from each other, except for the apd, mk2, and pks datasets. The only dataset for which regularization did not improve results, with outcomes mostly within one standard deviation, was snr. The results of obtained mean accuracies are presented in Table 3.
For most datasets, especially for a large number of hidden neurons (L = 500 and L = 1000), RN-ELM and ELM-REG were both capable of reducing the norm of the weight vector (used as a measure of network complexity), indicating that the network outputs are smoothed. These results can be seen on Table 4. The proposed method (RN-ELM) has achieved performance similar to that of ELM-REG in terms of mean accuracy and norm of the weight vector, which indicates that local resampling also leads to regularization.    In order to compare the three classifiers in the multiple datasets evaluated, the Friedman test and the Nemenyi post-hoc test were used [42], [43]. The results of the Friedman test can be seen on Table 2. For a significance level α = 0.05, the null hypothesis of equality between ELM training approaches can be rejected for L = {10, 30, 100, 500, 1000}. The Nemenyi post-hoc test was applied next, yielding the results summarized on Fig. 7 to 11. It can be seen that, for mean accuracy values, RN-ELM is not statistically different from ELM-REG, as, for all cases, even though ELM-REG is ranked higher, both methods are less than one Critical Difference (CD) apart. Due to the lack of standardization to run experiments and to compare model performances on different datasets, 10-fold cross-validation, adopted in this work, seems to provide the most general methodology for benchmarking and comparing different models in the literature. Although there is no guarantee that the considered folds were exactly the same on different publications, statistical properties of 10-fold cross-validation yields a reliable approximation to the global performance [45]. The results presented in the benchmarking paper by Gestel et al. [46], which adopted 10-fold cross-validation and grid search, are very similar to the ones obtained in this work. For instance, performances obtained by the authors in the following datasets were: aca (87,00%), bpa (70,20%), ion (96,00%), pid (76,80%), snr (73,10%), sth (84,70%) and wbs (96,40%). As can be observed in Table 3, the results are quite close, what may suggest that, since grid-search was also adopted by Gestel et al. [46], the outcomes obtained with RN-ELM may provide a reliable general approximation without the need to make an exhaustive search on parameter's space. Comparisons with results from papers that did not adopt 10-fold cross-validation also suggest that the performance obtained with RN-ELM on the remaining datasets is within the expected range reported in the literature [47]- [50].

VI. CONCLUSION
It has been shown formally in this paper that ELM training with local sampling leads to Tikhonov regularization. This outcome follows Bishop's developments presented in the mid 1990's [21], however, the graph-based local resampling approach presented in this paper leads the separation function to the margin region, without the need to exhaustively cover the whole input space.
The results presented in this paper also show that such an approach reduces the norm of the weights, indicating that the methods yields smoother solutions, reducing the effects of overfitting. Performance metrics also indicate that outcomes are statistically equivalent to the ones obtained by ELM-REG with regularization parameters obtained with cross-validation.
Directed resampling with the graph-based approach may reduce costly input space exploration in higher dimensional problems involving data augmentation. Although the graph needs to be generated in order to locate resampling, it is based on pairwise information which can be fully parallelized.
VÍTOR M. HANRIOT is currently pursuing the bachelor's degree in control and automation engineering with the Universidade Federal de Minas Gerais (UFMG). In his final semester, he is also an Intern with the Computational Intelligence Laboratory (LITC), UFMG.
ANTONIO P. BRAGA received the B.Sc. degree in electrical engineering and the master's degree in computer science from the Universidade Federal de Minas Gerais (UFMG), Brazil, in 1987 and 1991, respectively, and the Ph.D. degree in electrical engineering from the University of London, Imperial College, in 1995, in recurrent neural networks. Since 1991, he has been with the Electronics Engineering Department, Universidade Federal de Minas Gerais (UFMG), where he is a Full Professor and the Head of the Computational Intelligence Laboratory. He is also an Associate Researcher of the Brazilian National Research Council. As a Professor and a Researcher, he has coauthored many books, book chapters, journal, and conference papers. He has served in program committees for many international conferences and was the Program Co-Chair for IJCNN 2018. He was also an Associate Editor of many international journal, including Engineering Applications of Artificial Intelligence, Neural Processing Letters, International Journal of Computational Intelligence and Applications, and IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS. VOLUME 9, 2021