An Improved Extreme Learning Machine for Imbalanced Data Classification

In the field of machine learning, Extreme Learning Machine (ELM) has been widely used in classification and regression tasks. However, like many traditional machine learning algorithms, the classification results of ELM are often not good enough when facing imbalanced data. For this reason, we proposed an extreme learning machine algorithm with output weight adjustment called OWA-ELM, which can make the decision boundary of ELM move to majority classes, and improve the classification performance of imbalanced data. Specifically, in ELM, we add a reasonable increment Δ to the connection weights between hidden layer neurons and minority output neurons in ELM, so that the output value of minority output neurons increases. Finally, the classification accuracy of the minority class samples can be improved without significantly affecting the classification of the majority class samples. The performance of OWA-ELM was compared with ELM, WELM, CS-ELM and CCR-ELM. In the experiments on 22 data sets, OWA-ELM algorithm has achieved 9 times best and 4 times suboptimal results on G-mean. In F-measure, 13 best and 1 second best results were obtained. Therefore, the OWA-ELM algorithm is effective to deal with imbalanced data classification.


I. INTRODUCTION
In the practical classification application of machine learning, many data sets are often imbalanced [1]. The classification problem of imbalanced data is particularly prominent in many fields such as computer vision [2][3], medical science [4], information security [5] and industry [6]. When the traditional classification methods deal with such data sets, the classification results tend to be biased towards the majority class, and it is easy to ignore the minority class [7]. This is because these classifiers generally assume that the distribution of samples is balanced [8].
Extreme Learning Machine (ELM) is a single hidden layer feedforward neural network [9][10]. The input weights and the threshold of hidden layer neurons are randomly generated, and there is no need to adjust during the learning process. Even, it does not use the Backpropagation algorithm which is high time complexity to obtain the only optimal solution. Therefore, ELM has the advantages of fast learning speed and good generalization performance. It has been widely used in data classification problems in various fields, such as face recognition [11][12], military [13], image processing [14][15] and medical diagnosis [16][17] and etc. However, just like traditional machine learning algorithms, ELM was not designed to classify imbalanced data, which leads to its poor classification effect on imbalanced data [18]. When ELM is faced with imbalanced data sets, the classification results tend to favor the majority classes, and it is easy to ignore the minority classes.
In order to overcome the shortcomings of ELM in classifying imbalanced data, Zong and Huang et al. [19] proposed a weighted extreme learning machine (WELM) algorithm based on the idea of cost sensitivity. Before the training process of WELM, the samples are given weights, and the weight of the minority samples is greater than the weight of the majority samples, so as to improve the classification accuracy of the minority samples. In practical applications, two weighting schemes can be selected according to the sample distribution. Yu et al. [20] proposed the label-weighted extreme learning machine (LW-ELM) algorithm to improve WELM. LW-ELM has a faster training speed on large-scale data sets by eliminating a large-matrix multiplication operation. At the same time, the authors also provided their two weighting schemes. Xiao et al. [21] proposed a class-specfic cost regulation extreme learning machine (CCR-ELM).
CCR-ELM improves the classification accuracy of imbalanced data by modifying the optimization function of the ELM algorithm and introducing regularization parameters for different classes as a balance between structural risk and empirical risk. In order to overcome the shortcomings of CCR-ELM and reduce the parameters, Raghuwanshi et al. [22] proposed the CS-ELM algorithm. Compared with CCR-ELM, the parameters of CS-ELM are reduced by one, and the problem of degradation to ELM caused by different orders of magnitude of two regularization parameters in CCR-ELM is avoided. In order to solve the problem of data imbalance in online learning, Mao et al. [23] proposed a new online sequential extreme learning machine method which with two-stage hybrid strategy. In the offline phase, principal curve and database technique are used to model the data. In the online phase, they proposed a leave-one-out cross-validation method using Sherman-Morrison matrix inversion to solve the online imbalanced data. Meanwhile, add-delete mechanism is used to update the network weights.
Considering the shortcomings of ELM algorithm and the property of imbalanced data, an improved extreme learning machine with output weight adjustment (OWA-ELM) was proposed. The proposed algorithm improves the classification accuracy of minority classes by adding a reasonable increment Δ to the connection weights between hidden layer neurons and minority output neurons in ELM.
The reminder of this paper is organized as follows. Section 2 introduced related algorithms. The proposed method is presented in Section 3. The Section 4 provided the experimental results and analysis. Finally, the conclusion is given in Section 5.

II. RELATED WORK
In this section, the ELM, WELM and CCR-ELM will be briefly introduced.

A. EXTREME LEARNING MACHINE
Extreme learning machine (ELM) [9][10] is a single hidden layer feedforward neural network. Its structure is shown in figure 1. As mentioned above, in ELM, the input weights and the bias of the hidden layer neurons are randomly generated and not need to be changed during the training process. The output weight matric can be directly calculated and not need to be adjusted by the backpropagation algorithm. Therefore, ELM has the advantages of fast training speed and good generalization performance.
Let [ , ]( 1,2,..., ) represents N training samples. More precisely, the true label matrix of the i-th training sample where the superscript T represents the transposition of a matrix or a vector. Let n represents the number of neurons in the input layer, which is also the number of features of the samples. Let m represents the number of neurons in the output layer, which is also the class number of data set. Let L denote the number of neurons in the hidden layer. Then the input weights can be expressed as where, 1 = [ , ..., ] , 1, 2,..., And the bias vector of neurons in the hidden layer can be expressed as . Then the output of the i-th training sample in the hidden layer is as follows: is the activation function, and the output matrix of hidden layer composed of all training samples can be denoted as H which is with the dimension of n × L: Let ij  denote the connection weights between the hidden layer neuron Bi and the output neuron Oj, then the output weight matrix can be expressed as follows: 11 1 The optimization function of ELM is as follows: is a vector composed of training errors of sample xi at m output nodes. According to the KKT theorem, the solution of (4) is as follows: , I is a unit matrix with appropriate dimensions. For any given test sample x, the final prediction output is: Here, introduced. Each training sample is assigned a weight. In the diagonal matrix W, ii W corresponds to the weight assigned to the sample xi. The weight of minority class samples is greater than the weight of majority class samples, which makes the classifier pay more attention to the minority class samples. The authors provide two sample weighting schemes: The first weighting scheme W1: The second weighting scheme W2: Where #(ti) represents the number of instances of the i-th class, i=1, ..., m. AVG(ti) is the average number of samples of each class in the training set.
The optimization function of WELM is as follows: According to the KKT theorem, the solution of (10) is as follows:

C. CLASS-SPECIFIC COST REGULATION EXTREME LEARNING MACHINE
Class-specific cost regulation extreme learning machine (CCR-ELM) [21] was proposed by Xiao et al. in 2017, which aims that the ELM can classify imbalanced data better. Its biggest feature is that it introduces regularization parameters for different types of loss costs, which can be known through its optimization function: (12), C + and Care the class-specific cost regulation parameters. N + and Nrepresent the number of minority samples and majority samples, respectively. The solution of the above formula is

III. PROPOSED METHOD
For balanced data, the output weight matrix β obtained by (5) is the optimal solution. But for imbalanced data, β may not optimal. So, it tends to ignore the minority classes when ELM is used in classifying imbalanced data. In ELM algorithm,  is required as small as possible to avoid the structural risk of the model and resulting in weak generalization ability. However, if only slightly increase each element in the column vector that determines the output value of the minority output neuron in the output weight matrix β, it may not cause too much structural risk, but helps to improve the classification accuracy of minority samples. For this reason, Extreme learning machine with output weight adjustment (OWA-ELM) was proposed in this paper. By slightly increasing the corresponding column in the output weight β, that is, slightly increasing the weights between hidden layer neurons and the minority output neuron, the output of the minority output neuron is increased. Under the condition that does not significantly affect the correct classification of samples of the majority class, when increasing the value of minority output neurons, the probability of the samples of the minority class being correctly classified increases. In figure 1, for any sample x, the output of neuron Oi is calculated as follows: Here, Hl represents the output of the hidden layer neuron Bl, l = 1, ..., L. Assuming that the output neuron Oi corresponds to the minority class. Then, according to (7), for a sample x, when the output value of Oi is greater than other output neurons, x is predicted to be a minority class. In ELM, in order to pursue higher accuracy, the output value of the minority sample x on Oi is often slightly smaller than the output value of some of other output neurons, that is, the minority sample is mistakenly classified as the majority class.
According to (14), for a minority sample x, when the output of the hidden layer neuron H1, H2, ..., HL is constant, in order to slightly increase its output on Oi, one possible way is to slightly increase the weights β1i, β2i, ..., βLi, which are the connections between the hidden neurons B1, B2, ..., BL and Oi. Since proposed method adds the same number Δ to β1i, β2i, ..., βLi, according to the matrix representation of output weight β of ELM in (2), the new output weight a_β after finetuning only needs to add Δ to each element of βi: The training process of OWA-ELM algorithm is described as Algorithm 1. 3. Calculate the output weight of ELM β using (5); 4. Use (15) to adjust β obtained in step 3 to obtain the adjusted output weight a_β; Return a_β; Because the proposed method only adds a tiny value (Δ) to each element in one of the columns of β, the structural risk caused by the change of β may be little. Generally, Δ is much smaller than the number to be added, which will not cause too much negative impact on the neural network. Subsequent experiments will also prove that the OWA-ELM algorithm does not have structural risks such as over-fitting, but has achieved good results in the evaluation of G-mean and Fmeasure.

IV. EXPERIMENTS AND ANALYSIS
The experimental environment is shown as follows: the CPU used is Intel Xeon E5-1620 v3, 4 cores, 3.50GHz, and the memory is 8GB. The operating system is the windows 10, and the software platform is MATLAB R2020a.
In order to verify the effectiveness of the OWA-ELM algorithm, the algorithm was compared with the ELM, WELM1, WELM2, CCR-ELM and CS-ELM algorithms on 22 imbalance binary classification data sets (WELM1 and WELM2 are WELM algorithms using weighting schemes W1 and W2 respectively).

A. PARAMETER SETTINGS
To ensure the reliability and accuracy of the experiments, each algorithm used ten times of five-fold cross-validation on each data set, and takes the value of average and standard deviation as the final result. For each algorithm, the grid cross search method was used to obtain the best parameter combination. In this step, five-fold cross validation was also used to ensure that the obtained parameter combination is the best. the search range of the number of hidden layer neurons L of all algorithms was [10, 20, …, 190, 200].

B. DATASETS
The data sets used in the experiments are 22 binary data sets with different range of imbalance rates downloaded from the KEEL data set repository. The detailed information of the data sets is shown in table 1. The index to measure the degree of data imbalance is the imbalance rate, and its calculation formula is as follows: Here, #(Nmaj) is the number of majority instances, #(Nmin) is the number of minority instances. The higher the IR value, the more imbalance the data.
TP and TN respectively represent the number of samples of the minority class and the majority class that were correctly classified, while FP and FN represent the number of samples of the majority class and the minority class that were misclassified, respectively. In the experiment, the normal form of F-measure was set to 1, that is, α in (18) is equal to 1. Since the data set used in the experiment is a binary data set, G-mean can also be expressed as: The calculation formula of Specificity is as follows: Table 2 and Table 3 show the experimental results of each algorithm on 22 data sets. The results marked in bold indicate the best results, while the results marked with underline denote the second best. From the experimental results we can know: 1. ELM can't classify imbalanced data well, especially some highly imbalanced data. In the comparison of G-mean, compared to WELM1, WEWLM2, CCR-ELM, CS-ELM and OWA-ELM, the results obtained by ELM on most data sets are the worst, and the comparison of F-measure evaluation is not as good as WELM1, WEWLM2, CS-ELM and OWA-ELM. On data sets such as abalone19, poker-8-9_vs_5, and winequality-red-4 with large imbalance rates, the G-mean and F-measure values of ELM are even equal to or close to zero. This is mainly because, like traditional classification algorithms, Extreme Learning Machine was not designed to handle imbalanced data, which makes it easy to ignore minority samples. However, on the shuttle-2_vs_5 dataset with a high imbalance rate, like other algorithms, ELM also achieved good results. Perhaps the data distribution of this data set is relatively simple, and all algorithms can find the appropriate decision boundary very well.

D. EXPERIMENTAL RESULTS
2. The performance of CCR-ELM on G-mean and F-measure is inferior to WELM1, WELM2, CS-ELM and OWA-ELM, and it is only slightly better than ELM on G-mean. This is because, in the solution of β of CCR-ELM (12), if C + and Care not in the same order of magnitude, for example, C + is equal to 2 10 and Cis equal to 2 2 , then As a result, the solution formula of β of CCR-ELM is almost the same as that of ELM.
3. The performance of OWA-ELM algorithm in experiments was better than other algorithms, especially in F-measure evaluation. In the G-mean evaluation, 9 times best and 2 times second best results were obtained. In F-measure, 13 times best and 1 second best results were obtained. This is due to the characteristics of output weight matrix adjustment in OWA-ELM. By adding a reasonable increment Δ to the connection weights between hidden layer neurons and minority output neurons in ELM, the output value of the minority output neuron is increased, so the probability of the minority samples being correctly predicted is increased. Especially for the minority samples in the overlapping area, the probability that is correctly classified increases. The excellent performance of the OWA-ELM algorithm also confirms that a slight increment in the value of all elements in a certain column of the output weight matrix will not cause obvious structural risks to the ELM neural network or resulting in overfitting, but will enable the algorithm better to classify imbalanced data. 4. In addition to the imbalance rate, there are other factors that affect the classifier to classify imbalanced data. On some data sets with low imbalance rate, such as glass1, haberman, pima, and vehicle1, the G-mean value of each algorithm is between 0.5 and 0.7, and the F-measure evaluation value of each algorithm does not exceed 0.62. But on some highly imbalanced data sets, such as the shuttle-2_vs_5 data set that with an imbalance rate of 66.67, better results have been achieved. This is because, besides the imbalance rate, the class imbalance problem also includes other factors, such as class overlapping and disjuncts [24]. Perhaps the data sets such as glass1, haberman, pima, and vehicle1 have serious class overlapping or disjuncts, which affect the classification performance of each algorithm.

F. TRAINING TIME COMPARISON
In terms of time comparison, we mainly compared the training time of ELM, WELM1, WEWLM2, CCR-ELM, CS-ELM and OWA-ELM. In order to make the results more reliable, the control variates method was adopted. The number of hidden layer neurons in each algorithm was 200. In ELM, WELM1, WELM2, CS-ELM and OWA-ELM, the regularization parameter C was 2 4 . In CCR-ELM, C + and Cwere both 2 4 . In OWA-ELM, the Δ was 0.003. Table 4 shows the comparison results. It can be seen from table 4 that ELM has the least average training time on 22 training sets, followed by OWA-ELM and CCR-ELM. The training time of OWA-ELM is almost equivalent to that of ELM. However, WELM requires the most training time. This is because the OWA-ELM adjusts β during training, which increases the training time compared to ELM. Although CCR-ELM, CS-ELM and OWA-ELM perform addition operations on the matrix, but the difference is that CCR-ELM needs to calculate C C    I I in (13) and the CS-ELM requires more matrix operations to solve β. Although OWA-ELM adds β after obtaining the output weight, calculation in this step is less complex, and it takes less time. These two aspects also cause the training time of OWA-ELM to be almost equal to the training time of ELM, and even the training time required on most data sets is slightly shorter than that of CCR-ELM. CS-ELM also needs more time to train, because the operation of its output weight matrix is also more complicated. Comparing (11) and (5), it can be found that compare with ELM, two matrix multiplication operations are added during the solution process of β of WELM. The matrix multiplication operation is more timeconsuming than the matrix addition operation, which directly leads to the training time of WELM more than that of other algorithms. However, since OWA-ELM adds a parameter Δ, it takes more time than ELM and WELM to search for the optimal parameter combination in grid cross search, and it takes more time to search for the optimal parameter combination. One possible way is to fully analyze the possible range of Δ and avoid searching for Δ in meaningless ranges. Another feasible method is to use the particle swarm optimization algorithm to search for the best parameter combination of Δ, C and L.

V. CONCLUTIONS
ELM has attracted the attention and use of various industries because of its short training time and good generalization performance. However, in practical applications, the data is often imbalanced. When dealing with imbalanced data sets, the performance of ELM is often not ideal. Therefore, it is proposed an improved ELM algorithm with output weight adjustment (OWA-ELM). By slightly increasing the corresponding column in the output weight, that is, slightly increasing the weight between hidden layer neurons and the minority output neuron to increase the output of minority output neuron. After that, and the accuracy of minority class samples increases without significantly affecting the classification of the majority samples. Experiments on 22 KEEL imbalanced data sets show that in terms of G-mean and F-measure evaluation, the proposed OWA-ELM algorithm has excellent performance on many data sets, especially in Fmeasure evaluation. In the future, we will focus on optimizing the performance of OWA-ELM in the classification of imbalanced data sets with overlapping and disjuncts, and shorten the time required to find the optimal combination of parameters. In addition, applying the proposed algorithm to the online imbalanced data classification problem is also an important work in future.