An Ensemble Learning Algorithm Based on Density Peaks Clustering and Fitness for Imbalanced Data

In view of the low classification accuracy of the minority class in imbalanced data, an algorithm called DPF-EL (density peaks and fitness combined with ensemble learning) based on density peaks clustering and fitness is proposed. Firstly, this method uses the density peaks clustering algorithm to divide the majority class into different sub-clusters, the local density calculated in the clustering process is used to assign weights to each sub-cluster, and the number of under-sampling is determined by the weights. Secondly, the concept of fitness is introduced into the sub-clusters, the selection probability of the samples is calculated according to the size of their fitness, and the majority class is under-sampled based on the selection probability. Finally, combined with boosting algorithm, iterative training is performed on the balanced data set. Experimental tests were conducted with KEEL imbalanced data sets, and the experimental results show that the performance of DPF-EL algorithm is better than other algorithms, which indicates the feasibility of the proposed algorithm.


I. INTRODUCTION
Classification is one of the most extensively used machine learning (ML) techniques. Traditional ML classification algorithms usually assume that the sample number of each class in data sets is balanced and treats the samples of different classes equally to improve the overall classification accuracy [1]. However, in real applications, the number of samples in various classes in data sets is often imbalanced. When the number of samples in one or more classes (the majority class) is far more than others (the minority class), the classification algorithm will tilt toward the majority class, causing low classification accuracy for the minority class [2]. The classification accuracy of the minority class is essential in many cases yet. For example, in medical diagnosis [3], [4], [5], people suffering from malignant diseases are the minority class, suppose the traditional ML classification algorithm is used for auxiliary diagnosis of malignant diseases. In that The associate editor coordinating the review of this manuscript and approving it for publication was Cheng Chin . case, the classification accuracy of the malignant diseases are low, which may lead to misdiagnosis and delay the treatment timing of the patients. Today, with the increasing usage of ML technology, the low classification accuracy of the minority class caused by the data imbalance problem has existed in many fields, such as intrusion detection [6], [7], [8], fraud detection [9], [10], [11], and target detection [12], [13], [14], etc.
Many methods have been presented to solve the problem of data imbalanced, and these methods can be categorized into three classes: (a) data resampling, (b) improving the classification algorithms and (c) data resampling combined with ensemble learning [15]. The method of data resampling mainly synthesizes the minority class samples or removes the majority class samples to reduce the data imbalance rate. For example, the minority class synthetic algorithm, synthetic minority over-sampling technique (SMOTE) [16], and the random under-sampling (RUS) method [17]. The method of improving of the classification algorithm is mainly to improve the existing classification algorithm so that it can be applied to deal with imbalanced data sets, such as cost-sensitive approach [18], [19], [20], fuzzy support vector machine [21], [22], [23], [24], and improved random forest [25], [26], [27]. Due to the effectiveness of data resampling and the diversity of data brought by ensemble learning, the method of data resampling combined with ensemble learning has become one of the main methods to deal with the problem of data imbalance at present, this paper is also based on this method for research.
The method of data resampling combined with ensemble learning mainly uses different data resampling methods to balance the training data sets at the beginning of ensemble learning training [28]. Reference [29] proposed an algorithm that combines SMOTE with AdaBoost (adaptive boosting algorithm) called SMOTEBoost (synthetic minority over-sampling technique with AdaBoost). This algorithm uses SMOTE to oversample the minority class during the iterative training of AdaBoost, to alleviate the effect caused by data imbalance. However, during the oversampling, SMOTE algorithm has a marginal problem, which makes the classification boundary fuzzy and the accuracy of minority classification worse. Reference [30] proposed an algorithm called RUSBoost (random under-sampling with AdaBoost) that combines RUS with AdaBoost. It was similar to SMOTEBoost, and the difference is that random under-sampling is used to balance the data sets during the iterative training. Though this algorithm can deal with the imbalanced data effectively, but due to the uncertainty of the randomly under-sampling method, the samples carrying important information may be lost during under-sampling. Reference [31] proposed an under-sampling method based on density peaks. First, the majority class of samples in the overlapping areas are identified and removed. Second, the clustering is performed on the majority class of samples with the overlap region removed, and each generated subcluster is under-sampling according to its size. Finally, the bagging algorithm is used to integrate the classifier so that better classification performance is obtained. Reference [32] proposed a clustering-based under-sampling method, which takes the centers of the sub-cluster as the representative samples to replace the whole majority class samples, and then combines the AdaBoost for iterative training. This method improves the classification accuracy of imbalanced data to a certain extent. The deficiency of this method is that it only considers the cluster centers as the representative samples and ignores the selection of samples in the boundary area, which leads to the loss of samples near the decision boundary and affects the accuracy of classification.
To solve the problems existing in the above methods, this paper proposed an algorithm called DPF-EL based on density peaks clustering and fitness. The density peaks clustering algorithm [33] is a density-based clustering algorithm proposed by Rodriguez and Laio. The main advantages of this algorithm are that it does not need iteration, can find cluster centers at one time, and can identify clusters with any shape. Due to its simple implementation and superior clustering performance, the algorithm has been applied to many fields [34]. The algorithm in this paper uses the density peaks clustering to divide the majority class into several different sub-clusters, and the number of under-sampling in each sub-cluster is determined by the weight of the sub-clusters. To select the representative samples in the clusters, the local density of the samples is used as its fitness, and the selection probability of samples is calculated according to the fitness. The experiment was conducted on 13 imbalanced data sets with different actual application backgrounds, and the results show that the DPF-EL algorithm has better classification performance than other contrast algorithms.
The rest of this paper is arranged as follows: Section II introduces the theory of density peaks clustering and the decision tree algorithm. Section III introduces this paper's theory, steps, and algorithm design. Section IV introduces the experimental design and result analysis. Section V concludes the entire paper and points out its limitations.

II. RELATED THEORIES A. DENSITY PEAKS CLUSTERING ALGORITHM
The basic idea of the density peaks clustering is to form clusters by calculating the local density of sample points and finding density peak points. This algorithm is based on the following assumptions: 1) Samples with high local density may be cluster centers.
2) The distance between cluster centers should be larger. According to the above assumptions, the local density ρ i and the minimum distance δ i to other points with higher density are first needed to calculate to select the clustering centers. The method is given below.
Assume that the data set D = {x 1 , x 2 , · · · , x n } T , for the local density ρ i of any sample x i , can be calculated using the Gaussian kernel function, as shown in (1).
where d ij is the Euclidean distance between any two samples x i and x j , d c is the cutoff distance, generally set to 2% of the Euclidean distance descending sort. The minimum distance δ i to other points with higher density is defined as follows: The density peaks clustering algorithm takes the samples with higher local density and higher minimum distance as cluster centers. After the cluster centers are determined, the remaining samples are assigned to the cluster to which the nearest with higher local density belongs.

B. DECISION TREE ALGORITHM
Decision tree [35] is a common classification model, which has been widely used in ensemble learning due to its simple structure and high classification accuracy. The ensemble VOLUME 10, 2022 learning algorithm proposed in this paper uses the C4.5 algorithm to generate the base classifier.
C4.5 algorithm is a frequently used method to generate decision tree, and it takes the information gain ratio as the partition metric of the optimal feature. The information gain ratio is calculated as follows: Assume that the data set D has k categories, where k = 1, 2, · · · , K , p k denotes the rate of the number of k-type samples to the total number of samples in data set D, data set D is divided into V sub-datasets by the eigenvalues of feature a, |D v | is the number of samples in the v sub-datasets, the information gain ratio of feature a is calculated as shown in (3).
where Ent(D) refers to information entropy, which is used to measure the information purity of data set D, and the calculation formula is as shown in (4).

III. THE PROPOSED ALGORITHM A. ADAPTIVE UNDER-SAMPLING WEIGHT CALCULATION BASED ON DENSITY
In the existing clustering-based under-sampling methods, the under-sampling number of each cluster is usually determined according to a certain proportion or number, without considering the density of the samples in the cluster. To make the distribution of the data set consistent before and after sampling, the sampling number of the area with dense samples should be larger, and the sampling number of the area with sparse samples can be smaller. Therefore, the sampling weights are assigned to clusters according to it samples local density in this paper. The denser the sub-clusters, the larger their sampling weight, and the sparser the sub-clusters, the smaller their sampling weight. According to formulas (1) and (2), the local density ρ i and minimum distance δ i of the majority class samples in data set D are calculated to generate C different sub-clusters D k maj , where k = 1, 2, · · · , C, for each the majority sub-cluster formed by clustering, the density Rho k maj and sampling weight Weight k maj were calculated by formulas (5) and (6). (6) At last, the sampling weight Weight k maj of the sub-clusters is multiplied by the number of the minority class in date set D to calculate the under-sampling number US k maj of each subcluster, as shown in formula (7).

B. UNDER-SAMPLING METHOD BASED ON FITNESS
This paper used fitness based on samples to choose the samples distributed among the center and periphery of the subclusters as much as possible. Because to some extent, the importance of samples can be approximately measured by density, if a sufficient number of high-density instances are selected, the learning models will have better classification performance.
For the sub-clusters generated using the density peaks clustering, the samples with higher local density are the central or peripheral points of the sub-clusters, and the samples with lower local density are boundary or outlier (or noise) points. It can be seen from Figure 1 that the local densities of the sub-cluster center and peripheral points are higher than that of boundary points. Points 26 to 28 are far from most of the points and have lower local densities, which are regarded as outliers or noise points.
To make the center and peripheral points of sub-clusters have a larger selection chance, the local density of the samples is taken as their fitness. For instance, the fitness of sample point x i in a sub-cluster is defined as formula (8).
In genetic algorithms [36], fitness measures an individual's ability to adapt to the environment, and it is proportional to the selection probability. For point x i , its fitness is defined as f (x i ), and the selection probability p(x i ) can be calculated using the formula (9).
From the formulas (8) and (9), it can be concluded that the density of the samples in the sub-cluster is proportional to their selection probability. Hence, the central and peripheral points of the sub-clusters have a higher selection probability than the boundary and outlier points. In addition, the samples in the boundary region also have a certain probability of being selected, which will not result in the loss of useful samples related to the decision. For the convenience of description, the steps of the under-sampling method for a single cluster are given below.
Step1: calculate the selection probability p(x i ) of the samples in this cluster according to the formulas (8) and (9).
Step2: calculate the cumulative probability P i for the samples according to the formula (10).
Step3: generates a random number r within the interval [0, 1]. If r < P 1 , select sample x 1 ; otherwise, select the sample x i that satisfies condition P i−1 ≤ r ≤ P i .
Step4: repeat Step3 until the number of under-sampling in this cluster is satisfied.
The pseudo code of the above steps can be summarized as in Algorithm 1. Each cluster is under-sampled according to Algorithm 1, and the samples obtained by under-sampling are merged with the minority class samples in data set D to form a balanced data set D .

C. BASE CLASSIFIER GENERATION
C4.5 algorithm is used to train the decision tree classifier on the balanced data set D , and the depth of the decision tree is set to d. The steps for generating a decision tree using the C4.5 algorithm are given below: Step1: for the data set D , the information gain ratio of all features is calculated according to the formulas (3) and (4), and the feature with the maximum information gain ratio is taken as the optimal partition, which is used to establish the m ← 0; // m is the cumulative probability 6: r ← Random(0, 1); 7: for i = 1 to N do 8: m ← m + p(x i ); 9: if r ≤ m then 10: return x i ; 11: end if 12: end for 13: new_majority ← x i ; // if x i is not in new_majority 14: end while root node, and the child nodes are generated according to the different values of the optimal partition feature.
Step2: in the same way as Step1, the feature with the maximum information gain ratio is selected as the optimal partition feature for generated the sub-nodes, and the subsequent branches are recursively established until the samples of nodes all belong to the same class or reach the set depth d.
Step3: the classification rules are extracted to obtain the corresponding base classifier.

D. ALGORITHM DESIGN
The algorithm design of DPF-EL mainly brings the Algorithm 1 into the training framework of ensemble learning, and improves the classification performance for imbalanced data by repeatedly sampling and training corresponding classifiers. The pseudo code of DPF-EL algorithm designed can be summarized as in Algorithm 2.

E. DPF-EL TIME COMPLEXITY ANALYSIS
The time complexity of DPF-EL mainly concentrated in two aspects, the analysis is as follows.
1) The time complexity of clustering the majority class using density peaks clustering. Since the time complexity of the density peaks algorithm is O(n 2 ), so the time complexity of clustering the majority class using the density peaks clustering is O(n 2 ).

Algorithm 2 DPF-EL Design
Input: imbalanced data set, D = {(x 1 , y 1 ), · · · , (x n , y n )}, number of clustering, C, number of iterations, T Output: classification results 1: initialize the weight of x i : W 1 (i) = 1/n, i = 1, 2, · · · , n; 2: clustering the majority class in data set D using density peaks clustering algorithm; 3: calculating the under-sampling number of each cluster according to formulas (5) to (7); 4: for t = 1 to T do 5: create a balance data set D t according to the undersampling method in Algorithm 1; 6: use D t as the training data to train the base classifier h t ; 7: calculate the error rate of h t : where I is indicator function; 8: calculate the weight of h t : α t = 1 2 ln 1−e t e t ; 9: ; 10: end for To use the ensemble classifier to classify sample, x_test: 11: initialize weight of each class to 0; 12: for t = 1 to T do 13: c = h t (x_test); //c is the class predicted by h t 14: add weight α t to class c; 15: end for 16: return the class with the largest weight; To sum up, the time complexity of DPF-EL algorithm is O(n 2 ) + O(Tp|D |log|D |). [37], AUC [38], and Balance [39] are commonly used to assess the classification performance of algorithms for imbalanced data. It can be represented by using a confusion matrix. The method is given below.

G-mean
According to Table 1, the following evaluation metrics can be obtained.
True Positive Rate, the percentage of positive samples that are correctly classified.
False Positive Rate, the percentage of negative samples that are misclassified. Specificity, the percentage of negative samples that are correctly classified, which is to measure the ability to identify negative classes.
G-mean, the geometric mean of true positive rate and specificity. If an algorithm achieves a higher G-mean value, it means that the algorithm has better classification performance for imbalanced data, and the method of calculation is given in (14).
AUC, the area under the ROC curve. The higher the AUC value, the higher the positive rate, meanwhile, the lower the false positive rate. The calculation formula is shown in (15).
Balance, Balance is a method to measure the classification performance of algorithms for imbalanced data. A higher Balance value means the algorithm gets a better comprehensive classification performance. The calculation formula is shown in (16).

B. EXPERIMENT DATA SETS
This paper uses 13 groups of imbalanced data sets from KEEL data set repository [40] to train and evaluate the algorithm. Since this paper only studies the two-category problem, the category ''3'' is selected as the minority class, and the other categories are selected as the majority class on the wine data set. The imbalance ratio distribution of the experimental data sets ranged from 1.82 to 35.44. See Table 2 for detailed information.

C. EXPERIMENTAL DESIGN AND COMPARATIVE RESULTS
In the experiment, this paper compared the proposed algorithm with AdaBoost [41], SMOTEBoost [29], RUSBoost [30], cluster-based under-sampling with boosting   (CUSBoost) [42], neighborhood cleaning rule (NCL) [43], and clustering-based under-sampling (CBU) [32]. Among them, the reference [32] uses two strategies for undersampling, this paper chooses the second strategy called CBU-NN (clustering-based under-sampling with nearest neighbors of the cluster centers) with better classification performance as comparison algorithm, and the number of clusters is set to the quantity of the minority class samples. All algorithms use the C4.5 algorithm training the base classifier, G-mean, AUC, and Balance as the method of evaluation.
To make the experimental results fair and objective, the algorithm in this paper is run ten times with ten-fold crossvalidation, and their mean evaluation metrics values are shown in Table 3 to Table 5. The bold value is the highest under this evaluation metrics. From Table 3 to Table 5, it can be seen that the DPF-EL algorithm has achieved high G-mean and Balance evaluation values on 7 data sets and high AUC evaluation values on 12 data sets, compared with G-mean and Balance, the effect of DPF-EL algorithm is more obvious when using AUC for evaluation. On the data sets abalone9-18, haberman, pima, wine, winequalityred8vs6 and yeast1, the comprehensive VOLUME 10, 2022 classification performance of DPF-EL algorithm is better, their G-mean, AUC, and Balance values are all the highest, especially on the winequalityred8vs6 data set with imbalance rate as high as 35.44, the performance of this algorithm is well, compared with CBU-NN algorithm, the G-mean value is increased by 5.2%, the AUC value is increased by 3.1%, and the Balance value is increased by 2.3%. It shows that the classification performance of the proposed algorithm is still better on the data set with high imbalance rate, at the same time, it is proved that under-sampling combined with ensemble learning is a better method to solve the imbalanced data classification problem.
In order to more visually compare the classification performance between different methods, Figure 2 shows the mean values of the different evaluation metrics of 7 methods on 13 data sets. It can be seen in Figure 2 that compared with other methods, the mean value of the different evaluation metrics of the proposed method has been improved to a certain extent, which shows that the classification performance of the proposed method is better than other methods.
On the whole, compared with other methods, the G-mean, AUC and Balance evaluation values of the proposed method are higher, which indicates that this method has a higher classification accuracy and better classification performance for imbalanced data.

D. THE IMPACT OF CUTOFF DISTANCE
In the under-sampling phase, the selection probability of the samples is a positive correlation with their local density. To observe the influence of the parameter d c value on the algorithm performance, the parameter was set with different values. Figure 3 shows the changes of evaluation metrics values of the DPF-EL algorithm under different d c (1%, 2%, 3%, 4%, 5%). The evaluation metrics values in Figure 3 are the sum of the different metrics values on 13 groups data sets.
As can be seen from Figure 3, when d c = 1%, the sum value of all metrics are lowest, and when d c changes from 2% to 5%, the line chart changes gently, and the evaluation metrics values that increase or decrease are small. On the whole, when the d c changes little, it has a limited effect on the performance of the algorithm.

V. CONCLUSION
Under-sampling combined with ensemble learning can effectively solve the problems of imbalanced data learning and bring about the diversity of data. However, the existing algorithms usually have two problems: the size of clusters after clustering is different, how to reasonably allocate the number of under-sampling and select representative samples.
This paper proposed an algorithm called DPF-EL. This algorithm calculates the number of under-sampling for each sub-cluster according to the density of samples in the cluster, which keeps the consistency of data distribution before and after sampling. The fitness concept of genetic algorithm is used to model the samples of the sub-clusters, so that the central and the surrounding samples of the sub-clusters have a larger selection probability, and the representative samples in the cluster are reserved as much as possible. At last, the feasibility of this method is verified through the experiments.
In real-life applications, imbalanced data may have multiple classification circumstances. The following work will use the DPF-EL algorithm to study the classification of multiclass imbalanced data sets. In addition, since the method in this paper uses the density peaks clustering algorithm, the running time of the algorithm in this paper is slightly longer in the data set with a large amount of data. It is also worth studying how to shorten the running time in a parallel way.