An Improving Majority Weighted Minority Oversampling Technique for Imbalanced Classification Problem

Minority oversampling techniques have played a pivotal role in the field of imbalanced learning. While traditional oversampling algorithms can cause problems such as intra-class imbalance of samples, ignoring important information of boundary samples, and high similarity between new and old samples. Based on the situation, we proposed a new type of over-sampling method, BIRCH and Boundary Midpoint Centroid Synthetic Minority Over-Sampling Technique (BI-BMCSMOTE). First of all, the algorithm used the BIRCH clustering method to achieve quick cluster of the minority samples. After identifying and removing the noise, it marked the boundary minority samples in the label by probability. Secondly, it generated a density function for each sample cluster, calculated its density and sampling weight, performed midpoint composite sampling among the minority samples marked by probability and other minority samples in each cluster, and then calculated and analyzed the specific value of composite sampling to improve the accuracy of the model. According to the experimental results, the algorithm was proved to be valid.


I. INTRODUCTION
The imbalanced data [1] refers to the amount of one class or several classes of data in a dataset is far larger than that of the other classes. In two classes of the imbalanced data, the class with a larger amount is referred to as the majority class and the class with a smaller amount as the minority class. The data mining approaches have been used to establish the models and make decisions. But when it comes to the classification of the imbalanced data, the traditional classification model is not efficient. This is because the classification models drawn from the standard classifiers, such as logistic regression, support vector machine and decision tree, are not productive and distort some minority samples [2]; or because some exceptions are mistaken as noise, vice versa [3]. In addition, some imbalanced data has small samples and lacks density. The strong feature dimensions will limit the training on some classes by the learning models. The issues brought about by the imbalanced data can be found in many areas of data mining, such as credit card fraud [4], medical diagnosis [5], network intrusion [6], oil leakage [7], etc.
The associate editor coordinating the review of this manuscript and approving it for publication was Ali Salehzadeh-Yazdi .
The oversampling technique of synthesizing minority samples is a very popular technique to improve the performance of minority classes in imbalanced datasets.
Synthetic Minority Oversampling Technique (SMOTE) is the beginning of oversampling. After this, more and more experts joined in the team to improve the oversampling algorithm. In the past decade, the processing of imbalanced data has been carried out mainly in three directions, cost-sensitive learning [8], [9], algorithm modification [10] and data preprocessing [11]. Cost-sensitive learning assumes that the misclassified minority samples have a higher cost than the majority samples. This method can be implemented at both data level and algorithm level [2]. Algorithm modification is to improve the existing algorithm or classification paradigm to adapt to the learning of minority class. There are three classic techniques for data preprocessing, namely undersampling [12], over-sampling [10], [13], [14], [15] and mixed sampling, which are also common methods. Under-sampling is a reasonable censorship of the majority class, while over-sampling is an effective supplement to the minority class data. Chawla et al. [16] proposed the Synthetic Minority Oversampling Technique (SMOTE), which synthesizes new samples by performing linear interpolation between minority samples. However, simply copying a new sample can easily lead to overfitting. As a result, researchers have proposed many new SMOTE improved algorithms. Han et al. [17] put forward borderline-SMOTE algorithm, which emphasizes the classified boundary sample but overlooks the non-boundary sample interaction information. Adaptive Synthetic Sampling Technique (ADASYN) [18] can adaptively change the weights of minority samples according to the nearest neighbor ratio of majority and minority samples, and synthesize new samples to balance the skewed distribution of samples. Majority Weighted Minority Oversampling Technique (MWMOTE) [19] sets its importance according to the density of a minority samples in the cluster, calculates the weight to extract samples based on probability, but its interpolation method is still lagging. Density-Based Synthetic Minority Oversampling Technique (DBMOTE) [20] verifies the feasibility of using DBSCAN clustering and achieves good results in combination with SMOTE. DBSCAN and Midpoint Centroid Synthetic Minority Oversampling Technique (DB-MCSMOTE) [21] algorithm first performs DBSCAN clustering on minority samples, and the sampling weight of each cluster was calculated. Then a new sample is synthesized from two points in the cluster which are far apart. This method preserves the key information of each cluster and eliminates noise, but does not consider the influence of minority samples at the boundary.
This paper mainly studies the oversampling techniques for synthesizing minority samples. In order to better verify the validity of the new oversampling algorithm, this paper generates a scatter diagram of two-dimensional imbalanced data, and uses a visualization method to verify the accuracy of the membership degree of the synthetic data, which intuitively demonstrates the contribution of synthetic data to the expression of minority characteristics. Meanwhile, dataset of credit card default is added on the basis of multiple original datasets, on which the oversampling algorithm also presents better performance. On this basis, this paper adds a ratio analysis of the boundary samples and the normal samples, aiming to find out the optimal ratio to improve the performance of the new algorithm.
The main content of this paper is as follows. In section 2, we review SMOTE and describe the definitions, theorems, and other relevant knowledge that the BI-BMCSMOTE algorithm needs to master. In section 3, a new algorithm is proposed. In section 4, This paper conducts empirical research through experimental design. In section 5, The experimental results are obtained and analyzed. At the same time, the ratio between the marked boundary minority samples and the intra-cluster minority samples is analyzed.

II. REVIEW OF THEORY
A. SMOTE ALGORITHM SMOTE method is a classical oversampling algorithm applied in synthetic minority samples of imbalanced learning, which is an improved algorithm based on random oversampling, but not just simple replication of random oversampling. SMOTE method is to synthesize new samples manually by conducting linear interpolation among minority samples, and add synthetic minority samples to balance the dataset, thus alleviating the over fitting problem easily arising in random oversampling [16]. The algorithm flow is as follows: 1) According to the Euclidian distance, for each sample x i of a minority class, the KNN algorithm is used to get its k nearest neighbors. 2) Sampling ratio N % was set according to sample imbalance rate. For each sample x i of minority class samples, a few samples were randomly selected from k minority class neighbors, and the selected neighbor samples were assumed to be x i . 3) For each randomly selected neighbor sample x i , a new sample is constructed according to the following equation: 4) Add the synthesized new sample to the original data set to form a balanced dataset.

B. BIRCH ALGORITHM
Balanced Iterative Reducing and Clustering Using Hierarchies (BIRCH) is an incremental clustering method applying tree structure for fast clustering. This algorithm is suitable for the case of large samples, and introduces data points measured from multiple dimensions in an incremental and dynamic way, trying to produce the cluster with optimum quality within available resources (memory and time constraints). It runs so fast that it can achieve clustering with only one single scan of the dataset. The BIRCH Algorithm's core concepts are defined as follows: CF(Clustering Feature) [22]: Given N d-dimensional data points {x i }(i = 1, 2, · · · , N ) in a cluster, the Clustering Feature (CF) vector of the cluster is defined as a triple: CF = N , l , s , where N is the number of data points in the cluster, l is the linear sum of the N data points, i.e, N n=1 x n , and s is the square sum of the N data points, i.e., N n=1 x 2 n . (CF additivity theorem) [22]: Assume that CF 1 = N 1 , l 1 , s 1 ,and CF 2 = N 2 , l 2 , s 2 are the CF vectors of two disjoint clusters. Then the CF vector of the cluster that is formed by merging the two disjoint clusters is as following: CF tree: A height-balanced tree, similar to the β tree, has three parameters: branch factor β, leaf factor λ and threshold τ . Among them, the branch factor β represents the maximum number of non-leaf nodes. Leaf factor λ represents the maximum number of leaf nodes. The threshold τ represents the maximum diameter of the CF stored in the leaf node. As shown in Fig. 1. CF tree consists of root node, branch node, and leaf node, internal nodes contain no more than entries in the shape of [CF i , child i ], where CF i represents the clustering feature information of the i subcluster on the node, and pointer child i points to the i child node of the node. The leaf node contains no more than λ entries like an [CF i ]. In addition, each leaf node contains pointer prev pointing to the previous leaf node and pointer next pointing to the next leaf node. Each node represents a cluster formed by merging sub-clusters corresponding to clustering features in each entry. CF tree BIRCH Algorithm builds CF trees in memory for clustering by reading data points in turn. The BIRCH algorithm reads data points in turn, builds a CF tree in the memory for clustering, and then uses Agglomerative Clustering to cluster all CF entries globally to obtain a better CF tree to output the result.

III. BI-BMCSMOTE ALGORITHM
When synthesizing new samples, the traditional oversampling algorithm such as SMOTE omits the occurrence of some issues. First, the algorithm does not consider the necessity to filter off noise points, thereby resulting in the inclusion of many noises in the synthesized dataset. Second, the new sample is synthesized by the algorithm on the line connecting two points, so there is a possibility that the new sample may fall into the majority area between the two points. Third, the information of the boundary samples is of great importance, and ignoring the boundary samples will lead to an insignificant improvement outcome in the final classification.
In response to the above problems, this paper proposes a BIRCH and Boundary Midpoint Centroid Synthetic Minority Over-Sampling Technique (BI-BMCSMOTE), which mainly includes four steps: BIRCH clustering, marking boundary minority samples according to probability, calculating cluster density and giving the weight of the sample. Finally, using BI-BMCSMOTE to synthesize new minority samples. Fig. 2 shows the main steps of BI-BMCSMOTE algorithm.

Definition 1 (Cluster density distribution Function):
The density distribution function of cluster C i is defined as the proportional function of the number of sample points in cluster C i and the volume of the hypersphere formed by the sample points in cluster C i , the formula is as follows: where NN C i is the number of sample points in clus- hypersphere [33] formed by the sample points in the cluster C i ; r i is the Euclidean distance from the centroid u i to the farthest point in the cluster Definition 2 (Sampling weight) The sampling weight of cluster C i is defined as the sum of the reciprocal of its density distribution function divided by the reciprocal of all cluster density distribution functions [21]. The formula is as follows: According to the property of m i=1 w i = 1, the number of new samples generated in cluster

B. BIRCH CLUSTERING OF MINORITY SAMPLES
First, we use the BIRCH algorithm to cluster the minority samples in the original dataset. The specific theory and characteristics of the BIRCH algorithm have been elaborated in section 2-B. Before performing BIRCH clustering, we also designed a denoising operation to optimize the clustering effect. This is because the BIRCH clustering algorithm is operated by using Sklearn package in Python language., and it does not involve the step of removing noise samples. Therefore, we consider the k the samples of k in the adjacent majority classes of minority samples as noise, and delete them before clustering the minority samples, which proves to be effective after the test.

C. IDENTIFY BOUNDARY MINORITY SAMPLES AND PROPORTIONAL MIDPOINT OVERSAMPLING
Identifying the boundary minority samples can carry out the steps in Section 3-B simultaneously, because the identification method has no correlation with clusters. We can better understand the new algorithm by considering boundary minority sampling together with the proportional midpoint oversampling.
We can lose the key information from the boundary minority samples if the new samples are synthesized by oversampling only. The boundary minority samples are small in amount and can hardly be learned because they are in the overlapping areas with the samples of different classes. We therefore improved the algorithm to identify and label the boundary minority samples, and then decided the sampling probability based on the neighbor density of the majority class around the minority class, i.e., the greater the density, the higher the probability of sampling.
We then conducted proportional sampling. In each cluster, this method not only obtained the boundary minority samples based on probabilities to synthesize new samples, but also used the minority samples within the cluster to generate new samples. The ratio of synthesizing new samples is set to 1. In Section 5-B, we analyzed and discussed the results under different ratios. This is an innovation of our design. The importance of different data borders varies. Data of less border importance is given a smaller synthesizing ratio for the boundary minority samples.
Finally, the midpoint oversampling [21] was implemented. The BI-BMCSMOTE algorithm synthesizes new samples in a different way from the SMOTE algorithm. SMOTE synthesizes new samples by searching for the closest neighbors of the minority samples, and therefore the new and old samples could be identical. In contrast, the BMCSMOTE algorithm sorts the samples in order in a cluster based on the samples' distance to the centroid, then chooses the samples with furthest distance possible from each other, and synthesizes the new samples on the line that connects the two samples. This approach not only avoids high similarity by increasing the diversity of new samples, but also provides more useful information on classification for classifiers.

D. BI-BMCSMOTE ALGORITHM
Compared with traditional algorithms like SMOTE, the BI-BMCSMOTE algorithm not only clusters the samples in the minority class, but also adds boundary minority class samples selected by probability to synthesize new samples. The BI-BMCSMOTE algorithm is also different from the SMOTE algorithm in terms of the way they synthesize new samples. The SMOTE algorithm synthesizes new samples by searching for the nearest neighbors of minority class samples, so that old and new samples are prone to be highly similar. However, the BMCSMOTE algorithm sorts the samples within cluster by their distance to the center of mass, selects the samples in the same cluster that are as far away from each other as possible, and synthesizes new samples on the line connecting the two samples. This method not only avoids high similarity by increasing the diversity of new samples, but also provides the classifier with more effective information about the classification.

IV. EXPERIMENTAL STUDY
The empirical analysis in this section is divided into four components: dataset description, evaluation measures, oversampling algorithm and classifier selection, and method parameter setting of proposed method and comparison method.

A. DATASETS
This article uses eleven actual datasets. Ten actual datasets are selected from UCI database [23] and KEEL database [24], and their characteristic information is shown in Table 1. 1. For the k nearest neighbors of each minority sample, find the minority sample whose majority nearest neighbor is k −1 as noise and remove it. Get the new minority dataset P 1 . 2. Substitute P 1 into the BIRCH clustering algorithm, and output cluster C. 3. For each p a ∈ P 1 , find its k nearest neighbors. Calculate the number of majority neighbors of p a as maj a and calculate the number of minority neighbors of p a as min a . When maj a > min a , mark the sample as a boundary minority sample. Finally, the boundary minority sample set is H .
. The extraction probability of set H is PR.
6. For C i ∈ C do: Calculate the density distribution function of cluster C i according to formula (3): density(C i ). End for 7. Calculate the oversampling weight of each cluster according to formula (6): W 8. For C i ∈ C do: 1) According to the oversampling weight w i , calculate the number of new samples that need to be generated in cluster C i : N C i , the ratio of 1:1. For N C i1 , sample points in H are extracted with probability. For N C i2 , extract sample points in P 1 . Get sample sets G i1 and G i2 respectively. 3) Calculate the Euclidean distance r j from all sample points in the sample set G i1 to the centroid u i , and sort the sample points in the set G i1 according to r j from small to large: C i1 = {x 1 , · · · , x mid , x mid+1 · · · , x i }. 4) The sorted sample set C i is divided into the near centroid set X min = {x 1 , x 2 , · · · , x mid } and the far centroid set X max = {x mid+1 , x mid+2 , · · · , x i } from the middle position. 5) According to the arrangement order of the sets X min and X max , select a sample point from each of the two sets to pair in pairs: PP = {x 1 x mid+1 , x 2 x mid+2 , · · · x mid x i }, where means pairing the samples, PP j = x j x mid+j , j = 1, · · · , mid. Another large-scale actual dataset is selected in the Kaggle database. The detailed description of these eleven actual data sets will be shown below. Before using the oversampling algorithm, all datasets are simply preprocessed, including the removal of duplicate variables and data standardization to increase data validity.   Table 1 shows ten actual datasets. The reason for choosing these datasets is that their sample size and the number of variables have a large span. This shows that the BI-BMCSMOTE oversampling algorithm can be applied to datasets of different dimensions. However, with the development of the society, the sample size of the dataset that needs to be tested keeps increasing, and the variables also become a lot. In order to show that the BI-BMCSMOTE oversampling algorithm also has a good effect on large-scale data, this paper also uses the credit card customer default dataset in the Kaggle database for the analysis of Taiwan's credit card customer delinquency since 2005, which is about 30,000. This dataset contains customer credit card information for five months in 2005. The dataset contains 25 variables, with an imbalance ratio of 4.52, in which the ID variable is meaningless and needs to be deleted. This dataset also uses simple data preprocessing.
In order to visually demonstrate the superiority of the BI-BMCSMOTE oversampling algorithm, a synthetic two-dimensional imbalanced dataset is used. As shown in Fig. 3, the blue triangle is the majority sample, accounting for 80%, and the black circle is the minority sample, accounting for 20%. X1 and X2 represent two features. There are noise samples in this data set and the boundary of the samples is not clearly demarcated. Direct classification without measures will lead to average classification results.

B. EVALUATION MEASURES
Accuracy is an important measurement standard in traditional classification evaluation measures, but in imbalanced data classification, accuracy is not applicable. This is because of the rare proportion of the minority samples, even if the prediction is wrong will have a high accuracy. Therefore, researchers have put forward some powerful schemes for evaluating imbalanced dataset indicators [25], [26]. In this paper, F-measure and AUC are adopted as evaluation measures to apply to the classification effect detection of imbalanced datasets [25]. F-measure and AUC are based on the confusion matrix (see Table 2). According to the predicted and actual results of the classifier, it is divided into four combinations: True Positive (TP), False Negative (FN), False Positive (FP), True Negative (TN). Therefore, the definition of F-measure can be deduced: where recall = TP TP+FN , precision = TP TP+FP , and β > 0 is the relative importance of recall to precision. F-measure is the harmonic average of precision and recall, and is the result of weighing the importance of the two indicators. AUC is the probability that the positive sample is greater than the negative sample in the random test. It is the sum of the area under the ROC curve, and the value is less than 1.

C. OVER-SAMPLING ALGORITHM AND SELECTION OF CLASSIFIER
This paper uses five oversampling algorithms, SMOTE, Borderline-Smote, ADASYN, MWMOTE, and DBSMOTE, to compare with the BI-BMCSMOTE oversampling algorithm, and substitutes them into ten actual datasets and a larger credit default dataset for sample synthesis. After obtaining the balanced data set, the new dataset is classified using five mainstream machine learning models: K nearest neighbors (KNN) [26], random forest (RF) [28], support vector machine (SVM) [29], eXtremeGradient Boosting (XGBoost) [30], and Light Gradient Boosting Machine (LGBM) [30]. Some of these classification models are classic and well known, while others are very popular and frequently used in academia and industry. This paper selects these classification models for testing in order to show that the balanced dataset synthesized by the BI-BMCSMOTE oversampling algorithm has a wide range of validity and stability.

D. EXPERIMENTAL ENVIRONMENT SETTING
In this paper, the number of neighbors of SMOTE, Borderline-SMOTE, ADASYN is 5. The values for the parameters of MWMOTE are k1 = 5, k2 = 3, k3 = 5. The parameter of the DBSCAN clustering algorithm in the DBSMOTE algorithm is set to range (0.001, 2) which the interval is 0.05. The range of MinPts parameters to be set is (2,10). The optimal parameters and MinPts are selected by manual screening. KNN algorithm's nearest neighbor number is K ∈ [3,5].
In our proposed method, the number of categories generated by BIRCH clustering algorithm is m ∈ [2,6]. The number of minority nearest neighbors of BI-BMCSMOTE algorithm is k ∈ [3,12]. In the process of VOLUME 9, 2021 clustering, Agglomerative Clustering is used to cluster all CF tuples, which can eliminate the unreasonable tree structure caused by the sample reading order, and some tree structure splits caused by the limitation of the number of CF nodes.
The performance results of each data set are obtained by hierarchical five-fold cross validation [26]. When a new sample is synthesized, the ratio of the selected boundary minority samples to the intra-cluster minority samples is 1:1. In the final synthetic balanced dataset, the ratio of the majority class to the minority class is 1:1 [31], which has a more accurate test effect.

V. EXPERIMENTAL RESULTS AND ANALYSIS
This section first presents the visual effects of the SMOTE and BI-BMCSMOTE algorithms by using a synthetic two-dimensional imbalanced dataset, so as to demonstrate the superiority of the BI-BMCSMOTE algorithm. Next, the test effectiveness of each oversampling algorithm in different classifiers is compared by tabulation. The evaluation indexes include F-Measure and AUC. To verify the ratio of new samples synthesized by extracted boundary minority class samples to those synthesized by extracted intra-cluster minority class samples, 10 actual and credit datasets are further tested, followed by an interpretation of results.  Fig. 4 shows that due to the intrinsic noise in dataset, the samples synthesized by the SMOTE algorithm invade the majority class samples and further amplify  the noise. Moreover, it is also found that the minority class samples with more minority class neighbors are involved in synthesizing more samples, while the boundary minority class samples have less chance to participate in synthesizing samples. As a result, the learning of intra-cluster samples is wasted. Due to insufficient learning capacity of boundary samples, some important information about the boundary is also ignored. Fig. 5 shows that the BI-BMCSMOTE algorithm avoids the noise interference from samples. The BI-BMCSMOTE algorithm does not neglect the learning of intra-cluster samples while focusing on learning boundary samples. Another advantage is its ability to freely adjust the ratio of boundary samples to intra-cluster samples during synthesis (the ratio is set to 1:1 in the test).

B. ANALYSIS OF EXPERIMENTAL RESULTS OF BI-BMCSMOTE ALGORITHM
The test results of the optimal oversampling algorithm are bolded to facilitate comparison. F-Measure is denoted by ''F'' at the end of datasets, while AUC is denoted by ''A'' at the end of datasets. As shown in Table 3-7, the BI-BMCSMOTE algorithm has an average F-Measure score of 0.897 on the five classifiers, which improves by 2.5% than average F-Measure score of other algorithms. It gets the highest F-Measure score in 76.36% of all 55 tests. The new algorithm demonstrates the largest F-Measure improvement over the MWMOTE algorithm (an average improvement of 3.05%). In view of classifiers, the F-Measure of new algorithm performs the best on the LGBM classifier (an average score of 0.9067). The new algorithm demonstrates the largest F-Measure improvement over other algorithms  in the SVM classifier (an average improvement of 5.21%). In addition, the BI-BMCSMOTE algorithm has an average AUC score of 0.9384 on the five classifiers, which improves by 1.54% than average AUC score of other algorithms. It gets the highest AUC score in 69.1% of all 55 tests. The new algorithm demonstrates the largest AUC improvement over the ADASYN algorithm (an average improvement of 2.06%). In view of classifiers, the AUC of new algorithm performs  the best on the RF classifier (an average score of 0.9501). The new algorithm demonstrates the largest AUC improvement over other algorithms in the SVM classifier (an average improvement of 3.8%). The results above indicate that the BI-BMCSMOTE algorithm has some advantages over other oversampling algorithms. Moreover, it has excellent stability when running on the SVM classifier, compared with some other algorithms.

C. CHOOSING APPROPRIATE VALUES FOR BI-BMCSMOTE PARAMETERS
The BI-BMCSMOTE algorithm has two parameters to be selected: the number of clusters m generated by BIRCH clustering and the number of neighbors k of the minority samples. For m, if m is set to a large value, more clusters with smaller size will be generated. If m is set to a small value, fewer clusters with larger will be generated. Therefore, the choice of m depends on the size of the dataset. Of the datasets tested on 106 to nearly 30,000 samples, the m range was set best between 2 and 6. For K , if we set a larger value, it will make the nearest neighbor number of a minority sample larger. If there are not many minority samples in the dataset, and a higher k is set, it will cause more nearest neighbors of the majority class to join the nearest neighbors of the minority class, affecting the experimental results. In addition, if K is set at a non-extreme value, it has little influence on the experimental results. Therefore, the choice of k value depends on the number of minority samples or even the size of the dataset. Of the datasets tested on 106 to nearly 30,000 samples, the k range was set best between 3 and 12.

D. RATIO ANALYSIS OF SYNTHETIC NEW SAMPLES
The new samples were generated from the previously defined boundary minority samples and the minority samples in the cluster. Therefore, in order to verify whether the ratio of new samples generated by the two types of minority samples has an impact on final result, Table 8 showed the test results of the new algorithm on 11 data sets. The ratio is successively set to 1:9, 2:8, 3:7. . . 9:1. The test results in Table 8 show that the new algorithm cannot perform the best when the ratio is set to 1:1. At last, a frequency table of optimal ratio is made by recording the number of the highest F-Measure and AUC scores in each ratio. As shown in Fig. 6, the frequency table of optimal ratio has high values on both sides and low values in the middle, because some datasets have important information about boundary points and more boundary samples need to be synthesized for learning. If some datasets have less information about boundary points, the learning process should focus on the synthesis of intra-cluster samples.

VI. CONCLUSION
This paper proposes a new BI-BMCSMOTE oversampling algorithm, which considers the minority sample boundaries with sample cluster density functions. It provides a new method for imbalanced datasets. The BI-BMCSMOTE algorithm is executed in four steps: conduct BIRCH clustering through a single scan of dataset by applying a tree structure; calculate the number of samples in each cluster according to the cluster density; identify the boundary minority samples and mark them according to probability; synthesize new samples proportionally from the marked boundary minority sample and the normal sample. This method uses BIRCH clustering with fast running speed and good stability, enhanced boundary learning, midpoint centroid oversampling also avoids overfitting. Therefore, it can be concluded that this algorithm can synthesize diversified new samples and balance the synthetic data.
The novelty of the BI-BMCSMOTE algorithm lies in that this method not only considers the important information of the boundary samples, but also retains the normal sample information and the boundary sample information selected based on importance degree on the premise of identifying and removing noise. Our future research will focus on better improving the stability of BIRCH clustering algorithm to deal with data of different scales, and improving the synthesizing way of samples to further prevent over fitting.