SOTB: Semi-Supervised Oversampling Approach Based on Trigonal Barycenter Theory

The problem of classifying imbalanced data is one of the active research directions in machine learning and bioinformatics. The imbalance of data will greatly degrade the accuracy of classifiers. Good oversampling methods can improve the diversity and validity of new samples, which cannot only solve the imbalance problem of sample data, but also greatly improve the classification accuracy. In this study, we propose the trigonal barycenter theory and a semi-supervised oversampling method, called SOTB (Semi-supervised Oversampling method based on Trigonal Barycenter theory). SOBT works to: (1) construct the non-intersecting triangles based on Mahalanobis distance; (2) combine the semi-supervised sampling method with trigonal barycenter theory to oversample the positive samples, which can cope with the data imbalance problem without affecting the quality of data. Lastly, extensive experiments were conducted to verify the effectiveness of the proposed method. The results demonstrate that SOTB can improve the validity, diversity and rationality on the distribution of the newly generated samples as well as alleviate the phenomena of over-fitting which is popular in existing oversampling approaches. In particular, when compared with the state-of-the-art oversampling methods, the results show SOTB can achieve the best classification performance.


I. INTRODUCTION
Imbalanced data is perversive in all aspects of our living life, e.g., cancer gene data from hospitals for health examination [1], software defect data generated by defect detecting software [2] and telecommunication fraud data [3], [4] formed in telecommunication systems, which can be classified into different categories, the aforementioned data are called imbalanced data [5]. Generally speaking, machine learning approaches use a large amount of sample data to train The associate editor coordinating the review of this manuscript and approving it for publication was Navanietha Krishnaraj Rathinam. the classification model, which makes the model obtain the capability of accurately predicting the class labels of data [6], [7]. However, it will degrade the recognition rate of classification models by employing imbalanced data to train the model, resulting in a decrease in classification performance. Working as an important means to improve the performance of classification and prediction, balancing imbalanced data plays an essential role in classification research and prediction analysis [8].
Generally speaking, the number of negative samples is much greater than that of positive samples in the classification problem. However, due to people's daily wants, the recognition significance for positive samples is greater than that of negative samples. For example, in cancer gene detection, there may be only one cancer gene among thousands of genes. If normal genes are detected as cancer genes, it will cause unnecessary psychological burden to the patients. On the contrary, if a cancer gene is wrongly detected as a normal gene, the patients may miss the optimal treatment period and bring life-threatening to them. Another illustrative example is that, in the detection of virus programs, there are a few virus programs among a large number of software programs. If normal programs are recognized as virus programs, the computer may isolate this program, which will affect the normal operation of related programs. If the virus program is recognized to be a normal one, it may bring in serious consequences, e.g., information leaks, and computer crashes that threaten the safety of personal properties. In the research of bioimformantics, imbalanced data is perversive in various application scenarios, e.g., identifying DNA-protein binding sites, RNA binding residues and snoRNA. The imbalanced data plays an important role in the research in bioinformatics. For example, improving the prediction accuracy of DNA-protein binding site can help us know more about the disease. Balancing imbalance data has become an important research direction in bioinformatics. So, it is of great theoretical and practical significance to cope with the problem of imbalanced classification as well as improve the recognition rate of positive samples.
However, the fundamental reason for the decline in classification performance caused by imbalanced data is that most of the classifiers train their optimal parameters by gradient descent or gradient ascent [9]. Using imbalanced data to train the classifiers, the number of training times for the negative samples is more than the number of training times for the positive samples, which implies that the parameters of the classifier are excessively trained to recognize negative samples, so that the classifier can accurately identify the negative samples instead of accurately identifying positive ones. By using a balanced dataset to train the classifier, the classifier can learn the characteristics of the positive samples thoroughly to achieve the purpose of improving the classifier results. Therefore, it is important to propose an effective and efficient method to solve the data imbalance problem in machine learning.
Contributions: In order to improve the performance of imbalanced data classification, the main contributions of this paper are given as follows: New theory. We integrate the trigonal barycenter theory to generate new samples, which has the following advantages: 1) the new samples do not spread across the boundary of the samples, the effectiveness of oversampling can be greatly improved; 2) the new sample is uniformly distributed in the sample space, effectively describing, filling the sample space and improving the performance of machine learning; 3) the newly generated samples are rich in diversity, which can effectively alleviating the phenomenon of over-fitting.
New method. In order to effectively balance various data sets, we proposed SOTB method (Semi-supervised Oversampling method based on Trigonal Barycenter theory). SOTB applies the semi-supervised theory, which can resample the imbalanced data to achieve the purpose of improving the classification accuracy.
Extensive experimental results. In order to verify the effectiveness of SOTB, we conducted extensive experiments by comparing it with the state-of-the-art oversampling methods to evaluate the performance.

II. RELATED WORK
How to effectively train the classification model has become one of the important research directions in machine learning. The researchers have paid a lot of attention and efforts in it, and consequently some approaches to solve the imbalance problem have been proposed in order to improve the performance of classification. The most straightforward methods focus on modifying the classifier, such as: Zularaert et al. [10] proposed a new activation function in the convolutional neural network, which improves the validity of the convolution layer. Zhang et al. [11] viewed the performance evaluation function as the target function and proposed a loss function in order to improve the effect of training. Zhang et al. [12] replaced the middle layers of neural networks by a new function, which saves the training time and improves the classification accuracy.
However, the aforementioned methods of using improved classifier have the following drawbacks: 1) the improvement is not obvious because the existing methods of improving the classifier works well at the current stage; 2) there is upper bound of classification accuracy, because the number of positive samples in the imbalanced dataset is insufficient, so the classifier cannot fully learn its features, thus the recognition rate of positive samples is difficult to be improved. To improve the classifier cannot fundamentally cope with the imbalance classification problem. Then, some researchers integrated the results of multiple classifiers to improve the accuracy of classification, e.g., Zhang et al. [13] proposed an ensemble method by voting the classification results of multiple convolutional neural networks in order to obtain more comprehensive results. Siers and Islam [14] combined the classification results of multiple decision trees. However, the aforementioned methods still exist the following disadvantages: 1) the time cost is high, because the assembling operation of classification results requires the training of multiple classifiers; 2) the improvement of classification accuracy is still limited due to the lack of positive samples.
In order to improve the classification performance in an effective fashion, the methods for balancing the data are proposed. Some approaches discard a part of the negative samples which makes the positive and negative samples balanced in number, which was called undersampling method [15]; Yu et al. [16] proposed to use random undersampling to predict protein residue binding sites by randomly abandoning a portion of negative samples, which can increase the accuracy VOLUME 8, 2020 of predicting the protein residue binding position. Yen and Lee [17] found that random undersampling method will lose a lot of useful samples due to random selection. In order to avoid the problem of randomness, the negative samples are clustered, selecting representative samples in clusters as training set to improve the effectiveness of undersampling. Babar and Ade [18] believes that we should pay more consideration on the distribution characteristic of the samples and the undersampling method based on the distribution of samples can reduce the blindness of undersampling so as to guarantee the classification accuracy. In summary, undersampling can improve the classification performance of positive samples to a certain extent, but it will inevitably neglect negative samples, which makes the classifier not fully learn the features of negative samples and reduces the classification accuracy.
In order to reduce the loss of data features, several oversampling methods are proposed. The basic idea of oversampling method is to generate new samples with certain validity and rich diversity, which simulates to obtain positive samples in reality in order to balance the number of positive and negative samples. The oversampling methods can solve the imbalance problem to some exent [19], [20]. Currently, the research of oversampling has become an active research direction to solve the imbalance classification problem. The most classical oversampling method is the SMOTE oversampling method proposed by Chawla et al. [21]. Its basic idea is to randomly generate samples between positive sample and its k nearest neighbors. Since the newly generated samples by SMOTE is a random value between these two samples, which increases the diversity. Han et al. [22] thought that the samples in the boundary between positive and negative samples can be easily misclassified. The proposed method works to find positive samples at the boundary of the positive sample space. Then, it performs SMOTE oversampling on these positive samples to improve the learning of boundary samples. Ramentol et al. [23] combined the SMOTE method with the fuzzy rough set theory to achieve good classification effect on the decision tree. Nakamura et al. [24] proposed the LVQ-SMOTE method to oversample the positive samples by using the learning vector quantization method. The newly generated positive samples effectively fill the space of the positive samples. Dong and Wang [25] proposed a method called Random-SMOTE, which randomly performs SMOTE oversampling among positive samples to increase the diversity of SMOTE oversampling.
The oversampling algorithm based on SMOTE can improve the classification accuracy, but there are some disadvantages in current oversampling algorithms, because most of the oversampling algorithms are based on k nearest neighbor theory for sample generation. The basic idea is that randomly generating a new sample between the currently selected sample and a k nearest neighbor. The new sample generated based on k nearest neighbors has three drawbacks: (1) The newly generated sample will distribute across the boundary of positive and negative samples.  The newly generated positive sample spans across the boundary of samples which refers to the edge of the sample space in which a certain type of samples is located. Fig. 1 shows the distribution of two kinds of samples, black for positive samples, white for negative ones, and the central black five-pointed star is viewed as the currently selected positive sample at the initial stage. In the phase of sample generation, since the currently selected sample is located in the central region of the positive samples, the k nearest neighbors are positive sample, so the generated sample does not cross the sample boundary, as is shown in Fig. 2

(a).
As shown in Fig. 2(b), the currently selected sample is located at the sample boundary, the k nearest neighbors include two types of samples, a part of k nearest neighbors are positive samples and the others are negative ones. Thus, some newly generated samples cross the sample boundary, which are called noise sample. Since some samples generated are not in the sample space, using these samples to train the classifier will greatly degrade the classification accuracy.
(2) The distribution of synthetic samples is unstable. The oversampling method based on k nearest neighbor theory mostly adopt Equation 1 to generate samples, where α ∈ (0, 1), that is, choosing a random point between two samples as the new sample.
Although this method increases the diversity of samples, the distribution of the new samples is unstable. This is because the new sample is a random one between two existing samples, which will make the generated new samples concentrate in a certain area or sparse in a certain area, resulting in an unstable distribution after oversampling.  When the value of α is very small, the synthetic samples will distribute in the neighborhood of the currently selected sample, as shown in Fig. 2(a). If the value of α is large, the newly generated samples will deviate from the currently selected sample, as is shown in Fig. 2(b), and it will disrupt the stability of data distribution.
(3) The new samples do not have rich diversity.
In terms of the sample generation method according to Equation 1, the newly generated samples can only geographically located in the connecting line between two samples in space. Although the generated sample is different from the original one, the diversity of the new sample is not rich. It is impossible to generate a new sample away from the connecting line between two samples. The diversity of the new samples is insufficient, as shown in Fig. 3.
In order to overcome the aforementioned problems, in this paper, we propose a semi-supervised oversampling method based on the trigonal barycenter theory, called SOTB, which can effectively solves the following problems in oversampling methods: (1) the new sample distributes across the sample boundary, (2) the new sample is unstable in distribution, (3) the new sample is not particularly rich in diversity, and it can also alleviate the overfitting problem.

III. SEMI-SUPERVISED OVERSAMPLING METHOD BASED ON TRIGONAL BARYCENTER THEORY
In this study, the proposed semi-supervised oversampling method utilizes the characteristics of the trigonal barycenter and semi-supervised learning to generate positive samples, which can effectively solve the problems of the new sample across the sample boundary, unstable distribution of samples, and insufficient in diversity. In particular, it can reduce the phenomenon of over-fitting in the training phase, and greatly eliminate the influence of imbalance data. The semi-supervised oversampling algorithm based on the trigonal barycenter theory mainly includes the following steps: 1) distance measurement between samples, 2) non-intersecting triangle construction, 3) sample generation, and 4) semi-supervised oversampling.

A. DISTANCE MEASUREMENT BETWEEN SAMPLES
It is very scientific to measure the similarity between two samples with appropriate methods [8]. The Mahalanobis distance represents the covariance distance of two objects.
Unlike the Euclidean distance, it takes into full consideration the relationship of different features, and this kind of relationship is scale-invariant. Furthermore, Mahalanobis distance is not affected by dimensions, which implies that Mahalanobis distance between two samples is independent on the unit of measurement.
Definition 1 (Covariance Matrix): Given a sample set X = {x 1 , x 2 , · · · , x i , · · · , x m }, i ∈ {1, 2, · · · , m}, there are m samples, x i is a n-dimensional feature vector, then X is a m × n matrix, and each row in X is a sample. The equation of covariance matrix is given in Equation 2 and Equation 3.
represents the covariance matrix and E(x i ) is the expectation of x i .
Definition 2 (Mahalanobis Distance): As for the Mahalanobis distances of sample x i and x j , x i and x j are column vectors, and their mahalanobis distance is calculated by Equation 4.
where −1 represents the inverse matrix of and (x i , y i ) represents the Mahalanobis distance between x i and x j .

B. NON-INTERSECTING TRIANGLE CONSTRUCTION
The non-intersecting triangle implies that the sides of two triangles do not overlap and intersect with each other. In order to avoid the overlapping of triangles in the phase of triangle construction, we propose a non-intersecting triangle construction approach. This method regards the sample S border that is farthest from the mean center sample S mean as a benchmark, and use the Mahalanobis distance between other samples and S border as a measurement. By integrating this measurement, the non-intersecting triangle is constructed by an ascending order. By using this method, [ m 3 ] non-intersecting triangles are obtained, where m is the number of positive samples and the symbol of [] represents the rounding operation.
The non-intersecting triangle construction method includes the following steps. Firstly, it finds the mean center value of the positive samples by Equation 5.  which is the farthest sample from the mean center value. Then, it computes the Mahalanobis distances between S border and all positive samples besides S border and sorts the distance in an ascending order and records the sequence number of samples, so that all positive samples form a sequence which is sorted by ascending order according to the Mahalanobis distance between positive samples and S border where S border is the first one in the sequence. Finally, by the ascending order, every three samples constitute a triangle and a total of [ m 3 ] triangles are generated.

C. SAMPLE GENERATION METHOD
The sample generation method is the most important part of oversampling. The sample generation method based on the triangle barycenter can effectively alleviates the phenomenons of new sample across the sample boundary, new sample distribution is unstable, and the diversity of new sample is not rich. The specific generation method is that calculating the barycenter of the triangle via Equation 6-Equation 7.
S child = (w 1 , w 2 , · · · , w n ) where (x i. , x j. , x k. ) represents a triangle, x i. represents the i th sample, x ij represents the j th eigenvalue of the i th sample. And, S child is the set of newly generated samples by the current triangle.

D. SEMI-SUPERVISED OVERSAMPLING METHOD
Semi-supervised method is widely used in machine learning for improving the accuracy of classification. ]. In order to cope with the data imbalance problem, we introduce a semi-supervised oversampling method, which merges the synthetic positive samples into the current positive samples to form a new positive sample set as the input of the next round of sampling. In this fashion, data balance is achieved after multiple rounds of sample generation. The phase of data generation is shown in Fig. 4. Fig. 4 shows the phase of the semi-supervised oversampling method based on trigonal barycenter theory. The initial phase includes six positive samples, i.e., S 1 -S 6 . The mean center S mean is obtained by Equation 5. Then, it calculates the Mahalanobis distance between S mean and each positive sample to find sample S 2 that is farthest from S mean . The next step is to calculate the Mahalanobis distance of each sample to S 2 , to sort samples by ascending order to construct triangles, so S 1 -S 3 and S 4 -S 6 are paired to form triangles. It is worth to notice that if the number of current samples is not a multiple of three, the remaining samples will not participate in sample generation. The next step is sample generation, on the basis of the trigonal barycenter theory to generate new samples, the sample S 7 -S 8 is generated. Repeating the above steps to iteratively generate the positive sample.
Because the synthetic child sample is used as the mother sample in the step of next generation, the proposed generation method is semi-supervised. The algorithm of semi-supervised oversampling based on trigonal barycenter theory is shown in Algorithm 1.

E. ALGORITHM COMPLEXITY ANALYSIS
For a positive sample set consisting of m positive samples, the attribute dimension of the sample is n. Firstly, SOTB constructs the paired triangles by non-intersecting triangle construction method and generates several new samples through the trigonal barycenter theory. The non-intersecting triangle construction method works to first find the mean center of all positive samples by calculating the mean value of each attribute dimension w.r.t. all positive samples, and its time complexity is O(n). Then, it calculates the Mahalanobis distance between the mean center and all positive samples, and applies the bubble sorting algorithm to find the sample S border farthest from S mean , the time complexity of these two steps is O(2 * m). Secondly, it calculates the Mahalanobis distance of all positive samples to the boundary sample S border in order to obtain M mean by using the bubble sorting method. The time complexity of these two steps is O(m + m 2 ). Lastly, the phase of constructing non-intersecting triangle is to generate new samples, each time a non-intersecting trigonal is constructed, a new sample is generated inside this triangle. The time complexity of non-intersecting constructing and new sample generation is O([ m 3 ]). Therefore, the time complexity of this SOTB is O(n + 3m + m 2 + [ m 3 ]).

IV. PROPERTIES OF SOBT ALGORITHM
The proposed semi-supervised oversampling approach based on trigonal barycenter theory has the following properties: Property 1: The newly generated samples do not distribute across the sample boundary.
In terms of the real data, a certain kind of samples is distributed in a continuous sample space, so it is very difficult to find the boundary of these sample spaces. Currently, the sample boundary can only be roughly described by the existing methods. In this paper, the boundary of samples is partitioned by connecting lines between boundary samples.

Output:
Balanced data set D = {x 1 , x 2 , · · · , x m }. Applying bubble sort method to find S border that is farthest from S mean ; 7: Calculating the Mahalanobis distance between S border and each positive sample, sorting the distances by ascending order, and storing them in the matrix M border ; 8: Initializing S child and putting newly generated samples into S child ; 9: for i = 0 to P.no − (P.no%3) do 10: Initializing a sample s //s represents the newly generated sample by the current triangle; 11: if i%3 == 0 then 12: Samples that are considered to be outside the boundary are regarded to be not in the same category where the samples inside the boundary area. In particular, in this study, we choose the trigonal barycenter instead of the barycenter of polygon more than three edges, because the barycenter of polygon more than three edges may not be inside of the polygon. As shown in Fig. 5, the newly generated sample is not inside the polygon, which implies that the newly generated sample spans across the sample boundary, so the newly generated sample is viewed as a noise sample. Using this kind of sample to train the classifier will greatly reduce the classification performance of the classifier.
Property 2: The new sample can uniformly distribute in the sample space. Whether the training samples are uniformly distributed in the sample space or not has a very important influence on the training of the classifier. For example, when the sample is generated, the newly generated samples are too concentrated in a certain region, which causes the classifier will enhance the process of learning in this region, while the learning of other regions is weakened, resulting in insufficient learning w.r.t. the samples in the weakened regions. For this case, SOTB can better handle this problem, because the newly generated sample is at the barycenter of three samples. After multiple iterations, the distribution of the new samples will not be completely concentrated in a certain region, and the distribution of new samples is more uniform in the sample space to avoid the case where the sample region are too densely distrabuted. So, the classifier can be adequately and fully trained to achieve high classification accuracy.
Property 3: The newly generated sample is rich in diversity.
In the sample space, samples with rich diversity can describe the distribution of samples very well and improve the classification performance of the classifier. The reason why random oversampling methods have a limited improvement in classification accuracy is that the synthetic positive samples are similar to the original positive samples, and the new samples are not rich in diversity, and the classifier repeatedly learns the same samples, resulting in over-fitting. SMOTE, a famous oversampling method based on k nearest neighbors has certain diversity on sample generation, but the diversity is not rich enough. SOTB can generate new samples in the barycenter of three samples, which makes the new samples rich in diversity, meets the requirement of sample diversity in imbalanced classification, and effectively improves the classification accuracy. At the same time, due to the rich diversity of the newly generated samples, SOTB greatly reduces the over-fitting phenomenon.

A. DATA DESCRIPTION
In order to verify the validity of the semi-supervised oversampling method based on the trigonal barycenter, we conducted experiments on six different types of imbalanced data sets from the public available database [26], as shown in Table 1. In Table 1, S represents the total number of samples, Pos and Neg represent the number of positive and the number of negative samples, respectively, and F represents the number of dimensions w.r.t. the samples, and IR represents the ratio of the number of negative samples to the number of positive samples.

B. EVALUATION METRICS
In experiments, we need appropriate evaluation metrics to fairly evaluate the performance of different classification methods, which can accurately reflect the classification performance. The metric of classification accuracy Acc cannot meet the evaluation requirements of the imbalance classification. For example, for a data set with an imbalance ratio equivalent to 9, if the classifier determines all samples as negative samples, the overall classification accuracy rate is 90%, but none of the positive sample was correctly classified. Due to the importance of positive samples, the key task in imbalance learning is the identification of positive samples. Therefore, when selecting evaluation metrics, we should pay more attention to identifying positive samples. In the confusion matrix of the binary classification problem, as shown in Table 2; TP represents a positive sample that is correctly classified, FN represents a positive sample that is misclassified, FP represents a negative sample that is misclassified, and TN represents a negative sample that is correctly classified.
According to the particularity of the binary classification problem and the definition of the confusion matrix, the F1 − measure evaluation metric is the harmonic mean of two evaluation metrics of Recall and Precision, where Recall can effectively evaluate the recognition rate of positive samples and Precision can effectively evaluate the confidence of the identified positive samples. F1-measure combine Recall and Precision to achieve the purpose of comprehensive evaluation. Therefore, by using the F1-measure evaluation index, it is possible to comprehensively and effectively evaluate the performance of the binary classification in imbalanced classification problem.

C. EXPERIMENTAL SETUP
In order to verify the effectiveness of the proposed method, we used a single variable method in experiments and the experimental results were obtained by specifying the same parameter values. The comparing algorithms are used to compare the experimental data on six different imbalanced data sets. In experiments, this proposed method is compared with other four different sampling methods, including: unprocessed raw data, random oversampling, SMOTE oversampling and MAHAKIL oversampling. In order to reflect the characteristics of data balance, besides the unprocessed raw data, for other sampling methods, the positive and negative samples are equally generated, which implies IR equals 1. Furthermore, we evaluate these sampling methods by four common classifiers, including: Decision Tree, Support Vector Machine, Random Forest and Neural Networks. In order to obtain comprehensive classification results, the experimental data was randomly divided into five parts by the fivefold cross-validation method [27], and in order to eliminate the influence of random factors on experimental results, we repeated the experiment for ten times and took the average value as the final result to ensure the effectiveness of the experiment.

D. EXPERIMENTAL RESULTS ANALYSIS
The results of semi-supervised oversampling method based on trigonal barycenter theory are given in Fig. 6. Fig. 6 shows the classification results of the Decision Tree classifier in six different data sets, and SOTB performs best on the data sets of glass0, haberman, new-thyroid2 and pima, but not on vehicle1 and yeast data sets. Fig. 7 shows the classification results of different sample methods on six different data sets by using the Support Vector Machine classifier, and the proposed SOTB sampling method performs best on the data sets of glass0, haberman, new-thyroid2 and vehicle1, but not on pima and yeast data sets. Fig. 8 shows the classification results of in the Random Forest classifier in six different data sets, and the proposed SOTB sampling method performs best on the data sets of haberman, new-thyroid2, vehicle1 and yeast, but not on glass0 and pima data sets.     9 shows the classification results of in the Neural Networks classifier in six different data sets, and the proposed SOTB sampling method performs best on all data sets.
In the experimental results, we can see that the proposed SOBT sampling method performs the best on most of the data sets by these four commonly-used classifiers. But, for some classifiers, oversampling method does not improve the classification result. For example, in terms of the classification result on glass0 data by the random forest classifier, the unprocessed raw data obtains the best classification performance. The main reason is that different data have different distributions and it may be that the distribution of the data is suitable for one classifier, and the results of the classification result after oversampling is not as good as the unsampled raw data. Therefore, we can conclude that it is unreasonable to use a kind of classifier on some specific data set to evaluate the sampling performance.
In order to further demonstrate the effectiveness of the sampling method, we conduct statistical analysis of the experimental results by using the metric of F1-score. The better the classification results are, the higher ranking value the sample method has. By statistic methods, the average ranking score is obtained. The results are given in Table 3.
According to Table 3, we can see that the unprocessed raw data has the worst classification result by these four classifiers. After the imbalanced data is balanced by random oversampling, the classification effect is slightly improved, indicating that the balancing data set can improve the classification performance which agrees with the reality. As discussed in Section IV, SMOTE is a method that randomly generate a new sample between two samples. The newly generated sample has a certain diversity, but the new sample may cross the sample boundary. The MAHAKIL method partitions the samples into two classes, pairing the i th sample in the first class with the i th sample in the second class, and each pair of samples generates a new sample at the middlepoint of these two samples. The sample has a certain diversity and the new sample does not cross the sample boundary as well. Because the samples generated by the SMOTE and MAHAKIL methods have a certain diversity, the classification performance is greatly improved by applying these two sampling methods. In addition, the MAHAKIL method generates samples that do not span across the boundary, so its average ranking value is higher than that SMOTE on most classifiers, but the newly generated sample by these MAHAKIL and SMOTE is not very rich in diversity. SOTB uses the theory of trigonal barycenter to generate a richer diversity of samples when compared with SMOTE and MAHAKIL and the new sample does not cross the sample boundary, which guarantee the validity of the new sample. The experimental results also show that the average ranking score of SOTB wins other four methods, which shows its advantages.

VI. CONCLUSION
The problem of binary classifications in machine learning is very popular. Traditional oversampling methods have largely failed to take in to full consideration the problem of the validity of newly generated samples, the rational distribution of new samples, and the diversity of new samples. In this study, we propose a semi-supervised oversampling method based on the trigonal barycenter theory, which can cope with the following problems: the lack of validity of new samples, the sample distribution is unstable after sampling and the diversity of new samples is not rich, and the over-fitting problem in machine learning is alleviated. By comparing with the four different oversampling methods, the classification results of SOTB performs best, which not only shows the superiority of the proposed method in the oversampling methods, but also shows that it is one of the effective ways to solve the problem of imbalance classification by improving the effectiveness and diversity of existing oversampling methods.
In our future work, we focus on designing a more effective oversampling method. The implementation of such a method would require an effective model to extract the basic features of positive samples in which are used to generate new positive samples. Furthermore, we will cope with the following problems: how to evaluate the newly generated samples and how to apply the proposed method in multi-class learning research.