Constrained Oversampling: An Oversampling Approach to Reduce Noise Generation in Imbalanced Datasets With Class Overlapping

Imbalanced datasets are pervasive in classification tasks and would cause degradation of the performance of classifiers in predicting minority samples. Oversampling is effective in resolving the class imbalance problem. However, existing oversampling methods generally introduce noise examples into original datasets, especially when the datasets contain class overlapping regions. In this study, a new oversampling method named Constrained Oversampling is proposed to reduce noise generation in oversampling. This algorithm first extracts overlapping regions in the dataset. Then Ant Colony Optimization is applied to define the boundaries of minority regions. Third, oversampling under constraints is employed to synthesize new samples to get a balanced dataset. Our proposal distinguishes itself from other techniques by incorporating constraints in the oversampling process to inhibit noise generation. Experiments show that it outperforms various benchmark oversampling approaches. The explanation for the effectiveness of our method is given by studying the impact of class overlapping on imbalanced learning.


I. INTRODUCTION
The class imbalance problem occurs when some of the classes in a dataset have significantly more samples than the others. The discrepancy in sample number between classes brings about the imbalanced data structure in such kinds of datasets. Researchers have reported that this imbalanced structure hampers learning, resulting in an undesirable performance of data mining algorithms [1]- [4].
Imbalanced datasets exist in many real-world classification applications, such as detecting oil spills in satellite radar images [5], conducting medical diagnosis [6], detecting credit card frauds [7], analyzing neuroimaging data [8], predicting binding site [9] and so on. By convention, classes with a larger quantity of samples are called the negative classes or majority classes, while the others are referred to as the The associate editor coordinating the review of this manuscript and approving it for publication was Pasquale De Meo. positive classes or minority classes [10]. In these domains, minority samples (e.g. oil spills) are often the ones of interest and misclassifying minority samples is costly [3], [4]. However, traditional classification algorithms, constructed under the assumption or expectation of balanced data distribution, are usually inadequate in tackling imbalanced classification problems. Disappointing performances of classifiers are often derived on recognizing and predicting minority instances because the imbalanced structure of training set biases the decision boundary of the classifiers toward the majority class. Class imbalance issue has received considerable attention and is identified as one of the main challenges in data mining [11].
Methods designated to remedy this pervasive problem roughly fall into two categories: data-based methods [12]- [15], and algorithm-based methods [16]- [19]. Algorithm-based methods attempt to modify learning algorithms to adjust the classifier to imbalanced datasets [20]- [22]. They rely on selected classifiers so once we demand the use of a different classification algorithm, the extra computation cost is inevitable [23]. Moreover, the research presented by Maloof suggests that operations of re-sampling and adjusting classification algorithms by varying the cost matrix produce similar sets of classifiers [24]. Thus, we narrow the scope of our study to data-level solutions to imbalanced classification problem without loss of generality.
Data-based methods usually refer to re-sampling approaches that preprocess datasets to achieve a balanced data structure. Before data-level approaches are applied, we should notice that datasets differ from each other in the number of classes: binary-class datasets contain only one majority class and one minority class while multi-class datasets compose of several majority classes and minority classes. In multiple-class conditions, we can either decompose the problem into multiple two-class classification problems [25] or extend the schemes developed on binary-class scenarios to multi-class problems via pair-wise coupling techniques [26]. Therefore, we focus our efforts on imbalanced classification with binary classes the same as most researches in the literature.
Data-level strategies on binary-class imbalanced classification problems can be further divided into undersampling and oversampling. Undersampling removes samples belonging to the majority class according to designed criteria to reduce the degree of imbalance in the dataset. The least intricate undersampling proposal is Random Under Sampling (RUS) which randomly discards samples in the majority category to adjust the imbalanced data distribution. One of the more advanced undersampling techniques is the One-Sided Selection proposed by Kubat and Matwin [27]. This method detects and then deletes noisy and redundant samples in the majority category and leaves minority category untouched. By choosing a representative subset of the negative examples, it balances the dataset. Thus, the classifier learns a clean boundary and shows better predicting performances. Undersampling based on clustering put forward by Yen and Lee [28], is an approach that clusters all the training samples in the first step. Then this method decides the number of majority instances that should remain in each cluster according to the ratio of the number of majority samples to the number of minority samples in the particular cluster. In the last step, it removes the calculated amount of majority samples in each cluster to mitigate the imbalance between classes. ACOSampling, presented by Yu et al. [29], employs an ant colony optimization algorithm to remove less informative instances and search for optimal subsets of a randomly divided part of the original dataset. By doing that repeatedly, the statistical results from all local optimal training subsets are achieved and presented in a frequency list. Then high-frequency majority samples are extracted and they are combined with the original minority set to form a new training set.
Unlike undersampling, oversampling concentrates on increasing the number of minority samples. Japkowicz and Stephen [30] conducted a series of experiments on how re-sampling methods perform on data sets of different complexity and concluded that undersampling methods appear to be less effective than oversampling when two classes were assigned a symmetric cost. Meanwhile, oversampling was shown to help quite dramatically at all complexity and training set size levels. Similar results were reported by Bastista et al. [31]. We are more concerned with oversampling methods because they are reported to be more effective and are also more related to our work.
Random Over Sampling is a simple oversampling method that replicates minority class samples at random to get prescribed imbalance level. It was pointed out in previous work that ROS is prone to incur overfitting problem in learning [32]. Based on this intuitive method, more sophisticated oversampling techniques have been developed in researches. Some of the improved methods are as below.
Synthetic Minority Oversampling Technique (SMOTE), proposed by Chawla et al., forms new minority samples by linearly interpolating between minority samples that lie close to each other in feature space [33]. Despite its efficaciousness, this method has a major shortcoming: it blindly generates new samples for minority class examples without considering the distribution of original data, so sometimes noise samples are added into the dataset. This problem can be more serious when the original data set has one or more overlapping regions or holds noise samples in the original minority class.
Han et al. developed an oversampling technology called Borderline-SMOTE based on the observation that misclassified samples usually located on the borderline between minority and majority class [34]. In this proposal, only minority samples that lie near the borderline are operated on, and thus learning on the borderline is reinforced. There were two versions for their proposal: Borderline-SMOTE1 which only generates samples among borderline instances belonging to minority class and Borderline-SMOTE2 which also synthesizes examples between minority samples on the boundary and their nearest negative neighbors.
Another improved oversampling method is Adaptive Synthetic Sampling (ADASYN), which was presented by He et al. [35]. This method considers the original distribution. This method uses the number of samples that belong to the majority class in k nearest neighbors of a minority sample as a criterion to automatically judge the number of samples synthesized around this specific minority sample. More samples would be generated around instances situate close to the majority class region than those far away from the borderline between classes.
Cluster Based Synthetic Oversampling (CBOS), devised by Barua et al. [36], attempts to avoid noise generation by first conducting unsupervised clustering among minority samples and then carrying out SMOTE process inside the clusters. This data generation mechanism avoids the creation of synthetic minority samples in the majority region under some circumstances. However, when overlapping areas between classes occur in the dataset, majority samples would also be included inside the clusters. Thus, oversampling in these clusters generates minority samples that fall into regions belonging to the majority class and leads to misclassification of majority samples.
Safe-level-SMOTE assigns each positive instance a safe level ratio, which is the ratio of positive samples in k nearest neighbors of that instance [37]. Synthetic samples are generated along the same line segment as SMOTE, but these samples are placed closer to minority instances to which higher safe level rations are attached. This approach is effective in eliminating noise generation; however, it does not help recognize samples in an overlapping region since it generates no samples in this region. As a result, the usefulness of this proposal is weakened by the fact that overlapping between classes is pervading in real-world datasets.
All in all, oversampling is widely investigated and used as a remediation strategy to cope with imbalanced datasets. By far, researchers have developed a batch of advanced oversampling methods, such as SMOTE, Borderline-SMOTE, and so on. They have been applied to many real-world imbalanced datasets and are shown to be effective in improving the performance of classifiers. But unfortunately, though it was pointed out that class overlapping is a main cause of misclassification in imbalanced datasets [38]; previous researchers have not yet devised competent oversampling methods to deal with class overlapping regions as far as we know.
In class overlapping regions, majority samples mingle with minority samples so the boundaries between categories become ambiguous. Since classification algorithms are designed to learn the borderline of each class, they are less powerful in overlapping areas where borderlines seem to be confusing. Consequently, the examples situated in these regions are prone to be misclassified. Poor learning in overlapping regions calls for the demand of oversampling in these regions to improve the performance of classifiers. However, most existing oversampling approaches rest on the assumption that samples close to each other in feature space belong to the same class, which is not the case in overlapping regions. So, when existing oversampling methods are applied to these overlapping regions, noise minority samples that fall into the majority region are introduced into the datasets. These noise samples are detrimental to the performance of classification algorithms. As a result, traditional oversampling approaches show a deficiency in handling datasets with overlapping between classes.
In this study, we introduce an oversampling method that distinguishes itself from other oversampling techniques by incorporating constraints in the oversampling process to inhibit noise generation in overlapping regions. This method, namely Constrained Oversampling (CO), composes of three steps. First, samples placed in the overlapping area are identified with a simple k-nearest-neighbors (KNN) based algorithm. Second, in these overlapping regions, we apply the ant colony optimization (ACO) algorithm to search for feasible paths from randomly chosen majority samples to a specific minority sample, and the majority points on the paths which are closest to the destination are picked out as boundary samples. Finally, oversampling under distance constraints is performed between boundary samples and the chosen minority point to attain a more balanced dataset.
The rest of this paper is organized as follows. The proposed constrained oversampling method is described in Section II. In Section III, experiments are conducted on various real-world datasets from the UCI repository to test this method, and the results are presented and discussed. At last, Section IV concludes our work.

II. A NEW OVERSAMPLING METHOD: CONSTRAINED OVERSAMPLING
In order to balance the dataset without producing noise samples, we execute oversampling in overlapping regions, and the oversampling process; constraints are used to prevent noise generation. As is depicted in Fig. 1, our strategy is divided into three stages.

A. THE EXTRACTION OF OVERLAPPING AREA
In the first step of our method, we build a KNN-based method to extract overlapping regions included in the original dataset. As a geometric classifier, the KNN classifier determines the class membership of a data point from its distance to reference data points [39], and it is widely used in clustering [40]- [42]. Thus, KNN is a density-based method to classify the input data by the percentage of samples belonging to a different class in their nearest neighbors. In overlapping regions, minority and majority samples mix with each other. By the definition of KNN, minority samples locate in overlapping regions are more likely to be misclassified. This characteristic of KNN can be taken advantage of to identify samples locate in overlapping regions. When the minority samples in overlapping regions are classified to the majority class by the KNN classifier, we can know they are misclassifications because the data labels are known in our study.
First, we apply a 5-NN classifier to the original data set as recommended by Mani and Zhang [43] and record the misclassified minority samples. For every misclassified minority sample, we choose its m nearest neighbors to form an overlapping set. If there are n misclassified minority samples, then we get m * n data points. In these data points, there are some duplicate samples. The overlapping set O is It should be noted that the total number of overlapping regions do not need to define first. It is determined by the distribution of the misclassified minority samples. It forms when the overlapping set is extracted.
To better illustrate our method, we apply it to a simulated dataset with two features. Fig. 2 (a) displays the binaryclass imbalanced dataset with two overlapping regions which consist of 800 majority samples (represented by green circles) and 80 minority instances (represented by red pluses). After the above-mentioned method is applied to the dataset, overlapping regions which contain 191 samples are extracted and shown in Fig. 2 (b). According to the distribution of the misclassified minority samples, two overlapping regions are generated (shown in Fig. 2 (b)). The following steps of our oversampling method will be carried out on these overlapping regions.

B. BOUNDARY DEFINITION BASED ON ANT COLONY OPTIMIZATION ALGORITHM
Traditional boundary definition methods such as Tomek-links [27] and KNN are density-based approaches that judge the boundary samples by the percentage of samples belonging to a different class in their nearest neighbors. In overlapping regions where minority and majority samples mix with each together, traditional boundary definition methods are prone to take all majority samples in the overlapping regions as boundary samples and are thus imprecise [19].
We developed an ACO-based method to define the boundaries between classes in extracted overlapping regions. With this method, the boundaries between majority and minority regions are expressed in the form of a set of majority samples. Our proposal embodies two major merits: on one hand, it is more robust to noise in the original minority set since it does not solely depend on information conveyed by minority samples; on the other hand, it is advantageous in expanding decision region of minority category to identify boundary majority samples and then introduce them into the oversampling process.
In defining the boundaries, we require an algorithm that can automatically seek feasible itineraries from one class to the other. Ant colony optimization, developed by Colorni et al. [44], has been reported to perform well in solving various discrete combinatorial optimization problems, such as traveling salesman problem [45], protein folding [46], fingerprint matching [47], fuzzy identification [48], map matching [49], route improving [50]. This population stochastic search method was inspired by the communication of ants during the foraging process. Its ability to find an optimal path from nest to food in discrete space through iterative search is what we demand in defining the boundaries surrounding minority regions.
In this work, the ant system version proposed by Dorigo et al is adopted [51] and in particular, the sight of ants is limited to 3 (i.e. an ant can only search in the area of 3 nearest points at once to figure out a way toward the food source). In each iteration of our ACO-based method, t different majority samples are randomly chosen from the pre-defined overlapping region as well as one minority example. Then na ants are put on each majority instance and are ordered to search for the shortest path to that particular minority sample. We get t paths from which we can get to that minority sample from the majority area and then the last station in majority region is picked out and added to boundary set. After each minority sample has been taken as the destination once, the boundaries between majority and minority categories are constructed and described with a collection of majority samples. Fig. 3 demonstrates this process step by step, and pseudo-code of the establishment of boundaries based on ACO is listed as Algorithm 2.
It should be pointed out that, in our method, the rigid convergence of ACO is not a must since we aim at finding borderline points through which we can get into minority areas from the majority region. Once a feasible path is identified, we can find a majority sample locates on the boundary between two classes so it does not matter that whether this Randomly choose t different majority samples as 'nests' 5: for j = 1 : t 6: Start from majority instance nest j and na ants are simultaneously ordered to search for the shortest path to minority sample Min i using ACO 7: After NC_max iterations, the optimal path is saved as path i,j 8: The last majority sample in this path is stored as boundary i,j 9: endfor 10: endfor 11: Output: A set of majority samples defining the borderline of minority region boundary i,j route is the optimal one or only one of the sub-optimal solutions.

C. THE CREATION OF SYNTHETIC SAMPLES BY OVERSAMPLING UNDER CONSTRAINTS
In step 3, we construct the final training set by oversampling between the boundary samples and the minority instances in overlapping regions.
To reduce the generation of synthetic minority samples in majority regions, we impose distance constraints to the oversampling process. The aim of applying constraints to oversampling is to constrain new samples to the neighborhood of existing minority samples but not to that of majority samples.
Before oversampling, we shall deliberate over three questions: which samples are worthy of oversampling? How many sample points shall be created for each chosen minority sample? Where to place the generated minority samples? Question 1 is responded to in Subsection A and B of Section II. In terms of the second question, ADASYN [26] is integrated into our method to decide the number of new synthetic samples generated for each minority sample in overlapping regions. And question 3 can be answered by the constraints demonstrated as follows.
When a sample between a minority sample in the overlapping regions Min and a majority sample belonging to the boundary set Maj need to be generated, we denote the Euclidean distance between these two instances as Dist(Min, Maj), the distance between Min and its 5 th nearest neighbor as Dist(Min, 5 th NN of Min) and that between Maj and its 5th nearest neighbor as Dist(Maj, 5 th NN of Maj). The distance between the new synthetic sample and Min, Dist, is decided according to rules depicted in Fig. 4.
After Dist is calculated, the new sample is synthesized according to: where Min_new refers to the synthetic instance, Dir is a direction vector. The illustrations of the location of the newly synthesized sample on different occasions are as in Fig. 4.
In Subsection A, we applied a 5-NN classifier to the original data set as recommended by Mani and Zhang [43] to extract the overlapping area. For this data set, 5-NN has the best performance in classifying the original data. It means that 5-NN reflects the distribution characteristic of the original data very well when classifying the data. In the creation of synthetic samples for the minority samples, we increase the density of the minority samples in the overlapping area but the distribution characteristic of the original data should be inherited. Therefore, when we define the Dist the 5 th NN of Maj or Min is chosen. In this way, when we classify the newly generated data Min_new it will be classified with a low error rate.
The whole procedure of this stage is described in detail as Algorithm 3.

Algorithm 3
The Constrained Oversampling Method 1: Input: Original data set S, borderline sample set boundary i,j , minority set in overlapping region S min , the majority set in overlapping region S maj , amount of SMOTE N%, number of nearest neighbors k 2: Process: 3: Calculate the number of minority samples to be generated: G=N% * number of minority samples in original data set S 4: for i = 1: number of samples in S min 5: Find k nearest neighbors based on the Euclidean distance and get the ratio: r i = number of majority samples in k nearest neighbors / k 6: endfor 7: for i = 1: number of samples in S min 8: Calculate the number of synthetic samples g i that need to be generated for each minority sample according to: g i = r i * G/ (sum of all r i ) 9: endfor 10: for i = 1: number of samples in S min 11: for u = 1: g i 12: Except for samples that have been selected in the selected sample set S selected , randomly choose a minority sample in overlapping region Min i , retrieve its t corresponding majority samples on the borderline in boundary i,j . Randomly pick one out and denote it by Maj i . 13: Record the minority sample selected in this iteration to the selected sample set S selected so that the samples in this iteration will not be re-selected in the next iterations. 14: Generate a new sample Min_new between Min i and Maj i according to the rules depicted in Fig. 4  In the way mentioned above, to enhance learning on minority areas, newly generated minority samples are placed in overlapping regions where learning and predicting minority samples are difficult. Meanwhile, learning on majority regions is not hindered since these synthesized minority samples would not become noise points situated across the boundaries between classes. Our constrain-based oversampling method is powerful in handling class overlapping regions where traditional clustering-based methods do not perform well.

III. EXPERIMENT
In this section, we conducted experiments on five UCI datasets [52] with different numbers of features and imbalance levels. To investigate the performance of the VOLUME 10, 2022 proposed method, six benchmark oversampling strategies described in Section II, namely original data without oversampling (Origin), SMOTE, Borderline-SMOTE1 (BOS1), Borderline-SMOTE2 (BOS2), ADASYN, CBOS were applied to the datasets as well as Constrained Oversampling (CO) and the results were compared. A well-known binary decision tree learner called Classification And Regression Tree (CART) [53] was chosen as the test classifier in these experiments.

A. DATASETS
Among the five benchmark datasets from UCI Repository that were used in our tests, Pima and Haberman are composed of binary-class samples and the others are multi-class. In multi-class problems, we selected one of the classes as the minority class, and the remainders were merged into the majority class. Table 1 shows the characteristics of the five data sets, namely label of the minority class, number of attributes, number of minority instances and number of majority samples, sorted by the number of minority instances in ascending order. Particularly, imbalance level (IL) in the table refers to the ratio of the number of majority examples to that of minority examples.

B. EVALUATION METRICS AND PARAMETERS
The performance of a classifier can simply be shown by the ''raw data'' produced during testing -the counts of correct and incorrect classifications. Generally, this information would be displayed in a confusion matrix as follows.
In Table 2, Tp and Tn are the numbers of true positives and true negatives, respectively. Fp and Fn are the numbers of false positives and false negatives, respectively.
A widely adopted measurement of classification performance is overall accuracy calculated as follows: However, in imbalanced data sets, accuracy behaves poorly as a metric to evaluate the performance of classifiers over minority examples. When a data set is highly imbalanced, we attain a high accuracy even if all of the minority examples are misclassified. To overcome the problem mentioned above, other statistics such as G-mean and F-measure are also introduced into the assessment of classifiers. They are calculated as follows: where β corresponds to relative importance of precision versus recall and is usually set to 1. Both F-measure and G-mean are values between 0 and 1. Larger F-measure and G-mean indicate better performance. And Precision, Recall, Acc + and Acc − are further defined as: Acc − = Tn/(Tn + Fp) They are more reasonable indications of overall precision than accuracy due to some of their unique properties. One is that the magnitude of either G-mean or F-measure relies heavily on how the classifier performs on the minority class.
It is determined that in this paper, Overall Accuracy, G-mean, F-measure, and Acc + would all be calculated to provide a comprehensive understanding of the performance of the classifier on given datasets. The initial parameters of the proposed method in this study are listed in Table 3.

C. RESULTS
For each dataset presented in Subsection A of Section III, the minority class was over-sampled at 100%, 200%, 300%, 400% and 500% of its original size. We conducted 3 independent 10-fold cross-validation experiments at each percentage of oversampling and the final results were the average value of 3 tests [54]. In each test, all 7 different oversampling strategies -Origin, SMOTE, BOS1, BOS2, ADASYN, CBOS, and CO were applied to the same datasets. Performances of different methods on datasets oversampled at 200% are shown in Table 4, where four different metrics, Overall Accuracy, G-mean, Acc + , and F-measure are listed and compared. Other results are displayed in Fig. 5 to 9.

D. DISCUSSION
We present the discussion of the results displayed above in two aspects: the impact of oversampling rate on the behavior of oversampling methods and the influence of class overlapping to the classifier and oversampling methods:

1) THE IMPACT OF OVERSAMPLING RATE ON THE BEHAVIOR OF OVERSAMPLING METHODS
Generally speaking, a larger oversampling rate leads to a more advantageous dataset for learning the minority samples. This point is justified by Fig. 5 (c) to Fig. 9 (c). However, the progress in the performance of predicting minority is usually at the cost of more misclassified majority samples. As is depicted in Fig. 5 (a) and Fig. 6 (a), more synthesized samples sometimes mean more unreliable or noise minority instances that fall into the majority region and lead to the misclassification of majority samples. So a larger oversampling rate  in traditional oversampling methods may give rise to lower Overall Accuracy, G-mean, and F-measure.
As to our method in which we try to eliminate the generation of noise in the oversampling process, the figures show ascending performance when the oversampling rate becomes larger. However, this approach expands the decision region of the minority class in a mild manner and this characteristic leads to a relatively low performance when the oversampling rate is not large enough. When more samples are generated, our method is shown to be effective.

2) THE INFLUENCE OF CLASS OVERLAPPING TO THE CLASSIFIER AND OVERSAMPLING METHODS
As mentioned in Section I, we rest our method on the opinion that overlapping between classes is to be blamed for the loss of performance of learning systems in imbalanced datasets. VOLUME 10, 2022   To examine this stand, we exhibit the characteristics of an overlapping region extracted with the method proposed in Subsection A of Section II in Table 5.
First, we explored the influence of class overlapping on the performance of our classifier. We note that generally Acc + of CART on datasets in which most minority samples    located inside the overlapping region is undesirable. This phenomenon can be easily accepted by observing Fig. 10 which shows the relationship between the overlapping ratio of minority sample and Acc + in different datasets. It can be seen that on Haberman, which has as many as 64.20% minority samples placed inside the overlapping area, our decision tree suffers from great loss in precision -only an Acc + of 25.97% is achieved. On the contrary, on Pima, a dataset which possesses a similar imbalance level with Haberman but differs in the ratio of minority samples in overlapping region (this ratio is only 38.81% in Pima), our classifier turns out to be far more effective in recognizing minority samples by resulting in an Acc + of 56.01%. Similar situations happen to the changes of G-mean and F-measure on various datasets.
Based on the discussion above, we conclude that the performance of classifiers, at least CART, on an imbalanced dataset is heavily influenced by the overlapping ratio of minority samples. The kind of data distribution, which most minority samples in the dataset are positioned in the overlapping region, could well lead to performance degradation. This result verifies and extends the conclusion reached by Yu et al. [29] and Jo and Japkowicz [55]. VOLUME 10, 2022  Further, we investigated the influence of class overlapping on different oversampling methods. The relationship between the overlapping ratio of minority sample and F-measure in different datasets when oversampling rate equals 200% is shown for different oversampling strategies in Fig. 11. In Haberman, Abalone, and Yeast, where the large overlapping ratio of minority samples in these datasets may give birth to a lot of synthetic noise samples. SMOTE, ADASYN, and CBOS, which do not take the noise generation around the borderline into consideration, bear a greater loss in G-mean, Acc + , and F-measure compared to the other methods. Borderline-SMOTE shows better performance because they are devised to strengthen learning on the boundaries. But Borderline-SMOTE is still undesirable since its assumption is somehow oversimplified in handling overlapping regions. Among these techniques, our CO method renders to be the best solution in datasets with serious class overlapping.
Based on this observation, we believe that noise generation in oversampling imposes a negative effect on learning algorithms and the noise-prevention mechanism would improve the effect of oversampling methods.

IV. CONCLUSION
In this paper, we have proposed a new oversampling technique, Constrained Oversampling, to address the class imbalance problem, especially in datasets with overlapping between classes.
This technique is executed in three successive steps: extract overlapping regions based on the KNN algorithm; define boundary samples for each minority instance using ACO; synthesize the required amount of minority samples under constraints. Two major results are expected after this technique is applied: learning on the border is strengthened so that minority samples in imbalanced datasets are more easily recognized and few noise samples are introduced to the original dataset when oversampling the minority class in overlapping regions so that the performance would not be harmed by the generation of noise.
According to our experimental results, CO generally produces satisfying accuracy, G-mean, Acc + , and F-measure on various datasets differ with each other in imbalance level, number of features, and size. Further analysis of the results also reveals the impact of class overlapping on the performance of the classifier and oversampling strategies.
It should be noted that this study has examined only the influence of class overlapping in the imbalanced classification problem. To get a comprehensive understanding of the impact of other factors in imbalanced datasets, there are still a lot of works to be done. Meanwhile, there remains substantial room for future work considering the excessive computational and storage cost of the proposed method.