A Hybrid Sampling Approach for Imbalanced Binary and Multi-Class Data Using Clustering Analysis

Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data. The most common method to handle the class imbalance is data resampling that involves either over-sampling minority class instances or under-sampling majority class instances. In the case of under-sampling, there is a chance of losing some crucial information, whereas over-sampling can cause an overfitting problem. Therefore, we propose a novel Cluster-based Hybrid Sampling for Imbalance Data (CBHSID) approach to address these issues. The CBHSID calculates the mean of the data observations based on the number of classes. It uses the calculated mean as a threshold value to segregate majority and minority classes. CBHSID applies affinity propagation cluster analysis to each class to create sub-clusters and calculates the distance of each data item of sub-cluster using centroid mean. CBHSID removes data observations that are away from the center of sub-cluster during under-sampling. On the other hand, during the over-sampling, it generates synthetic samples using data observations near to the center of sub-cluster. We compared CBHSID with a few state-of-the-art data balancing methods on 12 binary and 4 multi-class benchmark datasets. Based on Geometric-Mean (G-Mean), Recall, and F1-score, our method outperformed the other compared methods on 14 datasets out of 16. Results also revealed that CBHSID is suitable for addressing class imbalance issues in both binary and multi-class classifications. In the current state, we have only validated CBHSID on stationary data streams. Consequently, CBHSID can further be tested on non-stationary data streams in online learning environments.


I. INTRODUCTION
Learning from imbalanced data has remained the area of interest in the research community over the last many years.
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wang . The health systems [1], [2], [3], [4], fraud detection systems [5], anomaly detection [6], and fault diagnosis [7], [8] are common examples of applications with unequal data distribution, which leads to the class imbalance problem [9]. In any data set, the class with a higher number of data samples is the majority class and the one with fewer data samples is known as a minority class [10]. The classifier trained on such data becomes biased towards the majority class [11]. For example, in financial systems where abnormal data samples are rare, most of the data samples are normal. The classification model trained on such data will produce 99.9% accuracy and considers only 0.1% of illegal transactions. As a result, it misclassifies most of the illegal transactions. Such classifiers could not be effective in real-world financial systems. Hence, it is hard for traditional classifiers to classify the imbalanced data accurately [12]. Over the years, many approaches have been proposed to address the class imbalance issue. The core objective of these approaches is to improve the performance of the classifier for the minority class. These approaches are generally divided into four categories Data-level approaches; Algorithm-level approaches; Ensemble approaches, and Hybrid approaches [13], [14]. The Data-level approach addresses the class imbalance problem by data resampling. Data resampling can be achieved by either increasing the minority class samples known as oversampling or decreasing the majority class samples known as under-sampling. Oversampling causes the issue of overfitting, and eliminating data during under-sampling may lead to the loss of useful information [15], [16].
The algorithm-level approaches are focused on producing an intelligent mechanism for data sampling. They apply weights to minority class samples to modify the learning algorithm to give more importance to the minority class samples. The main limitation of these approaches is that it is hard to calculate the weights of the minority class samples for most real-life applications [17]. On the other hand, Ensemblelevel imbalance handling has become another major category of methods to handle class imbalance by combining multiple classifiers like SMOTEBoost [18] and AdaBoost.NC [19].
The hybrid approaches are a combination of data-level and algorithm-level approaches. These approaches try to adopt the strengths of both approaches such as [20], [21], and [22]. The existing hybrid approaches use the traditional concepts of data-level and algorithm-level approaches. They mostly rely on oversampling the minority class samples and under-sample the noised samples of majority class [23]. Instead of oversampling minority samples and removing noise only, some approaches use cluster analysis to identify the most valuable data samples from the majority class and then apply data under-sampling such as [17] and [24]. The cluster-based approaches help in keeping the useful information of the majority class and only under-sample the data which is less important. In general, the existing data balancing approaches are unable to get the more relevant information for resampling except cluster-based approaches. Whereas the existing cluster-based approaches either under-sample the majority class or over-sample the minority class. Hence these approaches may again lead to the overfitting problem.
Therefore, to overcome the limitations of the existing approaches, we propose the Cluster-based Hybrid Sampling approach for Imbalance Data (CBHSID). In CBH-SID, we first calculate the mean (threshold) of the data observations concerning the number of classes. We use this threshold to separate the minority and majority classes of the underlying dataset. The same threshold is also considered the limit of samples in each class after under-sampling and oversampling. Any class with a number of samples less than a threshold is a minority class. In contrast, the class with the number of instances greater than the threshold is considered a majority class. After that, we divide each class into groups or sub-clusters with similar data points using affinity propagation [25]. Next, we calculate the number of data observations (k) to add or remove from each sub-cluster based on the size of the sub-cluster. We further calculated the distance of each item from its sub-clusters centroid. We kept the data observations closer to the centroid during under-sampling and discarded the rest for every sub-cluster of the majority class. In over-sampling, data observations near the centroid were used to generate k synthetic samples for each sub-cluster to raise the examples in the minority class. Hence every class (minority and majority) will have an almost identical number of instances, nearly equal to the threshold.
The rest of the paper is organized as follows. Section 2 discusses the existing approaches proposed to address the class imbalance issue. Section 3 describes the proposed method. Section 4 presents the experimental results and analysis; finally, we conclude our paper in section 5.

II. RELATED WORK
The class imbalance is a widespread problem in most applications such as social media mining [26], risk management [27], and anomaly detection [6]. These applications face a situation where data in one class (minority) is skewed or underrepresented than the other (majority class), termed a class imbalance [28]. For such applications, the classification models misclassify the minority class samples. Hence, different approaches have been proposed to address the issue to improve the classifier's performance for the minority class. These approaches are broadly divided into four categories Data-level strategies; Algorithm-level strategies; Ensemble strategies, and Hybrid strategies [13], [29]. There is another data balancing approach which uses any of these four approaches but it first converts data into clusters using cluster analysis and then applies data resampling. We briefly discuss each approach type in following subsections.

A. DATA-LEVEL APPROACHES
The most common and very simple approach to address the class imbalance is a data-level approach that resamples the actual data. Resampling is achieved by over-sampling, under-sampling, or combining both [30], [31]. Random Over-sampling (ROS) [32]and Random Under-sampling (RUS) [33] are the two standard versions of data resampling. Such methods randomly select samples from minority or majority classes to increase or decrease a particular class. Most of the time, ROS leads to an overfitting problem [16], [29]. In RUS, eliminating elements from the majority class sometimes causes the loss of valuable information [10].

B. ALGORITHM-LEVEL APPROACHES
Innovative algorithmic approaches like SMOTE [34] come into play to overcome the limitations of data-level approaches. It draws a line between k nearest neighbors of minority samples and generates synthetic data on these lines. Algorithm-level approaches are more focused on modifying the learning algorithm in a way that gives more importance to the minority samples, such as MWMOTE [35] and Adaptive Synthetic Sampling (ADASYN) [36]. These approaches apply weights or costs to misclassified data. In other words, higher weights are applied to minority class examples, whereas majority class examples are assigned lower weights. The learning algorithm is adjusted accordingly (the learning algorithm aims to minimize the global cost associated with misclassification) so that the algorithm becomes biased towards the minority class [37]. The main limitation of this approach is that it is hard to set actual values of the cost matrix for most real-life applications [17].

C. HHYBRID APPROACHES
A hybrid method SMOTETomek [23], applies both oversampling and under-sampling approaches. It uses SMOTE to generate synthetic data for minority classes and Tomek [38] to remove the ambiguous and noisy data from the decision boundary. We can say that SMOTETomek mainly relies on over-sampling. It does not really under-sample the majority class but only removes the noised data. Another hybrid data sampling approach introduced in the same paper combines SMOTE with Edited Nearest Neighbor (ENN) [39], known as SMOTEENN. The SMOTE was used for over-sampling, whereas ENN was applied for under-sampling. The ENN finds the k nearest neighbor of the observation and returns the class with more samples among the nearest neighbor, i.e., the majority class of that small group. It discards the observation and its neighbors if the observation class and its neighbor's majority class are not the same. The SMOTEENN may struggle in reducing the imbalance ratio because it removes the data observation and its neighbors when the majority of the neighbors and the observation itself belong to two different classes. There could be a situation where observation belongs to the negative (majority) class, and nearest neighbors belong to the positive (minority) class. In such cases, removing observation and its nearest neighbors may further increase the class imbalance because it removes minority class samples in the majority of only among most immediate neighbors.

D. ENSEMBLE APPROACHES
The ensemble approach is another famous method to deal with the class imbalance problem. In the ensemble learning approach, multiple classifiers are trained on training data. Their output is aggregated to get the final result. Bagging [40] and boosting [41] are well-known ensemble methods to deal with class imbalance. The bagging or Bootstrap Aggregation minimizes the prediction variance. It creates a separate subset of the training data for each model so that every model is trained on a different copy of data (but is a subset of the primary dataset). It uses a complete copy of the dataset to test the models and get the final decision after the aggregation of outputs of all models. The random forest [42] is the most common example of the bagging approach. Using multiple models instead of a single model reduced the overfitting problem. As the final output is the average of all models' results, one cannot achieve a precise value for the model. In boosting, on the other hand, multiple classifiers are sequentially created. The data samples that the initial model misclassifies are assigned higher weights. A new classifier is built, which will be trained on the updated data. The process is repeated till a certain level of accuracy is achieved. The AdaBoost [43] and Gradient Boosting [44] are common examples of boosting approaches.
Combining different approaches such as data-level and algorithm-level gives an improved version of the technique known as the hybrid approach. Some intelligent approaches, such as Kubat and Matwin [45], offered an intelligence-based under-sampling one-sided selection (OSS) approach, which selectively removes the majority instances that are either redundant or borderline majority examples. Overall, all the above-discussed methods are unable to get the more relevant information for resampling.

E. CLUSTER-BASED APPROACHES
The cluster-based imbalanced handling approaches aim to keep the more relevant information of a majority class and discard the rest or even create multiple clusters of majority class and combine each sub-cluster with minority class to create multiple balanced datasets [46]. Yen and Lee [17] presented a cluster-based under-sampling algorithm (CLUS) where training data with homogeneous characteristics is organized into groups and then the number of majority class samples in each cluster downsized. Two cluster-based majority class under-sampling approaches were proposed in [24]. Where k clusters are created from the majority class and the ratio of each sub-cluster class is calculated to the total number of samples of the majority class. In the first approach, certain numbers of majority class samples are selected from each sub-cluster. Whereas in the second approach they selected the majority class samples which are near to the centroid of the sub-cluster. At last, they merge the selected majority class samples with minority class samples to create a training set. The fast-CBUS [47] is another cluster-based undersampling approach that creates k clusters of the minority class. It then defines the boundary of each sub-cluster based on the distance of the foremost minority item of that subcluster. The observations of a majority class that lies outside of this boundary are discarded during under-sampling. This step is repeated for each sub-cluster. In case the majority class observations are less in numbers than the minority class in any sub-cluster they randomly select majority observations to balance the sub-cluster. This approach may face trouble with datasets where minority class samples are sparse and majority class samples are abundant within the boundary of minority VOLUME 10, 2022 sub-clusters. Identifying suitable values for k to generate k sub-clusters is also not easy in most cases.
Another cluster-based approach for under-sampling was proposed by Tsai et al. in [48], where they apply cluster analysis on majority class and convert it into small groups or clusters. They use the instance selection method to select a certain number of samples from each group (cluster). As a result, they develop a reduced majority class and then combine it with minority class to form a training set. Their proposed approach was tested only on binary class datasets. In [49], Zhu et al. applied a neighborhood-based clustering algorithm to divide minority class samples into three different groups based on their position: inland minority, borderline minority, and trapped minority. They further apply various over-sampling strategies to generate synthetic samples for each group of minority samples.
An adaptive weighted over-sampling approach is proposed in a more recent work [50] using density peaks clustering. In this study, clustering is applied to minority class samples. The sub-clusters with sparse data (less dense sub-clusters), which are closer to the borderline of the majority class, are assigned higher weights to achieve a high probability of selection for generating a new synthetic sample. They also apply a heuristic filter to minority samples. The existing cluster-based data sampling approaches mostly under-sample the majority class. Applying such methods in applications where the imbalance ratio is too high and contains very rare minority class samples may discard most of the data from the majority class to achieve a balanced training set. As a result, useful information can be lost.
Recently, [51] proposes a clustering strategy based on the Red Deer Algorithm (RDA) to address the class imbalance in kidney rejection data. In order to improve the accuracy of their final prediction models, the authors devised a three-stage clustering-based unders-ampling strategy to discard samples from the majority class. The study was limited to only kidney rejection dataset, also removing samples from majority class may cause to lose the important information.
In order to pre-process unbalanced data, the [52] study presents Reduced Noise-SMOTE (RN-SMOTE). To begin, RN-SMOTE uses SMOTE to over-sample the training data, leading to the introduction of noisy oversampled synthetic examples in the minority class. Then, they used DBSCAN to concentrate and filter the noise. The next step is to merge the original data with the pristine synthetic instances. After the dataset has been rebalanced using SMOTE, RN-SMOTE feeds it into the underlying classifier. The RN-SMOTE was limited to only binary class imbalance problem.
Inspired by the cluster-based approaches, we propose an approach called Cluster-based Hybrid Sampling for Imbalance Data to address the class imbalance issue. The existing cluster-based data sampling approaches only apply either over-sample or under-sampling, but our proposed approach is a combination of new over-sampling and under-sampling methods, unlike SMOTEENN and SMOTETomek, which combine two existing approaches. The following section discusses the proposed approach in detail.

III. PROPOSED APPROACH
The Cluster-based Hybrid Sampling for Imbalance Data (CBHSID ) first calculates the threshold using (1), which is further used to separate the minority and majority classes of the underlying dataset. The Z D is the size of the dataset D, and C D is the number of classes in D. This threshold (µ) is then considered the limit of samples in each class after under-sampling and over-sampling. Any class with less number of samples than a threshold is considered a minority class (MI), whereas the class with a higher number of instances than the threshold is considered a majority class (MJ).
Each class is divided into groups of data examples with similar properties called as sub-clusters using affinity propagation (AP) [25]. For under-sampling of the majority class, data points from each sub-cluster of the majroity class that are away from the sub-cluster centroid are removed to get observations close to the centroid. In the case of over-sampling the minority class, we use the data observations from sub-clusters of the minority class which are closer to the centroid to generate synthetic data. The core advantage of our approach is that it will not increase the training complexity of the classifier because it balances the training set in a way that the size of the training set remains the same. As CBHSID is a hybrid approach, the number of observations it removes from the majority class during under-sampling is equal to the number of synthetic observations it adds to the minority class while over-sampling. The flow diagram of the proposed approach is shown in Figure 1.
A. CLUSTER ANALYSIS Cluster analysis helps in identifying the most valuable data in the class. Existing well know data balancing approaches like ROS and RUS randomly select data for adding or removal purposes. The AP was chosen as the clustering approach to divide each class into multiple sub-clusters (we say k subclusters). The AP does not require the value of k (number of clusters to generate). AP is applied to each class which provides a certain number of sub-clusters based on data similarity; it also provides the centroid for each sub-cluster. Next, we under-sample the majority class sub-clusters and oversample the minority class sub-clusters to balance the data. The CBHSID removes observations that are away from the centroid in under-sampling and generates synthetic data using observations closer to the centroid to over-sample the data.
To identify how far (or how close) an observation is from its centroid, it calculates the distance of each observation of the sub-cluster from its centroid. The distance is nothing but the difference in means. To get the distance of observation, it first calculate the mean of every observation available in the sub-cluster and stores in a new feature called Mean. Lets say, if there are m observations in a sub-cluster with n features, it calculates the mean of every observation using (2) given below: The X r represents the r th observation and X r,f represents the f th feature of r th observation, n represents the total number of features in an observation X r , and Mean X r is the calculated mean of all features of the observation X r . Using this mean, it calculates the absolute difference as the distance of each observation from the centroid of the sub-cluster using (3) given below: Equation (3) subtracts the mean of every observation of a subcluster from the mean of the X centroid and returns absolute difference. The X centroid represents the observation with a centroid (center of the sub-cluster). This difference is considered as a distance of observation X r from the centroid. The observation with a smaller distance value is considered the nearest observations to the centroid. In contrast, the observation with higher distance value is considered the farthest observation from the centroid. Generally, class imbalance handling approaches either increase the size of training data in case of over-sampling or decrease due to under-sampling and losing some helpful information. The advantage of our proposed method is that it keeps the valuable information and the overall size of the training data remains almost the same as the original one as we apply both under sample and over-sample approaches. The details of novel under-sampling and over-sampling techniques are discussed in following sub-sections, and then summarised in Algorithm 1, Algorithm 2, and Algorithm 3.

B. CLUSTER-BASED UNDER SAMPLING
Suppose the class is identified as the majority class (MJ). In that case, we have to remove a certain number of observations from the majority class, i.e., N REMOVE MJ, to reduce the size of the majority class (N MJ ) to the threshold level; this is achieved using (4) given below: After applying clustering analysis as discussed in section III, the majority class is converted into multiple sub-clusters; The N REMOVE MJ is distributed among different sub-clusters based on their size; In other words, the N REMOVE MJ is equal to the sum of the observations removed from each sub-cluster of MJ. We propose a noval formula to calculate the number of observations to remove from every sub-cluster of the majority class MJ using the (5) given below: The SCi is the ith sub-cluster (SC), so SCi MJ represents the ith scub-cluster of majority class MJ. Whereas N SCi MJ represents the number of observations in SCi of MJ. Hence, the N REMOVE SCi MJ represents the number of observations to remove from ith sub-cluster of the majority class. The N REMOVE SCi MJ observations far away from the centroid (having a higher distance from the centroid) are removed from each sub-cluster of the majority class. This is achieved using the calculated distance of each observation. We arrange these distances in descending order (observations with higher distance on top), and then the top N REMOVE SCi MJ observations are removed. In other words, the observations that are far away from the centroid in terms of mean difference are removed. As a result, we get data observations near the centroid for each sub-cluster. These remaining observations of each sub-cluster are merged to form an under-sampled majority class with the size of µ approximately.
In Figure 2, we illustrate how the under-sampling part of our proposed approach works. Figure 2(a) illustrates the actual majority class. After applying cluster analysis we get sub-clusters of the class which is illustrated in Figure 2(b). In Figure 2(c), the lines from the center of the sub-cluster to the observation represent the distance of that observation from the centriod. The observations with samll distances are represented with blue-colored lines and observations with higher distance value are represented with red-colored lines. As CBHSID removes observations which are away from the centroid and keeps observations near to the centriod based on the distance value, hence we illustrated observations which for r = 1 to N SCi MJ Do // calculate the distance for every observation of SCi   are near to the centroid within dotted-circle. Figure 2(d) shows the majority class after under-sampling.

C. CLUSTER-BASED OVER SAMPLING
For over-sampling, we need to add a certain number of samples in the minority class (MI), denoted as N ADD MI, to increase the size of the minority class to the threshold level. We use the symbol N MI to represent the size of the minority class. The N ADD MI is calculated using (6) given below: After applying clustering analysis to the minority class, as discussed in section III, the minority class illustrated in Figure 3(a) is converted into different sub-clusters, as illustrated in Figure 3 (b). We further calculate the number of observations to add to each sub-cluster defined as N ADD SCi MI based on the size of the sub-cluster or number of observations in ith sub-cluster (N SCi ) of minority class MI using (7) given below: For over-sampling, we repeat the same method to calculate the distance of each observation from the centroid. This time we arrange these distances in ascending order (data observations with lesser distance on top), and then the top N ADD SCi MI observations are selected for generating synthetic data. These selected observations are the data observations that are near the centroid. For generating a synthetic data observation with n features, we use all n features of the nearest observation one by one. First feature of nearest observation is used to generate firt feature of the synthatic observation, second feature of nearest observation is used to generate second feature of the synthatic observation, and so on.
To generate a synthetic data, we first take the absolute difference of the first two consecutive feature values, i.e., abs (F i − F i+1 ) of the nearest observation to generate the synthetic value. F i represents the ith feature; we start with i = 0 to get the value of the first feature. We multiply it with a randomly selected feature from observation with a centroid value, i.e., X centroid . Again we multiply the absolute difference with the mean value of the centroid observation. Finally, we sum these values and add the actual value of F i to get the synthetic value for the first feature of the synthetic observation, which is represented as Fi .
We iterate this step by increasing the value of i by one till i reaches to n + 1 (the mean column) to generate n synthetic feature values to complete one synthetic observation (a complete row with n features). The method is repeated till we generate the required number of observations to add in a specific sub-cluster i.e N ADD SCi MI , as illustrated in Figure 3(c). In case the N ADD SCi MI (number of samples to add in ith sub-cluster) is equal to or greater than the N SCi MI (total number of samples in ith sub-cluster), then all the actual observations are used to generate the synthetic observations (synthetic data) or even one actual observation can be used to generate multiple observations. Selecting random values from the centroid observation helps us generate unique synthetic observations even if we generate multiple synthetic observations using a single actual observation. Eq. 8 shows the complete statement for calculating the synthetic value for a feature Fi . The F i and F i+1 are two consicutive features of the nearest observation and X centroid is the central observation. The random (X centroid ) gives feature value from the central observation randomly, whereas Mean (X center ) gives the mean value of central observation.

A. EXPERIMENT SETTING 1) BASE CLASSIFIER
The most well-known and straightforward classification algorithm, Support Vector Machine (SVM), is considered the base classifier for our experiments, the same has been widely used for the classification tasks as explained by [53]. The SVM is initially formulated for binary class problems [54]. However, by setting up different data arrangements, nowadays SVM is being used for multiclass classification too such as [55]. In this study for binary class data simple version of SVM, whereas for multi-class data the multi-class version of SVM is used with one-vs-rest classification setting which is considered less complex than one-vs-one.

2) PERFORMANCE METRICS
The classification performance is measured on the geometric mean (G-Mean), Recall, and F1 score metrics, mostly used in the literature when data is imbalanced [56]. The G-Mean is the geometric mean of the True Positive Rate (TPR) and True Negative Rate (TNR). A good classifier should have high TPR and TNR values which result in achieving high G-Mean. The recall is the proportion of the correctly classified positive class observations to the total observation in the positive class. The positive class is in the minority in most applications such as health systems [1]. Therefore, the algorithms which produce high Recall values are considered more helpful in such applications. The F1 score is another valuable metric to measure model performance when data is imbalanced. It is the harmonic mean of TPR and TNR; instead of considering only positive class, it considers all classes (positive as well as negative).

3) DATASETS
Experiments of the current study are performed on twelve binary and four multi-class datasets, which are selected from the University of California Irvine (UCI) and KEEL data repositories. The datasets are chosen based on the class imbalance ratio. The imbalance ratio is calculated as the number of data samples in the majority class over the number of data samples in the minority class for both binary and multiclass datasets. The imbalance ratio of the selected datasets ranges from 1.38 to 41.4. If the value of the imbalance ratio is closer to 1 means data is less imbalance and high imbalance otherwise. The selected datasets are considered as benchmark and widely used in the literature as summarized by [29]. The summary of the datasets used in this study is mentioned in Table 1.
The proposed approach is compared with well-known and commonly used state-of-the-art data balancing techniques that include data-level: ROS, RUS, SMOTE; hybrid methods: SMOTETomek and SMOTEENN and a cluster-based approach ClusterCentroid (ClusterC), the summary of these approaches is given in Table 2. The experiments are performed using imblearn and sklearn Python libraries. All approaches are used with default parameters. The train test split method is used to generate different training sets and testing sets each time with 80:20 ratio [57]. For every dataset, 80% of the observations are used for training and 20% are used for testing of the classifier. The experiments are run five times and the average of results is considered as the final result.

B. EXPERIMENT RESULTS AND DISCUSSION
The performance of the proposed approach is measured on three different performance metrics. We discuss each of those below. Table 3 shows the G-Mean results obtained using six state-of-the-art data balancing approaches and our proposed CBHSID approach on 16 datasets (12 binary class and 4 multi-class). The results show that our proposed algorithm outperformed the other approaches on most of the datasets. It completely outperformed all the compared methods on 10 out of 12 binary class datasets. Out of 4 multi-class datasets, our approach outperformed on 3 datasets (penbased, contraceptive, and Thyroid). It can be observed that the datasets on which CBHSID performed better than approaches compared, the range of their imbalance ratio is vast, i.e., CBHSID can perform well on datasets with low imbalance as well as a high imbalance. Table 4 shows the performance in terms of Recall. The results show that our proposed algorithm outperformed the other approaches used in the experiments on 12 datasets out of 16. However on Yeast5, Yeast6, Dermatology and Thyroid datasets which are highlly imbalanced, SMOTE performed better than CBHSID. Table 5 shows the performance of compared approaches in terms of F1 score. The results show that our proposed algorithm outperformed the other approaches used in the experiments on 12 datasets out of 16.
The pairwise comparison was performed between CBHSID and other selected state-of-the-art methods to validate the performance of CBHSID further. Table 6 shows the comparison results of CBHSID with other methods for 16 datasets on three evaluation metrics. A similar kind of comparison is performed in [58]. When CBHSID performed better than or equal to the compared method, we represent it with '1' and '0' otherwise. Columns from 1 to 16 represent VOLUME 10, 2022  the datasets in the same sequence as in Table 1, and column 'SUM' represents the total number of datasets on which CBHSID performed better than the compared method. Regarding G-Mean (Table 6a), CBHSID is better than ROS, SMOTE on 13 datasets and worse on only 3 datasets. In contrast, CBHSID is better than RUS, SMOTEENN, and SMOTETomek on 14 datasets and worse on only 2 datasets. In Recall (Table 6b), CBHSID performed better than ROS, SMOTE, SMOTEENN, and SMOTEtomek on 13,12,15, and 14 datasets, respectively. Whereas it completely outperformed RUS and Cluster-C on all 16 datasets. In the same way, in terms of F1score (Table 6c), CBHSID performed better than ROS, SMOTE, SMOTEENN, and SMOTEtomek on 13,12,15, and 15 datasets, respectively, whereas it again completely outperformed RUS and Cluster-C on all 16 datasets. Generally, CBHSID outperformed all other methods on most of the datasets for all three evaluation metrics. The performance of algorithms varies on different metrics, and the same can be seen in our results. CBHSID outperformed all other methods in G-Mean on Thyroid dataset (column 16) but did not produce good results in terms of Recall and F1 score for the same dataset. Similarly, on Glass6 (column 7) and yeast3 (column 9), it was behind all methods in terms of G-Mean but completely outperformed all other methods on Recall and F1 score metrics. On Liver, Pima, Cancer, Heart failure, Yeast1, Haberman, Glass6, Yeast3, Yeast4, Yeast5, Yeast6, Penbased and Contraceptive datasets CBHSID remain consistent on all three evaluation metrics. We performed statistical tests discussed in the following sub-section to verify whether the CBHSID performed significantly better than other methods.

1) STATISTICAL ANALYSIS
Results of all algorithms are statistically analyzed to identify the performance significance. We used the Friedman test [59] to compare the mean ranks of all the algorithms investigated across all experimental datasets. The Friedman test scores all of the compared algorithms independently on each dataset. The best algorithm receives a score of 1, the second-best receives a score of 2, and so on. In the case of a tie, we assign the average rank value.
The mean rank is the average value of the ranks of the algorithm across all datasets. As a result, a lower mean rank signifies higher performance. Suppose the null hypothesis, which states that all algorithms perform equally, is rejected. In that case, the Bonferroni-Dunn post-hoc test [60] is used to determine which algorithms are statistically distinct. Furthermore, we use the non-parametric Wilcoxon signed-ranks test [61] to analyze the performance difference between the two algorithms more precisely. The Wilcoxon test ranks the performance differences of two algorithms for each dataset. It compares the rankings while disregarding the signs for positive and negative differences, unlike the Friedman test, which is performed directly on the mean ranks of numerous algorithms. Table 7 shows the mean ranks in G-mean, Recall, and F1score for the compared algorithms across all the experimental datasets. From Table 7, we can see that CBHSID exhibits the best mean ranks in terms of all metrics. It reveals that CBHSID produced the best results compared to other class imbalance handling techniques. The same table also presents Friedman test results which reject the null hypothesis for all three metrics Gmean, Recall, and F1score over alpha (0.05).
We proceed with the Bonferroni-Dunn post-hoc test. The critical difference (CD) in Bonferroni-Dunn test is 2.014 (α = 0.05). The results of CD diagrams are given in Figure 4, where the horizontal line in the CD diagram connects the methods with no significant performance difference. The methods which are not connected with the same line shows the significant performance difference. We are comparing CBHSID with other methods, so we discuss the performance of CBHSID mainly. Figure 4(a) shows a significant change in performance of compared methods where CBHSID produced significantly better results than SMOTE, SMOTEENN, SMOTETomek, and Cluster-C over the G-mean metric. In contrast, there is no significant change in the performance of CBHSID, ROS, and RUS over the same metric.  Figure 4(b) shows a significant change in performance of compared methods where CBHSID produced significantly better results than SMOTETomek, Cluster-C, RUS, and SMOTEENN over Recall metric. There is no significant change in the performance of CBHSID, ROS, and SMOTE over the same metric, hence are connected with same horizontal line. Figure 4(c) shows that again there is a significant change in performance of compared methods where CBHSID produced significantly better results than RUS, SMOTEENN, SMOTE-Tomek, and Cluster-C over the F1 score metric. Whereas there is no significant change among the performance of CBHSID, ROS, and SMOTE over the same metric. Figures  4(b) and 4(c) show a statistically significant difference in the performance of CBHSID versus RUS, CBHSID versus SMO-TEENN, and SMOTETomek for recall and f1score metrics. At the same time, no significant difference was found among performances of CBHSID versus ROS and CBHSID versus SMOTE. Table 8 shows Wilcoxon signed ranks test results for the pairwise comparison of CBHSID versus ROS, RUS, SMOTE, SMOTEENN, SMOTETomek, and Cluster-C for all three metrics. The R+ stands for the sum of ranks for the datasets on which CBHSID outperformed the other one. Whereas R− denotes the sum of ranks for the opposite and I stands for identical results. The symbols '' * * * '' and '' * * '' represent that the considered algorithm is significantly better than the other at a significance level of 0.01 and 0.05, respectively. Since 16 datasets are used in this experiment, the critical values of the Wilcoxon test for α = 0.01 and α = 0.05 are 19 and 29, respectively. Table 8(a) shows that CBHSID performed significantly better than ROS on Gmean and F1score metrics for α =0.05, which is the opposite of the Bonferroni-Dunn test, which showed no significant change. However, for Recall, it shows no significant change, which is similar to the results of the Bonferroni-Dunn test.  On the Gmean metric for α = 0.05, whereas on Recall and F1score metrics, CBHSID is better for both α = 0.05 and α = 0.1. These results are also slightly different from the Bonferroni-Dunn test, which showed no significant change between performances of CBHSID and RUS on Gmean.
CBHSID vs. SMOTE showed significant results on Gmean for α = 0.05, and non-significance on Recall and F1score in Table 8(c), which is the same as achieved through the Bonferroni-Dunn test. Similarly, CBHSID performed significantly better than SMOTEENN and SMOTETomek on all three performance metrics for both α = 0.05. For α = 0.1 on Gmean, CBHSID is better than SMOTEENN, whereas, on F1score, CBHSID is better than both SMOTEENN and SMOTETomek. Lastly, CBHSID remained significantly better than Cluster-C on all three metrics, with α = 0.05 on Gmean and with α = 0.01 on Recall and F1score. Overall, CBHSID produced significantly better results than the other methods for most of the datasets on Gmean, Recall, and F1score performance measure metrics.

V. CONCLUSION
This paper introduced a novel cluster-based Hybrid Sampling approach for Imbalance Data to address the class imbalance problem for both binary and multi-class classification problems. In our approach, we first calculate the mean value, the average number of samples in each class. The same is further used as the threshold for data balancing. Next, we identify minority and majority classes based on the calculated threshold. The cluster analysis is applied to each class to convert it into sub-clusters. We calculate the distance of each data item from the centroid of the sub-cluster to identify the observations closer to the centroid. We achieve this by calculating the mean of every observation and selecting the observations whose mean is closer to the mean of the centroid observation. In case of over-sampling, we discard the rest of the data samples and keep only the data samples closer to the centroid. Whereas in the case of under-sampling, we generate synthetic data samples from the data observations near the center. The number of data observations to add (oversampling) or remove (under-sampling) in any sub-cluster is calculated based on the size of that sub-cluster. We compared our proposed approach with six state-of-the-art data balancing methods on 12 binary and 4 multi-class datasets. The results of G-Mean, Recall, and F1score show that our method outperformed the other compared methods on most of the datasets. On G-Mean metric, CBHSID is better than ROS, SMOTE on 13 datasets and CBHSID is better than RUS, SMOTEENN, and SMOTETomek on 14 datasets. On Recall metric, CBHSID performed better than ROS, SMOTE, SMO-TEENN, and SMOTEtomek on 13,12,15, and 14 datasets, respectively. Whereas it completely outperformed RUS and Cluster-C on all 16 datasets. In the same way, on F1score metric, CBHSID performed better than ROS, SMOTE, SMO-TEENN, and SMOTEtomek on 13,12,15, and 15 datasets, respectively, whereas it again completely outperformed RUS and Cluster-C on all 16 datasets. CBHSID produced best mean average rank among all the compared methods. The results were verified on the Friedman test, CBHSID produced best mean average rank among all the compared methods. It ranked fist on all three performance metrics, with 1.91 on Gmean, 1.66 on recall and 1.59 on f1 score. Bonferroni-Dunn test, and Wilcoxon signed ranks post-hoc tests, showing that CBHSID produced significantly better results than other methods. However on Yeast5, Yeast6, Dermatology and Thyroid datasets which are highlly imbalanced, SMOTE performed better than CBHSID on Recall and F1 score performance metrics.
The performance of CBHSID can be improved further by: i) Using different distance metrics in cluster analysis while calculationg the distance of data item from the centriod of the sub-cluster. ii) Apply feature selection and generate synthetic data using most relevant features. iii) While under-sampling the majority class, after applying cluster analysis, the data items are removed from each sub-cluster based on its size. There could be a possibility that some sub-clusters may have very few data items and removing data items from those sub-clusters may loss its reprensentation from the actual class. So, there should be a mechanisim to overcome such situation. iv) In current study, we only considered data samples which are closer to the centroid for over-sampling and under-sampling, because, we believe that there is very less possibility of presence of noisy data near the centroid. However, if in case noisy data is somehow present near the centroid that could lead to generate noisy data during resampling. v) Another interesting situation to look into could be a class imbalance between sub-clusters and test our approach based on the complaxity.
In current study, our proposed approach was not tested on noisy data which may affect the performance of the CBHSID. In our future work, we are going to look into the matters discussed above to see their effect on performance of CBHSID. We also plan to test CBHSID on non-stationary data streams in online learning enviroment.
MANZOOR AHMED HASHMANI (Senior Member, IEEE) received the Ph.D. degree in communication networks from the Nara Institute of Science and Technology, Japan. He is currently an Associate Professor with the Department of Computer and Information Sciences, Universiti Teknologi PETRONAS. He is also a member of the High Performance Cloud Computing Centre and the Centre for Research in Data Science. His research interests include computer communications (networks), artificial neural networks, and artificial intelligence.
HEITOR MURILO GOMES received the Ph.D. degree in computer science from Pontifícia Universidade Católica do Paraná, Brazil, in 2017. He is a machine learning researcher and a data scientist. He is currently a Lecturer (an Assistant Professor) in AI at the Victoria University of Wellington (New Zealand). His research interests include machine learning, data stream mining, big data analytics, and artificial intelligence ABDUL REHMAN GILAL received the Ph.D. degree in information technology from Universiti Teknologi PETRONAS (UTP), Malaysia. He has been mainly researching in the field of empirical software engineering and cybersecurity for creating the methods and techniques for efficient and secure software development. Based on his research publication track record, he has contributed in the areas of human factor in software development, complex networks, databases and data mining, programming, and cloud computing. VOLUME 10, 2022