PSU: Particle Stacking Undersampling Method for Highly Imbalanced Big Data

Imbalanced classes are a common problem in machine learning, and the computational costs required for proper resampling increases with the data size. In this study, a simple and effective undersampling method, named particle stacking undersampling (PSU) was proposed. Compared with other competing undersampling methods, PSU can significantly reduce the computational costs, while minimizing information loss to prevent a prediction bias. The performance benchmark applied on 55 binary classification problems indicated that the proposed method not only achieved an enhanced classification performance over other well-known undersampling methods (random undersampling, NearMiss-1, NearMiss-2, cluster centroid, edited nearest neighbor, condensed nearest neighbor, and Tomek Links) but also provided a computational simplicity that can be scalable to large data. Moreover, an experiment verified that two propositions forming the basis of the PSU algorithm can also be applied to other undersampling methods to achieve methodological improvements.


I. INTRODUCTION
Dealing with imbalanced data is a crucial task in data mining studies. In particular, concerning the classification problems, most datasets in the real world do not contain the exact equal number of instances in each class, i.e., the classes are unequally represented, which can eventually cause significant problems while applying some algorithms.
In supervised learning, most classifiers are designed to achieve the best accuracy at the risk of being overwhelmed by an underlying class distribution [1], [2]. In the worst case, the resulting classifier becomes indiscriminate. i.e., it may be biased toward the majority class presented in the training set without having performed any feature analysis. This causes diverse ramifications based on properties of classifiers. For instance, in support vector machines (SVM), prediction performance can deteriorate owing to a) minority data that do not correspond to an ideal hyperplane, b) soft-margins invalidated by minority data, and c) support vectors dominated by majority data [3].
Among various techniques devised to address the imbalanced data issue, the resampling technique is a widely used The associate editor coordinating the review of this manuscript and approving it for publication was Senthil Kumar. data-level solution [4] that is generally achieved using two main approaches: undersampling and oversampling.
In the undersampling technique, instances from the majority class are eliminated to obtain a balanced training dataset. For example, random undersampling (RUS) randomly deletes instances in the majority class. However, such an approach can lead to information loss from the removed data points [2], [3]. To mitigate this side effect, Altmcay and Ergün [5] proposed the concept of cluster centroids (CC) based on adopting the k-means clustering approach. In this method, instances belonging to the majority data are grouped into a certain number of clusters (for example, as in an integrated framework of RUS and k-means clustering in [6]). To avoid the loss of potentially useful data, various heuristic undersampling methods have been proposed. Hart [7] formulated the condensed nearest neighbor (CNN) rule, and Wilson [8] introduced the concept of an edited nearest neighbor (ENN) by applying the k-NN approach to reduce the number of data points in the majority class. Similarly, Batista et al. [9] suggested a combination of CNN with Tomek Links [10]; in this approach, a learner first selects a subset of the majority class data, the Tomek Links method is then applied to this subset. Mani and Zhang [11] proposed multiple versions of the NearMiss method using the k-NN approach: a) NearMiss-1 generated a resampled dataset based on the mean distance from the minority class data to the k nearest points; b) NearMiss-2 that yields a resampled dataset with the mean distance to the k farthest points in the minority class data; and c) NearMiss-3 that selects k nearest neighbor points in the majority class data to the whole minority class data.
Conversely, in the oversampling technique, the instances from the minority class are duplicated to match the number of majority class instances. Random oversampling (ROS) is a typical method that randomly replicates the minority class data. Another widely used oversampling method is the synthetic minority oversampling technique (SMOTE) [12]. The basic step of the SMOTE procedure is to perform an interpolation among the neighboring minority class data to synthesize under-represented instances. The SMOTE method is considered as the standard oversampling framework to deal with imbalance datasets [13] (as in various applications using SMOTE reported in [14]- [17]).
Furthermore, algorithm-level solutions have also been proposed to address the imbalanced data issue. For instance, the cost-sensitive modeling, a popular regularization treatment, is broadly used to mitigate the class imbalance problem. In SVM, cost-sensitive SVM (CS-SVM) [18] uses differing costs considering an underlying class distribution of training data to control the sensitivity of misclassification (see the heuristic based CS-SVM proposed in [19]). Then, Lin and Wang [20] combined the fuzzy concept with SVM (F-SVM) where fuzzy membership of each input point was reformulated in SVM such that different inputs can provide different contributions to the construction of a hyperplane. Wu and Chang [21] modified the kernel function using the adaptive conformal transformation to modify the spatial resolution around the class boundary. Moreover, Li et al. [22] introduced an integrated framework combining AdaBoost and SVM (AdaSVM) to boost the accuracy of SVM on imbalanced data.
Although various approaches have been developed to cope with imbalanced data, limited attention has been paid to computational scalability. For large-scale imbalanced data, it is logical to use an undersampling method that not only adjusts the class distribution but also obtains a manageable training dataset. However, as shown in Fig. 1, the computational cost required for the undersampling process can become a critical concern as the data size increases.
Herein, a simple and effective undersampling method, referred to as particle stacking undersampling (PSU), was proposed, which can reduce the computational cost compared with other well-known undersampling methods, while minimizing the information loss to avoid a prediction bias. As elaborated in the following sections, this is enabled by achieving both data representability and peculiarity.
The remaining paper is organized as follows. Section 2 provides an overview of the key principles and computational procedures of the PSU algorithm. Section 3 presents the performance benchmarks for the proposed method against other resampling methods, both in terms of classification performance and processing time. Section 4 provides discussion focusing on the relation between the proposed principles and classification performance. Finally, Section 5 provides a summary of significant findings and future research directions.

II. PARTICLE STACKING UNDERSAMPLING A. PRELIMINARIES
As previously noted, the undersampling method can lead to information loss owing to the artificial removal of the majority class instances from the training set. This implies that data representability can be attained when the distribution of the original data is maintained in resampled data. To realize this, we establish the first proposition as follows.
Proposition 1: Information loss can be minimized if the sum of the distance between the resampled and original data is minimized.
However, data redundancy increases the computational complexity without improving the quality of information. Therefore, securing independence among resampled data points is desirable. This leads to the formulation of the second proposition.
Proposition 2: Information redundancy can be minimized if the sum of the distance among resampled data is maximized.
One may notice that the above propositions are consistent with the aim of the CC method but differ from that of borderline-oriented methods, such as Tomek Links. In fact, the latter emphasized more on the identification of majority data relevant to a decision boundary. This may be beneficial when classes are readily separable, otherwise susceptible to outlying data points. In highly imbalanced data, minority class data are often enclosed by the majority class data, which hinders the retention of the original distribution by relying on the borderline-oriented methods (see discussion regarding unintended outcomes from borderline-oriented methods in [23]). Furthermore, Tomek Links, by its nature, restrictively reduce the number of majority class data and therefore has a limited ability to balance the class distribution. The CC method is prone to be affected by outlying data points and potentially converges with at locally optimal centroids. In particular, when outliers are sparsely distributed, centroids can be distorted; when they appear as a separate cluster, other clusters can be merged, both of which eventually deteriorate data representability. It is also well known that the k-means clustering method is sensitive to the choice of starting points; therefore, reproducible partitions are not always guaranteed [24]- [26]. Moreover, the CC method is severely impacted by time complexity that renders it unsuitable for large data [27], [27]- [29].
To address the aforementioned issues (see Fig. 2 for graphical insights), the proposed method focused on data representability and peculiarity while reducing the computational costs to ensure scalability to the mass of data.

B. ALGORITHMIC PROCEDURES
Concerning the two propositions and time complexity, the algorithmic procedures of PSU can be designed as presented in Algorithm 1. First, to attain data representability, data are split into multiple partitions based on the distance from the centroid of the majority class data. Each partition contains the equal number (m/n) of data points from which The resampled majority class data: D R one sample is selected to represent the partition. Notably, PSU selects existing data points as samples; hence, it can better reflect the distribution of the original data and also save computational time compared with clustering-based methods. When a sample is selected from a partition, the sample must be the farthest from other samples that are already selected. The aim of this criterion is to secure data peculiarity to the greatest possible extent because this prevents redundant data points from being included in the final sample set. This is also intended to facilitate data representability so that the data points from different clusters (if any) can be equally represented, even when they are included in the same partition. Moreover, it is possible for sparsely distributed data points that are largely dismissed by k-NN-based methods to be represented, unless the closely located samples are already selected. Finally, n majority data points are selected in such a way that the between-sample variation is maximized and the sample-to-original data variation is minimized.
PSU is intended to be a heuristic and deterministic undersampling method, i.e., the PSU method seeks a limited but representative set of majority data points with minimal operations, so that it can be applied to large data efficiently. Moreover, unlike the CC method, samples identified using the PSU method are reproducible, implying that the same unique solution can be obtained regardless of the experimental setting.

III. EXPERIMENTAL EVALUATION
In this section, seven well-known undersampling methods (RUS, NM-1, NM-2, CC, ENN, CNN, and Tomek Links) are compared with PSU by applying two popular kernels (linear and RBF) to SVM on 55 highly imbalanced datasets (imbalance ratio greater than 9) obtained from the KEEL repository [30] and, the comparison results are presented. Noted that 14 multi-class datasets were decomposed into 55 binary classification problems. Table 1 summarizes the description of these datasets.
The experiment was designed to perform 100 times repeated test for each dataset to further reduce variations in random splits (Fig. 3). In particular, the optimal parameters of the linear and RBF kernels, namely C and γ , were determined based on a five-fold cross-validation on the train set with respective undersampling methods. The classification performance of each fold was obtained by training SVM with the optimal parameters on the resampled train set using the same undersampling method that was used for the parameter optimization, and subsequently applying the trained model to test set. Note that the area under the curve (AUC) and geometric mean (G-mean) were used as performance measures as both of them were deemed as comprehensive and balanced metrics to better reflect the classification performance on imbalanced data [19], [31]. Table 2 summarizes the results of the experiment. 1 On average, in terms of both AUC and G-mean, CC achieved the most accurate classifiers followed by PSU and RUS that sig- 1 The results presented here can be reproduced at https://github.com/YongSeok-Jeon/IEEE.2020.PSU nificantly outperformed other undersampling methods. However, the CC method had significantly larger time complexity than those of competing algorithms, except CNN. This was  tolerable in the conducted experiment; however, when a large number of centroids need to be discovered in big data, the required computational load can become a critical drawback, as shown in Fig. 1. However, PSU achieved competitive resampling performance in a relatively short processing time, approximately, seventy times faster than CC.
To examine the statistical significance of the difference between the methods, the Friedman omnibus test [32] was first conducted on the rank values of classification performances for each undersampling method across the datasets. Consequently, the p-value was found to be less than the alpha risk of 0.05, indicating the existence of exceptional undersampling method(s). The Wilcoxon signed-rank test was then performed as a post-hoc analysis to facilitate the pairwise comparison of the undersampling methods with the adjusted alpha risk of 0.0017 (≈ 0.05/28) [33], [34].  Table 3 lists the p-values obtained from the post-hoc test; the value smaller than the adjusted alpha risk indicates that there existed a statistically significant difference between PSU and the corresponding benchmark method. Based on the result, it was confirmed that in the linear kernel, there was no dominance between PSU and CC, i.e., they equally yielded superior classification performance compared with the other methods in terms of both AUC and G-mean. However, in the RBF kernel, CC outperformed PSU, while they both maintained superiority to others. One possible interpretation of this is that the RBF kernel tends to map data to a higher dimensional space; thus, unlike the linear kernel, it can better handle the case when isolated centroids represent sparsely distributed data points, considering that some of them could contain important relations between classes. Finally, the aggressive identification of CC can be supplemented using the RBF mapping, which can also provide an opportunity to discover information from data. However, this is associated with the cost of additional computing resources required for the complicated mapping and parameter optimization; in our experiment, the average execution time for the RBF kernel (40.85s) was three times more than that of the linear kernel (12.68s). Below, key implications of the experiment are summarized in three points: 1) CC and PSU outperformed the other methods; however, PSU was considerably more scalable, concerning that its time complexity was significantly lower than that of CC. 2) RBF-SVM in conjunction with CC may still be preferred if the processing time is tolerable, notwithstanding the high data complexity.

3) Borderline-oriented methods, such as ENN, CNN, and
Tomek Links, were demonstrated to be relatively underperforming in terms of both resampling time and classification performance.

IV. DISCUSSION
In this section, a follow-up experiment was conducted to verify whether our propositions can serve as legitimate criteria in the undersampling practice. To enable comparing the extent to which the two propositions have been satisfied using different undersampling methods within a dataset, we introduce the fitness index ( ), which is defined as the sum of the distance between the resampled and original data divided by the sum of the distance among the resampled data. Note that by definition, a lower index corresponds to greater extent of satisfying the propositions using the resampling process.
An ordinal association between the fitness index rank and classification performance rank was investigated for each dataset and then collated to present the overall pattern (see Fig. 4). Note that we focused on five undersampling methods (RUS, CC, NM-1, NM-2, and PSU) because the borderline-oriented methods (ENN, CNN, and Tomek Links) were not intended to balance the number of major/minor classes. The obtained results indicated a statistically significant positive rank correlation: the Spearman rank-order correlation coefficient between the performance rank and fitness index rank was found to be greater than 0.35 with the p-value of 0.
Notably, the identified correlation was derived based on the five undersampling methods, i.e., when a lower fitness index was achieved using any method, it was likely that the resulting classification performance surpassed those of the other methods. This implied that the two formulated propositions serve as common principles for the five undersampling methods, and accordingly, they can be further applied when there is a need to achieve methodological improvements.
Generally, the PSU method lies in the middle between the sophisticated undersampling methods that may be precise yet not practically applicable to large data and the intuitive undersampling methods that are straightforward yet rely on arbitrary procedures without strictly formulated principles. It is therefore important to examine the nature of data to achieve the maximum effect of the proposed method. For instance, when a given dataset is separable without difficulty, VOLUME 8, 2020 simpler methods can be preferred by providing high priority to the time complexity. However, when the construction of a representative set of virtual data points is desirable, more advanced methods may have to be used to handle the classification complexity.

V. CONCLUSION
In this paper, a simple and effective undersampling method, named PSU, was proposed. Compared with other competing undersampling methods, PSU can significantly reduce the computational cost, while minimizing information loss to avoid a prediction bias. This was achieved by realizing both data representability and peculiarity in the proposed algorithm. The performance benchmark indicated that the proposed method not only reached a competitive classification performance over other well-known undersampling methods but also provided a computational simplicity that can be scalable to large data. Further, we experimentally verified that two propositions that form the basis of the PSU algorithm can also be applied to other undersampling methods to achieve methodological improvements.
In practice, data are mostly imbalanced, and the computational cost required for proper resampling increases with the data size. To address this problem, we focused on a data-level undersampling solution; however, some algorithm-level solutions can achieve the same goal in other ways. In this regard, a hybrid approach can be considered to assess complementary interactions between resampling methods and characteristics of a classifier. In addition, a proper size of the partition and/or the number of samples to be drawn from each partition can be determined considering the distribution of a dataset. Lastly, the differential application of multiple approaches, including delicate resampling applied to the data located near the decision boundary, while aggressive resampling is applied to other data, can be implemented to further improve the efficiency and effectiveness of the resampling process.
YONG-SEOK JEON received the bachelor's degree in systems management engineering from Sungkyunkwan University, South Korea. He is currently a Junior Researcher of the Industrial Engineering Department, Sungkyunkwan University. His current research interests include treebased ensemble modeling, optimization modeling, support vector machines, heuristic modeling, and big data.
DONG-JOON LIM received the B.S. and M.S. degrees in industrial engineering from Sungkyunkwan University, South Korea, and the Ph.D. degree in engineering and technology management from Portland State University, Portland, OR, USA. He is currently an Assistant Professor with the Department of Systems Management Engineering, Sungkyunkwan University. His current research interests include technological forecasting, optimization modeling, productivity analysis, and data mining. He is also a developer of an open-source R package ''DJL'' which implements various decision support tools related to econometrics and technometrics. His academic honors include the Emerald Literati Network Award (outstanding author), the ENI Award (finalist for renewable and non-conventional energy), the Marie Brown Award, and various fellowships from PSU, SKKU, and A&P, among others.