SMOTified-GAN for class imbalanced pattern classification problems

Class imbalance in a dataset is a major problem for classifiers that results in poor prediction with a high true positive rate (TPR) but a low true negative rate (TNR) for a majority positive training dataset. Generally, the pre-processing technique of oversampling of minority class(es) are used to overcome this deficiency. Our focus is on using the hybridization of Generative Adversarial Network (GAN) and Synthetic Minority Over-Sampling Technique (SMOTE) to address class imbalanced problems. We propose a novel two-phase oversampling approach involving knowledge transfer that has the synergy of SMOTE and GAN. The unrealistic or overgeneralized samples of SMOTE are transformed into realistic distribution of data by GAN where there is not enough minority class data available for GAN to process them by itself effectively. We named it SMOTified-GAN as GAN works on pre-sampled minority data produced by SMOTE rather than randomly generating the samples itself. The experimental results prove the sample quality of minority class(es) has been improved in a variety of tested benchmark datasets. Its performance is improved by up to 9\% from the next best algorithm tested on F1-score measurements. Its time complexity is also reasonable which is around $O(N^2d^2T)$ for a sequential algorithm.


I. INTRODUCTION
Class imbalance problem (CIP) refers to a type of classification problems where some classes are either majorly or moderately underrepresented in comparison to other classes [1]. The unequal distribution makes many conventional machine learning algorithms quite less effective, especially for the prediction of minority classes [2]. A number of solutions have been proposed at the data and algorithm levels to deal with class imbalance such as preprocessing for oversampling or under-sampling, data augmentation, cost-sensitive learning/model penalization and one-class classification [1], [3]- [5].
The imbalance dataset exhibits a major problem for the classifiers to be bias towards the majority class. The imbalanced class distribution results in the degradation of performance of the classifier model due to biased classification towards the majority class. It causes high true positive rate (TPR) and a low true negative rate (TNR) when majority samples are positive [6]. Data imbalance can be commonly seen in fraud/fault/anomaly detection [3], [7]- [11], medical diagnosis of lethal and rare diseases [5], [12], [13], software defect prediction [14], natural disaster etc [4].
Commonly used pre-processing technique is oversampling as undersampling removes important information and does not result in accurate classification [15]. Oversampling too suffers from inclusion of illegitimate samples which is still an active area of research [16], [17]. Synthetic oversampling technique (SMOTE) [18] is considered a "de facto" standard for an oversampling method. It is simple and effective; however, it may not produce diverse sample. SMOTE uses interpolation to randomly generate new samples from the nearest neighborhood of minority class data. It has been successfully used in regression [19], and classification problems [20] for a wide range of models [21]. A review of SMOTE and applications has been given in [22].
The data samples in the case of imbalanced dataset can also be generated through classification models as well with data augmentation approach. Generative Adversarial Network (GAN) and its variations are commonly used to generate new "fake" samples [23]- [26]. GAN was originally designed to generate the realistic-looking images for large datasets, however, it can also generate minority class samples VOLUME 4, 2016 1 arXiv:2108.03235v2 [cs.LG] 27 Mar 2022 thereby balancing the class distribution and avoiding overfitting effectively [27]. Imbalanced data classification is ubiquitous in application domains. Data augmentation technique based on variations of GAN have been successfully applied on many applications such as skin lesion classification [13] for better diagnosis or pipeline leakage in petrochemical system [28].
Bayesian inference provides a principled framework to estimate unknown quantity represented by the posterior distribution (parameters of a model) which is updated via Bayes' theorem as more information gets available [29]- [31]. Markov Chain Monte Carlo (MCMC) sampling is typically used to implement Bayesian inference [32]. It features a likelihood function that takes into account the prior distribution to either accept/reject samples obtained from a proposal distribution to construct the posterior distribution of model parameters, such as weights of a neural network [31]- [34]. A major limitation for MCMC sampling technique is high computational complexity for sampling from the posterior distribution [35], [36]. There recently there has been much progress in MCMC sampling via the use of gradient-based proposals and parallel computing in Bayesian deep learning [37]- [39]. However, these have been mostly limited to model parameter (weights) uncertainty quantification rather than quantifying uncertainties in data or addressing class imbalanced problems. In the case of class imbalanced problems, MCMC sampling has been used for benchmark real-world imbalanced datasets [35], [40]. MCMC method have been applied for handing imbalanced categorical data [36]. Das et al. in [35] have used Gibbs sampling (an MCMC method) to generate new minority class samples.
Another example of oversampling method is data dependant cost matrix, where a weighted misclassification cost is assigned to the misclassified classes [4]. It is not easy to determine this cost [35]. The cost-sensitive loss function has penalty based weights for misclassification errors from both majority and minority classes. Hybrid neural network with a cost-sensitive support vector machine (hybrid NN-CSSVM) in [41] considers different cost related to each misclassification. Castro et al. in [2] have improved the misclassification error for the imbalanced data by using the cost parameter according to the ratio of majority samples in the training set. One-class problem [8], [42], [43] also has a "minority" class but generally it is considered outlier which is removed from the training data. One-class modeling usually uses feature mapping or feature fitting to enforce the feature learning process [43].
In this paper, we propose a novel hybrid approach that combines the strengths and overcomes the deficiency of two independent models that include SMOTE and GAN. SMOTE is known to produce some irregular or "out of distribution" samples. Additionally, SMOTE has not been generally used with deep learning and GAN generally has not been used for small datasets (minority classes) [11]. Hence, we refer to it as SMOTified-GAN which is a two-phased process based on knowledge transfer or transfer learning [44]. Firstly, SMOTE generates promising samples which is then "transferred" to GAN which no longer uses random sample. Our approach may work well for both small and large datasets. This could lead to more feasible and diverse data which are further enhanced through GAN to prepare better quality samples. We have obtained impressive results for our proposed method on numerical benchmark CIP datasets mainly from UCI library [45]. Its efficiency is also reasonable which is the combination of SMOTE and GAN as discussed in Section 2 and Section 3. SMOTified-GAN, however, works on nonimage data only in its current form.
The rest of the paper is organised as follows. Section 2 presents the state-of-the-art techniques to solve CIPs. Section 3 discusses the proposed method -SMOTified-GAN. Section 4 shows the experimental results and Section 5 discusses the outcome of the experiments. Lastly, Section 6 concludes the paper by summarizing the results and proposing some further extensions to the research.

A. SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE)
The SMOTE is a "de facto" standard for pre-processing imbalanced data. This is not a complete random sampling whereas it uses interpolation among the neighboring minority class examples. It is efficient and easy to implement. Each minority example gets k-nearest neighbors (KNN) which are randomly selected to have interpolation to create new samples. The pseudocode is given in Algorithm 1. The parameters n and d are the size and dimension of the minority class respectively; N is the size of the majority class and parameter k for k-nearest neighbor. Lines 1-5 finds KNN for each minority sample then does the interpolation with them to create new samples. Lines 6-12 describes the interpolation step where N − n samples are being created and added into minority class. Its time complexity for a single machine has the order of O((N − n)dnlog k) ≈ O(N 2 dlog k) [46]- [48].
There are many variants of SMOTE that have been successfully applied to various application domains such as bioinformatics, video surveillance, fault detection or high dimensional gene expression data sets [47], [49], [50]. There are many variants of SMOTE such as regular SMOTE, Borderline-SMOTE, SVM-SMOTE and KMeans-SMOTE [51]. Kovacs in [52] has shown the implementation of 85 variants of SMOTE in python library.

B. GENERATIVE ADVERSARIAL NETWORK (GAN)
GAN is a class of machine learning frameworks in which there is a contest between two neural networks with a continuous and simultaneous improvement of both neural networks. This technique learns to generate new data with the same statistics as the training set by capturing the true data distribution [53].
GAN has been successfully used for data augmentation. The two neural networks of GAN learn the target distribution Algorithm 1: Pseudocode for SMOTE // Input: d-dimensional minority samples X of size n from a training data set of size N that requires N − n over-samples. k defines k-nearest neighbors. 12 return X and generate new samples to achieve similar distributive structure in its generated over-sampled data. A GAN is simply the synergy of two deep learning network that produce "fake" data examples emulating the properties of the real data [27], [54], [55].
GAN had not been designed for oversampling imbalanced classes but to create "fake" images of real images which should be hard to distinguish. However, its success in data augmentation for over-sampling has led to the introduction to many variations of GAN to solve CIP [24]- [26], [53], [56]- [58].
The first network is called Generator whose responsibility is to takes a vector of random values and generate the data similar to the real data used in training. The second network is called Discriminator that takes input data from both the real training data and the "fake" data from the generator, to classify them correctly. This process is shown in FIGURE 1.
The time-complexity of GAN can be roughly given as O(nT Ld 2 ) where the new parameters L and T are layer-size and total iterations for a GAN. Its convergence rate with the Stochastic Gradient Descent would be O( 1 T + σ 2 ) where σ 2 is the variance of the dataset [59]. See Section III for further details.

III. SMOTIFIED-GAN FOR CLASS IMBALANCE PROBLEM
Our proposed method tries to overcome the deficiency of both SMOTE and GAN by using a transfer learning concept where it first extracts the knowledge about minority class from SMOTE and then applies it to GAN. We have named it SMOTified-GAN as it tries to diversify the original samples produced by SMOTE through GAN. Additionally, the quality of the sample is further enhanced by emulating them with the realistic samples. The process of SMOTified-GAN is shown in FIGURE 3.
Even though SMOTE is widely used as an oversampling technique, it suffers with some deficiency. The major drawback of SMOTE is that it focuses on local information and therefore it does not generate diverse set of data as shown in FIGURE 2(a). Additionally, FIGURE 2(b) shows the 5 nearest neighbors of x 1 , {x 2 , ..., x 6 } are firstly, blindly chosen then interpolated (using Euclidean distance) to get the corresponding synthetic samples {a, ..., e}. Even, there there is a high chance of miss-classification for sample e with a majority sample y 1 [60]. The generated data are generally insufficiently realistic compared to GAN that captures the true data distribution in order to generate data for the minority class [61].
GAN is not ideally fit for oversampling as it has been originally designed for realistic looking images with convolutional neural networks (CNN) rather than producing oversamples for the minority class. Additionally, GAN may face data scarcity problem as minority class is already in reduced form where model training requires more of its data to be sacrificed for validation and testing purpose. Though, crossvalidation techniques may solve this problem to some extent.
The architecture of GAN consists of two networks as mentioned in the previous section where the objective of the generator network is to generate data that fools the discriminator network to classifies as "real". To optimize its performance, maximize the loss of the discriminator when data is coming from the generator. The objective of the generator is to generate data that the discriminator classifies as "real". To optimize the performance of the discriminator, the loss of the discriminator is to be minimized when given batches of both real and generated data. The objective of the VOLUME 4, 2016 (a) Low-diversity with SMOTE taken from [62] (b) Interpolation with SMOTE taken from [60] FIGURE 2: SMOTE processing for oversampling discriminator is to not be "fooled" by the generator [53], [63].
The discriminator score can be given as: D(x) contains the discriminator output probabilities for the real data x and D(G(z)) contains the discriminator output probabilities for the generated data z.
The generator score is: The pseudocode for the GAN algorithm is given in Algorithm 2 where SGD and weights are functions to determine gradient for a mini-batch using Stochastic Gradient Descent algorithm (SGD) optimizer [64] or its any other variation such as ADAM [65] or RMSprop [66], and update the weights respectively. Once the algorithm terminates 'good' fake samples are collected with accumulateFakeEx based on classification accuracy.
Goodfellow [67] has used sigmoid as the activation function that would result the following scores to minimize: x and noise samples z from appropriate random number generator. An optional parameter can be the size n f ake of fake sample needed. // initialize parameters // m i is minibatch indices for i th index and T is total iterations. 1 GAN (x, z, n f ake ) 2 for t = 1 : T do // generally step size S is 1 // subscript d and g refers to discriminator and generator entity respectively Discriminator: Generator: where y andŷ are the outputs of the Discriminator D and Generator model G respectively before the activation function is applied.
The formalization of SMOTified-GAN is not very different from the original GAN. Only the random generator function of GAN is replaced with the repertoire of oversample minority examples from SMOTE. The modified scores can be shown as: discriminator score: Generator score: where x * is training samples of minority class(es) and u is over-sampled data of the same class(es) generated from different algorithms such as SMOTE in this case. The pseudocode for SMOTified-GAN is given in Algorithm 3. Its implementation is not too difficult either. The Python code is available at https://github.com/anuraganands. As illustrated in FIGURE 3, there are two sections of SMOTified-GAN. The first one replaces the random number generator (refer FIGURE 1) with the repertoire of oversamples from SMOTE. The second section continues with the process of GAN using the new samples from SMOTE. Algorithm 3 also shows this process in two steps. Line (1) calls SMOTE function given in Algorithm 1 and then Line (2) calls GAN function given in Algorithm 2. However, this time the generated samples u is used instead of random noise z. Its time complexity for sequential algorithm is combination of SMOTE's and GAN's time complexity, i.e., Since n is a small part of N so it can be assumed nL is comparable to N . This can further simplify the complexity to O( The major difference between our proposed method SMOTEfied-GAN and GAN is the use of ready-made repertoire of samples generated from SMOTE instead of a set of random noise to begin with. Intuitively, this helps in improvement of the input samples that produces better oversamples. This natural synergy of SMOTE and GAN guides the naïve GAN to have a jump-start with promising data before going through further refinement of unrealistic data from SMOTE.

IV. EXPERIMENTS AND RESULTS
In this section, we provide experimental results of oversampling methods, namely, SMOTE, GAN and SMOTified-GAN on different datasets that have been taken from the literature of CIP [68]- [70]. The over-sampled minority class data have been made equal to the majority class data which is then augmented into training data that are then fed into the Neural Networks (NN) for classification. We have also done the testing on original datasets without using any data augmentation technique.

A. DATASETS
We evaluate and compare our model on small to large datasets that feature class imbalance as shown in TABLE 1. The datasets were mainly obtained from the UCI machine learning repository [45] that have been used in a number of methods for CIPs [68]- [70]. Some datasets such as Credit-Card Fraud and Shuttle are highly imbalanced with minority class contribution as 0.17% and 0.29% respectively.

B. EXPERIMENTAL SETUP
We used naïve GAN model [53] and naïve SMOTE [48] in this paper. SMOTified-GAN uses the above two models, however, it is flexible enough to work with other combinations of different variations as well. The parameter settings such as learning rate, total epochs and loss functions are shown in TABLE 2. The GAN generator neural network features 3 hidden layers with 128 neurons in each layer. The GAN discriminator network is similar to the generator network with major difference of having only two layers first a linear layer followed by a leaky-ReLu layer with alpha=0.2. The classifier architecture is given in FIGURE 5. In GAN training, we use binary cross-entropy activation function with training data batch-size of 128 and initial learning-rate of 0.00001 with Adam optimser.
After basic pre-processing steps, SMOTE oversampling is done with k = 5 neighbors. The stopping criteria for SMOTified-GAN and naïve GAN's training are based on validation error to avoid any over-learn. Additionally, it is ensured that the discriminator and generator loss remain significant and do not approach near zero.

C. PRELIMINARY INVESTIGATION
The experiment has been conducted on 11 benchmark imbalanced datasets that are trained on NN to test the efficacy of various oversampling techniques. We used SMOTE, GAN and our proposed method SMOTified-GAN for oversampling. We have also done the testing with original data without any data augmentation. The quality of classification and comparative results are shown in TABLE 3. As expected all datasets show high train and test accuracy due to high imbalance in the datasets. So it is important to look into F1 scores to determine high precision and recall measures. The best F1 scores have been shown with the bold font.
It is clear from the experimental results that SMOTified-GAN has outperformed other oversampling techniques. Only Connect4 is an outlier where all oversampling techniques are showing poor results compared to the non-oversampling technique. Surprisingly, SMOTE also performed poorly by 3.6% compared to the original training dataset without any data augmentation. This result can be attributed to the fact that the dataset is highly imbalanced where minority class constitutes only 3.84% of the training dataset. This does not provide enough data for generalization. So the minority class should not be over-sampled blindly for a given dataset. Conversely, no data augmentation with datasets such Ecoli (6.0% minority class) and Wine (2.7% minority class) shows very poor and unacceptable results. Here data augmentation techniques especially with SMOTified-GAN show much better results of F1-score of 92.2% and 52.7% respectively.    SMOTified-GAN gives the best results -considering F1 score -for all other datasets with the diverse proportion of minority class such as Creditcard Fraud (0.2% minority class), Spambase (39.4% minority class), Yeast (9.9% minority class) and Wine (2.7% minority class). The rest of the datasets, Ionosphere, Shuttle, Ecoli, Pageblocks and Poker also favors SMOTified-GAN. FIGURE 4 on the comparative F1 score shows SMOTified-GAN outperforms other algorithms on 10/11 datasets. Its performance is significantly improved for Pageblocks by 9% and 10% for Ecoli. SMOTified-GAN has also produced better precision and recall for most of the datasets. It has the best precision for all 11 datasets and the best recall for 9/11 datasets. GAN and SMOTE give mixed results on different datasets. GAN has produced 10/11 times better results than SMOTE. Notably, data augmentation less training is also better than GAN and SMOTE with 2/11 times and 4/11 times respectively.
Datasets like Yeast, Ecoli, Wine, Poker and Pageblocks have small number of minority class data instances relative to the majority class which allows SMOTified-GAN to show its potential over other algorithms as seen in the respective results. In datasets like Ecoli and Wine the minority instances are so low that the non-oversampling method completely fails to predict the minority class. All models give high train and test accuracy on all datasets which is attributed to the dominance of majority class in these datasets hence the true performance index measure is the minority F1-score which depends on both the precision and recall. Overall, the proposed model of SMOTified-GAN outperforms the others in terms of F1-score and a comparatively low standard  The training loss curves of SMOTified-GAN's generator and discriminator models w.r.t the number of epochs during training of selected datasets have been shown in FIGURE 7. In general, the Discriminator's loss curve converges fairly quickly whereas the Generative loss curve demonstrates high fluctuations, however, these fluctuations generally gets steady at around 2000 epochs. We have also drawn validation F1score in the same graph to determine the termination criterion. The training stops once the validation F1-score reaches its highest value to avoid any over-fitting. Table 3 presents a summary for the experimental results with NN using the respective oversampling methods -SMOTE, GAN, SMOTified-GAN and also without no augmentation. It shows training and test accuracy and measurements of F1-score, precision and recall. Its purpose is to demonstrate the effectiveness of the methods for class imbalanced datasets. We report the mean, standard deviation, and best performance using the respective evaluation metrics using 30 experimental runs where each run has a different randomised initial position in weight space. This is done to incorporate model uncertainty in our results. FIGURE 6 presents the receiver operating characteristic curve or ROC curve on precision and recall for the tested datasets. The results for nine datasets have been illustrated with all the tested algorithms. It also shows the measure for the area under the curve (AUC). This is a standard performance measure for imbalanced data. It is clear from the graph that our proposed SMOTified-GAN has highest AUC for all the datasets except Abalone. For example FIGURE 6(h) shows AUC for Shuttle dataset with SMOTified-GAN, GAN, no data augmentation and SMOTE have the result of 0.949, 0.911, 0.891 and 0.712 respectively in descending order. SMOTified-GAN is better than others by up to 0.038 to the next best algorithm. GAN and SMOTE shows the mixed results as discussed earlier with F1-scores.

V. DISCUSSION
A significant improvement in the quality of classification has been observed with the introduction of SMOTified-GAN as an oversampling technique. It has clearly outperformed naïve GAN and SMOTE in most of the datasets. The F1-score has been improved by up to 9% for Ecoli dataset from the next best oversampling technique where the precision has also shown significant growth of around {7% to 8%} for Wine and Yeast datasets. The recall has not been much improved. SMOTE generally has low precision and SMOTified-GAN has relatively better precision than the other models. Most notable improvement from GAN and SMOTE can be seen with {Abalone, Pageblock, Wine, and Shuttle datasets}, and {Credit-card Fraud and Wine datasets} respectively.
Considering the impact of the number of features, percentage of minority class and the size of the data on the quality of oversampling has shown mixed results. For example datasets with higher features such as Credit-card fraud (30 features, 0.172% of minority class) have shown good results but Connect4 (42 features, 3.84% of minority class) has shown poorer results. However, SMOTE generally performs poorly on F1-score even with raw data in 3/4 times when the features are high as in Connect4 (42 features  Furthermore, the best algorithm may not be clearly visible with ROC curves in FIGURE 6, however, the AUC-ROC measures for each graph shows that SMOTified-GAN outperforms other algorithms. The larger area the ROC curve occupies the better the algorithm which is shown by the AUC measures. For example, it is somewhat clear from the Shuttle that shows the best to worst in the order of SMOTified-GAN (0.949), GAN (0.911), no oversampling technique (0.891) and then SMOTE (0.712). So SMOTified-GAN is 3.5% better than the next best algorithm. Similarly, it is 2.1% better than the second best algorithm for Spambase.

VI. CONCLUSION
We presented a framework that addressed class imbalanced pattern classification problems by combining features from GAN and SMOTE. Our results show that the proposed framework significantly improves the majority of the class imbalanced problems. There were improvements of up to 9% on the F1 score for the benchmark datasets. Since it is an offline pre-processing technique with a reasonable time complexity order of O(N 2 d 2 T ), it does not affect the efficiency of the training process. We also visualized the learning process and found out that the AUC of SMOTified-GAN is better than the 2 nd best algorithm up to 2.1% (for Spambase) and 3.5% (for Shuttle).
There are several possible future directions from this work such as applying SMOTified-GAN to other neural networks such as CNNs and recurrent neural networks (RNNs) to oversample imbalanced image datasets and time-series data, respectively. Furthermore, different variations and combinations of SMOTE and GAN for the new model of SMOTified-GAN can improve it even further.
It will be interesting to investigate the conjoining of GAN with other over-sampling techniques such as MCMC. Its sampling method on a Bayesian framework can be used to incorporate uncertainty in the predictions and develop a probabilistic data generation process via GANs. The proposed framework can be used in a wide range of problems that face challenges when it comes to class imbalance issues. This framework can also be used to improve few-shot learning [72] to address problems where the model finds it difficult to draw decision boundaries due to a lack of data. Moreover, we can also investigate if the method can be used to address the bias-variance problems in order to improve the generalization ability of the model given that the training data differs significantly from the test dataset.

CODE AND DATA
We provide Python code and data for extending this work further 1 .