Deep Generative Models to Counter Class Imbalance: A Model-Metric Mapping With Proportion Calibration Methodology

The most pervasive segment of techniques in managing class imbalance in machine learning are re-sampling-based methods. The emergence of deep generative models for augmenting the size of the under-represented class, prompts one to review the question of the suitability of the model chosen for data augmentation with the metric selected for the-goodness-of classification. This work defines this suitability by using newly-sampled data points from each generative model first to the degree of parity, and studying classification performance on a large set of metrics. We extend the investigation to different proportions of augmented data points for identifying the sensitivity of the metric to the degree of imbalance, leading to the discovery of an optimum proportion against the metric. The models used are GAN, VAE and RBM and the metrics include Precision, Recall, F1-Score, AUC, G-Mean and Balanced Accuracy. We offer a comparison of these models with the established class of data synthesizing counterparts on the aforementioned metrics. Deep generative models outperform the state-of-the-art on 5 metrics on multiple datasets and also comprehensively surpass the baselines. This work thereby recommends the following model-metric mappings: VAE for high Precision and F1-Score, RBM for high Recall and GAN for high AUC, G-Mean and Balanced Accuracy under various recommended proportions of the minority class.


I. INTRODUCTION
Class imbalance is a ubiquitous problem to machine learning tasks, where the class significant to a business or scientific need contributes a smaller proportion of the total data instances. Anomaly detection and its derivatives, namely fraud detection and money laundering, medical diagnosis, fault diagnosis, spam detection are major examples [20]. Extreme imbalance is common in financial fraud datasets, where the minority class may represent even fewer than 0.5% of the total instances [8]. To counter this, the most popular methods lie in the category of over-and under-sampling, which manage data volume so that classes are more equally represented, [29], increasing the classifier performance [23], [24], [25]. For the purpose of this document, the former is referred to as data-augmentative which either resample or The associate editor coordinating the review of this manuscript and approving it for publication was Juan Liu . generate data points, and the later as data-reductive which prune them away. Of the two, data-augmentative methods have gained widespread adoption among industry practitioners as these exploit the entirety of the information in the data. These can be broadly classified into 3 types, replicating (classic), synthetic (prevalent) and generative (novel). First one augments via replication, second method produces instances using locally-linear interpolation. The recently-introduced third category generate new instances by learning the data distribution. Instead of using an interpolation process these employ a wide range of algorithms, from Gibbs sampling and variational inference to game theory, [22]. This subtlety raises the question of how well to study the contribution or efficiency of the newly generated instances in terms of performance metrics.
It is well known that each metric gauges performance from a different perspective, which increases the significance of contextual metric selection. For instance it is VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ inappropriate to select precision which measures model exactness in lieu of recall which measures model completeness where a strong restriction on false negatives is required. This is to be followed by an optimal model selection and its mapping onto the chosen metric, but a general absence of guidance from open literature (an exception may be the overriding metrics definition for GAN performance assessment on high dimensional data, Section III-A) compels this decision to be on intuitive or preferential bases. For the purpose of this document, model selection refers to the type of generative model (Section III), and not the classifier, degree of complexity or hyper parameters. The high algorithmic variance in generative models motivates the search for a behavioural alignment of these with specific metrics and the following drawbacks in synthetic methods impede further exploration using the latter as a data-production base. 1) Synthetic instances may be not true data representatives as these lack a guarantee of instance novelty, randomization that may mimic the level of noise in data, [28]. 2) These methods do not address the problem from probabilistic perspective. Therefore, obtained without a learned distribution; the synthesized examples lack interpretation and information required by a classification model, [4]. 3) Any new point obtained by a linear synthesizer is created by local interpolation from neighbourhood. Localities in data are linear [72], therefore these points are not able to follow the curve of the data manifold. The implication of imbalance on classification models have been well studied but that on performance metrics is relatively under-explored. Among the latter, works by [31] observe association between metrics with an emphasis on coherence of AUC metric. Works by [30] explore the effect on metrics performance as imbalance fluctuates via under-sampling on image datasets. However, the sensitivity of metrics with varying degree of sample proportionality using over-sampling (synthetic or generative) is yet to be explored. Further, a silence of the literature can be observed in proposing a unification of model-metric mapping with metric-sample proportionality.
This work therefore, fills in the research gap by proposing a quantifiable methodology, the Model-Metric Mapper (MMM), which presents a coherent and comprehensive prescription to the modeler in combating class imbalance. This includes: 1) model-metric mapping 2) calibrating metric-sample proportionality both, using 3) contextual metric selection.

A. CONTRIBUTIONS
The major contributions of MMM are: 1) It establishes an effective model-metric mapping, selecting the optimally performing models against the contextually relevant metrics. 2) It calibrates metric wise optimum data proportionality, exploring the degree of sensitivity of a metric to imbalance.
3) It advocates the use of deep generative models for data augmentation in the context of structured data. 4) It proposes a quantified methodology, guiding the modeler in their choice of data augmentation models.

B. PAPER ORGANIZATION
The paper is organized as follows: Section 2 reviews Prevalent approaches, Section 3 elaborates Deep, nonlinear models generative models, Section 4 discusses Performance evaluation metrics, Section 5 proposes the MMM methodology, Section 6 comprises Experimental setup, Section 7 tabulates and discusses Results, Section 8 sets MMM in motion and Section 9 concludes the paper.

II. PREVALENT APPROACHES
The problem of improving accuracy on skewed datasets was formulated in year 2000 at the first workshop held on the topic in American Association for AI conference, Japkowicz and Holte [34]. This section majorly discusses the prevalent augmentation and a significant subset of reduction methods. The next section is devoted to a comprehensive discussion on deep nonlinear models (generative methods), relatively new competitors to these methods. The methods are discussed as enlisted in below: The Method SMOTE performs nearest neighbor identification followed by synthetic instance generation using Euclidean distance between feature vectors. A minority class instance (base) is randomly selected followed by its k nearest neighbours (support) identification. Each new instance is created as an additive operation on the feature space of the base instance with the product of an inter-feature difference between base-support pair and a random value. This interpolation creates synthetic instances along the line section between features.
The Strengths and Weaknesses: In contrast with indiscriminate oversampling which works in the data space, SMOTE work in the feature space. This makes decision boundaries flexible. However the linearity restricts these to low dimensional datasets. The technique fails to counter class overlap existing in multiple disjoint clusters; when coupled with undersampling it leads to significant information loss. The synthetic instances lack variance and are not equipped to cover curve manifolds.

B. INSTANCE HARDNESS THRESHOLD -METHODICAL, LINEAR UNDERSAMPLING
This under sampling technique by Smith et al., [49] removes frequently mis-classified instances by computing their degree of hardness. The two forked approach first identifies recurrent mis-classified instances followed by unit level reason exploration. This singularity based dual analysis is in contrast to prevalent undersampling approaches (discussed later) operating at an aggregate level.
The Method: Equations 1 and 2 formulate the technique. Equation 1 calculates hardness for the instance < x i , y i >. Due to multiple learning algorithms used, the loss is a sum over these together with the weighing term p(d|z). The term p(y i |x i , d) defines the probability that d assigns the label y i to x i . Higher value means high probability prediction. The d is computed from d = g(z, β) when learning algorithm g is executed with parameter β on z. Substituting it in Equation 2 yields g k (z, β). The probability p(d|z) is estimated as 1 |C| while the probability p(y i |x i , g k (z, β)) is a summation over multiple learning algorithms in set C.
IHT uses classifier-out-difference technique by Peterson and Martinez, [67] for measuring degree of predictive variance between classifiers followed by clustering. 20 algorithms are summarized into 9. An example is BayesNet, DecTable, Ripper, Simple Cart are summarized as Ripper. This forms the set C in equation 2 for computing instance wise hardness. Further, two entities per instance for feedback are computed: classifier score and classification frequency. The Strengths and Weaknesses: IHT's distinction is instance mis-classification reason identification. Major reasons are class intersect, class tilt, and borderline complexity further subdivided into conflicting neighbours, disjunct size, class equilibrium, class probability and tree depths. The method also identifies a positive/negative instance-reason relation. Other undersampling approaches (Related works) rely on nearest neighbor variants which are vulnerable to class overlap. The major weaknesses of IHT are: may lead to information loss due to undersampling and heavy dependence on the classifier for computing hardness score.

C. INDISCRIMINATE REPLICATION/removal -RANDOM SAMPLING
The traditional approach to balance skewed datasets apart from the methodical counterparts discussed above is indiscriminate re-sampling. The dataset is balanced either via random replication of the minority instances or indiscriminate removal of the majority instances, Chawla, [50]. The former increases variance and the latter leads to over fitting, Pozzolo et al., [27]. Further demerits include: in class overlap settings the former leads to meaningful information reduction and the latter leads to high misclassification, Garcia et al., [32], Cieslak and Chawla [33]. Estabrooks et al., [51], have shown that several base classifiers report an improved accuracy on balanced datasets but linear separability being a prerequisite.

D. RELATED WORKS
Work by Douzas and Bacao, [7] propose Geometric-SMOTE (GMT), which produces instances in an ellipsoid surrounding the chosen minority instance. Work by Douzas et al., [6] propose K-means-SMOTE (KMT) which combines clustering with SMOTE where the former identifies clusters and the latter generates samples in clusters leading to noise reduction. Reference [2] propose Adaptive-SMOTE which uses instance complexity to partition minority data before oversampling. The method gives better results than borderline extensions by Han et al., [42] which generate instances near positive and negative neighbors. Work by Janbandhu et al., [3] use Adasyn by He et al., [45] for oversampling which creates different decision boundaries than SMOTE by producing samples near mis-classified instances. Douzas and Bacao, [39] use Self-organizing-map-oversampling, SOMO technique which preserves underlying manifold structure by creating two dimensional representation of the input space prior to applying SMOTE. Work by Koziarski et al., [5] propose Radial-specific-over-sampling where the method identifies potential regions for minority instances creation. Work by Nekooeimehr and Lai-Yuen, [38] use Adaptive semisupervised-weighted-oversampling, A-SUWO which uses cross validation for minority class cluster size identification and generates synthetic instances based on a weighting mechanism. Work by Bunkhumpornpat et al., [40] use Density-based SMOTE, which uses DB-SCAN algorithm to discover clusters and generates instances along the shortest path from each minority class instance to a cluster's pseudo-centroid.
Undersampling approaches are discussed. Addabbo and Maglietta [26], propose parallel-selective-sampling which gives importance to majority instances near demarcation and eliminates those further away. Lin et al., [21] use Clustering-balance to create majority class clusters equal to minority instances, than reducing the majority until it equals minority. Mani and Zhang, [46] use NearMiss and its variants which use nearest neighbour heuristics for undersampling. NearMiss-1 and NearMiss-2 select positive samples with smallest and farthest distance to negative samples respectively, while NearMiss-3 follows a two-step approach. Tomek-links removes the majority instance by identifying the disparate pair having the closest link. SMOTE-Tomek by Batista et al., [53] and SMOTE-ENN VOLUME 9, 2021 by Batista et al., [54] perform oversampling followed by undersampling, cleaning noisy instances.

III. DEEP, NON-LINEAR GENERATIVE MODELS
Deep learning based generative models have the capability of generating instances that have good likelihood guarantees with the parameters of the training distribution. The core idea behind generative modelling being; given a collection of high dimensional training instances, a model is able to do the following: 1) Density approximation: Given a large set of instances the model should be able to estimate the probability density function well enough to describe the data. 2) New instance generation: The model should keep the joint distribution of data over all variables, and have a random process that could generate new data instances from the estimated training distribution. Overall, interest in deep generative models has spawned interesting outcomes namely synthetic music, art work and forged human faces. This study uses these models to generate instances for the minority class in an effort to combat class imbalance problem. The authors observe that these nonlinear models stand out from the traditional linear counterparts which were capable of producing new instances upper bounded by the variation present in the dataset. Three types of generative models are discussed below. Later the study provides results of the experiments performed on multiple imbalanced datasets using these models. GANs by [70] follow a game theoretic approach where two models/players compete in an adversarial arrangement. The objective of the generator model is to generate instances analogous to training distribution while that of the discriminator model is to discriminate between actual and generated samples. Although being non-linear and generative, GAN differ from VAE and RBM as the latter adopt density approximation approach.
The Model: The equation shows the minmax objective of the bi-model network. The discriminator D with parameters d maximizes by making by D(x) the actual sample as close to 1 and D(G(z)) the counterfeit as close to 0. However the generator G with parameters g minimizes by making D(G(z)) as close to 1. The purpose of the bi-model is to make the generator generate images which the discriminator presumes to be coming from the training distribution and not as counterfeit. The prevalent approach would have been to minimize the objective of the discriminator being correct but this leads to flat gradients where learning is required and vice versa. However a spin on the generator's objective leads to marked improvement where rather than minimizing the discriminator being correct; it is maximized to be incorrect.
The Strengths and Weaknesses: GANs derive their strength from game theoretic foundations with the bi-model competitive feature. The generated instances are high in quality having salient resemblance with the training data. Thus positioning the models as strong candidates in an image or transactions based generative setting. C-GAN by [35] for transaction oversampling, Be-GAN by [57] for crisp and high resolution, Convolutional-GAN by [59] for vector arithmetic based morphing, Cycle-GAN by [58] for reversible domain transfer, LS-GAN by [61] and Wasserstein-GAN by [60] are variances used for training stability. The major weaknesses of GAN are: these are difficult to train, lack quantified performance assessment, does not use density approximation and follow a complicated inversion mechanism.
Overriding metrics for performance assessment Works by [17] and [11] propose a quantified performance assessment of GAN by expressing precision and recall differently. These consider FID metric [19] as uninformative due to its qualitative nature. Works in [17] use density modelling with mode change while works in [11] later arguing this as ambiguous articulate non-parametric manifold with truncation for metric estimation and quality/variation tradeoff, [12], [13]. Designed for assessing GAN performance on high dimensional data, both approaches use balance datasets. The former does introduce imbalance via mode change (class addition/removal), while the latter is silent on the subject.

B. VARIATIONAL AUTO ENCODERS
VAE by [69] are built on the idea of fusing autoencoders with probabilistic graphical models. The objective is to estimate and encode an intractable probability density via an understandable surrogate density, e.g. a Gaussian, then minimize the Kullback-Leibler divergence. These models are different from traditional autoencoders as they induce probability thus shifting the paradigm from deterministic to a random.
Computing the posterior pβ(x | z)] in graphical models has been intractable and has been estimated using Gibbs sampling or variational inference. VAE use the latter which works on maximizing the lower bound.
The model The encoder and decoder using parameters φ and β respectively, produce distribution parameters µ and . These are used to sample latent factor representation (x | z) and reconstruction (z | x) from the encoder and decoder respectively. These being differentiable lead to maximizing the lower bound. Equation 5 shows the loss function with the first term being reconstruction error and the second being the KL divergence regularizer together making up the lower bound. The encoder assumes a tractable Gaussian qφ(z | x) using KL divergence to make this close to pβ(z | x). The objective is to make prior closed to the posterior. This leads to deriving the second term in the loss function. The first term is expectation maximization of the conditional distribution qφ(z | x) with respect to q(z). As z being Gaussian makes the decoder a minimizer of reconstruction error.
The Strengths and weaknesses: The stochastic characteristic distinguishes VAE encoders from its traditional counterparts. The latent encoding z being sampled from Gaussian µ and parameters transform the disjointed representation into a continuous one. This paves way for a generative model to not only replicate but generate interesting image variations. The combined optimization of the two terms namely the KL divergence and the reconstruction loss induces a two-fold effect. The regularizer enforces yet random but densely packed encodings and the loss encourages clustering of similar encodings. This leads to decoder generating instances having local variation within similar samples and interpolating feature mixes between dissimilar clusters. The major weaknesses of VAE are: these generate blurry outputs at times, have subprime variational issues due to amor-tization and approx-imation gaps and produce gradients having high variance.

C. RESTRICTED BOLTZMANN MACHINES
These are unsupervised generative models proposed by [68], designed as a symmetrical arrangement of binary stochastic neurons where two layers form a bipartite graph using nonlinearity. Though deep models have produce sound generalization results but training and parameter optimization has been a challenge. Initialization with large weight values leads to poor local minima problem, while small weights leads to small gradients. But, with calibrated weights, learning algorithm performs well. This does require learning one layer of features at a time and captures strong high-order correlations of units in the layer below.
The Model: RBM defines distribution over visible unit v with latent variables h via energy function E. As shown in equation 6, negative w leads to high energy with a decrease in probability and vice versa. Energy and probability being reciprocal. The function gives the probability distribution p(v, h) shown in Equation 7. The challenge being: the partition function Z is the sum over all values of v and h. These are binary, so Z can take many values leading to an exponential sum over the numerator, making it intractable. To counter this [68] proposed contrastive divergence. The technique uses Gibbs sampling to approximate joint distribution when direct sampling is difficult. Alternating between layers, given one unit in visible layer, all units are independent in hidden layer, values in one layer be sampled given a value in another layer.
The Strengths and weaknesses: RBM are highly expressive models equipped with the capacity to encode a distribution without compromising computational efficiency. Symmetric connectivity between visible and hidden units makes faster algorithms likely. Unsupervised pre-training moderates parameter values in suitable ranges which makes back propagation efficient. Layer wise stacked unit creates deep belief networks which serve as meaningful feature extractors. The weaknesses of RBM are: these are tricky to train, vulnerable to local minima trap, use partition function which is difficult to approximate.

D. RELATED WORKS
Works by Engelmann and Lessmann et al., [4] use conditional-WGAN, a type of GAN on multiple credit-scoring datasets for minority class generation. Fiore et al., [8] use GAN for generating minority instances on financial anomalies dataset and outperforms traditional SMOTE. Zheng et al., [9] combine GAN with adversarial denoising autoencoder for countering imbalance in telecom fraud setting. The model outperforms state-of-the-art namely bayesian belief network, fuzzy inference and deep auto-encoder models. Douzas and Bacao, [35] follow generative oversampling using conditional-GAN on real and synthetic datasets. Park et al., [18] propose tab-ular-GAN which using a common structure for tabular and image data converts tabular rows into 2D matrix for convolution.
Tingfei et al., [14] use Variational auto encoders for over sampling minority class and outperform synthetic models on financial datasets. Islam et al., [15] use VAE for generating accident events, in highly imbalance setting and compared it with multiple Smote extensions. Dai et al., [16] use contrastive variant of Variational auto encoder for generating under represented class on clinical datasets surpassing linear models. Works by Guo et al., [10] use Gaussian Mixture VAE on high dimensional time-series data for generating minority class.
Works by Zieba and Tomczak [37] use Restricted Boltzmann Machine in an imbalance credit rating evaluation mechanism. Boltzmann encoded adversarial machines by [64] extend RBM where the model is trained against an adversary making it capable to discriminate between training and generated instances. Works by [65] use Gaussian-Bernoulli models as an extension to RBM. These are equipped with processing continuous data with improved gradients and used for oversampling.

IV. PERFORMANCE EVALUATION METRICS
The metric frequently used for evaluating model performance is accuracy. Being good at summarising, it is uninformative on imbalance data. As it weighs confusion matrix quadrants equally by measuring fraction of correct to total predictions thus does not provide an adequate measure on performance of minority instances. The following discussion highlights metrics preferred over accuracy in this context. However, due to different formulation and model gauging perspectives of VOLUME 9, 2021 each, the importance of contextual relevance of the metrics is also discussed. Therefore it can be summarised, contextual metric selection is required. This will ensure the attribute being measured; is the one required to assess model's performance.

The proposed Model-Metric Mapper methodology (MMM)
is conceived on the idea that an appropriate model selection is required following a suitable metric identification. The methodology adds on that metric specific minorityto-majority proportion is further required to get an optimum performance. The methodology is introduced below with its distinguishing features followed by the artifacts. 1) Distinguishing features: MMM exhibits 2 ground breaking features highly significant to data imbalance. These include: • Model to metric mapping: Designed to work with heavily imbalance datasets, the MMM guides a practitioner in relevant model selection for sample generation based on the required metrics. This work views that architecturally and algorithmically dissimilar models behave differently on distinct metrics (empirical evidence in Section VII). This is significant in imbalance context where performance on a specific metric is a key determinant in model selection. Hence, an informed model selection is required. To elaborate, selecting a model with high precision in an infectious disease environment may be useless where a high recall one would have been the obvious choice. Thus, MMM transforms the prevalent model selection approach from an intuitive/preference based to an informed one.
• Metric wise sample proportionality calibration: Together, with the appropriate model selection; MMM identifies the optimum minority-to-majority proportion against specific metric using quadrant wise calibration search. This work views that metrics are sensitive to sample proportionality, hence a metric specific proportion is to be identified. (empirical evidence in Section VII and discussed in Section VIII). This being significant as prevalent approaches of increasing the minority or decreasing the majority for mere balancing and without considering the required metric wise proportionality may not yield optimum results. 2) The artifacts: MMM is constituted as two independent modules or artifacts. The first artifact systematically generates minority instances from three different deep generative models. The second subsumes the first while adding a systematic reduction of the majority class. The artifacts are discussed: Artifact I -Generative calibrations is designed on the generative concept with the premise that instances generated from deep generative models are grander than the competing synthetic or re-sampled counterparts in the context of being novel and emitted from learnt joint distributions. This brings these in close proximity to the original ones. The artifact using quadrant based calibration employ the generated instances in search of an optimum metric wise majority to minority proportion and the model that yields it. The artifact has three streams each encompassing an architecturally and algorithmically different generative model namely GAN, VAE and RBM. VAE and RBM are density estimators but as density function being intractable the former use variational inference and the latter Gibbs sampling for approximation. GAN adopt a game theoretic approach rather than working with specific density function which makes it different from the two models.
Equation 8 represents Artifact I. χ min are the original and χ min_gen the generated minority instances respectively. These being generated as distinct sets from the 3 deep models. χ maj are original majority instances. α is the coefficient of proportionality with values set as {1/4, 1/2, 3/4 and 4/4 or 1/1}. This is used for governing the minority class instance generation proportion. The artifact uses quadrant wise calibration search by combining the majority and minority instances using the coefficient of proportionality and measuring against the mentioned 6 performance metrics. The search cycle continues until an optimum metric wise proportion and model is identified. Artifact II -Chaining generative and reductive calibrations Is designed on the generative+reductive concept with the premise that coupling minority instances generation with majority instances reduction in a systematic order may have the following effects. First, the reduced noise and induced novelty may lead to a different model-metric mapping than Artifact I. Second, majority to minority proportions may vary against the ones identified in previous Artifact. Therefore, Artifact II connects or chains IHT, the undersampling technique (already discussed) with contemporary instance generation from Artifact I. This constitutes the two links namely generatives and reductives in the artifact's chain.
represents Artifact II. χ maj_iht are the majority and χ min_gen minority set, ensuing from reduction and generation models respectively. χ min are the original minority instances. α is the coefficient of proportionality with values determined as 3/22 ≈ 1/7, 6/19 ≈ 1/3, 9/16 ≈ 1/22 and 12/13 ≈ 1. This is used for governing the minority class instance generation proportion. The artifact uses quadrant wise calibration search by combining reduced majority with the original and generated minority instances using the coefficient of proportionality and measuring it against 6 performance metrics. The search cycle continues until the optimum metric wise proportion and the model is identified. As for IHT, when applied to majority class, it assigns and removes instances of lower probabilities using factors namely class skew, overlap and decision boundary complexity. Artifact II differs from Artifact I as the later focuses on instance generation while the former couples reduction with generation. This is in addition to the fact that unlike the later, the former alters both the minority and majority instance counts. Therefore, MMM recommends metric specific informed model selection with calibrated sample proportion, localizing the model and required data proportion to the metric level. The methodology strengthens its recommendation using 2 independent artifacts both leading to similar conclusions.   [2]. It comprises of genuine and fraudulent transactions by European cardholders. It is highly imbalanced with 0.18% frauds. Give-me-somecredit dataset classifies risky financial borrowers, [73]. Minority accounts for 6.68% making it highly imbalance. Protein-homo dataset categorizes protein sequence comparability, [74]. Being highly imbalance, the minority class is 1.11% of the total. Skin-no-skin dataset covers skin segmentation task, [75]. Anomaly being 20% makes it imbalance. Anti-money-laundering-cases is a proprietary dataset comprising of financial transactions flagged as cleared and laundered. Being heavily imbalance as the minority class constitutes 1.01% of the total volume. All the datasets are included keeping in view volume, imbalance and tabular structure.  Table 2.
The effect of the use of different classifiers is made invariant by the use of a single classifier, the-industrystandard, XGBoost across all experiments. The parameters used are depth = 5, weak learners = 100 and learning rate = 0.1.The default implementations of the models are used as provided in GAN [78], VAE [76], RBM, [77], SMT [41], KMT [6], GMT [7], and IHT at [49]. Evaluation metrics based on highly imbalance nature of datasets, the evaluation metrics used are Precision, Recall, F1-score, AUC, G-Mean and Balanced accuracy. • Experiment set-I 'Generatives vs Synthetics': compares state-of-the-art synthetic with Artifact I's generative models using synthetic and generative oversampling respectively. These models collectively fall into 'Data Augmentative Category' with results shown in Section 7.2.
• Experiment set-II 'Generatives + Reductives vs Synthetics + Reductives': compares state-of-the-art synthetic with Artifact II's generative models both employing a common undersampling technique. These models collectively fall into 'Data Augmentative + Reductive Category' with results shown in Section 7.3.
• An inter-category comparison of leading model from each of the two categories is performed to identify the overall top performer against each metric. This is discussed in Section 8.

VII. RESULTS
The results section is divided into 4 segments:

A. BASELINE COMPARISON
The baseline comprises of the dataset in its nascent form and is compared with both generative and generative+reductive approaches from Artifact I and Artifact II respectively. For comparison the Artifacts produce training data with 1:1 minority-to-majority ratio. The objectives being: 1) To supports and rationalize the argument that balancing leads to an increase in performance metrics in general. Artifact I generates the minority while Artifact II reduces the majority together with increasing the minority to achieve training data balance. 2) To evaluate the behaviour on 6 different performance metrics. This being significant as these metrics are preferred over traditional accuracy metrics for reporting in imbalance settings. The comparison with baselines is performed on 6 distinct metrics over multiple datasets for generalizability. As shown in Table 3, approaches from both Artifacts surpass the baseline comprehensively over 6 metrics namely Precision, Recall, F1-Score, AUC, G-Mean and Balanced accuracy. Generative oversampling with maximum % increase of 1.71, 452.9, 7.46, 15.2, 19.7, 13.9 while chaining generatives and reductives with maximum % increase of 1.16, 435.29, 51.67, 30.55, 82.39, 30.55 beats the baseline. Thus providing a strong case that balancing either by generation or generation+reduction leads to an increase in performance.
This work further identifies that specific metric may mandate a more precise minority-to-majority ratio than mere balancing for further improvement as will be shown in next subsections.

B. GENERATIVES VS SYNTHETICS
This section compares generatives from Artifact I with stateof-the-art synthetic models including SMT and its current extensions KMT and GMT. The comparison is performed proportion wise. The objectives of this comparison being: 1) Compare 2 data augmentative techniques using 6 metrics on multiple datasets. To empirically identify the benefit of using generative models against synthetic counterparts. 2) Recommend a model-metric mapper. This works argues that model performance varies per performance metric.
Therefore, an effective model should be selected based on the given metric. 3) Search for an optimum minority-to-majority ratio.
The authors of this work view that identifying and maintaining precise metric specific proportionality together with the yielding model improves performance.
The results are tabulated in Tables 4 to 9 and proportionality summarized in Figure 2. Top, second best scores and proportionality quadrants use the same colours for analogy. Table 4 shows results on Precision metric. On all 5 datasets, VAE leads with scores of (0.87,0.54,0.89,0.9,0.88) against the synthetic KMT with (0.83,0.53,0.88,0.89,0.84). The best scores are found where minority-to-majority ratio is 1/4 or the 1st quadrant. VAE reports a maximum increase of 4%. On Recall, RBM clearly leads on all datasets with scores (0.95,0.97,0.96,0.98,0.96) as shown in Table 5. SMT follows with (0.88,0.54,0.9,0.96,0.86). Majority of the top scores are reported in the 1st quadrant with proportionality 1/4. RBM reports a maximum increase of 43%. Table 6 shows results on F1-Score. VAE surpasses on 4 datasets with (0.78,0.82,0.94,0.84) while SMT leads on 1 with (0.94). The 2nd best scores are (0.75,0.31,0.8,0.92,0.8) by GAN,SMT,KMT,SMT and KMT respectively. Majority of the best results are found where minority-to-majority ratio is 3/4 or 3rd quadrant. VAE reports a maximum increase of 5%. Table 7 shows the results on AUC metric. GAN excels on 3 and SMT on 2 datasets with scores (0.94,0.59,0.88,0.97,0.93) and (0.93,0.72,0.94,0.95,0.92) respectively with VAE following closely. The prime results are found in the 4th quadrant where minority equals majority. The maximum increase reported by GAN is 2%. Table 8 shows results on G-Mean metric. GAN and VAE closely follow on all the datasets. The best results are found in the 4th quadrant with proportionality 1/1. The Balanced accuracy metric results are reported in Table 9  split between the 3rd and 4th quadrants with 3/4 and 1/1 proportionality respectively. GAN reports a maximum increase of 2%. Therefore, following model-metric mappers are identified. VAE for Precision and F1-Score, RBM for Recall, GAN for AUC and Balanced accuracy and GAN and VAE for G-Mean. A discussion on metric-wise sample proportionality is provided in Section 8.

C. GENERATIVES + REDUCTIVES VS SYNTHETICS + REDUCTIVES
This section compares the generatives+reductives from Artifact II with the same state-of-the-art synthetic models from the previous section. Both employ IHT as majority instances reduction technique. The comparison is performed proportion wise. The objectives of this comparison being: 1) Include a majority reduction technique with both the generative and synthetic approaches. This strengthens the argument of balancing the dataset not only by augmentation but also by reduction. 2) Reinforce and improve the identified model-metric mapper. The addition of reduction technique may increase the efficiency of the model-metric and consistent findings will strengthen the argument of using deep generative models. 55888 VOLUME 9, 2021   3) Improve metric wise sample proportionality. As the proportion will comprise lesser majority instances, this may optimize the effective samples count.
The results are tabulated in Tables 10 to 15 and proportionality summarized in Figure 2. Top, second best scores and proportionality quadrants use the same colours for VOLUME 9, 2021    analogy. Table 10  and (0.67,0.9,0.67) respectively. All of the best scores are reported in the 1st quadrant where the minority-to-majority ratio is 1/7. VAE-i reports a maximum increase of 6%.     The best scores are reported in 2nd quadrant where the minority-to-majority ratio is 1/3. VAE-i reports a maximum increase of 2%. Table 13  However, the method in this section generates and reduces the minority and majority respectively, the findings are consistent with the previous section where the minority is generated only. To elaborate, similar model-metric mapping is identified. VAE-i for Precision and F1-Score, RBM-i for Recall, GAN-i for AUC and Balanced accuracy and VAE-i and GAN-i for G-Mean. As for the metric-wise sample proportionality a discussion is provided in Section 8.

D. COMPUTATIONAL EFFICIENCY
As for training efficiency, the generative models scale linearly as opposed to the synthetic counterparts which scale in orders of multiple. To elaborate, if there are m datasets and each is to be augmented using n proportions, than the generative models merely require m while the synthetic require m × n training time. As this work uses 5 datasets each with 4 proportions, the training time for generative models is 5 while for synthetic models is 5 × 4. This linear order training efficiency makes the generative models a stronger candidate.

VIII. MMM IN MOTION
This section sets the proposed methodology in motion. MMM launches a six pronged attack to neutralize class imbalance from six frontiers as shown in figure 3. The objective is to come up with a data driven and industry neutral class imbalance solution. The motion is set as follows: • The optimum minority-to-majority ratio against specific metric is identified. The sensitivity of the metric to varying degree of proportionality is elaborated. The model-metric mapping is established. The rationalization is strengthened by observing these against both categories.
• An inter-comparison of the leading models from each category is performed establishing a rank-order preference. MMM, finally recommends this rank-order based model-metric mapping along with the optimum minority-to-majority proportionality.
1) Precision metrics require a low minority-to-majority ratio. The reason being prime results are reported in 1st quadrant with sample proportionality 1/4 and 1/7 respectively, Figure 2. The scores drop in higher quadrants where minority representation increases, endorsing the sensitivity of the metric to proportionality, refer Tables    generated instances are even more expressive independently, Tables 4, 10. Thus, for high precision, MMM recommends VAE from Artifact I followed by VAE-i from Artifact -II with the mentioned proportionality. 2) Recall metrics similarly require a low minorityto-majority ratio as prime results are reported in the 1st and 2nd quadrant with sample proportionality 1/4 and 1/3 respectively, Figure 2. Scores decline in higher quadrants as minority representation is increased, cementing the sensitivity of the metric to proportionality, Tables 5, 11. RBM and RBM-i are top performers in their respective categories with the mentioned proportionality. High representational strength of RBM generated instances against can be confirmed as synthetic SMT trails in both categories and that also using high proportion of minority instances. The reduction in false negatives leading to high Recall can be attributed to these instances, marking the suitability of the RBM model against the metric.
Recommendation -RBM from Artifact I, 1/4 proportionality An inter-category comparison between two foremost show that RBM further leads over RBM-i, Tables 5,11. This shows that generative instances from RBM are independently more expressive than being combined with reductive technique as the later not only require more instances but also an equivalent reduction of the majority class. Thus, for high recall, MMM recommends RBM from Artifact I followed by RBM-i from Artifact -II with the mentioned sample proportionality. 3) F1-Score require a moderate minority-to-majority ratio as highest scores are reported in the 3rd and 2nd quadrant with 3/4 and 1/3 sample proportionality respectively, Figure 2. Sensitivity of the metric to proportionality can be observed as scores drop when minority-to-majority ratio is shifted to either extreme, Tables 6, 12. VAE and VAE-i exceed in their respective categories. A modest representation strength of VAE generated instances is observed against the F1-Score. The homogeneity between false positives and negatives leading to high F1-Score can be attributed to these instances, marking the suitability of the VAE model against the metric. Recommendation -VAE from Artifact I, 3/4 proportionality Comparing the two leaders from each category show VAE surpasses VAE-i, Tables 6, 12. This confirms a moderate representation of minority class is preferred over a restricted one. Therefore for high F1-Score, MMM recommends VAE as 1st and VAE-i as 2nd model with proportionality of 3/4 and 1/3 respectively. 4) AUC metrics require a high or near equal minorityto-majority ratio as the top scores being reported in 4th quadrant with sample proportionality 12/13 and 1/1 respectively, Figure 2. The metric is highly sensitive to proportionality as low scores are observed until near equilibrium between the two classes is achieved, Tables 7, 13. GAN-i and GAN lead on multiple datasets in their respective categories. This shows high expressiveness of GAN generated instances over other models against AUC. The enhanced grading predictions capacity leading to high AUC can be attributed to these instances, marking the suitability of the GAN model against the metric.
Recommendation -GAN-i from Artifact II, 12/13 proportionality: An inter-category comparison of the two prime performers observe GAN-i excels over GAN, Tables 7,13. This shows that for AUC metric, GAN instances are more expressive when a near equilibrium of both classes is maintained but with high count. Therefore for high AUC, MMM endorse GAN-i as 1st and GAN as 2nd model with proportionality 12/13 and 1/1 respectively. 5) G-Mean metrics Similar to AUC, G-Mean require a high or near equal minority-to-majority ratio as the top scores being reported in 4th quadrant with sample proportionality 1/1 and 12/13 respectively, Figure 2. Sensitivity to proportionality is evident as low scores are reported with low minority counts, Tables, 8, 14. GAN, VAE and GAN-i deliver near comparable performance against foremost models. To enjoy expressiveness, both synthetic and generated instances require a near equal presence of the opposite class with high count. Balancing the dataset is attributed to these instances which increase modestness and leads to high G-Mean, marking the suitability of GAN and VAE models against the metric.
Recommendation -G-Mean, GAN-i from Artifact II, 12/13 proportionality: An inter-category comparison between deep models establishes lead of GAN-i over GAN and VAE, Tables 8, 14. MMM, for high G-Mean, recommends GAN-i together with SMT as the 1st model followed by GAN and VAE with proportionality 12/13 and 1/1 respectively. 6) Balanced accuracy metric requires a moderate to high ratio. The reason being metric reports highest score in the 4th and a split between 3rd and 4th quadrants with proportionality 12/13, 3/4 and 1/1 respectively, Figure 2. The metric is observed to have medium sensitivity as low scores fall in lower quadrants, Tables 9, 15. GAN and GAN-i lead on multiple datasets in their categories. GAN generated instances enjoy high expressive strength against balanced accuracy. The equilibrium attained between true positives and negatives is attributed to these instances which increase comprehensiveness and leads to high Balanced accuracy, marking the suitability of the GAN model against the metric.

IX. CONCLUSION
The proposed MMM methodology, covers research gap in class imbalance domain by building on two concepts. First, the authors are of the view that metrics being distinct in their formulation and usage are also sensitive to data proportions but with varying degree. An effective proportionality for one metric may be not be suitable for the other. Therefore metric wise proportionality calibration is required. Second, a highly suitable model on one metric may be less suitable on the other. So, an informed model selection is required. Though, deep models are known to have strong generative capabilities, but their inherent architectural and algorithmic variation also makes a strong case for precise candidate selection. MMM, formulated on these concepts, conclude the following: 1) Optimal model-metric mapping identified and 1st, 2nd recommendation proposed. These are, Precision and F1-Score: VAE, VAE-i, Recall: RBM, RBM-i, AUC and G-Mean: GAN-i, GAN/VAE and Balanced accuracy: GAN/GAN-i. 2) Metric wise optimum minority-to-majority proportionality is calibrated on both Augmentation and Augmentation + Reduction categories. These are, Precision: