Loading [MathJax]/extensions/MathMenu.js
Imbalanced Data Problem in Machine Learning: A Review | IEEE Journals & Magazine | IEEE Xplore

Imbalanced Data Problem in Machine Learning: A Review


This survey explores machine learning techniques for handling imbalanced data, including data-level methods like oversampling and under-sampling, algorithm-level solution...

Abstract:

One of the prominent challenges encountered in real-world data is an imbalance, characterized by unequal distribution of observations across different target classes, whi...Show More

Abstract:

One of the prominent challenges encountered in real-world data is an imbalance, characterized by unequal distribution of observations across different target classes, which complicates achieving accurate model classifications. This survey delves into various machine learning techniques developed to address the difficulties posed by imbalanced data. It discusses data-level methods such as oversampling and undersampling, algorithm-level solutions including ensemble learning and specific algorithm adjustments, cost-sensitive algorithms, and hybrid strategies that combine multiple approaches. Moreover, this paper emphasizes the crucial role of evaluation methods like Precision, F1 Score, Recall, G-mean, and AUC in measuring the effectiveness of these strategies under imbalanced conditions. A detailed review of recent research articles helps pinpoint persistent gaps in generalizability, scalability, and robustness across these methods, underscoring the necessity for ongoing improvements. The survey seeks to offer an extensive overview of current approaches that improve the efficiency and effectiveness of machine learning models dealing with imbalanced datasets, thus equipping researchers with the insights needed to develop robust and effective models ready for real-world application.
This survey explores machine learning techniques for handling imbalanced data, including data-level methods like oversampling and under-sampling, algorithm-level solution...
Published in: IEEE Access ( Volume: 13)
Page(s): 13686 - 13699
Date of Publication: 20 January 2025
Electronic ISSN: 2169-3536

Funding Agency:

Related Articles are not available for this document.

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Imbalanced data poses a major challenge in machine learning. This occurs when the class distribution within a dataset is uneven, resulting in what is referred to as imbalance [1]. It means that one data class is significantly larger than the others. Most instances are part of the dominant class (the negative or majority class), with only a small number represented in the other classes (the positive or minority class) [2]. It causes a fundamental problem in machine learning, as when classifiers are trained on this type of data in which the distribution is unbalanced, the classifier’s conduct becomes biased in favor of the dominant class, often overlooking the lesser class [3]. This results in misclassifying instances within the minority class, which is usually exceedingly critical due to the affected part they want to discover constantly occurring in a few examples compared to all examples [2]. Imbalanced datasets present four key challenges: bias, overlap, feature vector size, and dataset size [4]. This scenario is prevalent across various domains such as healthcare like diabetes diagnosis [5] and skin lesion classification [6], Finance like credit fraud detection [7], Engineering like fault detection in wind turbines [8], and recognition rotting or dead tree [9], [10] and others.

The consequences of imbalanced data are far-reaching. In situations such as medical diagnosis, where accurately identifying rare diseases is vital, models that favor the majority class may lead to overlooked diagnoses and compromised patient care [11]. Similarly, in fraud detection, the machine learning model tends to exhibit bias towards the predominant class, causing a decrease in True Positives (TP) and an increase in False Positives (FP) [12], allowing fraudulent activities to go undetected. Thus, addressing imbalanced data is imperative for building reliable and effective ML systems. Now, the importance and seriousness of this problem have become clear to us.

Over the years, researchers have developed numerous methods to counteract the effects of class imbalance, encompassing three primary classifications: data-level techniques, algorithm-level techniques, and integrated methods [7], [13]. Data-level techniques involve modifying the dataset before the classifier is trained, including under-sampling the dominant class, over-sampling the minor class, or creating synthetic data [7], [14]. Conversely, algorithm-level techniques modify the learning algorithms to manage imbalanced data more efficiently without compromising the data itself, often through cost-sensitive learning, ensemble methods, or algorithm-specific adjustments [13].

While data-level and algorithm-level techniques offer valuable strategies for addressing imbalanced data, they each have limitations. Data-level techniques may discard potentially useful information or introduce noise through synthetic data generation [15]. Algorithm-level techniques, while effective, may not fully exploit the available data or may require complex adjustments for different algorithms [13].

Data-level and algorithm-level techniques [7], these hybrid approaches aim to leverage each category’s strengths while mitigating their weaknesses. Examples include combined sampling techniques, algorithmic resampling strategies, and ensembles of resampled datasets. Evaluation metrics like accuracy and precision may not adequately capture model performance in imbalanced datasets. Hence, metrics such as AUC (area under the precision-recall (PR) curve) and AUC (area under the receiver operating characteristic (ROC) curve), and the F1-score have been proposed and shown to be effective in classifying tasks with imbalanced data [16]. Additionally, researchers have introduced class-weighted evaluation frameworks that accommodate arbitrary skews in class cardinalities and importance, effectively addressing challenges presented by imbalanced datasets.

The subsequent sections of the paper will follow this structure. Section II outlines the review strategy, including the research questions that guide the survey. Next, Section III delves into the foundational techniques of oversampling and undersampling, providing an overview of their role in addressing data imbalance. Section IV explores a broad spectrum of balance strategies, categorized into data-level, algorithm-level, and hybrid approaches, with each subsection detailing methods to enhance model performance when dealing with imbalanced datasets. In Section V, the limitations of each technique are discussed, along with justifications for their inclusion. Following this, Section VI examines various evaluation methods essential for assessing the effectiveness of models handling imbalanced data, highlighting metrics like F1 Score, AUC, and others to provide a nuanced understanding of model efficacy. Finally, Section VII synthesizes these insights to offer a conclusive summary of the current techniques for managing imbalanced data in machine learning, pinpointing existing gaps in each approach.

SECTION II.

Review Strategy

This survey investigates the imbalanced data problem in machine learning by reviewing studies from various databases, including Google Scholar, IEEE, Springer, Elsevier, MDPI, and others. Emphasis is placed on recent advancements, with most reviewed studies published between 2020 and 2024, ensuring the latest methodologies and insights are included. The search process utilized specific keywords such as “imbalanced data,” “class imbalance,” “machine learning,” “data-level techniques,” “algorithm-level solutions,” “oversampling,” “undersampling,” “SMOTE,” “cost-sensitive learning,” “Ensemble methods,” and “hybrid approaches.” This review has investigated 40 studies, and most of the papers are from journals ranked in Q1 and Q2 categories, ensuring high-quality and impactful contributions. The selected references encompass foundational approaches, innovative techniques, and domain-specific solutions, providing a comprehensive analysis of the field. This strategy offers a deep understanding of current approaches to addressing data imbalance challenges while identifying research gaps that indirectly illuminate potential future directions.

This review aims to address key questions that encompass all aspects of the imbalanced data problem in machine learning:

  1. Q1: How effective are fundamental approaches, such as oversampling and undersampling, in addressing class imbalance across applications?

  2. Q2: What are the findings and limitations of data-level, algorithm-level, and hybrid techniques in achieving class balance?

  3. Q3: How do these limitations limit the overall performance of the balancing techniques?

  4. Q4: Which evaluation metrics best assess the success of balancing techniques in ML and DL tasks?

    The first section of the review focuses on answering the first question, providing a foundational understanding. The second section elaborates on the second question, offering detailed insights. The third question is addressed by analyzing the limitations and shortcomings of various techniques, as discussed in the third section of the review. Finally, the fourth part provides a concise and clear response to the fourth question, wrapping up the review comprehensively.

SECTION III.

Fundamental Approaches to Class Distribution Balancing

This section delves into the fundamental methods to mitigate class imbalance in ML: oversampling and undersampling, which the figure 1 shows. Class imbalance is critical in predictive modeling, often resulting in biased model outcomes that favor the majority class disproportionately. Addressing this imbalance is essential to foster fair and precise models. Over/Under sampling are two key techniques that adjust the arrangement of classes in training datasets to create a more balanced environment for model training. A thorough exploration of these foundational techniques prepares the groundwork for the advanced balancing methods detailed in subsequent sections of this survey.

FIGURE 1. - Basic approaches to class distribution balance.
FIGURE 1.

Basic approaches to class distribution balance.

A. Oversampling

Oversampling targets enhancing the minority class’s representation within a dataset, bringing its frequency up to parity with the majority class. This adjustment can be realized by simply duplicating existing instances or creating new, synthetic ones through methodologies like SMOTE (Synthetic Minority Oversampling Technique) [17]. SMOTE and its derivatives, such as Borderline-SMOTE and ADASYN, synthesize new samples by interpolating between minority class instances that connect via line segments to their nearest neighbors within the same class. These approaches operate primarily in the feature space, thereby injecting a higher degree of diversity into the lesser class and supporting the model’s capability to generalize from limited data [18].

Advantages:

  • Increases the model’s generalization abilities by introducing a more comprehensive range of variability within the minority class, thus preparing the model for broader scenarios.

  • Safeguards against the loss of essential information in minority class instances, which is particularly vital in datasets where each example holds significant value.

Disadvantages:

  • There is a risk of overfitting, as models might begin to memorize the noise inherent in the synthetically generated samples rather than learn to generalize from the actual data.

  • Elevates the computational burden, especially when employing sophisticated synthetic instance generation techniques, which can be resource-intensive.

B. Undersampling

In contrast to oversampling, undersampling aims to balance class allocation by reducing the size of the majority class, often achieved through the random deletion of its instances [17]. More refined techniques such as Cluster Centroids, Tomek Links, or Near Miss are utilized to maintain the statistical integrity of the majority class while minimizing its quantity [19]. These methods enhance model performance by eliminating instances that are either redundant or less informative, thereby creating a more balanced dataset that stops the model from being dominated by the traits of the predominant class.

Advantages:

  • Significantly reduces the time required for model training by decreasing the dataset size, simplifying the learning process.

  • Reduces the likelihood of model bias toward the majority class, promoting more fair and balanced decision-making.

Disadvantages:

  • There is a danger of losing essential data, as the removal process may inadvertently discard crucial instances to understand the predominant class comprehensively.

  • May lead to underfitting, especially if the diversity within the majority class is not fully captured, potentially impairing the model’s ability to generalize effectively.

C. Strategic Considerations for Choosing Between Oversampling and Undersampling

Choosing between oversampling and undersampling requires careful consideration of multiple aspects, such as dataset size, data characteristics, available computing resources, and the significance of minority class instances. Typically, oversampling is preferred in scenarios where the minority class includes critical, rare events that are essential to capture accurately, such as in fraud detection or diagnosing rare medical conditions. Conversely, undersampling is often more advantageous for extremely large datasets, where reducing the volume of data can significantly enhance computational efficiency and where there is enough data redundancy to minimize the risk of losing important information [20].

To achieve the best of both worlds, hybrid approaches that merge elements of both oversampling and undersampling are becoming more prevalent. Methods like SMOTEEN, combining SMOTE with Edited Nearest Neighbors, or various ensemble techniques incorporating different resampling strategies within a single classifier framework can provide a more balanced dataset [21]. These hybrid methods help ensure that models are not only efficient but also retain the integrity and diversity of data, thus enhancing overall model performance without sacrificing detail or computational speed.

SECTION IV.

Balance Techniques

Many dataset characteristics determine the most suitable techniques for addressing imbalance (data-level, algorithm-level, or hybrid). One key factor is the percentage of imbalance, which varies from dataset to dataset. Knowing it gives a clear understanding of the distribution of classes in the dataset, whether high or moderate. The percentage of imbalance is critical for effectively handling imbalanced datasets and building reliable ML models.

A. Data-Level Techniques

Data level techniques focus on aligning class distributions by adjusting the size of training datasets through resampling, have become widely adopted [22], which aims to equalize the class distribution through two methods: Undersampling and Oversampling [23]. Although resampling directly balances the training set, it introduces two main challenges: oversampling may lead to overfitting and reduced generalization on the test set, whereas undersampling may result in a significant loss of knowledge from the majority class [13]. Standard undersampling methods include Random Undersampling, Tomek Links, and Cluster Centroids [24]. Typically, to prevent the substantial depletion of instances from the predominant class, oversampling techniques are often preferred [4]. Prominent oversampling methods include SMOTE, ADASYN (Adaptive Synthetic Sampling), and Borderline-SMOTE [24].

SMOTE represents a commonly adopted oversampling method [25]. It identifies instances near within the feature space, establishes connections between them, and generates new samples along those connections. Nonetheless, generating synthetic examples without accounting for the majority class can create ambiguous instances, especially when there is significant overlap between classes, which is a notable drawback of this approach [13].

Many hybrid-sampling methods and SMOTE variants shown in table 1 have been proposed to address these challenges. Modify SMOTE-N (Synthetic Minority Over-sampling Technique for nominal data) to suit the nominal attributes of the data [25]. SMOTEENN (combining SMOTE-N with Edited Nearest Neighbors) aims to rectify class imbalance by oversampling the minority class and enhancing dataset quality through the elimination of noisy samples, and SMOTE-Tomek (combining SMOTE-N with Tomek links) to simultaneously create synthetic samples for the lesser class and undersample the greater class, effectively rebalancing the dataset and improving classification outcomes [26]. Distance-based SMOTE (D-SMOTE) regulates class overlap through a distance parameter, creating synthetic samples that better represent the minority class. Bi-phasic SMOTE (BP-SMOTE), on the other hand, overcomes traditional SMOTE’s shortcomings by enhancing the oversampling procedure through instance selection, guaranteeing the inclusion of only pertinent instances in the resultant training dataset [4]. CDSMOTE combines (class decomposition and synthetic minority oversampling); it starts by dividing the majority class into subclasses and then using SMOTE to increase the sample size of the minority class. This approach strives to attain a balanced data distribution while retaining crucial information [27]. Radius Synthetic Minority Oversampling Technique (RSMOTE) is unlike traditional SMOTE, which connects minority samples to create synthetic instances along line segments; it identifies the nearest samples from the majority class within a specified radius to generate synthetic data points, aiding in the creation of more diverse and realistic synthetic samples [28]. The SASMOTE (a self-inspected adaptive SMOTE approach) overcomes traditional SMOTE limitations by prioritizing visible neighbors and eliminating low-quality samples. Integrating adaptive nearest neighborhood selection and self-inspection for uncertainty evaluation elevates the quality of resampled data, particularly beneficial for highly imbalanced healthcare classification tasks [29]. Borderline-SMOTE is a sampling technique employed in managing imbalanced datasets, particularly in situations such as fraud detection in credit card transactions. It generates synthetic samples for the underrepresented class by targeting instances close to the decision boundary between classes, often known as borderline instances [30]. The Oriented Oversampling with Spatial Information Method (OOSI) tackles challenges in imbalanced and noisy datasets through a robust and adaptive approach that includes three critical phases: Oriented Information Sampling, Spatial Information Quantification, and Adaptive Data Space Partitioning [31]. The Synthetic and Dependent Wild Bootstrapped Oversampling Method (SDWBOTE) helps overcome the challenges of skewed data in fault detection and localization tasks within wind turbine systems. It considers temporal dependencies and relationships among samples [8].

TABLE 1 Effectiveness and Limitations of Hybrid-Sampling Techniques
Table 1- Effectiveness and Limitations of Hybrid-Sampling Techniques

GAN-based methods, leveraging Generative Adversarial Networks (GANs) composed of a generator and a discriminator, are gaining traction. These techniques generate synthetic samples by understanding the inherent data distribution and producing new samples that closely resemble actual data. For instance, the GAN-based Data Augmentation Method introduced by [6] seeks to improve the classification accuracy of imbalanced skin lesion datasets. Another study explores the (GANs) to address challenges associated with imbalanced datasets in machine learning. Focusing on three real-world datasets—Car Evaluation, Human Activity Recognition, and Bank datasets—the research aims to enhance minority class representation through data augmentation. The GAN-based approach generates synthetic data to balance the dataset, improving classification accuracy and model performance. This study underscores the potential of GANs as an effective tool for data augmentation and boosting model robustness across diverse applications [33]. An innovative Active Balancing Mechanism (ABM) is proposed to tackle the challenges of imbalanced medical data, focusing on myocardial infarction (MI) detection using electrocardiogram (ECG) signals. The ABM incorporates Gaussian naïve Bayes and entropy to enhance classification accuracy and reliability. Additionally, a modified convolutional neural network (MCNN) is developed to further optimize performance, showcasing the method’s potential for real-time monitoring and decision-making in clinical settings [32].

B. Algorithm-Level Techniques

Algorithmic strategies enhance learning in the minority class by immediately altering the training process of the classifier or modifying algorithms to address challenges associated with imbalanced data effectively [34]. Common strategies encompass:

1) Cost Sensitive Learning

Cost-sensitive learning entails allocating varying weights or costs to classes during the training process, prioritizing the penalization of misclassifications of the minority class [35]. MetaCost is a cost-sensitive classification algorithm that creates multiple classification models from the original dataset and computes the class probability for each sample. It then applies a conditional risk formula to assign a cost to each sample, subsequently re-labeling the training set based on these costs. Still, it is less suitable for handling imbalanced data in multi-class scenarios [36]. The adaptive cost-sensitive learning (ACL) method, effective under large imbalance ratios, dynamically adapts the sample cost throughout the entire training process. The model is trained using weighted losses in a manner that harmonizes the contribution of each class to the model parameter updates, preventing the dominance of majority classes in imbalanced data during training. Calculating the costs for samples considers the sample number distribution, the convergence trend of classes, and the convergence trend of samples [37]. Another study employs cost-sensitive learning to dynamically adjust class-specific costs within deep neural networks, thereby improving the recognition of minority classes. This approach is particularly useful in applications such as medical diagnosis and fraud detection, where accurately identifying underrepresented classes is critical [35]. An innovative adaptive cost-sensitive learning method is proposed to tackle the challenges of imbalanced data in industrial fault diagnosis. This approach dynamically calculates sample costs based on sample distribution and convergence trends, effectively reducing the dominance of majority classes during training. By ensuring adequate representation of minority classes, the method significantly enhances the performance of intelligent diagnostic models, particularly in scenarios with high imbalance ratios [37].

2) Algorithm-Specific Adjustments

Adjustments specific to algorithms entail customizing methods like decision trees, random forests, and SVMs to handle imbalanced data more effectively, as demonstrated in adaptive cost-sensitive learning, which enhances the performance of intelligent diagnosis models in such conditions. A Support Vector Machine (SVM) equipped with Multiple Kernel Learning (MKL) allows the SVM to integrate different kernel functions and fine-tune their weights, thus improving its ability to classify imbalanced datasets more accurately [38]. The Stratified Sampling-Based Deep Neural Network (SSDNN) approach tackles imbalanced data challenges by employing a stratified sampling technique, enhancing prediction accuracy. Partitioning the dataset into non-duplicated groups with balanced class representation increases computational complexity due to the stratified sampling method [9]. GENDA (Generative Neighborhood-based Deep Autoencoder) is a novel generative model to address the imbalance, particularly for image and time-series data [13]. By learning latent representations of the data and generating synthetic samples for minority classes, its flexibility allows for application in various domains where original data usage is restricted. It provides advantages over algorithm-level techniques by incorporating data augmentation and addressing imbalance classification without making distribution assumptions. A novel deep-learning-based model is introduced to address data imbalance in medical image classification. Acknowledging the underrepresentation of many medical conditions in datasets, the authors propose an innovative approach that utilizes effective perturbation operations to extract relevant features from single-class samples [39]. Another study conducts an in-depth exploration of adapting class-balanced loss functions for Gradient Boosting Decision Trees (GBDT) across diverse tabular classification tasks, including binary, multi-class, and multi-label classifications. The research highlights the effectiveness of these loss functions in addressing class imbalance challenges common in real-world applications. Additionally, the study introduces a Python package that simplifies the integration of these loss functions into GBDT workflows, making advanced techniques more accessible to researchers and practitioners [40].

3) Ensemble Methods

Ensemble methods entail merging multiple base classifiers or models to forge a more robust learner to address the imbalance problem [41]. Table 2 mentions some ensemble methods, their findings, and their limitations. These approaches leverage various algorithms or versions of the same algorithm to bolster predictive performance on imbalanced datasets. It mitigates the impact of class imbalance by blending predictions from multiple models, thus enhancing the classifier’s overall performance. Boosting refers to a group of machine learning techniques that successively train a sequence of weak learners. Each learner aims to rectify the errors of its predecessors by assigning higher importance to misclassified cases. This iterative process enables boosting to construct a resilient ensemble model capable of making accurate predictions, which is particularly beneficial in scenarios with class imbalance or noisy data. AdaBoost, a renowned boosting algorithm, continuously adjusts the weights of misclassified instances to improve the model’s overall performance [42]. Bagging, or Bootstrap Aggregating, generates multiple training dataset instances via bootstrap sampling, training individual base learners on each sample. The predictions of these base learners are subsequently aggregated, usually through averaging, to generate the final prediction. This process aims to enhance machine learning models’ overall performance and stability by mitigating variance and overfitting. Concurrently, it indirectly aids in addressing imbalanced data issues [43]. Random forests, a widely favored ensemble learning technique, amalgamate numerous decision trees to enhance predictive accuracy. Every tree in the forest is constructed from a randomized subset of features in the training data, and the ultimate prediction is formed by aggregating the predictions from all the trees [44]. Gradient Boosting optimizes weak learners, typically decision trees, to improve predictive accuracy. This algorithm functions through iteratively constructing decision trees to rectify the errors made by preceding trees. It achieves this by assigning greater weights to misclassified samples in each iteration [45]. A stacked deep learning algorithm leverages ensemble methods to manage imbalanced data adeptly. It integrates techniques such as Stacked CNN (Convolutional Neural Networks) and Stacked RNN (Recurrent Neural Networks), employing both to identify intricate patterns and temporal interdependencies [4].

TABLE 2 Effectiveness and Limitations of Ensemble Methods
Table 2- Effectiveness and Limitations of Ensemble Methods

C. Hybrid Approaches

Hybrid approaches amalgamate various strategies, frequently blending data-level and algorithm-level techniques as shown in figure 2. These methodologies provide a comprehensive solution to tackle class imbalance challenges, enhancing machine learning models’ resilience, effectiveness, and reliability. By integrating multiple methods, they aim to alleviate issues like overfitting, information loss, and poor performance in minority classes, thus enhancing the overall effectiveness of machine learning models in managing class disproportionality. Table 3 highlights the hybrid techniques employed in recent years. Common hybrid approaches include:

TABLE 3 Effectiveness and Limitations of Hybrid Approaches to Handle Imbalanced Data
Table 3- Effectiveness and Limitations of Hybrid Approaches to Handle Imbalanced Data
FIGURE 2. - Approaches to handle imbalanced data problem.
FIGURE 2.

Approaches to handle imbalanced data problem.

1) Data-Pre-Processing with Algorithm-Adjustments

Data preprocessing incorporating algorithm adjustments: involves using techniques like over-sampling, under-sampling, or SMOTE to prepare the data before applying algorithm-level adjustments. The Enhanced Generative Adversarial Network (E-GAN) technique amalgamates features from both (GANs) and (CNNs), which are shortcuts to generative adversarial networks and convolutional neural networks, respectively. This fusion leverages the data generation prowess of GANs and the classification capabilities of CNNs [11]. A recent study that tackles the challenge of imbalanced spectral data in materials science through a Generative Adversarial Network (GAN)-based data augmentation method. This approach uses joint optimization between the GAN’s generator and a classifier, allowing for the creation of synthetic samples that are both realistic and effectively distinguishable across material phases [46]. By employing transfer learning and domain adaptation techniques, explicitly using Maximum Mean Discrepancy (MMD), this approach addresses the issue of imbalanced data in detecting Return-Oriented Programming (ROP) attacks. It leverages balanced data from a source domain to train a model while minimizing the MMD to align the distributions of the source and target domains [47]. The integration of Genetic Algorithm (GA) with Support Vector Machine (SVM) seeks to optimize SVM parameters while using targeted sampling techniques to manage class imbalance effectively [48].

2) Algorithmic Ensemble with Data-Level Techniques

Integrating algorithmic ensemble with data-level techniques involves generating multiple models by employing diverse resampled dataset versions and consolidating their predictions. SMOTEBoost and RUSBoost exemplify these techniques, with SMOTEBoost combining the SMOTE algorithm with boosting to enhance predictions in imbalanced datasets. It overcomes the constraints of traditional boosting algorithms such as AdaBoost, boosts learning by creating synthetic examples for the minority class and adjusting the training distribution to focus on these instances [49], and RUSBoost combines Random UnderSampling (RUS) with boosting entails undersampling the majority class and subsequently boosting the classifier on the balanced dataset [50]. Integrating (DB-SLSMOTE) with (Random Forest) tackles class imbalance by augmenting the minority class through oversampling by synthetic samples generated from density distribution. Subsequently, Random Forest harnesses this balanced dataset to construct a resilient ensemble model, enhancing classification effectiveness [51]. Before applying the MCNN-LSTM model, the Tomek-Links technique is utilized as an undersampling method to manage imbalanced data.

The MCNN-LSTM model is a hybrid framework where Convolutional Neural Networks (CNN) are effective in text data when capturing local features and patterns. In contrast, Long Short-Term Memory (LSTM) networks are tailored for text classification tasks, instrumental in situations with imbalanced data. This integration addresses the inherent challenges of imbalanced data in text classification endeavors [52]. Integrating oversampling techniques with ensemble deep learning models involves merging modified oversampling methods like (D-SMOTE) Distance-based SMOTE and (BP-SMOTE) Bi-phasic SMOTE with Stacked CNN and Stacked RNN. It enhances predictive accuracy and robustness in handling imbalanced datasets [4]. In [53], authors combined Deep Convolutional Generative Adversarial Networks(DCGAN) for generating synthetic samples and Convolutional Neural Networks (CNN) for classification and feature extraction. The research by [54] integrates resampling techniques like SMOTE and US to rebalance class distributions. Moreover, it employs Particle Swarm Optimization (PSO) for attribute selection to improve sensitivity and reduce data dimensionality. At the same time, MetaCost is utilized as an algorithm-level approach to address class imbalances effectively. Another investigation employs a blend of undersampling using Tomek Links, clustering via BIRCH, and oversampling through Borderline SMOTE to address the imbalance in credit card transaction datasets and by removing noise with Tomek Links, clustering the data with BIRCH and generating synthetic instances for the underrepresented class through Borderline SMOTE, this approach seeks to equalize the dataset effectively [7]. The ATOMIC approach represents an automated machine-learning method explicitly designed for imbalanced classification tasks. It addresses the challenges posed by imbalanced datasets through a combination of the algorithmic ensemble, which optimizes the selection of learning algorithms, and data-level techniques, which optimize resampling strategies and hyperparameters [55]. A study examines the challenges of classifying minority classes in imbalanced datasets, with a focus on cerebral stroke prediction and bankruptcy risk in financial data. By evaluating the performance of various machine learning algorithms, the research underscores the limitations of traditional resampling methods like SMOTE in clinical contexts. It highlights the critical role of understanding dataset characteristics, as these factors greatly impact the effectiveness of predictive models [56].

3) Real-World Applications of Hybrid Approaches

Hybrid approaches have effectively addressed class imbalance across various real-world domains. Enhancing model performance and reliability has become essential in critical fields such as healthcare, fraud detection, cybersecurity, materials science, telecommunications, and others. Each application utilizes hybrid techniques to tackle the unique challenges posed by imbalanced datasets, improving predictive accuracy and decision-making capabilities. Table 4 shows the applications and datasets used in each study discussed.

TABLE 4 Applications of Hybrid Approaches
Table 4- Applications of Hybrid Approaches

In healthcare, techniques such as D-SMOTE and BP-SMOTE with Stacked CNN and RNN enhance predictive analytics for the early diagnosis of critical conditions like cardiovascular diseases and breast cancer [4]. Similarly, E-GAN improves disease detection accuracy for conditions such as breast cancer, diabetes, and chronic kidney disease by addressing imbalanced datasets effectively [11]. The DB-SLSMOTE with Random Forest method proves particularly valuable in detecting rare diseases by enhancing classification accuracy in datasets with limited positive cases [51]. DCGAN and CNN generate synthetic samples for medical imaging to mitigate class imbalances, supporting the accurate diagnosis of diseases like malaria [53]. Additionally, the integrated approach of SMOTE, Undersampling, PSO, and MetaCost optimizes the classification of underrepresented medical data, aiding healthcare professionals in making more informed decisions [54].

Fraud detection leverages methods like DB-SLSMOTE with Random Forest, enhancing models’ reliability for identifying fraudulent financial transactions [51]. The Tomek Links + BIRCH Clustering + Borderline SMOTE technique further improves fraud detection accuracy by addressing imbalances in transaction datasets [7]. Additionally, the ATOMIC Method automates the creation of machine learning solutions designed for imbalanced data, facilitating the detection of fraudulent activities in financial systems [55].

In cybersecurity, Transfer Learning with Domain Adaptation using MMD is utilized to enhance the detection of Return-Oriented Programming (ROP) attacks, a complex exploit technique [47]. This approach leverages transfer learning to address dataset imbalances, improving the accuracy and reliability of deep learning models for identifying such attacks.

In materials science, the GAN-Based Data Augmentation Method tackles the classification of material phases in imbalanced spectral datasets. Generating synthetic samples supports experimental design and materials characterization, as demonstrated in case studies involving hydrogels such as Pluronic F-127 and Alpha-Cyclodextrin [46].

In automated text classification, the Tomek Links Before MCNN-LSTM method categorizes news articles into diverse topics such as Politics, Sports, and Lifestyle [52]. This approach enhances content organization and retrieval for media organizations, ensuring improved representation and handling of underrepresented categories.

For telecommunications, the Genetic Algorithm with SVM improves user classification in systems such as Non-Orthogonal Multiple Access (NOMA) networks. By addressing class imbalances, this method ensures efficient resource allocation and optimized communication management [48].

The ATOMIC Method is applicable across various domains, such as anomaly detection, healthcare diagnostics, fraud detection, and credit scoring. Automating model optimization for imbalanced data simplifies analytical processes, enhancing decision-making and resource allocation in these critical areas [55].

The study presenting a Hybrid Ensemble focuses on cerebral stroke prediction by enhancing the reliability of machine learning models and assessing the effectiveness of SMOTE in clinical datasets [56].

SECTION V.

Discussion of the Limitations

This section provides a justification for the limitations of each technique. Data-level, the (SMOTE-Tomek and SMOTE-ENN) these techniques face difficulties with datasets featuring high-class overlap or noise. SMOTE-generated synthetic samples may resemble the majority of instances, causing classifier confusion. Additionally, cleaning methods can unintentionally remove valuable minority samples or miss noisy majority instances, reducing effectiveness [26]. Distance-based SMOTE (D-SMOTE) requires high computational resources for distance calculations in high-dimensional spaces, leading to longer processing times. The “curse of dimensionality” in such datasets can diminish the relevance of distance metrics, affecting synthetic sample quality and efficiency [4]. Bi-Phasic SMOTE (BP-SMOTE), its iterative process, with multiple SMOTE applications and instance selection phases, can be resource-intensive. As dataset size grows, processing time and resource demands increase, reducing scalability for real-world applications [4]. Class-decomposition SMOTE (CD-SMOTE) effectiveness relies on accurately decomposing the majority class into subclasses. Inaccurate decomposition can produce poorly defined subclasses, diminishing the impact of oversampling and potentially biasing the model toward the majority class [27]. Radius-SMOTE (R-SMOTE) performance depends heavily on correctly tuning parameters like radius distance for defining boundaries in synthetic sample generation. Poor tuning can cause excessive overlap with majority samples or inadequate minority sample generation. Additionally, dataset size and complexity increase computational costs, limiting scalability [28]. Self-Inspected Adaptive SMOTE (SASMOTE) requires precise hyperparameter tuning for optimal results. Its design for specific case studies limits generalizability across healthcare applications, suggesting a need for adaptation to broader contexts [29]. (Borderline-SMOTE) generates synthetic samples near decision boundaries, which may overlap with the majority class, creating ambiguous regions. This overlap can confuse the classifier and reduce generalization performance, particularly with poorly separated classes [30]. Oriented Oversampling with Spatial Information (OOSI) may face runtime challenges on complex datasets due to high dimensionality and intricate distributions. Adaptive spatial partitioning requires intensive computation, affecting scalability and efficiency, especially with noisy datasets [31]. SMOTE with Tomek Links + Borderline SMOTE (SDWBOTE) If temporal dependencies between samples aren’t accurately captured, SDWBOTE may carry over noise and bias from the original dataset, potentially resulting in poor real-world classifier performance and misrepresentation of minority classes [8]. (GAN-Based Data Augmentation) demands substantial computational resources for training due to the complexity of adversarial optimization between the generator and discriminator. Additionally, GANs are susceptible to mode collapse, where the generator fails to cover the entire data space, limiting sample diversity and quality of augmentation [6]. The limitations of (ABM) are summarized in the risk of traditional under-sampling leading to the loss of valuable minority class information and the reliance on validation with a single dataset, which restricts its generalizability to diverse clinical scenarios [32].

Algorithm-level, in (SVM with Multiple Kernel Learning (MKL)) optimizing multiple kernel functions increases computational demands, potentially limiting scalability in real-time applications. Extensive parameter tuning is also needed for optimal results, which can be time-consuming and may reduce model interpretability—a key factor in fields like healthcare [38]. (AdaBoost) emphasis on misclassified instances can make it highly sensitive to noisy data, increasing the risk of overfitting. The algorithm may over-focus on these points in the presence of outliers or noise, reducing its ability to generalize effectively to unseen data [42]. (Bagging) is computationally intensive, as it trains multiple models for each bootstrap sample. Additionally, it may not adequately address class imbalance on its own, as bootstrap sampling can maintain the original imbalance, resulting in poor performance on minority classes [43]. (Random Forests) may show bias toward the majority class in imbalanced datasets, as training often prioritizes majority class accuracy. The model also requires careful hyperparameter tuning to prevent increased bias or overfitting, making balanced performance across classes challenging to achieve [44]. In (Gradient Boosting) Decision Trees (GBDT) with additional trees, GBDT can become overly fitted to the training data, capturing noise and outliers instead of actual patterns, which leads to overfitting. Effective regularization is crucial to prevent performance loss on unseen data [45]. (Stacked Deep Learning) models are complex due to the integration of multiple architectures, raising the risk of overfitting. This complexity may lead the model to capture noise instead of general patterns, resulting in poor generalization, particularly with smaller or less diverse datasets [4].

Hybrid-level, in (D-SMOTE and BP-SMOTE with Stacked CNN and RNN), the complexity of combining multiple deep learning architectures increases the risk of overfitting, complicating generalization. Additionally, interpretability is reduced, making it challenging to understand model decisions, especially in sensitive fields like healthcare [4]. (E-GAN with CNN) combined GAN and CNN model demands substantial computational resources and time, particularly for large datasets. Additionally, synthetic samples may lead to overfitting if they do not accurately reflect the minority class distribution [11]. (DB-SLSMOTE with Random Forest) generating synthetic samples can add to training complexity and the risk of overfitting, with Random Forest training being especially resource-intensive for large datasets [51]. (Tomek-Links Before MCNN-LSTM) the model’s dependence on a specific dataset (Indonesian news) limits its generalizability, and the lack of transfer learning prevents it from utilizing larger datasets to enhance performance [52]. (DCGAN and CNN) requires substantial computational resources for adversarial training, and synthetic samples may lead to overfitting, affecting the model’s generalization on unseen data [53]. (SMOTE + US + PSO + MetaCost) concentrating on specific methods may restrict broader insights into alternative approaches. Furthermore, conclusions may lack generalizability if datasets do not include diverse medical characteristics [54]. (Tomek Links + BIRCH Clustering + Borderline SMOTE) the approach’s complexity and parameter sensitivity necessitates careful tuning, and its sensitivity to noise may leave residual noise, impacting model accuracy [7]. ATOMIC Method (Meta-Learning) ATOMIC’s handling of imbalanced data could be enhanced by a broader exploration of hyperparameters and algorithms, improving its performance and adaptability across different datasets [55]. (Genetic Algorithm with SVM) the iterative nature of Genetic Algorithms results in high computational costs and a risk of overfitting if hyperparameters, such as population size and mutation rate, are not carefully optimized [48]. (Transfer Learning and MMD) the high-quality source data is crucial for transfer learning, and limited validation data can reduce detection effectiveness. Careful model selection is essential to ensure consistent results [47]. (GAN-based Data Augmentation with Joint Optimization) is computationally intensive due to the dual optimization between the generator and classifier, and it struggles with high phase similarity, which limits distinct sample generation and effective class separation [46]. In the (Hybrid Ensemble) approach, classifiers face challenges in accurately predicting minority classes in medical datasets, and SMOTE’s theoretical validation may fail to align with real-world clinical applications [56].

SECTION VI.

Evaluation Methods

When handling imbalanced datasets in machine learning, choosing the appropriate evaluation metrics is crucial for precisely assessing model performance. This section explores various evaluation techniques particularly beneficial for addressing imbalanced data. It highlights the advantages of each method and its effectiveness in evaluating the model’s performance, particularly concerning accurately predicting outcomes for the minority class.

Accuracy is a fundamental assessment criterion utilized in machine learning and data mining. However, accuracy can result in misaddressing if used with an unbalanced dataset. A model may still achieve high overall accuracy even if its classification performance for minority categories is poor as long as it performs well in the majority categories. For example, if 99% of the testing data are negative samples, we can get a 99% accuracy by simply classifying all the testing data as a negative sample. So, the accuracy cannot be chosen as an evaluation index in imbalanced learning. The evaluation indicators relating to imbalanced learning are shown in Table 5.

TABLE 5 Evaluation Metrics for Imbalance Classification
Table 5- Evaluation Metrics for Imbalance Classification

In Table 5, most evaluation metrics are derived from the confusion matrix (CM), a critical tool for visually representing an algorithm’s performance. It’s particularly critical when dealing with imbalanced datasets as it delineates the count of accurate and inaccurate predictions for each class. This detailed breakdown is crucial for understanding the model’s effectiveness across the predominant and underrepresented classes. The main components of the confusion matrix are:

  • (TP) True Positives: Accurately identified positive observation.

  • (TN)True Negatives: Accurately identified negative observation.

  • (FP) False Positives: Incorrectly identified as positive.

  • (FN) False Negatives: Incorrectly identified as negative.

Metrics like AUC and G-mean are commonly used because they remain unaffected by class distribution imbalances. AUC is based on the entire ROC curve, while G-mean incorporates different parts of the confusion matrix, ensuring a more balanced model performance evaluation. This makes them suitable for dealing with situations where there are large differences in the number of positive and negative class samples. The Receiver Operating Characteristic curve and the Area Under the Curve are valuable for assessing the quality of classifier outputs. These metrics are particularly adept at evaluating performance across different threshold settings, offering robustness against class imbalance. The AUC condenses the ROC curve’s insights into a single value, expressing the likelihood that a classifier will prioritize a randomly chosen positive instance over a negative one. The ROC curve graphs TPR (TruePositive Rate) against FPR (False Positive Rate) across different threshold configurations. AUC is the area under the ROC curve. The Geometric Mean computed by extracting the square root of (Recall and Specificity) product guarantees that enhancements in one class’s efficiency do not adversely impact the other. This balance is crucial for effectively evaluating models where it is essential not to overlook the minority class, which is often of higher interest in imbalanced datasets. For multiclass imbalance problems, the G-mean evaluation metric is often preferred as it offers a unified measurement approach, eliminating the need to assess each class separately. And for highly imbalanced Big Data, the Area under the Precision-Recall Curve (AUPRC) is a more effective metric for evaluating the performance of classifiers. In highly imbalanced Big Data, the AUC metric fails to capture information about precision scores and false positive counts that the AUPRC metric reveals. The F1 Score denotes the harmonic mean of precision and recall, which is valuable in situations requiring a balance between the two and is common in datasets with imbalanced class distributions. The F1 Score provides more context than accuracy in situations with uneven class distribution. In the formulas for AUC and F1-score, Precision denotes accuracy [57], [58].

Considering these evaluation methods ensures a thorough understanding of a model’s performance on imbalanced datasets, guiding the development of efficient and equitable models.

SECTION VII.

Conclusion

This paper highlights the crucial importance of addressing class imbalance in machine learning initiatives. It begins by discussing basic strategies for balancing class distribution, which paves the way for a comprehensive exploration of techniques categorized into data-level, algorithm-level, or hybrid strategies. Additionally, this work examines the limitations inherent in each technique, providing justifications for their shortcomings, to offer a nuanced understanding of their practical challenges and opportunities for improvement.Subsequently, it underscores the importance of evaluation methods in assessing the efficacy of these strategies under imbalanced data conditions, examining metrics like F1 Score, AUC, and G-mean, among others. These metrics are vital for evaluating how various techniques fare, especially in accurately predicting outcomes for the minority class.

Recent studies have identified several gaps in addressing imbalanced data across data-centric, algorithmic, and blended approaches. At the data level, there is a pressing need for scalable techniques that can manage large imbalanced datasets and maintain their effectiveness across different domains. Algorithm-level challenges revolve around strengthening the resilience of methods against the evolving threats inherent in imbalanced data scenarios. Meanwhile, hybrid approaches face scalability, generalizability, and robustness issues, highlighting the necessity for methods that can effectively scale, function across diverse settings, and resist evolving threats in imbalanced data contexts.

These challenges underline the continuous push to develop more effective and flexible imbalanced data classification methods. Moreover, the balance between model complexity and generalization remains a significant hurdle, emphasizing the need for ongoing research. Understanding the strengths and limitations of each approach, including essential evaluation methods, equips researchers to develop machine-learning models that are effective, robust, and ready for real-world application. These models are purposefully crafted to manage the intricacies linked with data imbalance adeptly.

References

References is not available for this document.