Conferences >2020 11th International Confe...

Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Data imbalance in Machine Learning refers to an unequal distribution of classes within a dataset. This issue is encountered mostly in classification tasks in which the di...Show More

Metadata

Abstract:

Data imbalance in Machine Learning refers to an unequal distribution of classes within a dataset. This issue is encountered mostly in classification tasks in which the distribution of classes or labels in a given dataset is not uniform. The straightforward method to solve this problem is the resampling method by adding records to the minority class or deleting ones from the majority class. In this paper, we have experimented with the two resampling widely adopted techniques: oversampling and undersampling. In order to explore both techniques, we have chosen a public imbalanced dataset from kaggle website Santander Customer Transaction Prediction and have applied a group of well-known machine learning algorithms with different hyperparamters that give best results for both resampling techniques. One of the key findings of this paper is noticing that oversampling performs better than undersampling for different classifiers and obtains higher scores in different evaluation metrics.

Published in: 2020 11th International Conference on Information and Communication Systems (ICICS)

Date of Conference: 07-09 April 2020

Date Added to IEEE Xplore: 27 April 2020

ISBN Information:

ISSN Information:

DOI: 10.1109/ICICS49469.2020.239556

Conference Location: Irbid, Jordan

Contents

I. Introduction

In machine learning and statistics, classification is defined as training a system with labeled dataset to identify a new unseen dataset to which class it belongs. Recently, there is enormous growth in data and, unfortunately, there is lack of quality labeled data. Various traditional machine learning methods assumed that the target classes have the same distribution. However, this assumption is not correct in several applications, for example weather forecast [1], diagnosis of illnesses [2], finding fraud [3], as nearly most of the instances are labeled with one class, while few instances are labeled as the other class. For this reason, the models lean more to the majority class and eliminate the minority class. This reflects on the models performance as these models will perform poorly when the datasets are imbalanced. This is called class imbalance problem. Thus, in such situation, although a good accuracy can be gained, however, we don’t gain good enough scores in other evaluation metrics, such as precision, recall, F1-score [4] and ROC score.

References is not available for this document.

Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?