Detecting Frauds and Payment Defaults on Credit Card Data Inherited With Imbalanced Class Distribution and Overlapping Class Problems: A Systematic Review

Credit card payments are one popular e-payment option apart from cash payments. Recent reports show that credit card fraud and payment defaults are increasing annually and are alarming. Thus, researchers have attempted various machine learning techniques to address these two challenges. However, they are challenged to mitigate the two major problems inherited in credit card data: (i) imbalanced class distribution and (ii) overlapping classes. Mitigating these problems shall effectively detect credit card frauds and payment defaults, thus benefiting card issuers and holders. Hence, this paper aims to develop a systematic review using PRISMA to identify and compare various credit card datasets, machine learning techniques, and evaluation metrics. Subsequently, we provide recommendations for handling these two problems. We extracted research papers from 2016 to 2023 from ScienceDirect, Springer, Association and Computing Machinery (ACM), and IEEE databases. The papers shall be included if written in English and published in peer-reviewed and indexed journals or conference proceedings. Finally, 87 papers were selected based on the eligibility criteria. Based on our findings, the European and Taiwan datasets are widely used in the research community. However, most researchers focus on tackling imbalanced class distribution rather than two problems together. We recommended to the research community the application of deep learning, ensemble learning, and sampling methods to effectively detect fraud and payment defaults on credit card datasets that inherit the two problems. In evaluating the machine learning algorithms, we recommend using metrics that can separately evaluate the algorithms’ performance in detecting frauds/payment defaults and normal transactions.


I. INTRODUCTION
Frauds and payment defaults on credit cards are the two main challenges issuers face.The mainstream media reported that the past COVID-19 pandemic has triggered a rise in card fraud cases when it accelerates the digitalization of economies [1], [2].Fraudsters take advantage of various situations to execute their fraudulent activities, and card fraud activities have already caused a great loss globally.
The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang .
According to the Nilson Newsletter of December 2022, the losses were approximately $32.34 billion worldwide in 2021 [3].
In Malaysia, using credit cards for e-payment transactions has increased by 28.7% from 2021 to 2022 [4].According to a news report, Malaysian credit card holders have yet to pay RM35.89 billion in outstanding balance in July 2022 [5].Credit card holders may get interest charges for late payments, collection calls and visits, black marks on the credit report, and even bankruptcy, should they fail to repay (payment default) [6].
It is reported that credit card companies such as Visa are looking into artificial intelligence (AI) technology to detect fraud cases.Financial institutions also train machine learning models using financial data to identify credit card holders prone to payment default based on their repayment delinquency.
Researchers have also made great efforts to deal with credit card fraud and payment defaults.However, they often face difficulties dealing with credit card data as it usually suffers from imbalanced and overlapping class problems.The imbalanced class problem occurs when the number of samples in a majority class (normal transactions) is much greater than in a minority class (frauds/payment defaults).The data involved is even more complicated when the minority class samples have similar characteristics to the majority class samples, thus causing the problem of overlapping classes.As a result, general machine learning algorithms learn the minority class ineffectively.Further, the algorithms bias the majority class to maximize the detection accuracy.Thus, the algorithms produce low detection rates for credit card fraud and payment defaults.
In this study, we aim to review credit card datasets, their inherited problems, machine learning techniques used to tackle the problems and evaluation metrics used to measure the techniques in credit card research.We shall recommend effective machine learning techniques to handle imbalanced class distribution and overlapping class problems in credit datasets, as well as suitable evaluation metrics for evaluating the techniques' performance.

II. RESEARCH QUESTIONS
The systematic review responds to the following research questions.We shall answer the questions in Section IV Results.
1. Which credit card datasets are widely used for fraud and payment default detection in the research community?2. How frequently are the two problems, imbalanced class distribution and payment defaults, being addressed in credit card research?3. Which machine learning techniques have been applied to address these two problems?4. What metrics are appropriate to evaluate the performance of the machine learning techniques? 5. What is the highest detection result achieved so far? 6.Which machine learning techniques produce the promising detection result?

III. METHODS
This systematic review was conducted based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [7].PRISMA is the preferred reporting item for systematic review in many research communities.

A. ELIGIBILITY CRITERIA
For papers to be included in this systematic review, the researchers must report machine learning techniques for tackling the imbalanced class distribution and overlapping class problems in credit card datasets.The papers shall report original research related to the credit card, irrespective of the maturity level of each published work.They must be written in English and published in peer-reviewed and indexed journals or conference proceedings.

B. INFORMATION SOURCES
We retrieved papers from the online bibliographic databases, i.e., ScienceDirect, Springer, ACM, and IEEE.Fig. 1 presents the PRISMA process for selecting research papers to answer the research questions.

D. STUDY SELECTION
We selected papers from peer-reviewed journals or conference proceedings related to the credit card domain in the screening.We checked the titles and abstracts of the papers and excluded papers without titles, abstracts and those that were not written in English.We also excluded papers that did not address the problems of imbalanced class distribution and overlapping classes.In the second screening, we read the full-text papers.The papers shall be excluded if they do not provide experimental results.After screening, we consolidated 87 papers for this systematic review and listed them in Appendix A.

E. DATA EXTRACTION
In this phase, we extracted six variables to answer the six research questions (refer to Section II).The variables include the widely used credit card datasets, the dataset problems, the machine learning techniques used to address the problems, evaluation metrics to measure the techniques' performance, the detection results and the effective machine learning techniques for detection.

IV. RESULTS
To answer the research questions, we summarized and presented the variables in the following figures and tables.

A. THE WIDELY USED CREDIT CARD DATASETS
We reviewed credit card fraud and payment default datasets employed by these 87 papers.Researchers may state the usage of more than one dataset in their papers.Fig. 2 demonstrates the usage frequency of each dataset.The descriptions of the datasets are shown in Table 1.

B. THE DATASETS' PROBLEMS
Dealing with credit card datasets is challenging for many researchers because of the two inherited dataset problems, i.e., imbalanced class distribution and overlapping class samples.Fig. 3 demonstrates the frequency of these problems being addressed in the selected papers.All the papers address the imbalanced class distribution problem based on the analysis.However, only 15 papers address the overlapping class problem together with the imbalanced class distribution problem.None address the overlapping class problem alone.

C. MACHINE LEARNING TECHNIQUES USED
Researchers used various machine learning techniques to address the two problems.Fig. 4 shows the frequency of these techniques being used for handling these two problems.Note that the researchers usually used multiple techniques in the same study to address the problems.Thus, the total frequency of techniques used is more than 87, the total number  of papers we reviewed.Fig. 4 shows that the techniques frequently used to address only the imbalanced class distribution problem are ensemble learning, deep learning and sampling methods, with the frequency of 31, 26, and 25.On the other hand, sampling methods are the most frequently used techniques for addressing imbalanced class distribution and overlapping class problems with the frequency 8, followed by deep learning and ensemble learning, with frequencies of 7 and 6, respectively.Table 2 shows the summarized details of techniques used in the selected papers.

D. EVALUATION METRICS
All papers used multiple evaluation metrics.Fig. 5

E. DETECTION RESULTS
We reviewed 87 papers, and the details are provided in Appendix A. We then pick the top five research works from the appendix for each of the two popular datasets and describe them in the following subsection.

F. THE EFFECTIVE MACHINE LEARNING TECHNIQUES
Appendix B summarizes the top five research works that used the two widely used datasets: European and Taiwan datasets.Our analysis shows that the most frequently used evaluation metric is the True Positive Rate (TPR/Recall/Sensitivity) (Fig. 5).Hence, we shall use this metric to compare the researchers' works and identify effective machine learning techniques for solving the two dataset problems: imbalanced class distribution and overlapping class.
Research work that achieved the best fraud detection result for the European dataset is [93].The researchers obtained the highest result for all evaluation measurements using deep learning and neural networks.Researchers [92] and [88] used sampling methods with ensemble learning and deep learning, respectively, and both achieved a 100% True Positive Rate (TPR).Researchers [23] and [63] are not far behind, who used neural networks and achieved TPRs of 99.98% and 99.6%, respectively.
As for the Taiwan dataset, [92] obtained the highest TPR of 98.09% using sampling methods and ensemble learning for detecting payment defaults.It is followed by [88], where the researchers used sampling methods and deep learning to achieve a TPR of 93%.Researchers [20] and [63] were using sampling methods and ensemble learning.Reference [63] implemented a neural network and obtained a high TPR of 92.4% compared to [20], who achieved a TPR of 89.9%.In [61], the researchers used a linear classifier model with a sampling method and obtained a TPR of 83.6%.

V. DISCUSSION
We discuss the answers to the questions set in Section II.The widely used credit card fraud dataset is the European dataset.As for the credit card payment default, the Taiwan dataset is the most widely used.These datasets are popular because they are publicly available and widely used by many researchers.
From our findings, the imbalanced class distribution problem is widely studied by researchers, and the overlapping class problem is not as explicit as the imbalanced class distribution problem.An imbalanced class distribution can easily be identified through the number of samples in each class.On the other hand, the overlapping class problem requires a 23640 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.thorough dataset analysis, and the problem complexity may vary on different datasets.The next two paragraphs discuss why researchers widely use specific machine learning techniques to tackle the two problems.
Deep learning and neural networks can handle highly imbalanced datasets.A study by [100] shows that neural networks are capable of classifying highly imbalanced data by taking into account the direction of the unit gradient for both majority and minority classes.In addition, a local boundary expansion strategy is taken into consideration to address the issue of the minority class's inadequate empirical representation.Even without considering the unit gradient direction of majority and minority classes, coupling deep learning and neural networks with sampling methods also tackles the imbalanced datasets well.Sampling methods, i.e., under-sampling and over-sampling, work by balancing a dataset's class distribution.Particularly, over-sampling the minority class will make the decision boundary of the minority class clear.On the other hand, under-sampling reduces the overwhelming effect of the majority class.As a result, general learning algorithms can learn and pick up the minority class relatively easily than before applying sampling methods on credit card datasets.
Ensemble learning, i.e., boosting, bagging, and stacking, works by training multiple machine learning algorithms and combining detection outcomes.As single learning algorithms are usually limited and make errors, ensemble learning can help improve detections by taking advantage of the strengths of different learning algorithms [101].Further, ensemble learning can easily be coupled with sampling methods, improve detection with a new aggregating strategy, or integrate itself with another ensemble learning for hybrid learning strategies [102].
We further discuss which evaluation metrics are appropriate to use.TPR (Recall or Sensitivity), AUC, Precision and F1 are suitable for imbalanced credit card datasets.TPR is the most widely used metric because it can measure the detection performance of a machine learning technique for majority class (normal transactions) and minority class (frauds/ payment default) separately.More importantly, calculating TPR includes false negatives, which consider frauds/payment defaults wrongly detected as normal transactions.Even though Accuracy is a very common metric to use, it is not suitable as the sole evaluation metric for imbalanced datasets because it looks into the overall performance of a machine learning technique on all dataset classes.The detection performance for frauds and payment defaults is hidden in Accuracy and may be biased towards the high detection rates for normal transactions.Researchers favor machine learning techniques that give high detection rates for credit card fraud and payment defaults.

VI. CONCLUSION
Imbalanced class distribution and overlapping class samples are the two main problems of credit card datasets.General machine learning algorithms may not be able to effectively detect the important minority classes (fraud and payment default) as the algorithms are usually biased towards the majority class samples (normal transactions).In addition, the overlapping class problem further affects the performance of the machine learning algorithms.
We analyzed the two widely used credit card datasets, European and Taiwan, and identified effective machine learning techniques to tackle the two problems so that detecting frauds and payment defaults becomes effective.Based on our analysis and review, the top three common effective techniques are (i) deep learning and neural networks, (ii) ensemble learning, and (iii) sampling methods.We recommend these techniques for credit card fraud and payment default datasets and also for future research dealing with other types of credit card data that also inherited the two problems.
The most common metrics for evaluating the performance of learning algorithms are TPR (Recall or Sensitivity), followed by Accuracy, AUC, Precision and F1.However, since the credit card datasets are usually imbalanced in their class distribution, using only Accuracy as the evaluation metric is inappropriate.If Accuracy is to be used to describe the performance of a machine learning algorithm on these imbalanced datasets, it would be better to use these four evaluation metrics (TPR, AUC, Precision and F1 scores) to complement Accuracy.
To conclude, this systematic review can become a reference for the datasets, machine learning techniques, and evaluation metrics related to future credit card research dealing with imbalanced class distribution and overlapping class problems.

FIGURE 1 .
FIGURE 1.The PRISMA process for paper collection and analysis.

FIGURE 2 .
FIGURE 2. Frequency of the datasets used in the selected studies.

FIGURE 3 .
FIGURE 3. Frequency of the problems being addressed in the selected papers.

FIGURE 4 .
FIGURE 4. Frequency of techniques used for handling the two problems.

FIGURE 5 .
FIGURE 5. Frequency of the evaluation metrics used in the selected papers.

TABLE 1 .
Summary of datasets used in the selected papers.

TABLE 2 .
Summary of techniques used in the selected papers.

TABLE 3 .
Summary of evaluation metrics used in the selected papers.

TABLE 3 .
(Continued) Summary of evaluation metrics used in the selected papers.

TABLE 5 .
The high detection rates achieved by the 321 researchers' work using the popular credit 322 card datasets.the numbers in bold show the highest tpr achieved by research work utilizing 324 a particular dataset.