A Deep Learning Ensemble With Data Resampling for Credit Card Fraud Detection

Credit cards play an essential role in today’s digital economy, and their usage has recently grown tremendously, accompanied by a corresponding increase in credit card fraud. Machine learning (ML) algorithms have been utilized for credit card fraud detection. However, the dynamic shopping patterns of credit card holders and the class imbalance problem have made it difficult for ML classifiers to achieve optimal performance. In order to solve this problem, this paper proposes a robust deep-learning approach that consists of long short-term memory (LSTM) and gated recurrent unit (GRU) neural networks as base learners in a stacking ensemble framework, with a multilayer perceptron (MLP) as the meta-learner. Meanwhile, the hybrid synthetic minority oversampling technique and edited nearest neighbor (SMOTE-ENN) method is employed to balance the class distribution in the dataset. The experimental results showed that combining the proposed deep learning ensemble with the SMOTE-ENN method achieved a sensitivity and specificity of 1.000 and 0.997, respectively, which is superior to other widely used ML classifiers and methods in the literature.


I. INTRODUCTION
Information technology advancements have significantly impacted the financial sector, leading to the broad adoption of electronic commerce (e-commerce) platforms. Also, the recent outbreak of the novel coronavirus (COVID-19) pandemic has further shown the need for a more digital world and further expanded the e-commerce industry [1], [2]. One of the major issues associated with modern e-commerce is the high cases of credit card fraud [3]. Also, in the last decade, there has been an increase in credit card fraud, which is a huge burden on financial institutions [4]. The increased credit card fraud rate is associated with the expansion of e-commerce and increased online transactions. Therefore, credit card fraud detection (CCFD) is crucial for financial companies to avoid losses.
Artificial intelligence (AI) and machine learning applications in the financial sector can produce excellent results for companies, such as improved efficiency, reduced opera-The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro . tional cost, and enhanced customer satisfaction [5]. Several ML-based systems have been developed to detect credit card fraud. For example, Malik et al. [6] studied the use of hybrid models in CCFD. The hybrid models were achieved by combining a variety of ML algorithms, including extreme gradient boosting (XGBoost), random forest, adaptive boosting (AdaBoost), and light gradient boosting machine (LGBM). The experimental results indicated that the hybrid model based on AdaBoost and LGBM obtained the best classification performance. In a similar research work, Alfaiz and Fati [7] conducted a performance evaluation of ML classifiers and data resampling techniques for detecting credit card fraud. The classifiers used in the study include LGBM, XGBoost, random forest, categorical boosting (CatBoost), logistic regression, and naïve Bayes. The results indicated that the CatBoost classifier integrated with a k-nearest neighbor-based undersampling technique performed better than the other methods.
Meanwhile, building robust machine learning-based CCFD models has remained a challenge for some reasons. Firstly, conventional classifiers make predictions based on the transaction details only, such as amount, transaction country, and transaction type, ignoring the sequence of transactions that defines the clients' shopping behaviour, which is useful in identifying appropriate fraud patterns [8], [9]. Secondly, credit card fraud datasets are highly imbalanced since genuine transactions significantly outnumber fraudulent transactions [10]. Imbalance classification is a predictive modelling problem where there is an uneven distribution of samples across the classes [11]. The class that makes up a large proportion of the dataset is called the majority class, while the class with a smaller proportion is called the minority class. Imbalance classification is a challenge because most ML algorithms were designed with the assumption of an even class distribution. Therefore, using imbalanced data such as the credit card dataset results in models with poor classification performance, especially for the minority class, i.e., fraudulent transactions. Furthermore, correctly identifying the minority class samples is of utmost importance in imbalance classification problems [12].
Deep learning (DL) and ensemble learning have recently dominated the ML field [13], [14], [15], [16], achieving excellent prediction performances in complex problems, and they could be applied to solve the challenges in credit card fraud detection. Deep learning, a subset of machine learning, is mainly a neural network with multiple layers [17]. Deep learning models using recurrent neural networks (RNN) have been employed for different sequential modelling-based ML tasks [18], [19], [20]. For example, Shen et al. [21] noted that algorithms that utilize sequential modelling, such as RNNs, usually perform better than conventional ML models. Meanwhile, simple RNN-based models are prone to the vanishing gradient problem, a situation where the RNN is unable to propagate relevant gradient information from the model's output end back to the layers near the input end [22]. However, LSTM and GRU-based RNNs were proposed to solve the vanishing gradient problem and have shown good performances in different sequence classification tasks [8], [23], [24].
Meanwhile, ensemble learning involves training multiple base classifiers and combining their outputs to obtain better performance than the single base classifiers. Some ML algorithms build models with low accuracy, high bias, or high variance [25]. Ensemble learning-based classifiers tend to outperform single classifiers [26]. However, DLbased ensemble models have rarely been employed for CCFD. Therefore, to fill this research gap and solve the challenges facing CCFD, this paper considers the CCFD as a sequence classification problem and uses the LSTM and GRU networks as base learners in a stacking ensemble model. Also, most ensemble learning-based CCFD studies employed voting-based methods. Hence, to contribute to existing literature, this study uses the MLP neural network as the meta-learner in the stacking ensemble. The proposed method uses the advantages of sequential modelling and ensemble learning to achieve enhanced CCFD.
Furthermore, this study uses the SMOTE-ENN method for solving the imbalance class problem. Hybrid resampling techniques such as SMOTE-ENN have shown enhanced effectiveness in dealing with imbalanced data compared to oversampling and undersampling techniques [27], [28], [29]. In SMOTE-ENN, the SMOTE step adds minority class samples to the dataset. Meanwhile, SMOTE obtains synthetic samples via linear interpolation between majority class samples in the neighbourhood and often generates noisy samples [30]. Therefore, it is essential to eliminate such noisy samples, and in the hybrid SMOTE-ENN, the ENN removes noisy samples. Meanwhile, the main contributions of this work include the following: • An improved credit card fraud detection approach is proposed using the LSTM and GRU as base learners and MLP as a meta-learner in a stacking ensemble model.
• Improving the credit card fraud detection rate by combining the robustness of ensemble learning and deep learning.
• Overcoming the class imbalance problem in credit card datasets using the SMOTE-ENN method.
• Constructing the credit card fraud detection model using recurrent neural networks (i.e., LSTM and GRU), capable of learning client's spending behaviour and transaction sequences.
• A comparative study with recently proposed credit card fraud detection methods is conducted The rationale behind the proposed approach is that considering the CCFD as a sequential modelling problem could better detect slight differences in genuine clients' spending behaviour. The main objective of the research is to build a well-performing CCFD model using the SMOTE-ENN data resampling technique and deep learning-based stacking ensemble. The following classifiers would be implemented and used as the baseline for performance comparison: LSTM, GRU, MLP, AdaBoost, and random forest.
The remainder of the paper is organized as follows: Section II presents a literature review, while Section III presents the dataset and the various algorithms used in the study. Section IV introduces the proposed CCFD approach. Section V presents and discusses the experimental results, and the paper is concluded in Section VI.

II. RELATED WORKS
Several machine learning methods have been proposed for credit card fraud detection [31], [32]. Specifically, supervised learning algorithms have shown to be highly effective in detecting credit card fraud, where labelled datasets containing previous transaction records are utilized to build machine learning models that can detect new fraudulent transactions. These supervised learning algorithms include logistic regression [33], support vector machines (SVM) [34], decision trees [35], adaptive boosting (AdaBoost) [36], random forest [37], and artificial neural networks (ANN) [38], [39], [40]. VOLUME 11, 2023 Meanwhile, numerous researchers have proposed different methods to improve the performance of machine learning algorithms for credit card fraud detection. For example, Taha and Malebary [41] proposed an optimized light gradient boosting machine (LightGBM) to detect fraud in credit card transactions. The method involves optimizing the hyperparameters of the LightGBM using a Bayesian-based algorithm. The authors used two credit card datasets in the study. They performed a comparative analysis with other classifiers such as SVM, decision tree, random forest, naïve Bayes, and cateborical boosting (CatBoost). The optimized LightGBM performed better than the benchmarked classifiers, having obtained an accuracy of 98.40%, a precision of 97.34%, and an area under the receiver operating characteristic curve (AUC) of 92.88%.
In another research, Ileberi et al. [42] proposed a credit card fraud detection model based on feature selection using a genetic algorithm (GA). The selected features were then used to train different ML models using naïve Bayes, logistic regression, decision tree, ANN, and random forest algorithms. The study showed that the GA-based feature selection enhanced the performance of the various ML classifiers, and the random forest achieved the highest accuracy of 99.98%.
Furthermore, Salekshahrezaee et al. [43] aimed to develop robust CCFD models by studying the impact of feature extraction and data resampling on the following machine learning classifiers: CatBoost, random forest, extreme gradient boosting (XGBoost), and LightGBM. The feature extraction was achieved using principal component analysis (PCA) and convolutional autoencoder (CAE), while the data resampling methods include SMOTE, random undersampling (RUS), and SMOTE Tomek techniques. The experimental results indicated a significant increase in the performance of the classifiers when the RUS and CAE techniques were used for resampling and feature extraction, respectively.
Meanwhile, the class imbalance in most credit card datasets has made fraud detection difficult [44], [45]. In imbalanced datasets, the number of transactions labelled as legitimate (majority class) is significantly higher than those labelled as fraud (minority class). Therefore, conventional machine learning algorithms tend to underperform, especially in identifying fraud cases [9], [46]. Also, these conventional ML algorithms build CCFD models using individual transaction details, such as transaction location and amount, without considering the sequential information associated with the credit card clients [8].
Hence, deep learning techniques such as RNNs that consider the client's shopping behaviour and transaction sequences can be vital in detecting important fraud patterns [8], [47], [48]. For example, Benchaji et al. [8] proposed a credit card fraud detection method using an LSTM network to achieve sequential modelling and ensure improved fraud detection. The LSTM network was coupled with a uniform manifold approximation and projection (UMAP) technique to identify the most relevant attributes and an attention mechanism to improve the performance of the LSTM. The experi-mental results demonstrated that the proposed approach was robust in detecting credit card fraud.
However, deep learning-based ensemble models have rarely been employed for credit card fraud detection, even though combining deep learning-based techniques such as LSTM in an ensemble model could result in more robust models. Therefore, this paper aims to use LSTM and GRU networks to achieve sequential modelling and enhance the identification of fraudulent transactions. Meanwhile, to ensure the classification task is significantly improved, the LSTM and GRU networks would be employed as base learners in a stacking ensemble model, with an MLP classifier as the meta-learner. Additionally, the SMOTE-ENN technique would be employed to overcome the class imbalance problem and ensure effective machine learning.

III. MATERIALS AND METHODS
This section describes the credit card dataset used in the study and provides a detailed explanation of the various algorithms and methods used in formulating the proposed credit card fraud detection method.

A. CREDIT CARD DATASET
This study uses a credit card dataset containing transactions performed by European cardholders in September 2013, which is publicly available [49]. It contains 283,807 transactions, among which 492 transactions are labelled as fraudulent. The dataset is highly imbalanced, with only 0.172% labelled as fraudulent transactions. Most of the features were transformed to numerical variables using principal component analysis (PCA) because of confidentiality issues, and the names of the features were anonymized as V1, V2, V3, . . . , and V28, excluding the ''Time'' and ''Amount'' features. The ''Class'' feature is the target variable, and it has values 1 and 0, representing fraud and non-fraud transactions, respectively.

B. MULTILAYER PERCEPTRON
Multilayer perceptron is a type of feedforward neural network comprising three layers, including the input layer, hidden layer, and output layer. It is a powerful neural network with applications in several domains [50], [51], [52]. In the MLP network, data flows from the input to the output layer. The hidden layer, placed between the input and output layers, is the core of the MLP, which processes the input information and transfers it to the output layer [53]. The neurons are the processing elements in the MLP, and the neurons in each layer are connected to every neuron in the next layer. The input layer feeds the network with the input variables, and subsequent layers receive their inputs from the output of the previous layers [54]. The MLP network is usually trained using the backpropagation algorithm [55], enabling the network to update its weights to minimize the output error. The mean squared error (MSE) is the commonly used error function, and it is represented as: where n is the number of data points, and p i and t i are the predicted output and target output for sample i, respectively. Meanwhile, this layer-to-layer transfer of information is achieved using activation functions, such as the sigmoid function σ (k) = 1 (1+e (−k) ) , where e is Euler's number [53], [56].

C. LONG SHORT-TERM MEMORY
The LSTM network is a modified form of a recurrent neural network. Unlike conventional neural networks like MLP, RNNs are not limited to a unidirectional data flow [57]. They can loop through several layers and temporarily memorize information that can be used later. Meanwhile, the simple RNN is susceptible to the vanishing gradient problem, and the LSTM and GRU were developed to solve the problem [58].
The LSTM can learn long-term dependencies, making it suitable for classifying sequential data, such as credit card data. LSTM networks consist of a memory cell c t , with an input gate i t , a forget gate f t , and an output gate o t . The three gates control how the data is processed and used [9]. The following mathematical formulations represent the flow of information within the LSTM layers: where V * , W * , and b * are learnable parameters, h * is the hidden state, where * is used in place of f , i, o, or c to represent the given gates and memory cell. Meanwhile, σ and tanh are the sigmoid and tanh activation functions and ⊗ is the element-wise product [59].

D. GATED RECURRENT UNIT
The gated recurrent unit, developed by Cho et al. [60], is a gating mechanism in recurrent neural networks similar to the LSTM network but without an output gate. The GRU replaces the three gates in LSTM with two, i.e., the update gate z t and reset gate r t . The update gate and reset gate control information that flows into memory and information that flows out of memory, respectively [61]. These gates are basically vectors that determine the information passed on to the output and can be trained to keep past information or discard unnecessary information that does not contribute to the prediction. The gates in GRU are given sigmoid activations, thereby ensuring their values are in the range (0,1), which could be ''fraud'' and ''non-fraud'' classifications. Furthermore, in the GRU, the hidden state h t and cell state c t blend into one, i.e., h t = c t .
The GRU update equations include the following: where V r , W r , V z , and W z represent the weight matrices, b r and b z denotes the bias vectors [62].

E. ENSEMBLE LEARNING
Ensemble learning is a machine learning approach that combines multiple algorithms to achieve better classification performance than the individual base models [63] [64]. ML models usually have shortcomings, such as high bias, high variance, and low accuracy, and are not exempted from making errors [25], [65]. Therefore, rather than relying on one classifier, ensemble learning methods harness the strengths of two or more classifiers and often obtain higher accuracy than the individual base classifiers [66], [67], [68]. Ensemble learning methods can be broadly grouped into bagging, boosting, and stacking [69]. Stacking, which is the focus of this paper, involves using different base algorithms to build models, termed level-0 models, and a different algorithm called a meta-learner (or level-1 classifier) is trained to combine the predictions of the base models [70]. The trained level-0 models are tested with out-of-sample instances, and the predicted class labels, combined with the actual labels, make up the dependent and independent variables in the new dataset employed for training the meta-classifier [71].
Unlike bagging and boosting, which uses combination rules such as majority voting and weighted majority voting, the stacking ensemble model uses another ML algorithm (i.e., meta-learner) to aggregate the predictions from the level-0 models [72]. In the literature, stacking-based ensembles have been applied in diverse fields [73], [74], [75].

IV. PROPOSED CCFD METHOD
A. DATA RESAMPLING Only 0.172% of the samples in the European credit card dataset are labelled as fraudulent. Therefore, the dataset is highly imbalanced, which usually leads to models with poor generalization ability. Both oversampling and undersampling methods have been widely utilized to solve the class imbalance challenge in different ML applications [76], [77], [78]. However, oversampling methods create balanced training sets by duplicating samples in the minority class, which could result in overfitting [79], while undersampling methods obtain balanced datasets by discarding selected majority class instances. Hence, undersampling could remove useful examples that might be crucial in building efficient ML models, and it is also inefficient in highly imbalanced datasets like the European credit card dataset.
Therefore, the SMOTE-ENN method is adopted to balance the credit card dataset in this study. The SMOTE-ENN, VOLUME 11, 2023 summarized in Algorithm 1, is a hybrid method, having both undersampling and oversampling aspects using ENN and SMOTE, respectively [80]. The SMOTE technique oversamples the minority class instances while the ENN deletes overlapping samples. The ENN's neighborhood cleaning approach is utilized to discard samples that vary from two in the three nearest neighbors [81].

Algorithm 1 SMOTE-ENN Resampling Method
Input: training data S ={(x 1 ,y 1 ),. . . ,(x 2 ,y 2 ),. . . ,(x m ,y m )} Procedure: Step 1: Oversampling 1) Select an instance x i randomly from the minority class 2) Find the k nearest neighbors of x i and let S k represent the samples 3) Generate a synthetic data point p by randomly picking one of the samples in S k called z, then connect p and z to obtain a line segment in the feature space 4) Assign the minority class label to p. 5) Generate consecutive synthetic instances as a convex combination of p and z.
Step 2: Undersampling 1) Select a random instance x r ∈ S 2) Find the k nearest neighbors of x r , where k = 3 3) Delete x r if it has more neighbors from the other class. 4) Repeat 6 -8 for the whole training data. Output: A balanced dataset for effective CCFD.

B. DEEP LEARNING ENSEMBLE
The proposed CCFD method combines LSTM, GRU, and MLP neural networks to obtain a robust stacking-based ensemble model. The stacking framework comprises two layers, i.e., level-0 and level-1. At level-0, the base classifiers are trained and then tested with out-of-sample instances; the resulting predictions and their actual labels comprise the independent and dependent variables in the new dataset used to train the meta-classifier [71]. In this study, the LSTM and GRU networks are the level-0 learners, while the MLP is the level-1 learner. The main reason behind selecting the LSTM and GRU as the base learners include their robustness in modelling sequential data and their high prediction performance in such tasks. Also, their difference would ensure diversity in the ensemble, which is essential because different base models are likely to make different types of errors. A flowchart of the proposed methodology is shown in Figure 1.
From Figure 1, the proposed approach can be divided into three steps. The first step involves training the base models using the LSTM and GRU networks. Meanwhile, the 10fold cross-validation (CV) approach is used to develop the ensemble, ensuring there is no data leakage. In the second step, the two base models generate out-of-fold predictions, which are transformed, and, together with the actual labels, form a new dataset. Specifically, the predicted target labels are used as attributes in the new dataset, whereas the original class labels make up the response variable.
Furthermore, the base learners are trained individually on the training data using the same CV indices, while the meta-classifier is trained with the out-of-fold predictions and the actual labels associated with these instances; since they were not used in training the level-0 models. Therefore the output of the base learners is 1 and 0, representing fraud and non-fraud transactions. For instance, assuming each sample in a credit card dataset S is x i , y i , a new samplex i , y i is created, wherex wherex i is the out-of-sample prediction and {h 1 , h 2 , . . . , h T } are the base models. The third step involves using the generated dataset to train the MLP-based meta-classifier, which is employed to combine the base models, i.e., assuming x is a test instance, the final ensemble prediction for x isĥ(h 1 (x), h 2 (x), . . . , h T (x)), whereĥ represents the metamodel [82]. The proposed approach is outlined in Algorithm 2.

V. RESULTS AND DISCUSSION
This study develops a deep learning ensemble with data resampling for improved credit card fraud detection. The experimental results have been split into two, i.e., the classifiers' performance before and after data resampling. Meanwhile, the proposed stacking ensemble was achieved using the LSTM and GRU neural networks as base learners and an MLP neural network as the meta-learner. It was implemented together with five other classifiers for performance Algorithm 2 Proposed DL Ensemble Input: Credit card dataset S = (x 1 , y 1 ), . . . , (x 2 , y 2 ), . . . , (x m , y m ) The selected base learners, i.e., LSTM and GRU The meta-learner L, i.e., MLP Procedure: Step 1: Training the level-0 models for t = 1, . . . , T : Train a base classifier h t using S end for Step 2: Generating the new dataset S N for i = 1, . . . , m: Generate a new examplex i , y i using (13) end for Step 3: Fit the MLP-based meta-modelĥ (or level-1 model) using S N return H (x) =ĥ(h 1 (x), h 2 (x), . . . , h T (x)) Output: Stacking-based DL ensemble classifier H.
comparison. The classifiers include AdaBoost [83], random forest [84], MLP [85], LSTM [86], and GRU [60]. All the models were developed using scikit-learn [87], a well-known ML library in Python, and the hardware used in building the models is a 16 GB RAM windows computer having the following specification: Intel(R) Core(TM) i5-102100U CPU @ 1.60 GHz 2.10 GHz. Meanwhile, the 10-fold crossvalidation technique is utilized to evaluate all the models' performance.
Furthermore, the following performance metrics were used to evaluate the models: sensitivity (SEN), specificity (SPE), receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC). Sensitivity indicates the ability of the classifier to predict fraudulent transactions as fraud. A classifier with a high sensitivity score is usually preferred in CCFD [88]. Specificity indicates the ability of the classifier to predict legitimate transactions [89]. These performance metrics are represented mathematically as follows: where true positive (TP) and false positive (FP) represent the number of instances correctly predicted and wrongly predicted as fraud; in contrast, true negative (TN) and false negative (FN) denotes the number of transactions correctly and wrongly predicted as non-fraud. Meanwhile, the ROC curve is a plot showing the ability of a classifier to distinguish between the fraud and non-fraud classes. It is obtained by plotting the true positive rate against the false positive rate at different threshold settings [90]. The AUC is a summary of the ROC curve, having a value of 0 to 1, where 0 indicates that all the classifiers' predictions are wrong, and 1 indicates a perfect classifier. A. PERFORMANCE OF THE CLASSIFIERS WITHOUT DATA RESAMPLING Table 1 shows the prediction performance of the proposed deep learning ensemble and the selected classifiers trained with the original credit card dataset without resampling. The proposed method achieved higher specificity, sensitivity, and AUC scores than the individual classifiers that make up the ensemble (i.e., LSTM, GRU, and MLP). Also, the proposed ensemble performed better than the well-known AdaBoost and random forest classifiers. Specifically, the proposed ensemble achieved the highest sensitivity value of 90.5%, followed by the random forest (80.2%) and AdaBoost (77.5%). A similar trend is observed for the specificity and AUC values, where the proposed ensemble achieved the best performance. Meanwhile, Table 1 shows that the sensitivity values are lower than the specificity values, which implies the classifiers can correctly identify the majority class samples or non-fraud transactions and fail to correctly predict the minority class samples, i.e. fraud transactions. Hence, it is necessary to solve the class imbalance problem in order to achieve optimal classification performance. Table 2 shows the performance of the classifier after the SMOTE-ENN-based resampling. The results suggest that the performance of the classifiers is high compared to Table 1. Results in Table 2 indicate improved sensitivity and specificity values, with the proposed deep learning ensemble obtaining the best performance. Specifically, the proposed method obtained a sensitivity of 100% and a specificity of 99.7%. The AdaBoost obtained the second highest sensitivity of 96.6%, followed by the GRU (95.2%) and random forest (94.5%). Regarding the specificity value, the AdaBoost achieved 98.4%, followed by the LSTM (97.1%) and random forest (96.1%). Figure 2 shows the ROC curves of the various models. The proposed ensemble model's ROC curve is at the top-left, indicating it has the best classification performance compared to the benchmarked models. The DL ensemble achieved an AUC of 1.000, which is higher than the other methods. The high AUC indicates the DL ensemble performs better in distinguishing between fraud and non-fraud transactions. Furthermore, Fig 3 and Fig 4 provides a comparison of the sensitivity and specificity score achieved by the classifiers VOLUME 11, 2023    before and after the resampling step. The figures demonstrate the impact and importance of the data resampling, as it significantly influenced the learning and prediction performance of the classifiers. Using the balanced dataset ensured the sensitivity or true positive rate is greatly improved, which implies an improved detection of the samples labelled as fraud.

C. COMPARISON WITH STATE-OF-THE-ART METHODS
This section compares the proposed method with wellperforming methods in recent scholarly articles, including a weighted extreme learning machine (Weighted ELM) [91],  an optimized LGBM [41], a deep neural network (DNN) based classifier [92], a cost-sensitive SVM [93], a neural network ensemble [94], a random forest-based genetic algorithm wrapper method (GA-RF) [42], a method that sequentially combines the C4.5 and naïve Bayes classifiers [95], a dynamic weighted ensemble technique using Markov Chain [96], a model developed using random forest algorithm and SMOTE based resampling (RF-SMOTE) [97], an XGBoost model with SMOTE based resampling [98], an LSTM ensemble with SMOTE-ENN [9], a comparison of SMOTE and ADASYN based resampling with a DNN classifier [4], and an ANN model with random undersampling (RUS) and similarity-based selection (SBS) resampling techniques [99].
The stacking-based DL ensemble obtained optimal performance in comparison with other well-performing methods in Table 3, reflecting the proposed method's robustness. Meanwhile, it would be beneficial to observe how the proposed approach would perform using a different dataset. Therefore, the Taiwan credit card dataset [100] is employed to evaluate the proposed method and the selected baseline classifiers. The  Taiwan dataset is imbalanced, containing 30 000 instances, with 23 364 and 6 636 samples labelled as good and bad clients, respectively. Table 4 and Table 5 show the performance of the classifiers before and after the application of data resampling, and it can be seen that the proposed ensemble obtained superior performance. Furthermore, the classification performance of the classifiers was enhanced after resampling the data. Using the European credit card dataset, the sensitivity, specificity, and AUC increased by 10.5%, 3.2%, and 8.7%, respectively. Similar performance increases were observed when the models were trained using the Taiwan credit card dataset. Also, the proposed method achieved excellent performance compared to other credit card fraud detection methods in the literature, including those that used resampling techniques, such as oversampling-based methods (SMOTE and ADASYN) and RUS undersampling methods.
From the above, it is fair to conclude that combining the hybrid SMOTE-ENN-based resampling and the stacking-based deep learning ensemble resulted in excellent classification performance. Additionally, the stacking-based deep learning ensemble proposed in this study effectively learned from the credit card datasets, predicted both fraud and legitimate transactions efficiently, and could be deployed for real-time fraud detection.

VI. CONCLUSION
The growth in e-commerce and credit card usage has led to a significant rise in credit card fraud, affecting financial institutions and customers. Credit card fraud detection is an essential and challenging task. This study contributes to the existing literature by proposing a new approach using a deep learning-based stacking ensemble with data resampling to detect credit card fraud effectively. The stacking ensemble uses LSTM and GRU neural networks as base learners and an MLP as the meta-learner. Meanwhile, the data resampling was achieved using the hybrid SMOTE-ENN method. The proposed method achieved sensitivity, specificity, and AUC values of 1.000,.997, and 1.000, respectively, outperforming the baseline classifiers, including AdaBoost, random forest, MLP, LSTM, and GRU. In terms of performance comparison with other scholarly works, the proposed method presents an excellent performance. Future works would aim to introduce diversity in the base models using classifiers with different training methods, such as combining LSTM with random forest, logistic regression, or SVM. Also, future research works could consider feature importance and risk factor analysis.