A Neural Network Ensemble With Feature Engineering for Improved Credit Card Fraud Detection

Recent advancements in electronic commerce and communication systems have significantly increased the use of credit cards for both online and regular transactions. However, there has been a steady rise in fraudulent credit card transactions, costing financial companies huge losses every year. The development of effective fraud detection algorithms is vital in minimizing these losses, but it is challenging because most credit card datasets are highly imbalanced. Also, using conventional machine learning algorithms for credit card fraud detection is inefficient due to their design, which involves a static mapping of the input vector to output vectors. Therefore, they cannot adapt to the dynamic shopping behavior of credit card clients. This paper proposes an efficient approach to detect credit card fraud using a neural network ensemble classifier and a hybrid data resampling method. The ensemble classifier is obtained using a long short-term memory (LSTM) neural network as the base learner in the adaptive boosting (AdaBoost) technique. Meanwhile, the hybrid resampling is achieved using the synthetic minority oversampling technique and edited nearest neighbor (SMOTE-ENN) method. The effectiveness of the proposed method is demonstrated using publicly available real-world credit card transaction datasets. The performance of the proposed approach is benchmarked against the following algorithms: support vector machine (SVM), multilayer perceptron (MLP), decision tree, traditional AdaBoost, and LSTM. The experimental results show that the classifiers performed better when trained with the resampled data, and the proposed LSTM ensemble outperformed the other algorithms by obtaining a sensitivity and specificity of 0.996 and 0.998, respectively.


I. INTRODUCTION
In the past decade, there has been a rise in e-commerce, which has increased credit card utilization significantly. The increasing credit card usage has brought about a constant increase in fraudulent transactions [1]. Fraudulent credit card transactions have severely impacted the financial industry. A recent report showed that about 27.85 billion dollars were lost to credit card fraud in 2018, a 16.2% increase compared to the 23.97 billion dollars lost in 2017, and it is estimated to reach 35 billion dollars by 2023 [2]. These losses can be reduced through efficient fraud monitoring and prevention. Meanwhile, machine learning (ML) has been The associate editor coordinating the review of this manuscript and approving it for publication was Joey Tianyi Zhou. applied to develop several credit card fraud detection systems [3]- [7]. However, credit card fraud detection remains a challenge from a learning perspective due to the class imbalance that exists in the datasets [4]. Though the class imbalance is not the only problem that has hindered credit card fraud detection, it is the most critical challenge [6]. The class imbalance is a problem that occurs in several real-world ML applications, where datasets have an uneven class distribution. For example, samples belonging to one class (the majority class) are higher than those of the other class (the minority class). Most credit card transaction datasets are imbalanced because the legitimate transactions significantly outnumber the fraudulent transactions [8]. Most traditional ML algorithms perform well when they are trained with balanced data. The skewed class distribution makes conventional ML algorithms have biased performance towards the majority class because the algorithms are not designed to consider the class distribution but the error rate [9]. Therefore, more minority class examples are misclassified than the majority class samples [10].
The methods used in the literature to classify imbalanced data can be grouped into three categories, including data-level, algorithm-level, and hybrid techniques. Data-level techniques tend to create a balanced dataset by undersampling the majority class or oversampling the minority class, sometimes the combination of both [11]. Algorithm-level methods aim to solve the class imbalance problem by modifying the classifier to give more attention to the minority class examples. Examples of algorithm-level techniques include ensemble learning and cost-sensitive learning methods [12]. Meanwhile, the hybrid methods combine both data-level and algorithm-level techniques.
Several research works have proposed different methods to handle the imbalanced class problem in credit card fraud detection. For example, Padmaja et al. [13] proposed a fraud detection method using k-reverse nearest neighbor (KRNN) to eliminate extreme outliers from the minority class samples. Secondly, hybrid resampling was performed on the dataset, i.e., undersampling of the majority class and the oversampling of the minority class. The resampled data was used to train several classifiers, including the naïve Bayes, C4.5 decision tree, and k-nearest neighbor (KNN) classifiers. Compared to traditional resampling methods, the proposed approach obtained superior performance.
Taha and Malebary [14] proposed a credit card fraud detection method using a light gradient boosting machine (LightGBM). The hyperparameters of the LightGBM were tuned using a Bayesian-based optimization algorithm. The technique achieved an accuracy of 98.40% and a precision of 97.34%. Furthermore, Randhawa et al. [1] studied the performance of some standard machine learning algorithms and hybrid classifiers, including ensemble classifiers based on majority voting. The experimental results show that the majority voting technique yields excellent performance in detecting fraudulent transactions.
Despite the numerous studies proposed to handle imbalanced data, this problem remains a challenge, especially in credit card fraud detection [6]. Since the advent of deep learning, recurrent neural networks (RNN), such as long short-term memory (LSTM) and gated recurrent units (GRU), have shown enormous potential in modelling sequential data [15]- [17]. Conventional machine learning algorithms have not been successful in credit card fraud detection because they do not adapt to the dynamic shopping trends of credit card clients, which results in misclassifications when used for fraud detection systems [18]. To address this problem and proffer a robust solution that models the time series in credit card transactions, this study employs the LSTM neural network. The rationale behind this study is that it can be more beneficial to consider the entire sequence of transactions rather than only individual transactions because a method capable of modelling time in credit card data will be more powerful in identifying small shifts in legitimate customer shopping behavior.
The contribution of this study is the development of a robust credit card fraud detection method using an LSTM neural network ensemble. In the process, we implement an effective feature engineering method via resampling of the imbalanced data using the SMOTE-ENN technique. The proposed ensemble technique uses the LSTM neural network as the base learner in the adaptive boosting (AdaBoost) algorithm. This method is significant for two reasons: the LSTM is a robust algorithm for modelling sequential data. Secondly, the AdaBoost technique builds strong classifiers that are less likely to overfit, with lesser false-positive predictions [19]. Hence, integrating the LSTM neural network and AdaBoost algorithm could be an excellent method for effective credit card fraud detection.
The rest of this paper is structured as follows: Section II discusses the credit card fraud detection dataset, together with the conventional AdaBoost and LSTM techniques. Section III presents the proposed credit card fraud detection system, including the feature engineering and the LSTM ensemble. Section IV presents the results and discussions, while Section V concludes the paper and provides future research direction.

A. DATASET
This research utilizes the well-known credit card fraud detection dataset [20]. The dataset was prepared by the Université Libre de Bruxelles (ULB) Machine Learning Group on big data mining and fraud detection [9]. The dataset contains credit card transactions performed within two days in September 2013 by European credit card clients. The dataset is imbalanced, with only 492 fraudulent transactions out of 284 807. Meanwhile, all the attributes except ''Time'' and ''Amount'' are numerical due to the transformation carried out on the dataset, and they are coded as V 1 , V 2 , . . . , V 28 for confidentiality reasons. The ''Amount'' attribute is the cost of the transaction and the ''Time'' attribute is the seconds that elapsed between a transaction and the first transaction in the dataset. Lastly, the attribute ''Class'' is the dependent variable, and it has a value of 1 for fraudulent transactions and 0 for legitimate transactions.

B. ADAPTIVE BOOSTING
The AdaBoost algorithm [21] is an ensemble technique used to build strong classifiers by voting the weighted predictions of the weak learners [22]. It has achieved excellent performance in several applications, including credit card fraud detection [1] and intrusion detection systems [23]. Overfitting is common in machine learning applications [12], leading to poor classification performance. However, classifiers trained using the AdaBoost technique are less likely to overfit and also, the risk of high false-positive predictions VOLUME 10, 2022 is reduced [19]. In the AdaBoost implementation, an algorithm is selected to train the base classifier using the initial input data. Secondly, the weights of the samples are adjusted, and more weight is given to the misclassified samples. Furthermore, the adjusted instances are employed to train the subsequent base learner, which attempts to correct the misclassifications from the previous models. The iteration continues until the specified number of models is built, or there are no misclassified samples in the data.

C. LONG SHORT-TERM MEMORY NEURAL NETWORK
Long Short-Term Memory neural network is a special type of recurrent neural network (RNN) that has achieved excellent performance in learning long-term dependencies and avoids the gradient disappearance problem [24]. LSTM consists of a memory cell c t to remember the previous information and three types of gates that controls how the historical information is used and processed. The three gates are forget gate f t , input gate i t , and output gate o t . The LSTM layers are updated using the following equations: Meanwhile, * can represent f , i, or o to denote the specific gate or c for the memory cell. Therefore, V * and W * are the weight matrices, h * represent the hidden state, b * is the bias, h t is the output vector at time instant t. Furthermore, σ and tanh are the sigmoid and tanh activation functions [15]. The operator ⊗ represents the Hadamard or element-wise product. The first step in the LSTM algorithm is the identification of unrequired information which would be removed from the cell. An LSTM cell serves as a memory to write, read, and delete information depending on the decisions given by the input, output, and forget gates, respectively [25].

III. PROPOSED CREDIT CARD FRAUD DETECTION METHOD A. SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE AND EDITED NEAREST NEIGHBOR (SMOTE-ENN)
The credit card dataset used in the study is highly imbalanced, leading to poor performance when used to build ML models. The synthetic minority oversampling technique (SMOTE) is widely used in solving the imbalanced class problem [26]- [28]. It is an oversampling technique that balances the class distribution in the dataset by adding synthetic samples to the minority class. In contrast, undersampling methods such as edited nearest neighbor (ENN) creates a balanced dataset by deleting some majority class samples. Meanwhile, undersampling techniques can delete potentially useful examples that might be vital in the learning process. Also, undersampling methods become ineffective when the samples in the majority class significantly outnumber those in the minority class, such as the credit card dataset used in this research. Furthermore, oversampling could lead to overfitting since it makes copies of existing data samples. Therefore, the proposed credit card fraud detection model employs the synthetic minority oversampling technique and edited nearest neighbor (SMOTE-ENN) method to obtain a balanced dataset. The SMOTE-ENN is a hybrid resampling technique that performs both oversampling and undersampling of the data. It uses SMOTE to oversample the minority class samples and ENN to remove overlapping instances [29]. This algorithm employs the neighborhood cleaning rule from the ENN to remove examples that differ from two in the three nearest neighbors [30]. Algorithm 1 presents the pseudocode for the SMOTE-ENN technique.

Algorithm 1 SMOTE-ENN Technique Input: Input data
Step 1: Oversampling: 1: Choose a random sample x i from the minority class 2: Search for the K nearest neighbors of x i 3: Generate a synthetic sample p by randomly selecting one of the K nearest neighbors q, and connect p and q to create a line segment in the feature space 4: Give the minority class label to the newly created synthetic sample 5: Generate successive synthetic samples as a convex combination of the two selected samples.
Step 2: Undersampling:  y 1 ), . . . , (x n , y i )}, where x * is the independent variable and y * is the dependent variable (i.e., fraud or legitimate transaction). Let D m represents the weight distribution of the training samples at the mth boosting iteration, which was assigned a similar value 1/n at the first iteration, then the total classification error of the current base model can be computed using: where x i denotes the input sample and y i is the corresponding label, L m represents the trained LSTM model at the mth iteration. Furthermore, the weight distribution of the input data is updated depending on the prediction performance of the previous classifier in order to assign higher weights to the incorrectly classified instances and lesser weights to the correctly predicted cases. The weight update is achieved using: where Z m denotes a normalization parameter and ∂ m represents the voting weight of the base learner L m . The normalization parameter ensures the weight D m+1 (i) have a suitable distribution. Meanwhile, Z m and ∂ m can be mathematically represented as: After M iterations, the ensemble classifier consists of M base learners. Therefore, the final AdaBoost prediction is the combined predictions weighted by ∂ m : where the sign function sgn(x) is computed using: The proposed method is represented algorithmically in Algorithm 2. The LSTM models are trained using the resampled data from Algorithm 1 and integrated with the AdaBoost technique to create a powerful ensemble. Lastly, the classification results from the LSTM networks are combined via the weighted voting technique to obtain the final prediction results.

IV. RESULTS AND DISCUSSION
This section presents the experimental results. The proposed LSTM ensemble is benchmarked against some classifiers, including the SVM, MLP, decision tree, LSTM, and the traditional AdaBoost. We performed experiments using the original and resampled datasets to demonstrate the impact of the SMOTE-ENN resampling technique on the performance of the various classifiers. Meanwhile, we used the Python programming language and its associated machine learning libraries for all the experiments. Furthermore, we utilized the stratified 10-fold cross-validation technique to evaluate the performance of the models. The stratified 10-fold crossvalidation technique is well suited for imbalanced classification problems. It ensures that the proportion of fraud and non-fraud samples found in the dataset is preserved in all the folds.
The performance of the models is evaluated using the following performance evaluation metrics: sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). Sensitivity, also called recall, indicates the proportion of fraud samples correctly predicted by the classifier [31]. In contrast, specificity (true negative rate) is the Algorithm 2 LSTM Based Ensemble for Credit Card Fraud Detection Input: training data, U ={(x 1 , y 1 ), . . . , (x n , y i )} LSTM network as base learner L the number of time steps T and learning iterations M Output: Ensemble prediction Procedure: Step 1: Learn base classifier 1: initialize the input data weight distribution using D m (i)= 1 n , for all i = 1, 2, . . . , n 2: for m=1, 2, . . . , M do 3: train an LSTM base learner using U 4: for t=1, 2, . . . , T do 5: compute the output of the LSTM input gate using (1) 6: compute the output of the LSTM forget gate using (2) 7: update the LSTM memory cell using (3) and (4)   8: update the LSTM output gate using (5) 9: compute the LSTM output vector using (6) 10: end for 11: return L m = h T Step 2: Construct the ensemble prediction 12: compute the training error of L m using (7) 13: set the voting weight of L m using (10) 14: update the weights of the training samples using (8) 15: end for 16: obtain the final ensemble prediction using (11) proportion of legitimate transactions predicted correctly by the classifier. Meanwhile, the AUC is a measure of the classifier's ability to distinguish between legitimate and fraudulent transactions. An AUC value of 1 implies a perfect model, and the closer the AUC value is to 1, the better the classifier [32]. The sensitivity and specificity can be represented mathematically as: where • True positive (TP) represents an instance where a transaction is fraudulent, and the classifiers correctly classify it as fraudulent.
• True negative (TN) denotes an instance where a transaction is legitimate, and the classifiers correctly predict it as legitimate.
• False-positive (FP) represents a case where a transaction is legitimate, and the classifier classifies it as fraudulent.
• False-negative (FN) is an instance where a fraudulent transaction is wrongly classified as legitimate. VOLUME 10, 2022

A. CLASSIFIERS PERFORMANCE WITHOUT DATA RESAMPLING
Firstly, we trained the proposed LSTM ensemble and the benchmark classifiers using the original data, which has not been resampled; this is necessary to demonstrate the impact of the data resampling on the performance of the classifiers.
The results obtained are shown in Table 1. The results show that the proposed method achieved superior performance than the other algorithms, having obtained a sensitivity of 0.839, a specificity of 0.982, and an AUC of 0.890. It is observed that all the classifiers obtained poor sensitivity values. The sensitivity or true-positive rate measures the proportion of actual fraud transactions that are correctly identified. The poor sensitivity observed in the classifiers, including the proposed ensemble, can be attributed to the class imbalance inherent in the data, hence, the need for efficient resampling.

B. CLASSIFIERS PERFORMANCE AFTER DATA RESAMPLING
In the second set of experiments conducted in this research, we used the balanced data to train the proposed LSTM ensemble and the other classifiers. The results obtained are shown in Table 2. From the experimental results, the proposed method obtained a sensitivity of 0.996, specificity of 0.998, and AUC of 0.990. Secondly, the classification performance of the various classifiers has been improved compared to Table 1, which can be attributed to the data resampling. Particularly, from Table 2, we can see that the sensitivity values of the classifiers are higher than those in Table 1. Sensitivity is a crucial metric in fraud detection, and the enhanced sensitivity values are significant because it is vital that our models correctly detect fraudulent transactions. Meanwhile, Fig 1 shows the various classifiers' receiver operating characteristic (ROC) curves. The ROC curve is used to visualize the trade-off between the true-positive rate and false-positive rate, and it is a measure of the prediction ability of the classifier [33]. From Fig 1, the ROC curve of the proposed LSTM ensemble is closer to the upper-left corner, which implies it has a better predictive ability than the other classifiers. Also, the proposed method obtained an AUC value of 0.99, which is superior to the other classifiers. These results imply that the proposed technique achieved high performance in detecting fraudulent and legitimate transactions. Furthermore, Fig 2 and

C. COMPARISON WITH EXISTING METHODS
It is not sufficient to base the superior performance of our proposed method on the comparison with conventional algorithms. However, it is necessary to compare our approach with existing credit card fraud detection methods in the  literature. The methods include the following: the sequential combination of C4.5 decision tree and naïve Bayes (NB) [5], a light gradient boosting machine (LightGBM) with a Bayesian-based hyperparameter optimization algorithm [14], a light gradient boosting machine (LightGBM) with a Bayesian-based hyperparameter optimization algorithm [14], a cost-sensitive SVM (CS SVM) [6], an optimized random forest (RF) classifier [34], a deep neural network (DNN) [35], a random forest classifier with SMOTE data resampling [36], an improved AdaBoost classifier with principal component analysis (PCA) and SMOTE method [37], a cost-sensitive neural network ensemble (CS-NNE) [38], a stochastic ensemble classifier operating in a discretized feature space [39], a model based on overfitting-cautious heterogeneous ensemble (OCHE) [40], a dynamic weighted ensemble technique using Markov Chain (DWE-MC) [41], and an extreme gradient boosting (XGBoost) ensemble classifier with SMOTE resampling technique [42].
In Table 3, the proposed LSTM ensemble with SMOTE-ENN showed excellent performance compared to the other state-of-the-art methods, indicating the robustness of the proposed approach. Lastly, to further validate the effectiveness of the proposed approach, we carried out more simulations using two more real-world datasets, i.e. the Taiwan default of credit card clients dataset [43] and the German credit dataset [44]. Both datasets have an imbalanced class distribution. The Taiwan dataset contains 30 000 instances, where 6 636 and 23 364 cases are categorized as bad and good clients, respectively. Meanwhile, the German dataset comprises 1 000 instances, where the bad clients are 300, and good clients are 700. The experimental results are tabulated in Tables 4-7.  From Tables 4-7, the proposed LSTM ensemble obtained the best performance compared to the other classifiers. For the Taiwan credit card dataset, the proposed LSTM ensemble  obtained a sensitivity of 0.924, a specificity of 0.951, and an AUC of 0.930. For the German credit dataset, the proposed classifier achieved a sensitivity of 0.904, a specificity of 0.933, and an AUC of 0.910. Therefore, from the above experimental results, it is fair to conclude that the combination of SMOTE-ENN and the proposed LSTM ensemble is an efficient method to detect credit card fraud.

V. CONCLUSION
Recently, machine learning has been crucial in detecting credit card fraud, though the class imbalance has been a significant challenge. This paper proposed an efficient approach for credit card fraud detection. Firstly, the SMOTE-ENN technique was employed to create a balanced dataset. Secondly, a robust deep learning ensemble was developed using the LSTM neural network as the base learner in the AdaBoost technique. From the experimental results, using the well-known credit card fraud detection dataset, the proposed LSTM ensemble with SMOTE-ENN data resampling achieved a sensitivity of 0.996, a specificity of 0.998, and an AUC of 0.990, which is superior to the other benchmark algorithms and state-of-the-art methods. Therefore, combining the SMOTE-ENN data resampling technique and the boosted LSTM classifier is an efficient method in detecting fraud in credit card transactions. Future research would consider more resampling techniques and improved feature selection techniques for enhanced classification performance.