Credit Card Fraud Detection Using State-of-the-art Machine Learning and Deep Learning Algorithms

People can make use of credit card for online transactions as it provides efficient and easy-to-use facility. With the increase in usage of credit cards, the capacity of credit card misuse has also enhanced. Credit card frauds cause significant financial losses for both credit card holders and financial companies. In this research study, the main aim is to detect such frauds which include the accessibility of public data, high-class imbalance data and fraud nature can be changed and the false alarm is in high rates. The relevant literature presents a number of machines learning based approaches for credit card detection. Such as Extreme Learning Method, Decision Tree, Random Forest, Support Vector Machine, Logistic Regression and XG Boost. But due to low accuracy, there is still need to apply state of the art deep learning algorithms to reduce the fraud losses. The main focus has been to apply the recent development of deep learning algorithms for this purpose. Comparative analysis of both machine learning and deep learning algorithms was performed to find efficient outcomes. The detail empirical analysis is carried out using European card benchmark dataset for fraud detection. Machine learning algorithm is first applied on the data set which showed improvement in the accuracy of detection of the frauds to some extent. Later, three architectures based on convolutional neural network are applied to improve fraud detection performance. Further addition of layers further increased the accuracy of detection. A comprehensive empirical analysis has been carried out by applying variations in number of hidden layers, epochs and applying the latest models. The evaluation of research work shows the improved results achieved such as accuracy, f1-score, precision and AUC Curves having optimized values 99.9%,85.71%,93%,98% respectively. The purposed model outperforms over state of art machine learning and deep learning algorithms for credit card detection problems. In addition, we have performed experiments by balancing the data and applying deep learning algorithms to minimize the false negative rate. The proposed approaches can be implemented effectively for the real-world detection of credit card frauds.


I. INTRODUCTION
Credit card fraud (CCF) is a type of identity theft in which someone other than the owner makes an unlawful transaction using a credit card or account details.A credit card that has been stolen, lost, or counterfeited might result in fraud.Card-not-present fraud, or the use of your credit card number in e-commerce transactions has also become increasingly common as a result of the increase in online shopping.Increased fraud, such as CCF, has resulted from the expansion of e-banking and several online payment environments, resulting in annual losses of billions of dollars.In this era of digital payments, CCF detection has become one of the most important goals.As a business owner, it cannot be disputed that the future is heading towards a cashless culture.As a result, typical payment methods will no longer be used in the future, and therefore they will not be helpful for expanding a business.Customers will not always visit the business with cash in their pockets.They are now placing a premium on debit and credit card payments.As a result, companies will need to update their environment to ensure that they can take all types of payments.In the next years, this situation is expected to become much more severe [1].
In 2020, there were 393,207 cases of CCF out of approximately 1.4 million total reports of identity theft [4].CCF is now the second most prevalent sort of identity theft recorded as of this year, only following government documents and benefits fraud [5].In 2020, there were 365,597 incidences of fraud perpetrated using new credit card accounts [10].The number of identity theft complaints climbed by 113% from 2019 to 2020, with reports of credit card identity theft increasing by 44.6% [14].Payment card theft costed the global economy $24.26 billion last year.With 38.6% of reported card fraud losses in 2018, the United States is the most vulnerable country to credit theft.
As a result, financial institutions should place a high priority equipping themselves with an automated fraud detection system.The goal of supervised CCF detection is to create a machine learning (ML) model based on existing transactional credit card payment data.The model should distinguish between fraudulent and nonfraudulent transactions, and use this information to decide whether an incoming transaction is fraudulent or not.The issue involves a variety of essential problems, including the system's quick reaction time, cost sensitivity, and feature preprocessing.ML is a field of artificial intelligence thatuses a computer to make predictions based on prior data trends [1] ML models have been used in a number of studies to solve numerous challenges.Deep learning (DL) algorithms applied applications in computer network, intrusion detection, banking, insurance, mobile cellular networks, health care fraud detection, medical and malware detection, detection for video surveillance, location tracking, Android malware detection, home automation, and heart disease prediction.We explore the practical application of ML, particularly DL algorithms, to identify credit card thefts in the banking industry in this paper.For data categorisation challenges, the support vector machine (SVM) is a supervised ML technique.It is employed in a variety of domains, including image recognition [25], credit rating [5], and public safety [16].SVM can tackle linear and nonlinear binary classification problems, and it finds a hyperplane that separates the input data in the support vector, which is superior to other classifiers.Neural networks were the first method used to identify credit card theft in the past [4].As a result, (DL), a branch of ML, is currently focused on DL approaches.
In recent years, deep learning approaches have received great attention due to substantial and promising outcomes in a variety of applications, such as computer vision, natural language processing, and voice.However, only a few studies have examined the application of deep neural networks in identifying CCF.[3].It uses a number of deep learning algorithms for detecting CCF.However, in this study, we choose the CNN model and its layers to determine if the original fraud is the normal transaction of qualified datasets.Some transactions are common in datasets that have been labelled fraudulent, and demonstrate questionable transaction behaviour.As a result, we focus on supervised and unsupervised learning in this research paper.
The class imbalance is the problem in ML where the total number of a class of data (positive) is far less than the total number of another class of data (negative).The classification challenge of the unbalanced dataset has been the subject of several studies.A large collection of studies can provide a number of answers.Therefore, to the best of our knowledge, the problem of class imbalance has not yet been solved.We propose to alter DL algorithm of CNN model by adding the additional layers for features extraction as well as classification of credit card transactions into fraudulent or otherwise.The top attributes from the prepared dataset are ranked using feature selection techniques.After that, CCF is classified using several supervised machinedriven and deep learning models In this study, the main aim is to detect fraudulent transactions using credit cards with the help of ML algorithms and deep learning algorithms.This study makes the following contributions: • Feature selection algorithms are used to rank the top features from the CCF transaction dataset which help in class label predictions.The rest of the paper is structured as follows: The second section examines the related works.The proposed model and its methodology are described in depth in Section 3. The dataset and evaluation measures are described in Section 4.
It also shows the outcomes of our tests on a real dataset, as well as the analysis.Finally, Section 5 concludes the paper.

II. RELATED WORK
In the field of CCF detection, several research studies have been carried out.This section presents different of research studies revolving around CCF detection.Moreover, we strongly emphasise the research that reported fraud detection in the problem of class imbalance.Many techniques are used to detect credit cards.Therefore, to study the most related work in this domain, the main approaches can be categories, such as DL, ML, CCF detection, ensemble and feature ranking, and user authentication approaches [1], [3].
Figure 1 shows the commonly used payment card authorization process for credit card authentication.There are two ways of authentication including passwords and authentication through biometrics.Biometrics-based authentication can be further divided into two groups: physiological authentication and behavioural authentication, and combined authentication [4], [5].

Figure 1: Payment Card Authorisation Process
A. Supervised Machine Learning Approaches ML has many branches and each branch can deal with different learning tasks.However, ML learning has different framework types.The ML approach provides a solution for CCF such as random forest (RF).The ensemble of the decision tree is the random forest [3].Most researchers use the RF approach.For the purpose of combining the model, we can use (RF) along with network analysis.This method is called APATE [1].Researchers can use different ML techniques, such as supervised learning and unsupervised techniques.ML algorithms, such as LR, ANN, DT, SVM and NB are commonly used for CCF detection.The researcher can combine these techniques with ensemble techniques to construct solid detection classifiers [3].The linking of multiple neurons and nodes is known as an artificial neural network.A feed-forwards perceptron multilayer is built up of numerous layers: an input layer, an output layer and one or more hidden layers.For the representation of the exploratory variables, the first layer contains the input nodes.With a precise weight, these input layers are multiplied, and each of the hidden 'layer nodes are transferred, with a certain bias, and they are added together.
An activation function is then applied to create the output of each neuron for this summation, which is then transferred to the next layer.Finally, the algorithm's reply is provided by the output layer.The first set randomly used weights and formerly used the training set to minimise the error.All these weights were adjusted by detailed algorithms such as backpropagation [2], [6].The graphic model for contingency relationships between a set of variables is called the Bayesian belief network.The independence assumption in naïve Bayes is that it was developed to relax and allow for dependencies among variables.

Table 1: Algorithms of Machine Learning and their Accuracy
Variable quantity is characterised as nodes, although dependencies of conditions between variables are shown as arcs between nodes.The conditional probability table of each node is linked, which makes possibilities of the node's variable conditional on the parent's node values [7], [8].The computational system of the bilateral-branch network (BBN) is as follows: Finding a construction for the network is the first step: it was raised by human experts, which may be conditional on the specific algorithms by using the data.When this network topology originates, straightforwardly fitting the network uses antique data in naïve Bayes so that the constant variables are also discretised and supposedly distributed normally.Correspondingly, in BBN, it is expected that each node is autonomous of its no offspring, assuming its maternities in the graph [3], [9].This is acknowledged as the condition of Markov.The linear classification model is a support vector machine (SVM) and problems of regression.Rendering to the SVM algorithm, we can find the points closest to the line from both classes [10], [11].These points are called support vectors.This paper is concerned with the integration of unsupervised techniques with supervised techniques for the classification

B. Deep Learning Approaches
DL algorithms are useful, including the convolutional neural network (CNN) algorithm, and more algorithms are deep belief networks (DBNs) and deep autoencoders; these are considered learning methods.They have numerous layers of processing data, illustration learning and classification of a pattern [7], [15].The objective of deep-learning is to study artificial neural networks.The standard technique regards the size of neural networks, and it is considered as the backpropagation model [8], [16].The efficiency of the backpropagation algorithm decreases greatly, increasing the depth of the neural networks, which can cause problems, such as insufficient local goals and a dilution of errors.Deep designs should be considered to be an achievement.They can theoretically address the optimisation struggle in a profound manner within the training parameters [17], [18].
The training technique of the deep belief network is often considered the effective primary case of deep architecture training.Traditional ML algorithms, such as SVM, DT and LR have been extensively proposed for CCF detection [3].These traditional algorithms are not very well suited for large datasets.A CNN is a DL method; it can deeply relate with three-dimensional data, such as image processing.This method is similar to the ANN; the CNN has the same structure hidden layer and the dissimilar number of channels in each layer in addition to special convolution layers.The idea of moving filters through word convolution is linked from the data that can be used to capture the key information, and automatically performs feature reduction.Thus, the CNN is widely used in image processing.The CNN does not require heavy data pre-processing for training.
For image processing, the purpose of using a CNN is to minimise processing without losing key features by reducing the image to make predictions [4], [6].The main terms in the CNN are feature maps, channels, pooling, stride, and padding.For text, image and video processing, CNN models are conventionally used and take two-dimensional data as input, which is called the 2DCNN.To learn the internal representation, the feature mapping process is used from the input data.The location of features is not relevant, and the same procedure can be used for one-dimensional data.Natural language processing is a very popular example of a 1DCNN application where sequence classification becomes a problem.In a 1DCNN, the kernel filter moves top to bottom in a sequence of a data sample, rather than moving left to right and top to bottom in the 2DCNN [17], [18].
Raghavan [16] defined an autoencoder as an actual neural network.An autoencoder can also encrypt the data in the same way as it would decrypt the data.In this method, for no anomalous points, the autoencoders are trained.According to the reconstruction error it would present, the anomaly ideas classify it as 'fraud' or 'no fraud,' meaning that the system has not been trained, which is predicted to have a higher amount of anomalies [19], [20].However, a slight value overhead the higher bound value or considers the threshold an anomaly.This technique is also used in [8] an autoencoder-based network detection of an anomaly.A ML model is a generative adversarial network where two neural networks collaborate to improve their prediction accuracy.GANs are often unsupervised and learn using an obliging zero-sum game framework.The fundamental category of the a deep-learning model is a GAN [11], [21] and the perception of development for DL progress it can offer is the most promising direction.GAN takes two main modules.In training, all of the modules make up a model of DL, which is a neural network.
The main two methods used are as a generator (G) and a discriminator (D).The network of the generator can generate the data as simulated, and the difference between the simulated data and the target data determines the discriminator, yielding a determination that is true and false around the virtual data.Finally, the model may generate higher-quality simulation data to finish the data creation process [22], [23].A VAE is a variational autoencoder with regularised training circulation to guarantee that its hidden space has adequate assets, allowing us to create fresh data.A VAE is generated by introducing variation on the basis of the autoencoder.The VEG and the GAN are extremely similar.Once again, the goal is to change and match the data distribution to generate virtual data that is near the target [8], [22].
Usually, the number of samples is similar to that of a normal distribution.If all examples are found, the work can be very successful.Consequently, investigators frequently use neural networks to approximate the mean and modification of normal distribution.Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in DL models [24], [25].The LSTM network is compatible with categorising, processing and building predictions based on time sequence data.The most common type of recurrent neural network (RNN) is the LSTM.An ordinary neural network (NN) cannot keep track of the preceding information of a learning task every time they have to perform a task.In very simple words, with memory, the RNN is a neural network [26], [27].RNNs tend to have short-term memory because of the vanishing gradient problem.The backbone of neural networks is backpropagation, as it reduces the loss by weights of network adjustment by using gradients that it originated.In RNNs, as the gradient moves the backbone in the network, it shrinks, and then there is a minor update in weight.These small updates are affected by the earlier layers in the network.They do not learn more, and the RNN loses the ability to recall early examples in long sequences, making it a short-term memory network [28].
The use of DL methods is still very limited, and methods, such as CNN and LSTM are encouraged for image classification, as well as the natural language processing (NLP) and RBM because of their ability to handle massive datasets.The way these DL methods perform CCF classification is the major focus of this study [29].In addition, data pre-processing is an important stage in the ML process.How the classification performance is affected in response to data pre-processing when detecting credit cards is another question that needs to be answered.Table 2 presents the summary of deep learning algorithms.

III. RESEARCH METHODOLOGY
Research is said to be methodical, and research methodology is predicated by the applied research method.Applied research is administered to unravel the issues.Before realworld experimentation, the research covers all fundamentals by performing these steps:

A. List of features of credit card transaction data
Table 3 lists the important features and shows the mainframe transaction table of credit cards.Even though the whole construction of the transaction information table might be slightly dissimilar amongst card issuers, the vital characteristics recorded would be controlled in the database and are accessible for fraud detection modelling.The country where the transaction takes place

Transaction City
The city where the transaction takes place

Approval Code
The response to the authorisation request, it means approve or reject.

Experimental Step-up
We discuss the dataset to be cast-off and the achievement evaluation measurements to be applied.

I. DESCRIPTION OF DATASET
The credit card dataset is accessible for research purposes.The dataset [11] holds transactions made by a cardholder over a two-day period, i.e., September 2018.There were 284,807 transactions in total, of which 492, or 0.172 percent, were fraudulent.Because disclosing a consumer's transaction details is considered a problem of confidentiality, the main component analysis is applied to the majority of the dataset's features using principal component analysis (PCA).PCA is a standard and widely used technique in the relevant literature for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss [2], [4], [19].It does so by creating new uncorrelated variables that successively maximize variance.We use and apply the following machine and ensemble learning algorithm.

1) Extreme Learning Method
The extreme learning method (ELM) is a neural network for classification, clustering, regression and feature learning.It can be used with one or a multilayer of unseen notes.Parameters of unseen nodes are tuned.The weights of the output are hidden nodes learned in a single step.This is the essential amount that is needed to properly learn a linear model.Given a single hidden layer of ELM, we assume that the output function of the j-unseen node is h(z)= G (p, q, z) wherever the parameters of the j th node are.The output function is as follows: Is the weight of the output the i th hidden node?

2) Decision Tree
As a result, the decision tree classifier is used to create the model, starting with the decision tree.We set the 'max depth' to '4' in the algorithm, which indicates that the tree can split four times, and the 'criterion' to 'entropy,' which is similar to 'max depth' but decides when to stop splitting the tree.We have thus finished installing and storing everything.
3) K-Nearest Neighbours (KNN) Supervised Learning is the learning that the amount or the result that we want or expect is inside the training data (labelled data), and the amount in the data that we need to learn is known as the Target or the Dependent Variable.Next, for the K-Nearest Neighbours (KNN), we build the model using the 'K-Neighbours Classifier' model and take the value of k, which represents the nearest neighbour, as '5'.The value of the 'n-neighbours' is arbitrarily selected, but it can be selected positively through iterating a range of values, surveyed by fitting and storing the predicted values into the 'knn-yhat' variable.

6) Logistic regression
Logistic regression is an easy algorithm that estimates the association between one dependent binary variable and independent variables, computing the probability of the occurrence of an event.The regulation parameter C controls the trade-off between increasing complexity (overfitting) and keeping the model simple (underfitting).For large values of C, the power of regulation is reduced, and the model increases its complexity, thus overfitting the data.The parameter 'C' is tuned using Randomised Search CV () for the different datasets: the original, the standardised and the dataset with the most important features.Once the parameter 'C' is defined for each dataset, the logistic regression model is initiated and then fitted to the training data, as described in the methodology.The logistic regression hypothesis function can be seen below, where the function g(z) is also shown as follows: The logistic Regression for the hypothesis can be seen as follows: ℎ(: ) = 1 1 + ⅇ −  (7) Here θ (theta) is a vector of restrictions that our model calculates to appropriate to our classifier.

7) XG Boost
The decision-tree-based ensemble ML algorithm is XG Boost, and it uses a framework for gradient boosting.

III. APPLIED DEEP LEANING TECHNIQUES
We use and apply the following deep learning algorithm.

a.
Baseline Model Essentially, a baseline is a model that has a reasonable chance of providing acceptable results and is simple to set up, usually rapidly experimenting with them, and implementations are widely available in popular packages with low costs.

b. Convolutional Neural Network (CNN)
CNNs, also acknowledged as Conv-Nets, contain multiple layers and are mostly used for processing images.Object detection is widely used for image processing and classification, estimating time series and detecting differences.

Layers in the CNN Model
Here are six distinct layers in the CNN model: 1. Input layer

Convo Layer
The convo layer is occasionally known as the feature extraction layer since features of the text are extracted within this layer.First, a part of the text is associated with the Convo layer to make a convolution operation and calculate the dot product between the approachable field and filter.The outcome of the process is a single number of output capacities.The Convo layer also holds the ReLU activation function to build all negative values to zero.

Pooling Layer
The pooling layer is used to decrease the spatial capacity of the input text after convolution.The layer can use two layers of convolution.If we put a fully connected layer after the Convo layer without first including a pooling or max pooling layer, then it will be computationally expensive, which we do not want.Therefore, max pooling must be used to reduce the spatial volume of the input text as shown in Figure 2.

Fully Connected Layer (FC)
A fully connected layer includes weights, biases, and neurons.It attaches the neurons in one layer to the neurons in an additional layer.This layer is used to classify data between dissimilar categories by training.These categories are:

SoftMax/Logistic Layer
The SoftMax or Logistic layer is the final layer of the CNN.It is placed after the FC layer and is used for binary classification.Logistic is used, and SoftMax is used for multiclassification.

Output Layer
The output layer holds the label, which is in the procedure of one-hot encoding.Hence, we have a better understanding of CNN.We implement a CNN in Keras.Figure 3 depicts the architecture of CNN from input to output layer.

Creation of the Model
The pipeline of CNN model over keras includes conv layer, max pooling layer, dropout layer, conv layer, max pooling layer, dropout layer along with two fully connected layers sequentially.Figure 4 depicts input neural network and output of dropout layer.

Epochs and Batch Size
We used a dataset of 20 samples, a batch size of 2 and determined that the algorithm needed to run for three epochs.Consequently, in all epochs, we use five batches (20/2 = 10).All batches are run through the algorithm; then, we have five iterations per epoch.This method is often an improvement over the sequential model.The most modification comes from the Stalk group and a few slight changes within the module of the sequential model.

IV. PERFORMANCE-EVALUTION MEASURES
Traditional methods of estimating ML classifiers can use confusion metrics relating to the difference between the rock bottom dataset truth and the model's prediction where TP, TN, FP, and FN denote true positive, true negative, falsepositive and false negative, respectively.
1. ACCURACY Accuracy is used to measure the performance in the evidence domain recovery and processing of the data.The fraction of the results that are successfully classified can be represented by equation ( 9) as follows: 2. PRECISION Precision is a performance assessment that measures the ratio of correctly identified positives and the total number of identified positives.This can be seen as follows:

F-MEASURE/F1-SCORE
The f-measure considers both the precision and the recall.The f-measure may be assumed to be the average weight of all values, which can be seen as follows:

RECALL
The recall is also referred to as the sensitivity, which is the ratio of connected instances retrieved over the total number of retrieved instances and can be seen as follows:

A. Data Visualisation
The dataset covers credit cards transactions in October 2018 by European cardholders.The dataset includes transactions that happened in two days, and it includes 492 frauds out of a total 284,807 transactions.It covers only mathematical input variables, which are the outcome of a PCA transformation.Due to the issue of concealment, we cannot offer the structures of the original dataset and the data more background information.The feature 'Time' covers the seconds elapsed between the first transaction in the dataset and each transaction.Figure 5 shows class distribution of CCF dataset into fraudulent and nonfraud transaction.

Figure 5: Class Distribution of Fraudulent and nonfraud transactions
Another insight about the data is that there are no null values; hence, there is no need to fill in missing values.

B. Top 10 algorithms in Machine Learning for
Fraud Detection In the study [3], top ten ML algorithms are incorporated for detection of credit card .The list of these algorithms is given below: 1. Linear Regression 2. Logistic Regression 3. Decision Tree 4. SVM 5. Naïve Bayes 6. CNN 7. K-Means 8. Random Forest 9. Dimensionality Reduction Algorithms 10.Gradient Boosting Algorithms These algorithms can also encompass association analysis, clustering, classification, statistical learning, and link mining.This is among all the critical topics covered by ML research and development.

I. THE CONFUSION METRICS FOR MODELS
A classification model visualisation is a confusion metric that displays how fit the model is projected to be to the results once associated with the earliest ones.Frequently, the anticipated results are deposited in a variable that is then changed into an association table.Utilizing the association table in the form of a heatmap, the confusion metrics can be plotted.Even though there are numerous built-in methods to envision confusion metrics, we can define and visualize them based on the score to allow for better correlation.Figure 6 depicts the confusion metrics of machine learning algorithms.

II. THE ACCURACY OF MACHINE LEARNING ALGORITHMS
In this phase, we structure six distinct kinds of classification models.We could use numerous other models to resolve classification problems; however, these are the most popular models in use.Using the algorithms, all these models can be built workably provided by the sci-kit-learn package.The results of applied ML algorithms are presented in Table 5.

III. RESULT OF THE CASE AMOUNT STATISTICS OF THE DATASET
As shown Figure 7 the case count statistics, the values of the 'Amount' variable vary substantially once associated with the respite of the variables.To decrease the wide range of the values, we can standardise it by means of the 'Standard-Scaler' method in Python.

IV. THE COMPARATIVE ANALYSIS OF MACHINE LEARNING ALGORITHMS
Figure 8 show the comparative analysis of applied ML algorithms for CCF using accuracy and F1 measure metrics.

C. Top 10 algorithms in Deep Learning for Fraud Detection
In [8], ten DL algorithms are identified as top algorithms d.The list of these algorithms is given below:

I. THE EVALUATION METRICS
We can use confusion metrics to summarise the labels of actual vs. predicted, wherever the X-axis is the label of the predicted, and the Y-axis is the label of the actual:  If the model had projected the whole thing accurately, this would be a diagonal metric whose values would be away from the main diagonal and demonstrating an incorrect prediction value of zero.In this case, the metrics display that because of the comparatively rare false-positives, it is determined that a few legitimate transactions were flagged incorrectly.This trade-off might be desirable because false negatives would permit more fraudulent transactions to go through.

III. THE SUMMARY OF THE CNN MODEL
Once a model is "built", the summary () method can be called to show its details as shown in Table 8.However, it can be beneficial when constructing a sequential model incrementally to show the summary of the model thus far with the current output.The total number of parameters is 119,457 and the total number of trainable parameters is 119,265.Finally, the number of the nontrainable parameters is 192.

IV. THE SUMMARY OF THE BASELINE MODEL
By using the function, we now develop and train the previously defined model.Note that the model is best suited to using a batch size larger than 2048; this is important for confirming that each batch has a decent chance of comprising a rare positive fraud example.The summary of baseline model is presented in Table 9.The total amount of parameters is 497 and the total number of trainable parameters is 497.Finally, the total amount of nontrainable parameters is 0.

V. DISTRIBUTION OF THE DATA
Identifying fraudulent credit card transactions is a common type of imbalanced binary classification where the focus is on the positive class (is fraud) class and negative class (is not fraud) class.Then, we compare the classification of the positive and negative instances over a rare feature.The positive and negative distribution are shown in Figure 11 and Figure 12 respectively.

VI. VARIATION OF EPOCHS
We train the model for 20 and 30 epochs, with and without careful initialisation, and compare the losses.The figure clearly shows that careful initialisation gives a clear advantage in regard to validation loss.Figure 13 shows the validation loss using zero bias and careful bias.

VIII. THE DIAGNOSIS MODEL BEHAVIOUR
The behaviour of a ML and DL model can be used to diagnose the shape and dynamics of a learning curve and to possibly recommend the best configuration changes for improving performance and learning.There are four learning curves: Underfit, Overfit, Good Fit, Epoch.The learning curve is used to plot the model for training and validation accuracy and training and validation loss vs. epochs.We display overfitting over the epochs, which is where validation accuracy is less than training accuracy and epochs where validation loss is greater than the training loss.7. Figure 15 depicts the accuracy and loss of CNN model using balanced CCF dataset.

1) Architecture of 14 Layers
Our proposed model has 14 layers: a convolutional layer with a kernel size of 32 x 2 and a ReLU activation function, followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.2.Then, we add another convolutional layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a batch normalisation layer and dropout layer with a dropout rate of 0.5.Then, we add a flattened layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a dense layer and dropout layer with a dropout rate of 0.5, followed by 3 dense layers.The first dense layer has a ReLU activation function of (100).The second dense has a ReLU activation function of (50).The third dense layer has a ReLU activation function of (25).Finally, we add a dense layer for classification with a sigmoid activation function.At 100 epochs, the accuracy is 96.34%.

2) Architecture of 17 Layers
Our proposed model has 17 layers: a convolutional layer with a kernel size of 32 x 2 and a ReLU activation function, followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.2.Then, we add another convolutional layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.5.Then, we add another convolutional layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.25.
Then, we add a flattened layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a dense layer and a dropout layer with a dropout rate of 0.5, followed by 3 dense layers.The first dense layer has a ReLU activation function of (100).The second dense layer has a ReLU activation function of (50).The third dense layer has a ReLU activation function of (25).Finally, we add a dense layer for classification with a sigmoid activation function.After 100 epochs, the accuracy is 95.53%.

3) Architecture of 20 Layers
Our proposed model has 20 layers: a convolutional layer with a kernel size of 32 x 2 and a ReLU activation function, followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.2.Then, we add another convolutional layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.5.Then, we add another convolutional layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.5.
Then, we add another convolutional layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.25.Then, we add a flattened layer with a kernel size of 64 x 2 and a ReLU activation function, followed by a dense layer and a dropout layer with a dropout rate of 0.5, followed by 3 dense layers.The first dense layer has a ReLU activation function of (100).The second dense layer has a ReLU activation function of (50).The third dense layer has a ReLU activation function of (25).Finally, we add a dense layer for classification with a sigmoid activation function.At 100 epochs, the accuracy is 94.92%.

Figure 17: Accuracy of the CNN model over number of layers xii. The Comparative Analysis of the Machine Learning and Deep Learning Algorithms
The most important distinction between DL and standard ML is how well deep learning performs when the amount of data changes, as DL techniques do not perform well when the amount data is very small.This is because DL algorithms require a large quantity of data to fully learn features.ML algorithms are less accurate than deep learning algorithms.Therefore, the existing accuracy of ML algorithms and DL algorithms is low compared to the accuracy of proposed model.Table 10 presents comparative analysis of ML and DL algorithms.

V. CONCLUSION AND FUTURE WORK
CCF is an increasing threat to financial institutions.Fraudsters tend to constantly come up with new fraud methods.A robust classifier can handle the changing nature of fraud.Accurately predicting fraud cases and reducing false-positive cases is the foremost priority of a fraud detection system.The performance of ML methods varies for each individual business case.The type of input data is a dominant factor that drives different ML methods.For detecting CCF, the number of features, number of transactions, and correlation between the features are essential factors in determining the model's performance.DL methods, such as CNNs and their layers, are associated with the processing of text and the baseline model.Using these methods for the detection of credit cards yields better performance than traditional algorithms.Comparing all the of algorithm performances side to side, the CNN with 20 layers and the baseline model is the top method with an accuracy of 99.72%.Numerous sampling techniques are used to increase the performance of existing examples, but they significantly decrease on the unseen data.The performance on unseen data increased as the class imbalance increased.Future work associated may explore the use of more state of art deep learning methods to improve the performance of the model proposed in this study.

4 ) 5 )
Random Forest (RF) RF is an ensemble technique and is considered group learning for classifying elements and regression.Deep trees are used to learn irregular patterns.If deep trees learn the same part of the training sample, RF takes an average of its value's variation, which can be reduced by this method.The training data (p = p1…….pn)with responses ( Q = q1, …, qn) and bagging (X times) choose a random sample and replace it with the training set that fits the trees for these samples as follows: For x = 1..., X: Support Vector Machine (SVM) The SVM algorithm texts effectively.The SVM separates positive and negative instances with high margins.The SVM provides better results than the naïve bayes in earlier studies regarding fraud detection.A decision surface is used to split training points into two categories based on support vectors.Optimisation is calculated as follows: Therefore, when using unstructured data with prediction problems (text, etc.), artificial neural networks tend to outperform all other algorithms or frameworks.The XG boost model for classification is called the XGB Classifier.It can be fit into our training dataset.Models are fit using the sci-kit-learn API and the model's fit () function.Parameters for training the model can be passed to the model in the constructor.Now, we use serviceable defaults.
This model determines how to classify an extremely imbalanced dataset where the number of examples in one class greatly outnumbers the examples in another.

2 .
Convo layer (Convo + ReLU) Input layer The input layer in the CNN model incorporates CSV data.Text data is characterised by three-dimensional matrices, which should be reshaped into one column.

Figure 4 :
Figure 4: Application of Dropout over Neural Network 2.Compile the ModelCategorical cross-entropyWe build binary cross-entropy at prior portions and in ML.At that time, we use definite cross-entropy.This means that we have multi-classes.The equation can be seen as follows:

Figure 7 :
Figure 7: The Case Count Statistics for fraud and nonfraud tranctions.

Figure 9 :
Figure 9: Metrics of Deep Learning with epoch sizes as 35 and 14

Figure 10 :
Figure 10: Area under the interpolated precision-recall curve II.THE ACCURACY OF DEEP LEARNING ALGORITHMS Table 7 shows the training and validation accuracy of proposed CNN and baseline CNN algorithms.The CNN model is applied by varying the layers from 11 to 20 and compare the result with baseline 5-layer architecture.

Figure 11 :Figure 12 :
Figure 11: Positive Distribution of the Data

Figure 13 :
Figure 13: validation loss using Zero Bias and Careful BiasVII.RECORD OF THE TRAINING DATASETIn this section, we construct schemes of the model's accuracy and loss on the training and validation sets.We check for overfitting; these measurements are valuable too, as they can help us learn more about the overfitting and underfitting of the model.Figure14depicts the training and validation loss, precision recall accuracy (prc), precisions and recall over 35 epochs.

Figure 14 :
Figure 14: Training and Validation history of Loss, Precision Recall Accuracy (PRC), Precisions and Recall (Epoch size 35)

Figure 15 :
Figure 15: Training and validation history of accuracy and loss of CNN model using 100 epochs

Figure 16
Figure 16 depicts the training and validation accuracy of proposed model over 20 and 50 epochs.

Figure 16 :
Figure 16: Model Accuracy when Epoch Sizes are 20 and 50 XI.RESULT OF THE CNN LAYERS IMPLEMENTATION Our proposed sequential model has a convolutional layer with 32 filters of size 3 and a ReLU activation function, which is followed by a batch normalisation layer and a dropout layer with a dropout rate of 0.25.Figure 17 depicts the accuracy of CNN model using different layers architecture.The architectures of our proposed model are as follows.

Table 3 :
The List of features available in the CCF dataset

Table 4 :
Table 4 presents the detail of dataset containing 31 columns including time, V1, V2, V3……V28 as PCA applied features, amount, and class labels.Characteristics of the dataset

Table 5 :
The Accuracy and F1-socre of Machine Learning algorithms

Table 6 :
The result of CNN model using epoch size as35  and 14

Table 7 :
The accuracy of deep learning models using different epochs.

Table 8 :
The summary of CNN Sequential model

Table 9 :
The summary of baseline CNN Sequential model

Table 10
presents the training and validation results of baseline deep learning model using 35 and 14 epochs.

Table 10 :
Results of deep learning model using different epochs

Table 10 :
Comparative analysis of ML and DL algorithms