A Closer Look Into the Characteristics of Fraudulent Card Transactions

Widespread use of Internet also had the substantial impact on the increase of the online card transactions especially with the beginning of the last decade. Along with the increase of online transactions, the worldwide banking sector was forced to deal with or to encounter an unforeseen number of fraudulent activities, yet. Hence, rule-based systems were designed to mark the high-risk transactions and let the experts to confirm the fraudulent nature of such transactions. As a countermeasure, static nature of rule-based systems were exploited by the latest attacks to go undetected. Thus, researchers aimed at designing adaptive fraud detection systems utilizing mainly machine learning techniques with the very recent application of deep learning. However, they were focused on detecting fraudulent activities but, to the best of our knowledge, none of them delved into the better understanding the characteristics of fraudulent card transactions in order to produce more resilient models. Therefore, in this study, we built the biggest data set ever used in a research, consisting of 4B non-fraud and 245K fraud transactions contributed to by the 35 banks in Turkey. Consequently, we introduce and examine the performance of profile-based fraud detection models, namely card-type based model, transaction characteristics based model, and amount-based model. Also, we made temporal and spatial analysis on our data set to show the robustness of the proposed models against aging and zero-day attacks.


I. INTRODUCTION
The number of credit card transactions are increasing in line with technological developments and the rise of e-commerce. In 2017 alone, 375 billion card payments were made around the world [1]. However, 16.7 million fraudulent transactions occurred in the same year [2]. The ratio of fraudulent transactions to normal transactions is approximately 0,006% worldwide. Although this rate may seem insignificant, every fraudulent transaction hurts the reputation of banks. For this reason, banks are investing in fraud detection. The number of fraudulent activities and their methods increases and changes every day. It is very difficult and costly to detect fraudulent activities only by examining the transactions. Fast and accurate fraud detection is crucial to maintain customer satisfaction and trust. Therefore, banks need to identify these The associate editor coordinating the review of this manuscript and approving it for publication was Shiqiang Wang . transactions as quickly as possible and in the least harmful way for the customer.
Today, fraudulent activities using social engineering are predominantly performed through Internet. Malware and phishing methods are engineered for this purpose. Most popular types of fraud include customer information altering through call center and branches, ATM fraud, credit card application fraud, card account theft, lost-stolen, fake credit cards and card duplication [3]. Current commercial solutions used by banks for fraud detection are primarily rule-based. However, in recent years research has shown that machine learning methods are more effective than most rule-based solutions. The data sets used in the publications in which these results are published do not always correspond to the real banking environment in terms of numbers, characteristics, and changes in time. In this paper, unprecedented analysis on a real data set were conducted to reveal the unidentified characteristics of fraud detection activities. 245,000 fraudulent transactions and four billion VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ non-fraudulent transactions obtained from different banks for the year 2017 were utilized in these analyses. This study makes the following contributions to the literature: • To the best of our knowledge, a data set with the largest amount of fraudulent transactions was created. Thus, all the analysis were carried out and experimental results were obtained using this data set.
• Card transactions were profiled based on card-type, amount and transactional characteristics. The resulting models were shown to have negligible effect on the fraud detection performance when compared to similar models applied to the unprofiled data set.
• The performance of fraud detection models using transactional characteristics were shown to decay with time. Thus, such models require periodical training with recent fraud and non-fraud instances.
• We attempted to evaluate the zero-day performance of the models using both unprofiled and profiled data sets. We observed the behaviour of the models against the unseen fraudulent transactions.
• In contrast to existing studies on cost-sensitive fraud detection models, we aimed at expressing the performance of fraud detection models in terms of financial gain and loss.
The manuscript is outlined as follows. Section II discusses state-of-the-art studies and points out our novel contributions to the literature. In Section III, we introduce BKM data set especially how it satisfies big data characteristics and its importance to the fraud detection research domain. Section IV elaborates on the details of pre-processing and feature selection applied to BKM data set, and we also compare fraud detection models in terms of their eligibility to perform the anticipated analysis tasks on the data set. In Section V, both the outlines and the results of the analysis are given in depth. Then, we conclude the paper in Section VI.

II. RELEATED WORK
Fraud detection is very popular and is practiced in multiple areas. The survey of Abdallah et al. examined a wide range of fraud topics including [4] credit card fraud [5]- [7], telecommunication fraud [8], [9], healtcare insurance fraud [10]- [12], automobile insurance fraud [13], [14] and online auction fraud [15], [16]. The survey revealed that the majority of the previous researches were conducted on banking fraud. Banking fraud is followed by insurance fraud, e-commerce fraud, and telecommunications fraud. The size and uneven distribution of the data is one of the biggest problems with fraudulent banking transactions. Many studies have focused on solving this problem. However, in order to solve the given problems, many studies establish fraud detection models by randomly selecting a certain number of transactions among non-fraudulent transactions [17]- [20].
The privacy and protection rules of personal data have further complicated access to real banking transactions for research purposes. Ong Shu Yee et al. mimicked real life data to overcome this problem [21]. In this study, particular attributes, such as credit card number, reference number and terminal id were determined and mimicked to create synthetic transactions. In the study conducted by Andrea Dal Pozzolo et al., unlike other studies, a real system was designed and co-utilization of data-driven and rule-based methods were suggested, along with periodic updates [22].
Among the reviewed studies, some of them were found to be comparable to ours in terms of methods, test scenarios, and performance metrics. Shiyang Xuan et al. created various classification scenarios and used the B2C transactions obtained from a Chinese e-commerce site between November 2016 and January 2017 [33]. They worked with a data set containing more than 30 million transactions with 62 features. However, their data set contained only 82,000 fraudulent transactions. In this study, Random-Tree Based and Classification and Regression (CART)-Based Random Forest algorithms were compared. The best accuracy (96.77%) and the best F-measure value (0.9691) were obtained by the CART-Based Random Forest algorithm. Training and test groups were formed and tested with the rates 1:1 to 10:1. This was to identify the significance of the non-fraud:fraud ratio. Only January transactions were included in the data sets. The results show a continuous increase in accuracy. This was attributed to the enlargement of the test data and training sets. In order to assure fair comparisons, the data set must be kept constant. Recall rate tends to decrease for fraudulent transactions. The best F-measure value is 0.964 at the 5:1 ratio. Finally, the first 11 days of January 2017 were tested against the 2016 data in training. For fraudulent transactions, the recall rate was 59.62% and the accuracy of the test was 98.67%. The recall rate shows that when the test is performed with data from months excluded from training, fraud detection success decreases significantly.
Study [21] established the importance of the pre-transaction stage and unlike other studies, also included the test time as a performance metric. In this study, normalization, smoothing, aggregation, attributes construction, and generalization of the data were carried out during the pre-transaction stage, then dimensionality reduction was performed using the Principal Component Analysis (PCA) technique. The experimental results in this study were obtained using 10-fold cross validation. In the study, Bayesian classifiers such as K2, Tree Augmented Naive Bayes (TAN) and Naive Bayes were used along with Logistic Regression and J48 (C4.5) algorithms. The highest accuracy for raw data belonged to the TAN algorithm at 84.0%, while the worst success rate belonged to the K2 algorithm with 41.8%. The success rates increased when engineered features were used. According to their results, the Logistic Regression and J48 decision tree algorithm scored 100% accuracy. The lowest accuracy rate was 95.8% and with the K2 algorithm. However, these success rates were achieved using a synthetic data set. It is hard to tell whether the model will be as successful in real life situations. As to the execution times, TAN and J48 algorithms were found to be taking longer to execute than others, while K2 and Naive Bayes were fastest to produce results.
Some studies preferred adopting ensemble approach, such as AdaBoost and Majority Voting. In study [34], a publicly available data set [28] containing 284,807 transactions (492 fraudulent) made in September 2013 by European card-holders was used to generate experimental results. The results were obtained by using the 10-fold cross validation method. The best fraud detection success belonged to Naive Bayes with 83.13% and the best non-fraud detection success belonged to the Random Forest algorithm with 99.99%. The Adaboost and Majority Voting methods have decreased fraud detection success and increased non-fraud detection success. This increase in accuracy was attributed to the data set containing more non-fraud data. There are 287,224 transactions in the data set created by the researchers. Of these, 102 are marked as fraudulent. In tests conducted with this data set, initial fraud detection success was over 90% and non-fraud detection success was 99.99% for various algorithms. The Adaboost and Majority Voting methods were observed to increase fraud detection success. However, the number of fraudulent transactions in both data sets are insufficient to conclude on the success of the models.
The study published by Abhimanyu Roy et al. in 2018 aimed at observing the effect of deep learning methods on credit card fraud detection and used the data set provided by financial institutions engaged in retail banking [25]. The data set contains about 80 million transactions collected over an eight-month period. Only 0.14% of the data is fraudulent. They used four different Deep Learning topologies: Artificial Neural Networks (ANNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTMs), and Gated Recurrent Units (GRUs). In the study, classification results were obtained by using 10-fold cross validation. The best accuracy of the study was achieved by the GRU topology with 91.6%. The study claimed that increasing the number of layers and nodes led to higher success rates.
Some studies have used unsupervised methods [35] to detect fraud. Richard J. Bolton et al. stated that it would not always be possible to access fraudulent transactions, meaning that the success achieved in the supervised methods was misleading [36]. This study calculated a suspicion score for each user. This score was updated with every transaction. The Peer Group Analysis (PGA) tool was developed to observe the change in spending behavior. The study involved weekly analysis of total expenditures for 858 accounts in 52 weeks. The PGA was applied to four-week periods for each account.
Apapan Pmsirirat et al. have argued that methods of fraud are constantly changing, and therefore unsupervised methods should be used [37]. 80% of the data sets were used for training, while 20% were spared for testing. Every instance had 21 features. In their study, anomaly detection in customer profiles was carried out using the Auto Encoder and Restricted Boltzmann machine (RBM) methods. In the Auto Encoder based system, Area Under Curve (AUC) values was 0.9603 for the European data set. The sizes of the data sets referenced in this section are presented in Table 1.
In contrast to the data sets mentioned in the literature, our data set is composed of purely real financial transactions of debit and credit cards issued by 35 different banks in Turkey. The data set spans a time period of 8 months and is comprised of more than 4 billion transactions of which 0.006% was reported as fraudulent activity by the respective banks. Apart from other studies, we investigated the effect of profile-based fraud detection models on the performance. As the basis of the profile-based fraud detection, we clustered the data set according to card-type, amount spent, and transactional characteristics. Then, the transactions were tested for fraudulent activity using the respective model. In addition to the classical temporal performance decay analysis of the fraud detection models, we also formulated the zero-day attack performance to this end. We specifically left out some of the clusters generated as a result of k-means clustering of the fraudulent instances from the training sets. Those clusters were included in the test data sets to observe the performance of the respective model in case of a never-before seen fraudulent activity. As in medical diagnosis problems or network intrusion detection scenarios, our data set is imbalanced by nature. Although current literature includes cost-sensitive models [39], [40] [41] especially where data imbalance problem becomes prominent, we attempted to measure the performances of the fraud detection models in terms of financial value.

III. DATASET
In this study, we use real banking data obtained from the banking sector in Turkey. The data set contains more than VOLUME 8, 2020   four billion credit and debit card transactions belonging to 35 member banks between January 2017 and August 2017. The ratio of fraudulent transactions to non-fraudulent ones is 0.006%.
The number of fraudulent transactions per month are given in Figure 1, whereas Figure 2 depicts the ratios of non-fraud to fraud transactions. Table 3 gives the detailed distribution of transactions regarding their card type, amount range, and transaction type. Each transaction in the data set contains 60 features. It was observed that many features were mis-valued and did not have an even distribution. Twenty-two such features were eliminated in the first stage. Of the remaining 38 viable features, three are numerical and the rest are categorical. The types and counts of the initial and viable features are given in Table 2.

IV. FRAUD DETECTION MODEL
In order to create a fraud detection model, incomplete or incorrect data was eliminated, distinguishing features were identified, the instances to be used in the model were selected, and the performance of classification algorithms were evaluated.
Analyses were performed on the whole data set. The data set were split up by 70% to 30% for training and testing, respectively. The training part of the data set was further divided into two sub data sets with a ratio of 70% to 30% for training and validation, respectively. This data set will be referred to as the master data set throughout the remaining of this study. Thus, master training data set should be interpreted as the training part of the master data set as described above.

A. PRE-PROCESSING
As the majority of the viable features were categorical, most of the pre-processing efforts were put into dealing with them. For each categorical feature category values were analyzed and invalid ones were replaced with null values. To represent categorical values, the one-hot encoding method was used, being one of the most common methods for converting categorical features to numerical ones.
The one-hot vector is obtained by expressing each distinct value for a feature in binary form. As a result, the number of features increases by the number of existing distinct values for each feature. In every one-hot vector, the feature belonging to the respective category is expressed as ''1'', and other features are expressed as ''0''. This process is shown in Figure 3.

B. FEATURE SELECTION
On a dataset, comprised of all fraudulent transactions plus randomly chosen non-fraudulent transactions with an equal number (Table 4), Information Gain [43], Gain Ratio [44], One Rule [45], Relief [46], Symmetrical Uncertainty [47] algorithms were used for feature selection, and ranker values were obtained. The first 5, 10, 15, 20, 25, 30, 35, 38 features were grouped according to their ranker value and classification tests were conducted to decide the features that were going to be used. The accuracy values given in Figure 4 were evaluated and the first 30 common features were chosen.    Table 5. Table 6 reflects the detailed analysis of the selected features and their categories in terms of fraudulent distribution.

C. INSTANCE SELECTION
Since the fraud:non-fraud ratio is about 0.01%, instance selection was carried out to establish a balance within the data set. Different methods were adopted to select non-fraudulent transactions, while all of the fraudulent transactions were used in the analysis. Many studies in the literature suggested random selection of non-fraudulent transactions [17]- [20], [33]. Therefore, non-fraudulent transactions were randomly selected in this study as well. However, to better reflect the characteristics of the transactions to the features, the transactions were first clustered using the k-Prototypes algorithm [48] and the optimum number of groups were determined by using the elbow technique [49]. Non-fraudulent transactions were then randomly selected, not from the overall data set, but from each group. As pointed out in the literature mentioned above, the ratio of non-fraudulent to fraudulent transactions in the training group affects the performance of the model. Therefore, training data sets were formed using the 1:1, 5:1, 10:1, 15:1, 20:1, 25:1, 30:1, 35:1, 40:1 ratios, respectively, and classification tests were performed to decide the optimum ratio of non-fraudulent to fraudulent transactions. Non-fraudulent transactions were selected incrementally. Following the tests conducted using classification algorithms in Section IV-D, best f-measure value was found to be 5:1 ratio, which is the same conclusion reached in the study [33]. Consequently, we decided to use the 5:1 ratio in the remainder of our study.

D. CLASSIFICATION
As suggested in [17], [19], [20], [23], we considered to use the following machine learning algorithms for the classification process; Naive Bayes, Decision Tree [50], Random Forest [51] and Multi-Layer Perceptron [50]. The Naive Bayes [50] algorithm performs the classification process by calculating probabilities. For Naive Bayes classification, VOLUME 8, 2020 Bernoulli Naive Bayes [52] classifier was used as in the pre-processing step, the features were converted into one-hot vectors.
The classification results in terms of recall, specificity, precision, f-measure and mcc (matthews correlation coefficient) are given in Table 7. Based on recall values for the fraud class, Naive Bayes classifier outperforms the other classifiers, whereas the other classifiers were more performant for the classification of non-fraud instances. A closer look at Table 7 reveals the fact that the Naive Bayes classifier produces a rather large number of false alarms compared to the other classifiers. Our cursory tests show that for a randomly selected month, the ratio for the number of false alarms lies in the range of 6:1 in favor of other classifiers. From a financial point of view, this fact could be considered as a major drawback for the Naive Bayes classifier as it means more operational work for a financial institution to put the results of the classifier into action in real life scenarios.

V. ANALYSIS OF FRAUDULENT TRANSACTIONS
In this section, the data set was assessed spatially and temporally, the success of the system against zero-day fraud was analyzed, the financial gain from detecting non-fraudulent and fraudulent transactions was examined, and finally, the successes and run times of the algorithms were compared. Scikit-learn library was utilized to implement the classification processes used in the analyses [53]. The performance metrics provided in Equation 1, 2, 3, 4 were used in the study.
Initial investigation of the performances of the selected classifiers were carried out on the master data set as a whole. However, due to the random selection nature of both non-fraudulent and fraudulent instances for training, validation, and testing steps, for a given point in time the respective data sets could contain instances from the future. Thus, in order not to make the model to learn about fraud types not occurred yet and to produce better recall values which could not be obtained in real life, we opted for not including any instance within the training and validation data sets to obtain a better judgment about the performance of a given classifier. The test instances were naturally selected from transactions in future which reflects real life behaviour. Table 8 reflects the results of this approach. Starting with January 2017, one month's instances were used to train the respective model and the following month's instances were used to test the performance of the model. Again, during training the used   Table 7, one month was used to train the respective model and the following month was used to test the performance of the model. The last row for each classifier gives the weighted average of the performance metrics for the tests conducted.
instances were split up into train and validation subsets with a ratio of 7:3. The last row in the table for each classifier gives the weighted average of the performance metrics for the tests conducted.

A. SPATIAL ANALYSIS
In this section, we examined the effect of designing classification models according to card types, amount spent and characteristics of non-fraudulent and fraudulent transactions.

1) CARD TYPE BASED CLUSTERING
In this subsection, we explored the effect of spending characteristics for each card type (debit, classic, gold or business) on the fraud detection success. First of all, we obtained the distribution of the recall values of the selected classifiers in terms of card types, as depicted in Figure 5. Thus, before generating an independent classification model for each card type, we tried to establish a base line for performance comparison and obtained four new recall values for each classifier representing each card type. It is evident from Figure 5 that all the classifiers perform poorly for debit cards.
As the next step, we decided to design two card-type based scenarios, namely Scenario 1 and Scenario 2, using the selected classifiers. In Scenario 1, all the fraudulent transactions were used alongside with non-fraudulent transactions belonging only to the respective card type. On the other hand, in Scenario 2, both fraudulent and non-fraudulent instances were chosen based on the card type. For both scenarios, the aforementioned non-fraud to fraud ratio of 5:1 was VOLUME 8, 2020   preserved. Table 9 gives the overall number of non-fraud and fraud instances used in Scenario 1 and Scenario 2.
A comparative summary of the performances based on the recall value for all the selected classifiers are given in Figure 6. It is evident that except for the case of debit cards, card-based profiling does not help to boost the classification performance independent of the scenario used. For the debit card case both Naive Bayes and Random Forest classifiers yield the highest boost of about 26% points followed by Decision Tree and Multi-Layer Perceptron.

2) AMOUNT BASED CLUSTERING
In order to assess the effect of spending amounts on the fraud detection performance, we opted to group the transactions into bins using a logarithmic scale. The details of this logarithmic binning are given in Table 3.
A procedure similar to the procedure described in Section V-A1 was used to generate the models for each bin but this time amount ranges were used instead of card types.
The overall number of non-fraud and fraud instances used in Scenario 1 and Scenario 2 are given in Table 10. Figure 7 depicts the performance comparison of the generated models. Except for the Naive Bayes based classifiers, we believe that amount-based profiling has some merit which could be used to detect and prevent frauds with larger amounts as early as possible. Most of the time fraudsters attempt to exploit a stolen credit card information with amounts in the range of either 0 to 10 before committing a fraud with a considerable larger amount. Both Random Forest and Multi-Layer Perceptron classifiers show significant improvement over the master model for the aforementioned amount range. Therefore, an intelligent fraud detection system could decide to use the most appropriate detection model and boost performance. On the other hand, the same two classifiers also show better results for the 1000 to 10000 amount range. This amount range is the mostly exploited one after successfully committing a test fraud within the 0 to 10 amount range. Thus, this approach could be both used to detect and prevent larger frauds.

3) TRANSACTION BASED CLUSTERING
In order to observe the effect of grouping transactions according to their characteristics, non-fraudulent and fraudulent transactions were clustered using the k-Prototypes algorithm. The Elbow method was chosen to determine the number of clusters. The data set was divided into k clusters where k was chosen in the range of 1 to 200. For each iteration k value was increased by 5 and the Mean Square Error (MSE) [54] values were calculated. Figure 9 depicts the relationship between k and respectively calculated MSE values. According to the results, we opted to use 60 clusters for fraudulent transactions and 120 clusters for non-fraudulent transactions.
Since some clusters show similar characteristics, such clusters were merged according to the distances of their centroids iteratively. As a result of this merging operation, seven cases for non-fraudulent transactions with cluster sizes of (1, 20, 40, .., 120) and four different cases of fraudulent transactions with cluster sizes of (1, 20, 40, 60) were obtained. Afterwards, performance evaluation tests were run on the resulting 28 clustering combinations. For each scenario, the respective models' number of outputs corresponded to the number of total clusters. The output of the model was then mapped to a binary classification, consisting of fraud and non-fraud classes, by aggregating the results according to their cluster belonging. Then, the recall values for each scenario was calculated based upon aggregation which are given in Figure 8.
The results in the figure clearly show that Random Forest, Decision Tree, and Multi-Layer Perceptron classifiers attain almost the same performance regardless how the non-fraudulent and fraudulent clusters are crossed. On the other hand, Naive Bayes classifier is very susceptible to this setup as Figure 8-a demonstrates a crossing of an unequal number of non-fraudulent and fraudulent clusters produce significantly worse results. For Naive Bayes classifier, to obtain optimum results the number of clusters for both classes should be equal. Also, having the non-fraudulent instances unclustered in the classification process suppresses this phenomenon and the Naive Bayes classifier produces almost the same result regardless of the number of fraudulent clusters.

B. TEMPORAL ANALYSIS
In this subsection, we examined the temporal validity of a generated model, the effect of temporal changes within the instances on the performance of the classification as well as the response of the generated model in case of previously unencountered fraudulent activities, which practically corresponds to zero-day attack analysis. Although temporal analysis and zero-day attack analysis of a given model could be perceived to be the same, there is a clear distinction how these analyses were performed. In our temporal analysis tests, a given model trained with all the types of fraudulent instances of up to 6 consecutive months was tested using instances from the upcoming month to investigate the effects of the drift in the data. On the other hand, in the zero-day attack analysis fraud instances were clustered based on their transactional characteristics and some clusters were specifically left out from the training data set, which we refer to as never before-seen fraud attacks.

1) DETERIORATION RATE OF DETECTION MODEL
In this subsection, the deterioration rate of classification success was examined as the fraud detection model was used to detect fraudulent transactions farther away from the time of the training and validation dataset. By taking into account the considerably less number fraudulent instances in August 2017, we have chosen March 2017 as the middle point within our dataset so that we were able to test the temporal change of the success of a given model up to 2 months back and 5 months forward. Also, March 2017 has the most number of fraudulent instances so this was another point of consideration for making it the pivot month. Figure 10 shows a decay in recall rates as we move away from March independent of the classifier used. The decay is significantly smaller for the Naive Bayes classifier in comparison with the other classifiers. On the other hand, specificity was not affected at all. For a financial institution, these findings dictate that any given fraud detection model should be updated with the instances of the current month to be prepared for the upcoming month. Also, ideally with enough processing power the model could be refreshed on a weekly even daily basis within the present month. VOLUME 8, 2020

2) EFFECT OF TIME SPAN OF TRAIN SET INSTANCES
In this subsection, we concentrated on the effect of temporal changes within the instances on the classification performance. Therefore, as previously stated because of the lack of sufficient number of fraudulent instances, we left out August 2017 and chose July 2017 as our test data set. Then, starting with June 2017 and going back to January 2017 we generated at total of 6 models for each selected classifier by including another previous month to the already included ones. For each model, the respective data set was split up in itself 70% for training 30% for validation. The obtained performance results in terms precision, recall, specificity, and f-measure for each selected classifier are summarized in Table 11.
Except for the case of the Multi-Layer Perceptron, the inclusion of past instances does not contribute to the performance of a given classifier. For Multi-Layer Perceptron, with some fluctuations, up to 2% points gain was observed.

3) ZERO-DAY PERFORMANCE
Although temporal analysis and zero-day attack analysis of a given model could be perceived to be the same, there is a clear distinction how these analyses were performed. In our temporal analysis tests, a given model trained with all the types of fraudulent instances of up to 6 consecutive months was tested using instances from the upcoming month to investigate the effects of the drift in the data. On the other hand, in the zero-day attack analysis fraud instances were clustered based on their transactional characteristics and some clusters were specifically left out from the training data set, which we refer to as never before-seen fraud attacks.
In financial fraudulent transaction detection, we define the zero-day performance of a detection model as its response to a fraudulent activity with a type not included in the training data set during model generation. Therefore, we deliberately excluded some types of fraudulent transactions from the training data set and then tested the performance of the generated models with those previously unseen instances. We carried out zero-day performance tests on the master data set as well as data sets generated for the analysis of card type-based profiling and amount-based profiling. Figure 11, shows the results of the performed tests for each selected classifier in terms of recall values. Again, Naive Bayes classifier demonstrated a better performance than the other classifiers. Nevertheless, we deem the performance of all classifiers acceptable for this previously unexplored challenging task.

C. MONETARY PERFORMANCE ANALYSIS OF FRAUD DETECTION
To the best of our knowledge, fraud detection studies conducted so far have assessed their performance in terms of accuracy, recall, precision, and f-measure values derived    from correctly and incorrectly classified transactions without considering the financial value of the respective transactions [33]. In this subsection, we evaluated the performance  of the selected classifiers in terms of the monetary value they represent.
For this analysis, we opted to use the master data set and we based our evaluation on the recall values. For each classifier, the monetary equivalent of the correctly classified fraud transactions were divided to the monetary value corresponding to the total number of fraud transactions. The so obtained ratios and the recall values of the respective classifier are given in Figure 12. Our intention was to transform the recall value obtained from instances having the same weight into a monetary recall value representing what percentage of financial loss would be recovered with the use of that classifier.
A closer look into the Figure 12 reveals that even though Random Forest and Multi-Layer Perceptron classifiers were outperformed by the Naive Bayes classifier, both of them were able to represent a higher financial percentage of the overall fraudulent transactions. It is our opinion to pursue this way of investigation in a future study to build more formal relationship between the plain performance metrics and the so-called financial performance metrics.

D. TIME ANALYSIS
In banking transactions, it is crucial that the payments are not interrupted [55]. Therefore the running time performance of the system was also analyzed. IBM POWER9 based servers were used for the timing tests. The servers had a total of 160 CPU cores and a total of 1.1 TB of memory running Ubuntu 16.04 operating system. BKM processes almost 15 million transactions on a daily basis which corresponds roughly to 174 transactions per second. For a transaction to go through the system without further delay the processing time of any given transaction should be less than 6 milliseconds. Under these constraints, to evaluate the timing performance of our classification models, we ran the respective classifiers with 1 million instances per classifier. The average pre-processing time was determined to be 0.8 milliseconds independent of the classifier, whereas the average run times for each classifier is given in Table 12. The results show that our generated models with the exception of Random Forest classifier could be deployed on a single server and still be within safe limits in terms of pre-processing and classification time. For the Random Forest based model, a simple load balancing scheme should suffice to meet the aforementioned timing constraints.

VI. CONCLUSION
In this study, we completed a comprehensive analysis on the biggest data set, namely BKM data set, ever used in fraud detection domain. It contains 4 billions non-fraud and 245k fraud transactions contributed to by the 35 banks in Turkey. Unlike most of the research work cited in the literature, we chose to generate fraud detection models according to predefined profile types. Those profile types were based on card-type, amount-range and the characteristics of the financial transactions. We showed that except for the case of debit cards, card-based profiling does not help to boost the classification performance independent of the scenario used. As for the amount-based profiling both Random Forest and Multi-Layer Perceptron classifiers showed a significant improvement over the master model for the 0 to 10 and 1000 to 100000 amount range in Turkish Lira. Therefore, an intelligent fraud detection system could decide to use the most appropriate detection model and boost performance. As well as generating multiple detection models based on the transaction characteristics reversely affected the model's performance. This is largely due to the fact that the imbalance between non-fraudulent and fraudulent instances gets more dominant with the lesser number of fraudulent instances. On the other hand, the resilience of the detection models is strongly related to the number of instances and their time span. Therefore, our test results show that any fraud detection model should be periodically updated in order to include recent fraudulent instances. Regarding the zero-day performance, we showed that all models without exception demonstrated some weakness against previously unencountered fraudulent activities. Nonetheless, the overall performance of the classifiers were acceptable for such cases. As another contribution, we expressed the performance of a classifier in terms of the financial value it represents. For the BKM data set our experiments showed that large number of false positives do not necessarily correspond to a large financial loss. We think that this observation begs for more detailed investigation in future studies.
ALI GOKHAN YAVUZ received the Ph.D. degree in computer engineering from Yildiz Technical University, Istanbul, Turkey. He is currently an Associate Professor with the Department of Computer Engineering, Yildiz Technical University. He is also the Co-Director of the Intelligent Systems Laboratory. His current research interests include systems and network security, cloud computing, and big data.