Feature Engineering and Resampling Strategies for Fund Transfer Fraud With Limited Transaction Data and a Time-Inhomogeneous Modi Operandi

Detecting financial fraud to profile crimes and pinpoint system vulnerabilities is an essential issue in the financial industry. Because of interpretability requirements and the lack of mass transaction data due to privacy regulations, sophisticated handcrafted features have been adopted in much of the literature for fraud detection. In addition to established recency, frequency, monetary, and anomaly features, we propose behavior- and segmentation-type features based on statistical characteristics belonging solely to (non-)fraudulent accounts informed by financial expertise. Our proposed features are difficult for automatic feature generators to synthesize, and provide transparent cause-effect relationships and good prediction results. Features with time-inhomogeneous properties cause popular boosting classifiers such as XGBoost and LGBM to produce unstable detection results. We use the Kolmogorov–Smirnov test to detect and remove these features to improve XGBoost and LGBM detection performance and robustness. The resulting performance shown in our experiments is better than that of other classifiers, such as SVM and random forests. We examine the advantage of our technique by comparing it with several feature engineering works on fraud detection and automatic feature generation methods. On the other hand, we also find that generating training/testing sets with random sampling falsely eliminates such time inhomogeneity and results in misleading assessments of the robustness of machine learning models. These time-inhomogeneous phenomena also entail various modus operandi patterns, which influence the performance of different resampling methods for addressing data imbalance in fraud detection. Improper linear interpolation of SMOTE-related approaches leads to poor performance due to varying patterns of modi operandi. However, synthesizing fraudulent samples with simple oversampling and GANs mitigates this problem.


I. INTRODUCTION
With the emergence of new information technologies and the evolution of various financial services, the magnitude and The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . variety of financial fraud have also grown. Common financial frauds include credit card fraud, fund transfer fraud, insurance fraud, mobile communication fraud, etc. Such frauds lead to considerable economic losses and therefore incur high fraud detection, management, and law enforcement costs. For example, a malicious scammer could guide victims to transfer VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ money from their accounts to a criminal gang's account via phone or social communication platforms, such as Facebook and Line, and thereby commit a fund transfer fraud. In fact, the total loss of fraud in 2019 was 28.3 billion USD with very low clear-up rates as reported by the Communication Fraud Control Association. 1 Fund transfer fraud (FTF), such as romance scams, buyer overpays, etc., 2 is difficult to prevent and detect. Various fraud prevention acts have been enacted throughout the world [2], including (in Taiwan) the Proceeds of Crime Act, the Money Laundering Prevention Act, and the Money Laundering Control Act. Financial institutions must follow fraud prevention guidelines to detect crime, profile modi operandi, and identify vulnerabilities in fund transfer systems. Rule-based models are still widely adopted by Taiwan's financial institutions to identify suspicious accounts, but they generally fail to recognize the complicated and time-varying characteristics of modi operandi (i.e., the dataset shift problem [1]) committed by fraudulent actors [3]. For example, the rule-based model provided by our partner bank (Bank L hereafter) produces an extremely low recall rate (5.56%) with a poor precision rate (40%). This incurs huge management costs and infringes on normal users' rights to access financial services without actually preventing fraud. Thus, it is critical to develop a high-performance fraud detection system with fair interpretability.
Recent studies broadly apply machine learning to detect FTFs. Although adopting automatic feature synthesis from raw data has recently become popular, the lack of sufficient transaction data limits its performance. Indeed, the amounts of raw transaction data available for training a machine learning model are limited due to privacy regulations. Whitrow et al. [4] also point out the difficulty of feature synthesis due to the high-dimensional nature of raw transaction data. In addition, automatic feature synthesis usually does not satisfy interpretability requirements and may require much running time, as verified in our experiments. However, sensible reasons (generated from interpretable features) are required to screen or freeze suspicious accounts. To address the above issue, Bhattacharyya et al. [5], Bahnsen et al. [6], and Whitrow et al. [4] study handcrafted features to retrieve patterns from raw transaction data. Baesens et al. [7], Zhang et al. [8], Xie et al. [9], and Bahnsen et al. [6] collect the features generated by previous fraud detection studies. Then they group these features into recency, frequency, money, and (unsupervised) anomaly detection (abbreviated as RFMA). Most of these features are constructed based on mathematical or statistical properties and involve little financial expertise. Baesens et al. [7] point out that data engineering is of the utmost importance to improve fraud detection performance even with simple machine learning models. In light of the above observations, we extend our conference work [10] to develop two new feature categories, namely, behavior and segmentation -generated based on characteristics belonging solely to (non-)fraudulent accounts informed by financial expertise. Such feature constructions can capture cause-effect relationships between modi operandi and features to improve interpretability; our later experiments show that the importance of these features ranks high, which reflects their strong relationship. Behavioral features capture critical transaction patterns that are typically used only by fraudsters (or normal users) and that cannot be properly captured by the construction guidelines of recency, frequency, monetary, and anomaly raised in the literature. For example, a fraudster seeks to withdraw as much as possible before the freezing of the fraudulent account, and thus the amount attempted to be withdrawn from an account is often larger than the account balance. Such features are generally constructed based on financial expertise, which can be interpreted as the knowledge base of an expert system. This reveals that combining the last-generation AI (i.e., expert systems [11]) and the current-generation AI can produce better classification results given limited amounts of training data. Segmentation features are constructed by dividing the raw transaction data according to classification rules and then compiling summary statistics to extract meaningful phenomena. For example, we first calculate the number of fraudulent accesses to each ATM and note its owner bank. Next, banks can be classified by the total number of fraudulent ATM accesses; this is because the management and location selection strategies of a bank may affect the likelihood of fraudulent access. In addition to analyzing fraudulent behaviors, we follow the idea of Abdallah et al. [12], who use features to capture the patterns of non-fraudulent behaviors. This indirectly improves fraud detection performance by strengthening the ability to recognize normal accounts. For instance, a user can assess over-the-counter services only by being physically present at the bank branch; fraudsters are unlikely to assess such services to avoid exposing their identities. Then a new feature is created by analyzing 405 different types of transactions to determine those that are used almost exclusively by normal users.
To analyze the performance of our proposed features, we first compare the performance of some feature engineering works on fraud detection with our real transaction data from Bank L. However, many proposed features in their works depend on the specific properties of their data and are difficult to apply to our data. This might explain the poor performance in training with their feature sets. To make the comparison fair, we collect and implement features from fraud detection feature engineering works, such as [2], [4], [5], [6], [7], [9], [13], [14], [15], [16], [17], [18], and [19]. Then, we train with the features collected from past research and (or) behaviorand segmentation-type features to compare the fraud detection performance of different classifiers, such as support vector machines (SVMs), random forests, and extensions of gradient boost decision trees like XGBoost and light gradient boosting machines (LGBM). Although XGBoost and LGBM have the potential to yield the best detection results due to their sophisticated boosting techniques, incorporating noisy features could significantly deteriorate the prediction performance. Specifically, the distributions of noisy features change significantly with time, making a machine learning model designed to fit the training set fail to detect fraud effectively in the testing set. To identify features with time-inhomogeneous characteristics, we propose a new scheme to measure the difference of a feature's distributions in the training and testing sets split by chronological order in terms of p-values generated by the Kolmogorov-Smirnov test (the KS test). 3 The aforementioned deterioration in recall and precision rates can be addressed by removing features whose distributions vary significantly (i.e., have low p-values). In addition, training XGBoost/LGBM with only our features and with our features in addition to those in previous works (except for noisy features) yields comparable detection performance; however, using our feature set requires less running time. Note that generating training/testing sets via random sampling can mistakenly improve fraud detection performance significantly because time-inhomogeneous characteristics in real data are eliminated. Besides, such unstable prediction problems do not occur in SVMs and random forests, possibly because sophisticated boosting techniques are not used in these classifiers. In this case, the detection performance improves as features are added, even for noisy features.
In addition to the handcrafted feature generation mentioned above, automated feature engineering is used in Kanter and Veeramachaneni [21], Lucas et al. [18], Esenogho et al. [22], and Ebiaredoh-Mienye et al. [23] to detect fraud. Kanter and Veeramachaneni [21] use a deep feature synthesis algorithm and substitute raw data via a transform primitive to generate primitive features. These features are then substituted into aggregation primitives that use relations between database tables to store different types of data elements to systematically generate aggregated features. Lucas et al. [18] use multi-perspective hidden Markov models to examine the amounts and recency of transaction sequences from the histories of credit card holders/merchants. Esenogho et al. [22] use long short-term memory (LSTM) to capture temporal patterns from credit card transaction data. Then, they train an adaptive boosting model with these synthetic features to detect credit card fraud. Ebiaredoh-Mienye et al. [23] use the stacked sparse encoder to generate feature representation for each observation for predicting credit card defaults. We train with features generated by the above automatic feature generation models, but most detection results are poor. One possible reason is that most fraud detection works study credit card frauds, which provide more aspects and larger amounts of transaction information than bank accounts' fund transfer data. Another reason might be that existing automatic feature generators can generate simple features, such as RFMA proposed in past literature, but find it challenging to create sophisticated behavior or segmentation features that require complex generation procedures and financial expertise. Our feature constructions, in contrast, can provide new insight into automatic feature generation given limited amounts of training data. In addition to extracting RFMA features from raw transaction data, generating a statistical summary of labels (i.e., fraudulent accesses) for raw transaction data with exogenously-obtained categorization information could yield useful features. For example, as mentioned above, we classify ATMs according to their owner banks and then count the number of fraudulent accesses for each bank to identify suspicious banks, as in Table 4. Similarly, statistical summaries of fraudulent/normal accesses classified by ATM branch (location) and transaction type also produce useful features as detailed in Figure 2 and Table 3, respectively.
The ratio of the number of fraudulent accounts to normal accounts is usually extremely biased, namely, 1 : 250 in our dataset. Such data imbalance can cause machine learning models to predict all account observations as normal. However, it is generally more important to identify fraudulent accounts (i.e., improve the recall rate) than to achieve high prediction accuracy. Accordingly, we resample the training dataset to balance the ratio of positive/negative observations. We find that modus operandi patterns can be divided into several subgroups as observed from a scatter plot projected from high-dimensional feature vectors of accounts. Thus, even through Vassallo et al. [27] suggest that SMOTE-NCL is especially useful for dealing with financial data imbalance, our experiments show that SMOTE-related approaches yield degraded prediction for time-inhomogeneous fraud detection. This is because interpolations adopted by SMOTE-related methods can improperly place synthesized positive observations where negative observations are dense and/or change the statistical properties of features belonging to positive observations. This problem can be avoided by adopting naive full/random oversampling methods or Wasserstein generative adversarial networks (GANs) (see [24]). Our experiments show that GANs outperform oversampling methods, which in turn outperform SMOTE-related methods.
In addition to predicting modi operandi, the ability to interpret the predictions for summing up the patterns of modi operandi is also essential for an FTF detection system to pinpoint the vulnerabilities of the procedures of financial services. Indeed, profiling modi operandi prevents fraudsters from utilizing fund transfer systems (see [25]) and fulfilling the ''risk-oriented'' 4 property asked for by the Financial Action Task Force 5 (see [26] and [27]). Our proposed behavioral and segmentation features provide a clear cause-effect relationship between features and fraudulent labels. Their excellent qualities can be examined by showing that our features generally rank high under the feature importance 4 Efforts should be allocated where the fraudulent risk is likely. 5 https://www.fatf-gafi.org/ VOLUME 10, 2022 ranking procedure proposed in [28] published in a core finance journal (Journal of Banking and Finance. ) We organize the structure of the remaining paper as follows. We first survey relevant fraud detection studies in Section II. Section III describes the data format provided by the partner bank. Section IV describes our feature engineering approach. The experiments in Section V are divided into three parts: Section V-A compares the performance of various machine learning models with the features suggested in the literature and those in this paper. We also analyze how the time inhomogeneity of input features influences the stability of fraud detection. Section V-B addresses the data imbalance by comparing the performance of various resampling methods and analyzes their relationship to the distributions of original/synthetic observations. Section V-C analyzes feature importance, and Section VI concludes.

II. LITERATURE REVIEWS
Financial fraud includes credit card fraud [17], [19], [29], phone fraud [30], online transaction fraud [31], instant payment fraud [32], etc. To ensure the interpretability of the detection results, most conventional banks detect frauds by rule-based methods. However, these methods generally fail to capture complex, time-varying characteristics of modi operandi; fraud detection performance thus tends to be poor. In addition, fraudsters can easily bypass these fixed rules.
It is impractical to train a machine learning model with raw transaction data due to data availability constrained by regulations and the heterogeneous and high-dimensional nature of transactions. Whitrow et al. [4] and Bhattacharyya et al. [5] train machine learning models with features extracted from aggregated raw transaction data to detect fraud. This process generates a set of features to capture insightful properties of (non-)fraudulent accounts. Xie et al. [9] show that many features generated in previous studies are generally based on transaction frequency. However, capturing temporal properties without considering the characteristics of modi operandi from other aspects hinders machine learning models from recognizing a wider variety of fraudulent behaviors. For example, chronological relationships in raw transactions are difficult to capture by frequency features. We note in this connection that fraudsters usually first do transactions with small amounts to test the vulnerability of a transaction system, and then do a large one. Xie et al. [9] argue that using interpretable monetary features to capture patterns from transactions of fraudulent accounts dramatically enhances fraud detection results. Zhang et al. [8] suggest that the features generated by recent fraud detection studies can be categorized into recency, frequency, and monetary (RFM) groups. In addition to RFM, Baesens et al. [7] propose another two groups: anomaly detection and other feature engineering techniques. They empirically compare the fraud detection results by training with the above features and show that excellent performance can be achieved by adequately constructing feature sets without using sophisticated machine learning models. In light of the above research, we create new features by carefully observing (non-)fraudulent behaviors and using financial expertise to capture their patterns to propose two feature categories: behavior-and segmentation-type features.
Advanced machine learning models such as graph-based models are utilized due to the availability of certain types of transaction information. Wang et al. [32] build a heterogeneous attribute graph to represent accounts' social relations and frequently used locations of online merchants by using semi-supervised graph embeddings to produce a lowdimensional representation for each node. To detect instant payment fraud with interpretable results, they use a hierarchical attention mechanism for each node to determine the relations between neighbors or attributes. Li et al. [33] identify paths of fraudulent fund transfers through graph-based models. Cheng et al. [17] extract RFM-based features from temporal and spatial information embedded in raw transaction data and detect credit card fraud by applying an attention mechanism to extract important features. Zheng et al. [30] aggregate transaction records (including transfer records from the sending account to the receiving account(s) and the receiving bank) from two banks to detect suspicious transfers. A GAN (generative adversarial network) is then applied with a denoising autoencoder to calculate the probability that a cross-bank transfer is fraudulent. However, such detailed spatial transaction data, social relationships, and cross-bank transaction records are unavailable in our raw transaction dataset. Hence, we do not consider such sophisticated methods.
Methods for detecting anomalies like fraudulent accounts generally face significant data imbalance problems; that is, the distribution of the training dataset is biased, with few/many observations in minority/majority classes. To mitigate the resulting learning bias toward majority classes, resampling procedures are used to alter the ratio of positive to negative observations (usually to be closer to 1) by oversampling, undersampling, or hybrid methods as categorized in [34]. Oversampling methods increase the number of observations in the minority group (e.g., fraudulent accounts in our case) by replicating or synthesizing new ones. Undersampling methods filter out observations from the majority group (e.g., nonfraudulent accounts). Hybrid methods combine oversampling and undersampling. Ghorbani and Ghousi [34] and Hordri et al. [35] compare different resampling methods, including SMOTE, borderline SMOTE, SMOTE-ENN, SVM-SMOTE, SMOTE-Tomek, and random under/oversampling. Vassallo et al. [27] claim that SMOTE-NCL is especially useful for dealing with financial data imbalance. We compare the above methods in our experiments, and consider the Wasserstein GANs [24].
Addressing imbalanced binary classification problems, such as fraud detection using machine learning methods, has also been a recent focus in academic financial journals. Khandani et al. [36] use generalized regression and classification trees to predict the delinquency and default rates of credit card holders since interpretable decision logic can be obtained from the rules of each tree node and the tree structure. They demonstrate significantly improved detection rates and show that such interpretable analyses may have important applications in forecasting systemic risk, that is, the risk of major collapses, such as the financial crisis of 2007-2008. Butaru et al. [28] also show that decision trees and random forests perform well in predicting credit card delinquency for different banks. However, they criticize the poor interpretability of these tree-based models, as the features selected by these models vary with time and banks. Moreover, the complex structure and high number of leaves of the trees complicate the comparison of overall feature selection results. To profile and compare the results of feature selection, they measure and rank the importance of each feature by the number of occurrences, the occurrence position (in the tree), and the information gain of each feature. This measures the effectiveness of each feature and further improves the interpretability and cause-and-effect of their machine learning models. In Section V-C, we adopt the ranking method of [28] to show the effectiveness of our proposed features based on XGBoost. Addo et al. [37] build binary classifiers based on different machine learning models to predict the probability of loan default. They rank the ten most important features and use them to assess the stability and performance of prediction among different models. They show that tree-based models outperform other models. Zhang et al. [38] compare major machine learning algorithms and sampling techniques to detect money laundering. They argue that decision trees are more flexible than parametric methods like logistic regression in capturing nonlinearity and accounting for missing values and outliers. However, decision trees suffer from overfitting and thus require stopping rules, for instance, to limit the maximum depth of the tree or the number of branches.

III. DESCRIPTIONS AND PREPROCESSING OF RAW TRANSACTION DATA
Bank L has provided us with raw transaction data from April 2018 to September 2019 and the records of fraudulent accounts provided by the National Police Agency. This paper proposes sophisticated features extracted from real raw transaction data that contain rich details of various patterns of normal and fraudulent behaviors, as shown in Table 1. Although Kaggle provides public datasets for fraudulent detection, 6 the datasets are either generated by simulations or provide limited disclosure information due to strict regulations of privacy. These drawbacks limit the ability to profile the characteristics of normal and fraudulent transaction behaviors by generating interpretable features. Chen et al. [39] show that due to this limitation, anti-fraud works such as [40] and [41] merely consist of descriptions of methods without experiments. Hence, our studies focus on real transaction data provided by Bank L. Inputting the entire transaction record to train a machine learning model to detect fraudulent transactions is unrealistic  (see [4]) due to the high-dimensional nature and heterogeneity of the raw transaction data. Thus, the feature representation of each bank account is constructed by first aggregating n transactions for that account that occurred within a predetermined period. Then these studies extract features from aggregated transactions. This aggregate approach is widely adopted in [2], [4], [5], [6], [7], [9], [14], [15], [16], [17], [18], and [19]. However, determining the hyperparameter n faces a dilemma since increasing (decreasing) n results in more (fewer) aggregated transactions to describe the characteristics of a bank account, but fewer (more) bank accounts being included in the training/test dataset. Note that many accounts are used infrequently and are removed when we set a large n. Such removals may significantly damage the training of fraud detection systems because fraudulent accounts are rare and many of them are infrequently used. To preserve a fair percentage of fraudulent accounts for training without sacrificing the number of aggregated transactions, we set n to 9. As observed in Fig. 1, we observe that the ratio of fraudulent accounts used for training drops rapidly when the hyperparameter n is larger than 9. The percentage of aggregated fraudulent transactions to all transactions of fraudulent accounts is also higher for the n = 9 scenario. VOLUME 10, 2022

IV. CONSTRUCTION OF BEHAVIOR AND SEGMENTATION-TYPE FEATURES
Whitrow et al. [4] and Bhattacharyya et al. [5] retrieve the patterns of normal and fraudulent transactions by generating the features from aggregated transactions. These features are then used to train classification models to predict fraudulent accounts. Much of the literature follows this approach, and the features considered in the literature can be categorized as recency, frequency, money, and anomaly detection techniques (see [7]). This paper extends our previous conference work [10] to collect the features that can be generated based on our raw transaction data 7 from [2], [4], [5], [6], [7], [9], [14], [15], [16], [17], [18], and [19], as illustrated in the upper part of Table 2. For ease of analyzing their effectiveness in fraud detection, the set of all these features is denoted as Other in the following experiments. Along with the newly proposed features, we construct two novel feature categories: behavior and segmentation, and denote the resulting feature set as Proposed. The Proposed categories are illustrated in the lower part of Table 2 and the feature constructions from Proposed are discussed in the following two subsections. Our experiments show that training gradient-boosting decision trees with the proposed features could yield good fraud detection results. This echos the findings in [7] that careful feature engineering without sophisticated machine learning techniques can also achieve good performance.
Generally speaking, the literature on past fraudulent detection creates features mainly based on abnormal patterns of modi operandi for identifying fraudulent accounts. Note that fraudsters must execute illegal behaviors to commit crimes even though their behaviors are generally similar to those of normal account holders most of the time in order to cover their identity. Thus, it is straightforward to identify fraudsters to pinpoint ''what fraudsters do''. However, we find that profiling certain normal behaviors avoided by fraudsters is also beneficial. This is because fraudsters avoid certain behaviors to prevent increasing the risk of disclosing their identity or disturbances to their criminal plans. Accounting for both ''what fraudsters do not do'' and ''what fraudsters do'' helps distinguish fraudsters from normal users with similar transaction characteristics. Indeed, our improvements would reduce the number of false alarms, hence the inconvenience for normal account holders and the associated labor costs to screen or freeze the accounts [6]. Since the features constructed according to the guidelines of ''what fraudsters (don't) do'' cannot easily fit into RFMA proposed by Baesens et al. [7], we add the categories ''behavior'' and ''segmentation.''

A. BEHAVIOR FEATURES
Features belonging to the behavior category profile characteristics of (non-)fraudulent transactions. These features cannot be easily categorized into RFMA but are critical to 7 We ignore features such as the locations of transactions that are not contained in our raw transaction data defined in Table 1. The definitions of the features used in our experiments can be found in Table 4 of [10]. recognizing the modus operandi patterns of financial expertise. We list the behavior-type features as follows.

1) IMMEDIATE_PAID_OUT
We count the occurrences of payment made from an account immediately following the event of payment made into the same account on the same day. Note that fraudsters want to transfer funds before the police investigate and freeze the account in question.

2) ATM_TRANSACTION
An ATM transaction is one of 405 types of transactions. We focus on this transaction type since most FTFs utilize ATMs to transfer or dispatch money, as it is convenient, safe, and unlikely to disclose the identities of fraudsters. We count the number of transactions using ATMs from all n aggregated transactions of the same account defined in Figure 1. Note that the 405 transaction types cause the one-hot encoding to result in unnecessarily high-dimensional inputs and poor fraud detection performance. Additionally, there is no natural order for these 405 transaction types, making label encoding unreasonable. In addition, target encoding is impractical since the number of fraudulent accounts to normal ones is extremely biased. Thus, we propose the following features to aggregate information on transaction types.

3) LT_COUNT
We count the occurrences of ''likely legal'' transactions, whose transaction types are frequently conducted by normal users but typically avoided by fraudsters, from aggregated transactions of an account. Some transaction types, such as withdrawals or deposits on bank counters, increase the risk of being identified or getting caught. Other types, such as purchasing and redeeming investment products, are generally irrelevant to the modi operandi. The five most frequently used transaction types used by (non-)fraudulent accounts are illustrated in Table 3; three of these are common to both normal and fraudulent accounts. This shows that naively identifying high-frequency transactions used by (non-)fraudulent accounts does not yield useful features. Our sophisticated feature generation idea is useful since LT_count is found to be the second most important feature, as will be discussed later in Table 13.

4) LAST_PAID_OUT_LARGER_THAN_SAVINGS
This denotes a case in which the last amount paid out from the account was larger than the balance of that account. Note that fraudsters would try their best to transfer funds from fraudulent accounts before these accounts are frozen.

5) FRAUD_FACTOR
It represents the likelihood of fraud as the product of several fraud-related features: ( Last_paid_out_larger_than_savings ) / n × ( Immediate_paid_out ) / n × ( ATM_transaction ) / n, 86106 VOLUME 10, 2022  [28] that will be discussed later in Table 13. ''*'' in parentheses denotes a feature importance of zero. We use red (black) color or ''Proposed'' (''Other'') to denote features used only in this study (in past literature categorized by [7]). where n denotes the feasible number of aggregated transactions determined in Figure 1. This feature facilitates the capture of coexisting occurrences of fraud-related features to measure the likelihood of fraud.

B. SEGMENTATION FEATURES
Segmentation features are constructed by discovering useful classification rules with summary statistics. We first label each ATM machine or an account with its associated bank, branch, or other meaningful classification and then analyze their relationship to fraud. We list these features as follows:

1) SUSPICIOUS_ATM_BANK
We perform summary statistics for the number of fraudulent accesses to each ATM and recognize its owner bank. Then we calculate the lump sum of fraudulent ATM accesses for each bank and label those with the top 5% (see the upper panel of Table 4) of fraudulent accesses as ''Suspicious ATM Banks.'' Note that in significance tests, five percent is a prevalent statistical threshold. High fraudulent access to ATMs belonging to a specific bank may result from the bank's ATM location selection or management policies. 8 For example, ATMs near train stations are frequently used by fraudsters. 9 2) LATM_COUNT This is the number of times an account has accessed ''likely legal'' ATMs, defined as ATMs that have been used by fraudulent users fewer than six times. We decided on six because the number of fraudulent accesses of 95% (a frequently used 8 For instance, a bank may deploy ATMs in branches of a chain store with which it cooperates. 9 See the news https://news.tvbs.com.tw/local/1415294 statistical threshold) of ATMs in our training set is fewer than six times (see the lower panel of Table 4).

3) SUSPICIOUS_BRANCH
This specifies whether ATM transactions in an account are executed where a lot of fraud occurs. Although actual ATM locations cannot be extracted from the raw transaction data (see Table 1), we instead identify each ATM's owner branch by comparing the serial number of the bank branch with the serial numbers of its ATMs. Then we profile the area where a branch office is located with its ATM transaction data since ATMs owned by the same branch office are located close to the office. We label the branches owning ATMs that have been accessed only by fraudsters as suspicious branches, as illustrated in Figure 2. Note that it does not imply that ATMs of suspicious branches are accessed only by fraudsters, as we have access to the transaction data of only Bank L accounts within a limited period.

4) BRANCH_ID
This summarizes the branch to which (non-)fraudulent accounts belong; it can also be used to generate the above feature. VOLUME 10, 2022 Recently, studies of open-source tools for automatic feature synthesis have become popular, such as Featuretools. This tool extracts features from raw data via primitives such as averages, sums, minima, maxima, standard derivations, and the skew of raw data. The aggregates of the primitives of the raw data are then used to synthesize features, such as average debit amounts and maximum account balance. Using these primitive aggregation techniques is challenging to generate behavioral-and segmentation-type features. Nevertheless, our discussion of feature construction suggests a new feature synthesis strategy. Specifically, many of the above features are constructed by forming a statistical summary of raw transaction data based on the labels (of normal and fraudulent accounts) and exogenously obtained classification information. For example, we can obtain ATM sequence numbers only from raw transaction data. By labeling each ATM sequence number with exogenous classification information, such as the bank or branch location it belongs to, we can construct a statistical summary as in Table 4. Using fraudulent and normal access labels, we can form features such as Suspicious_ATM_bank and Suspicious_branch to mark banks or branches with many fraudulent accesses. Given the wide variety of exogenous classification information related to different attributes of raw transaction data, it is not efficient to manually collect this information to construct statistical summaries and features. Our findings provide a possible development path for automatic feature construction: various segmentation features could be efficiently generated by using spiders to extract useful classification information from the Internet in combination with a statistical summary generator.
The following experiments compare fraud detection performance using behavior and segmentation features analyzed above with RFMA features and automatic feature engineering methods proposed in previous work. We also analyze the characteristics of noisy features that degrade detection performance and examine the performance improvements obtained by removing these features. The quality and interpretability of our proposed features are attested by their high ranking in terms of the feature importance proposed in [28] published in a premium finance journal.

V. EXPERIMENTS
Note that banks are required to provide qualified and interpretable fraudulent detection systems. To echo these requirements, we compare fraudulent detection performance under different feature engineering and resampling techniques in Sections V-A and V-B. Feature importance rankings in Section V-C analyze the interpretability of the features proposed in our paper and previous works.

A. ANALYSES OF FEATURE ENGINEERING TECHNIQUES
We first compare the fraud detection performance by training popular classifiers, such as SVM, random forests, XGBoost, and light GBM (LGBM), with different feature engineering models and Bank L's real transaction data. We collect our  proposed features and RFMA features surveyed in the past literature as defined in Table 2 and train different combinations of the features to compare the fraud detection performance. We also study unstable fraudulent detection results caused by time-inhomogeneous features and remove these features using the Kolmogorov-Smirnov test. In addition, the performance of automatic feature synthesis algorithms is also compared in this section.
We sort all transaction accounts and their aggregated transaction information in chronological order. The first 60% (last 40%) of the data are used as the training (testing) set. We try different parameter settings to optimize the fraudulent detection performance, and the best settings in our experiments for each machine learning model are shown in Table 5. As a severe data imbalance hinders the recognition of the characteristics of minority samples (i.e., fraudulent accounts), we use full oversampling to adjust the ratio of fraudulent to non-fraudulent accounts to 1 : 1, where all minority observations are duplicated an equal number of times. The effects of other resampling methods will be studied in Section V-B. Then we use features proposed by credit card fraud detection works [9] and [7] to predict fraudulent accounts with our Bank L's transaction data, as illustrated in Table 6. The prediction results are poor, likely because many of their features depend on specific information associated with their data and cannot be retrieved from our transaction data (see Table 1). Specifically, credit card transaction records include extensive data such as consumption locations and merchandise by which they generate meaningful features to improve credit-card fraud detection performance. In our scenario, failing to retrieve these features from Bank L's transaction dataset degrades the performance of their models. To make comparisons of feature engineering fairer, we collect all features proposed by [2], [4], [5], [6], [7], [9], [14], [15], [16], [17], [18], and [19] that can be generated based on our raw transaction data to form a feature set Others. Then we compared the performance by training with ''Others'' and (or) our proposed feature set Proposed, as shown in Table 7, to show the advantage of our proposed features to solve the above heterogeneity problem of raw transaction data. Table 7 compares the performance for detecting fraudulent accounts by training different machine learning models with different combinations of features, as listed in the ''Model+Data'' column. Here, we first focus on gray cells that split the data chronologically. Although gradientboosting classification models, such as LGBM and XGBoost, can achieve strong detection performance if features are properly selected, the inclusion of noisy features can generate unstable results. This echoes the finding of [42] that the classifiers mentioned above are sensitive to overfitting due to the existence of noisy data. In fact, using our Proposed features produces a good F1 score (73.95%) with a low training time; recall rates and F1 scores deteriorate significantly with high training times if we include all the features in Table 2 for training. This confirms the argument in [7], namely, that careful feature engineering improves the performance of machine learning models, as we can use far fewer features (denoted red in Table 2) to achieve good fraud prediction performance.
To determine the causes of the dramatic performance drop in XGBoost and LGBM, we monitor the change in detection performance using the leave-one-out feature selection mechanism; that is, we repeatedly single out a feature for all training features. Significant performance deterioration is due to the presence of two features from the (unsupervised) anomaly detection category: LOF and KNN_distance. Simultaneously dropping these two noisy features, as shown in the right panel of Table 7, restores the XGBoost F1 scores, namely, 74.34% (w/ Other+Proposed w/o LOF & KNN_distance) and 68.42% (w/ Other w/o LOF & KNN_distance). Compared to the relatively low F1 scores of the random forest, the nonlinear kernel SVM produces slightly worse performance than XGBoost if noisy features are removed. Furthermore, SVM performance increases steadily with increasing training features without deterioration in detection ability caused by noisy features. However, a nonlinear kernel SVM cannot easily rank feature importance to capture the patterns of modi operandi or identify the weaknesses of the transaction procedures. It also takes much more running time than other models in our experiments. Under interpretability and running time concerns, the following experiments focus on improving XGBoost for simplicity. 10 To explore why detection performance deteriorates significantly due to the presence of LOF and KNN_distance, we repeated the above experiments by sampling the training and testing sets chronologically and randomly with various proportions of the training/testing dataset, as illustrated in Table 9. We trained XGBoost with all Other and Proposed features in these experiments. The percentage in the first column denotes the proportion of data used as the training set; the remaining data were used as the testing set. Recall that each account can be represented as a vector composed of its features; KNN_distance (LOF) profiles the statistical properties of an account's overall behavior, as it describes the average distance from its feature vector to neighborhood vectors (compared to the average density around the vectors), as stated in [7]. Thus, we can determine whether the overall behavior patterns of (non-)fraudulent accounts are similar during the training and testing periods by calculating the similarity of the cumulative distribution of KNN_distance (LOF) by the KS test. The null hypothesis of the KS test is ''the two distributions are the same'', which we reject to adopt the alternative hypothesis-''the two distributions are different''-if the p value is small. 11 The p values to test the features' distributions of fraudulent and normal accounts are listed in columns 6, 7, 8, and 9. 10 Comparisons of LGBM are ignored for simplicity since its detection performance is similar to that of XGBoost. 11 The p value is the probability of obtaining test results that are more extreme than the current result given that the null hypothesis is correct. Intuitively, the size of the p value reflects the similarity of two distributions. VOLUME 10, 2022 TABLE 7. Fraudulent detection performance when data are split chronologically (gray cells) and randomly (white cells). To evaluate the stability of our experiments, we conducted each experiment five times to ensure that the results do not vary significantly. The reported performance is the average of the experimental results.
In Table 9, we observe that chronological sampling (gray cells) yields highly divergent recall rates (1.49%-70.59%) and F1 scores (2.94%-82.76%) with changes in the training data size. These values seem to be highly correlated to the similarity of the cumulative distributions of KNN_distance and LOF proxied by the p values of the KS test. Specifically, low p values of these two features of fraudulent accounts also map to low recall rates, that is, the likelihood of detecting fraudulent accounts. Recall rates increase (70.59%) when p values increase. These relatively low p values are evidence that the modi operandi vary widely; therefore, the patterns captured from the training set may become invalid in the testing set, resulting in low and unstable recall rates. However, the p values of normal accounts are higher and more stable, which suggests relatively stable behavior for normal accounts and hence high precision rates (87.5%-100%). Note that the relatively low precision of 87.5% also maps to a low p value of 0.3% given a training set composed of 70% of the data.
The time inhomogeneous properties of the KNN_distance and LOF features of fraudulent accounts disappear if the training/testing sets are partitioned by random sampling (white cells). As shown in Table 9, the p values in the chronological sampling cases are generally lower than those in the random sampling cases. This is because generating training and testing datasets by randomly sampling the raw transaction dataset results in similar cumulative distributions of KNN_distance and LOF features across training and testing sets. This further implicitly enables XGBoost to foresee the rapid changes in future modi operandi from the training set, which clearly is impossible in real-world applications, as reflected in the unrealistically high detection performance. The recall rates (86%-90%) and hence the F1 scores (88%-93%) all become high and stable regardless of changes in the training data size. We further repeated the experiments introduced at the beginning of Section V-A with a 60%/40% random partition of the training and testing sets, respectively, instead of a chronological split, also illustrating the results of random partitioning in Table 7. We observe that the presence of noisy features LOF and KNN_distance no longer deteriorates the fraud detection results of XGBoost. It can be observed that the F1 score is 86% for ''XGBoost w/ Other+Proposed'' and 80% for ''XGBoost w/ Other,'' which outperforms the counterpart experiments removing LOF and KNN_distance. Since changing modus operandi patterns should not be foreseen, it is inappropriate to assess a machine learning model with time-inhomogeneous data by random sampling or cross-validation.
We also applied the KS test to all Other and Proposed features as in Table 10 with the chronological data split of Table 7 to examine the similarity of these feature's cumulative distributions in the training / testing datasets: the p values of features other than LOF and KNN_distance are all high for both fraudulent and normal accounts. This implies that the distributions of these features in the training and testing sets are similar, and explains why the detection performance of XGBoost significantly improves when both LOF and KNN_distance are removed, as illustrated in Table 7. We examine the robustness of this finding by changing the proportion of the training set, as illustrated in Table 9 (blue cells). Excluding these two features stabilizes the recall rates and F1 scores, which are generally higher than the results in Table 9 (gray cells) regardless of changes in the training data size. We also use the discriminative learning studied in Bickel et al. [43] and Nair et al. [44] to show that removing noisy features makes the observations retrieved from training/testing sets more indistinguishable, as in Table 8;  TABLE 9. Training XGBoost with various training set sizes. All numerical results are the averages from training and evaluating XGBoost five times. Gray and white cells denote data split chronologically and randomly, respectively, with all Other and Proposed features. Blue cells denote data split chronologically with KNN_distance and LOF excluded. Columns 6, 7, 8, and 9 list the p values generated by the KS test. The running times listed in the last column increase with the training set's size. that is, fraud detection models become less likely to learn time-varying patterns, which results in reduced predictability. Because the F1 score of the discriminative learning experiment is high when noisy features are not removed, the original account transaction data do exhibit a significant dataset shift problem. The F1 scores of the discriminative learning model decline after removing noisy features, which indicates that this problem has been alleviated. This explains why fraud detection results improve after removing noisy features.
To verify that handcrafted feature generation is essential when mass transaction data are not available under the constraint of privacy regulations, as in our case, we compare several feature synthesis algorithms proposed by [18], [21], [23], and [22], as illustrated in Table 11. Lucas et al. [18] generate features with multiperspective hidden Markov models to learn the monetary and recency properties of the transaction sequences from the credit card transaction histories. Kanter and Veeramachaneni [21] use a deep feature synthesis algorithm to generate features. Ebiaredoh-Mienye et al. [23] use the stacked sparse encoder to generate feature representation for each observation. We train the above three models with Bank L's transaction data to create features. These features are then used to train the classifier models listed in the second column of Table 11. Most automatic feature generation algorithms perform poorly, except that training the random forest with the deep feature synthesis algorithm yields good detection results. The performance of training gradient-boosting classification models with these algorithms is poorer than that of training with our proposed features. In addition, Esenogho et al. [22] use long-short term memory (LSTM) to capture temporal patterns from credit card transaction data. Then, they train an adaptive boosting model with these synthetic features to detect credit card fraud. However, the fraudulent detection performance in our experiment is also poor. This might be because most of these works study the detection of credit card fraud, which provides more aspects and large amounts of transaction information compared to the fund transfer data of banks' accounts detailed in Table 1. In addition, it might be a challenge for automatic feature generators to generate sophisticated behavior or segmentation features that require complex generation procedures and financial expertise in contrast to the simple RFMA features proposed in the past literature. Besides, these automatic feature generators typically require more computational resources, reflected in higher running times.

B. ADDRESSING DATA IMBALANCE
To visualize the dataset imbalance 12 and the accounts' pattern features, each observation in our dataset is represented by a high-dimensional vector composed of the Other and Proposed features analyzed in Section IV except for KNN_distance and LOF. We used a 60%/40% chronological split for the training and testing datasets, respectively. We used principle component analysis (PCA) to project the observations represented by high-dimensional vectors to a two-dimensional plane, as illustrated in Figure 3. There are significantly fewer fraudulent observations (denoted by blue spots) than non-fraudulent ones (orange spots). Fraudulent observations are clustered in several subgroups, suggesting various modi operandi in the training dataset.
To mitigate learning bias toward the majority class due to data imbalance, we compared the resampling methods analyzed in [34], [35], and [27] with a Wasserstein GAN (see [24]). The ratio of fraudulent to non-fraudulent observations was rebalanced from 1 : 250 to 1 : 1 in the training data. Scatter plots for after-resampling training data are shown in Figure 3, and the corresponding fraud detection  Comparisons of resampling methods. The results were obtained by training XGBoost five times with Other and Proposed features excluding KNN_distance and LOF. The ratio of non-fraudulent to fraudulent accounts was adjusted to 1 : 1 by resampling. The first 60% (last 40%) of the data were split chronologically into training (testing) set data. The means of all standardized features' means and variances of the original data are the same as those of full oversampling. The running times are listed in the last column. performance is presented in Table 12. Full oversampling and random oversampling yield better F1 scores (74.34% and 73.87%) and higher areas under the curve (AUC) of the precision-recall curve (0.7689 and 0.7559) than other resampling methods since these two methods do not change the subgroup pattern of the modi operandi, as observed in

Figures 3(b) and 3(c)
. Random undersampling produces the worst F1 score (41.67%) and AUC (0.6394), as it removes excessive non-fraudulent observations due to the extreme imbalance in the dataset, in which only 0.4% of accounts are fraudulent. Such removal harms the pattern learning of non-fraudulent accounts and reduces precision considerably. However, it also provides higher recall rates and uses less running time than other methods.
Unlike oversampling methods, which produce minority observations directly by replication, SMOTE-related methods produce observations by synthesizing new samples via linear interpolation. Our experiments in Table 12 suggest that these methods all yield lower F1 scores than the full / random oversampling methods. Figure 3 shows that SMOTE linear interpolation harms the learning of subgroup patterns of fraudulent accounts. Specifically, SMOTE, ADASYN, SMOTE_ENN, and SMOTE_TOMEK add synthesized fraudulent observations in places densely populated by non-fraudulent observations since these methods perform linear interpolation across all fraudulent observations. The resulting noise clearly reduces the ability to distinguish fraudulent accounts from normal ones. Borderline_SMOTE and SVM_SMOTE, however, perform linear interpolation selectively, and synthetic observations are added to the left part VOLUME 10, 2022 of the subfigure where fraudulent observations are dense, as shown in Figures 3(f) and 3(g). However, these two methods produce even lower F1 scores and AUCs.
The above problem might be explained by the argument in [45] that ''SMOTE does not change the expected value of the (SMOTE-augmented) minority class but it decreases its (minority class's) variability'' due to linear interpolation. Note that the decrease in the variability in fraudulent observations due to the use of SMOTE-related methods inhibits fraud detection models from identifying different fraud patterns. To estimate in greater detail how resampling methods influence the distributions of observations, we calculate the means of all standardized features' means and variances after applying different resampling procedures, as illustrated in columns 7 and 8 of Table 12. Applying full/random oversampling and random undersampling does not alter the means of features' means and variances, but applying SMOTE-related models could significantly reduce the mean of features' variances, especially for Borderline_SMOTE and SVM_SMOTE, because all synthetic observations are added to the tight area crowded with fraudulent accounts; that is, the left part of Figures 3(f) or 3(g). This phenomenon aggravates the effect of decreasing variance. Such unbalanced observation insertions (compared to other SMOTE-related methods) also decrease the mean of minority samples. The significant changes in the statistical properties could explain why these two SMOTE methods produce the poor detection results shown in Table 12. In conclusion, we suggest that SMOTE-related methods perform more poorly than simple oversampling methods if minor observations form several subgroups with severe data imbalance; this result is consistent with the findings in [46]. Furthermore, unlike the methods mentioned above, which use oversampling or linear interpolation to synthesize fraudulent observations, WGAN refines its synthetic samples via interaction between the generator and discriminator networks and does not insert improper observations, as illustrated in Fig. 3(l), nor does it decrease the means of features' means and variances, as in Table 12. Additionally, it somewhat prevents classification models from overfitting since synthetic observations are not exact replicas of the original observations. Thus, WGAN outperforms all other resampling methods in terms of the F1 score and AUC of the precision-recall curve, as illustrated in Table 12.

C. INTERPRETABILITY AND FEATURE IMPORTANCE
The interpretability of fraudulent detection models is critical, as unreasonable false accusations and missed arrests could have serious legal ramifications. (Non-)fraudulent behaviors can also be profiled with interpretability to improve existing fraud detection rules and to sketch the cause-effect relationship of the model's fraud detection process. Currently, banks are required to abide by anti-money-laundering (AML) guidelines by screening transaction records according to specified rules. However, the rule-based model of our partner bank yields poor precision (40%) and recall (5.56%) rates. Training XGBoost with sophisticated handcrafted features based on the transaction patterns of (non-)fraudulent accounts and financial expertise can improve fraud detection results and capture cause-effect relationships to abide by AML guidelines. Feature importance reflects the strength of a relationship between a modus operandi and the feature. Table 13 measures and ranks the importance of the features defined in Table 2 (except for KNN_distance and LOF) by following the method proposed by Butaru et al. [28] published in a premium finance journal. All 10 features in our behavior and segmentation categories have nonzero importances, whereas 3 of the 30 RFMA features (see [8] and [7]) have an importance of 0. In addition, the two most important features belong to the behavior or segmentation categories. The experiments in Table 7 also suggest that training XGBoost with Proposed features outperforms XGBoost with Other features (with/without) KNN_distance and LOF. These results confirm the argument in [7], namely; namely, that good fraud detection can be achieved by careful feature engineering techniques even with simple classifier models.

VI. CONCLUSION
Due to the limited amount of available transaction data and strong interpretability requirements, much of the literature addresses financial fraud detection by training a machine learning model with sophisticated handcrafted features instead of raw transaction data or automatic synthesized features. Handcrafted features generated in the literature can be divided into categories of recency, frequency, monetary, and anomaly (RFMA). This paper proposes behavior and segmentation-type features describing non-RFMA characteristics belonging solely to (non-)fraudulent accounts. Behavior-type features are generally constructed based on financial expertise, which can be interpreted as a knowledge base of an expert system. Segmentation-type features can be constructed based on statistical summaries of the classifications of raw transaction data, as in Tables 3 and 4, providing a good hint in future designs of automatic feature generation. We compare the performance to train popular classifiers, such as SVM, random forests, XGBoost, and LGBM, with features generated by automatic generation algorithms or proposed in the past fraud detection literature and in this paper to show the superiority of our proposed features. We analyze the features that cause XGBoost and LGBM to produce unstable detection results. These noisy features are time-inhomogeneous and are detectable using the Kolmogorov-Smirnov test. According to the experimental results, although SVM and random forest produce stable predictions without suffering from this unstable detection problem, XGBoost and LGBM yield better fraud detection results with fair interpretability by removing noisy features. In addition, the presence of noisy features reflects the time-inhomogeneous nature of the modi operandi. Improperly assessing the robustness of a machine learning model by generating training/testing sets with random sampling eliminates such time inhomogeneity and falsely produces good performance. To address data imbalance due to the small number of fraudulent accounts, we examine multiple resampling methods and WGAN. Because SMOTErelated methods apply improper linear interpolations on different modus operandi patterns, they decrease the variability of overall fraudulent observations and generate low-quality fraudulent observations. However, full (random) oversampling and WGAN avoid these problems and improve the detection results. The quality of our proposed features (categories) is verified by showing that the features in the proposed categories rank high according to the method proposed in [28] published in a premium finance journal. YEN-WU TI received the B.E. degree in mathematics from Tamkang University, in 1995, the M.E. degree in applied mathematics from the National Chiao Tung University, Hsinchu, Taiwan, in 1997, and the Ph.D. degree in computer science and information engineering from the National Taiwan University, Taipei, Taiwan, in 2009. He is currently an Associate Professor at the College of Artificial Intelligence, Yango University, China. His research interests include machine learning and algorithms.
MING-CHUAN HUANG received the B.E. degree in computer science from the National Taiwan Normal University, Taipei, Taiwan, in 2020. She is currently a Graduate Student at the Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan. Her research interest includes anti-money laundering.
TING-HUI CHIANG received the Ph.D. degree in computer science from the National Yang Ming Chiao Tung University, Taiwan, in 2018. He has been an Assistant Professor at the Department of Information Engineering and Computer Science, Feng Chia University, Taiwan, since 2019. His research interests include artificial intelligence and indoor localization. In AI research, he focuses on activity recognition, video inpainting, and fintech services. For localization research, he focuses on wireless localization, pedestrian dead reckoning, acoustic localization, particle filters, and AI-based localization models.
LIANG-CHIH LIU received the Ph.D. degree in finance from the National Chiao Tung University, in 2016. He joined the Department of Information and Finance Management, National Taipei University of Technology, in 2017, as an Assistant Professor. His research interests include computational finance, credit risk issues associated with corporate debt structure, issues concerning optimal call policy of corporate bonds, and text mining.