Amaretto: An Active Learning Framework for Money Laundering Detection

Monitoring financial transactions is a critical Anti-Money Laundering (AML) obligation for financial institutions. In recent years, machine learning-based transaction monitoring systems have successfully complemented traditional rule-based systems to reduce the high number of false positives and the effort needed to review all the alerts manually. Unfortunately, machine learning-based solutions also have disadvantages: while unsupervised models can detect novel anomalous patterns, they are usually characterized by a high number of false alarms; supervised models, instead, usually offers a higher detection rate but require a large amount of labeled data to achieve such performance. In this paper, we present Amaretto, an active learning framework for money laundering detection that combines unsupervised and supervised learning techniques to support the transaction monitoring processes by improving the detection performance and reducing the compliance management costs. Amaretto exploits novel selection strategies to target a subset of transactions for investigation, making more efficient use of the feedback provided by the analyst. We perform the experimental evaluation on a synthetic dataset provided by the industrial partner, which simulates the profiles of clients trading in international capital markets. We show that Amaretto outperforms state-of-the-art solutions by reducing money laundering risk and improving detection performance. In particular, we compare state-of-the-art unsupervised and supervised techniques commonly used in the AML domain with the ones implemented in this work. We show that the Isolation and Random Forests of Amaretto perform best in the task under analysis, with an AUROC of 0.9 for the first (20% better on average) and a detection rate of 0.793 for the second (30 % better on average). In addition, they are characterized by lower investigation costs computed in terms of the daily number of transactions to be examined and the number of false positives and false negatives. Finally, we compare Amaretto against a state-of-the-art active learning fraud detection system, achieving better detection performances and lower costs in all the analyzed scenarios. Worth mentioning, Amaretto improves the detection rate up to 50 % and reduces the overall cost by 20% in the most realistic scenario under analysis.


I. INTRODUCTION
Money laundering encompasses any process by which the income of unlawful activities (e.g., drug trafficking, illegal arms trafficking, tax evasion) is introduced into the financial system through multiple operations that conceal their illicit origins. Nowadays, money laundering affects all worldwide economies and is responsible for generating illegal financial The associate editor coordinating the review of this manuscript and approving it for publication was Khin Wee Lai . flows between $1.6-2.85 trillion per year, equivalent to 2.1%-4% of the Gross World Product [1]. Transaction monitoring in AML consists of a set of activities carried out by analysts and automated systems to scrutinize customers' transactions. The aim is to detect suspicious behaviors linked to money laundering. Financial transactions include bank transfers, credit card payments, or investment banking transactions such as equity and derivative trades.
An expert system is often the first step to implement AML procedures by deploying rules that are configured to monitor pre-determined unusual behaviors. The expert system generates alerts if rules are triggered (e.g., if amount > 10,000,000 then raise alert). The benefits of a heuristic-based approach are the ease in interpreting the output of the system and the ability for subject matter experts (i.e., analysts working in the AML domain) to use that information easily. The disadvantage of expert systems is that money laundering techniques and financial crime are always evolving, so the rules need to be updated to ensure they are fit to capture these changes. Moreover, rules can only cover known anomalous behaviors, and they cannot detect unknown unusual behaviors resulting in false negatives. The fact that rules have to be configured using specific static thresholds results in a high number of false positives and, subsequently, an increase in the volume of manual investigations.
Machine learning enhances these AML techniques by overcoming some of the pitfalls in rule-based systems. Machine learning models can extract and analyze patterns and insights from data and assess unusual correlations unknown to subject matter experts. Supervised machine learning models can classify transactions as normal or anomalous. However, they require a large sample of manually reviewed transactions (labels). In order to collect a valuable set of labeled data as quickly and as efficiently as possible, these modern systems can leverage active learning. Active learning is a technique that uses machine learning models to select transactions for an investigation that have the highest probability of improving the performance of the supervised machine learning system.
In this paper, we present Amaretto, an active learning system that combines unsupervised and supervised models organized in an ''analyst-in-the-loop'' framework. The unsupervised model allows the system to detect unknown anomalies and new patterns that have not been seen before, while the supervised model can use labels previously classified by subject matter experts to improve the detection rate. The system preprocesses the raw transactional data, converting it to aggregated vectors; the aggregation is performed because money laundering patterns usually comprise multiple transactions executed within a period of time. The unsupervised and supervised models take the vectors as input and compute an anomaly score for each one. Subsequently, a selection strategy is applied to choose the samples that the analyst will review. Finally, the labels collected from the analyst are used as training data for the supervised model that will compute the final risk score for the aggregated vectors.
We perform the experimental evaluation on a synthetic dataset provided by the industry partners we collaborated with. The dataset includes both normal transactions and anomalous patterns that may be linked to potential money laundering activities. The dataset is modeled based on real-world investment banking scenarios. On this data, we compare state-of-art unsupervised models, and we demonstrate that Isolation Forest is the best performing for AML tasks under analysis. Then, we conduct a similar assessment amongst supervised techniques concluding that Random Forest outperforms the others. Subsequently, we prove the contributions made by the unsupervised model to complement the ability of supervised models by detecting new types of anomalies. Finally, we confirm the robustness of our design in a real-world scenario, identifying the best selection strategies from the ones proposed and showing that Amaretto outperforms state-of-the-art solutions by improving the detection rate and reducing the compliance management costs for financial institutions. Amaretto improves the True Positive Rate (TPR) up to 50% and reduces the overall cost by 20% in the most realistic scenario under analysis. It is important to highlight that, to provide a meaningful comparison between Amaretto and the other approaches under analysis, we perform the entire experimental evaluation on the same dataset. In addition, to allow the reproducibility of the results, we released the synthetic dataset of transactions at https://github.com/necst/amaretto_dataset.
In summary, the contributions are the following: • Amaretto, an active learning system that combines unsupervised and supervised models organized in an ''analyst-in-the-loop'' framework.
• A novel selection strategy for an active learning framework that detects potential money laundering patterns. This strategy considers event diversity and prioritizes new anomalous patterns to improve the quality of the knowledge base and training set.
• The experimental evidence that demonstrates the importance of an active learning framework to achieve better detection performance and to reduce the cost of monitoring transactions for financial institutions. This analysis includes the comparison of supervised and unsupervised algorithms for detecting potential money laundering, including detailed benchmarking. The remainder of this paper is structured as follows: In Section II, we provide some background concepts related to the money laundering detection problem, alongside an analysis of the main challenges, fundamental for understanding the choices we made in Amaretto. In Section III, we present some of the most relevant works related to money laundering and fraud detection, highlighting the main differences with Amaretto. In Section IV we describe the synthetic dataset at our disposal and the classes of anomalies considered in this work. In Section V, we provide a detailed description of Amaretto, its main components, as well as the design choices that were made to build an end-to-end active learning framework. Then, in Section VI, we show the experimental evaluation of our framework. Finally, in Section VII and VIII, we discuss the limitations, the future works, and the conclusions of our work.

II. BACKGROUND AND CHALLENGES
Detecting money laundering can be considered one of the most challenging tasks within anomaly detection. First of all, there is no common worldwide regulation that sets standards for which transactions are suspicious with respect to money laundering activities. Furthermore, money laundering processes involve multiple transactions between different counterparties using diverse monetary instruments; therefore, traditional anomaly detection techniques analyzing transactions in isolation may be ineffective. This paper focuses on detecting unusual transactional activities that may be linked to money laundering occurring in capital markets. Unusual transactions and customer behaviors are considered outliers associated with a money laundering risk. In [2], an outlier is defined as ''an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data''. In this paper, an anomalous behavior is considered high risk for money laundering and, therefore, should be investigated. This is similar to fraud; however, money laundering often involves multiple transactions across multiple accounts, while fraud mainly occurs at a transactional level. In an ideal world with unlimited resources, an analyst would look at every transaction and then decide which one is worth investigating further as it may be linked to money laundering. Considering the large volume of transactions executed daily in global markets, this approach is not feasible because financial institutions, regulators, and enforcers have a limited amount of subject matter experts to deal with such a demand. To solve this problem, organizations employ a risk-based approach by adopting automated systems to flag and allocate transactions for review. The objective is to maximize the time spent investigating suspicious activities with a high risk of money laundering.
One of the key challenges in researching novel approaches to detect money laundering is the lack of standardized and available datasets. Financial institutions rarely share data due to confidentiality reasons and specific regulations. As part of this research, we leveraged a dataset generated by our industry partners; they specialize in AML and work with financial institutions to help them comply with AML regulations. The dataset implements different suspicious patterns similar to those defined by the Financial Action Task Force (FATF) on money laundering [3].

III. RELATED WORKS
In the last years, several systematic review papers have been published, which describe machine learning applied to the fraud and money laundering detection domains [4]- [12]. From a high level point of view, current approaches can be divided into unsupervised, supervised, and active learning techniques. In addition, these works describe each method's strengths and weaknesses, highlighting the need for combining them and providing further motivation to the work presented in this paper. While supervised solutions have high performance in detecting known frauds, they cannot find new fraudulent patterns and have a high rate of false positives; therefore, unsupervised techniques are needed to detect novel money laundering patterns.

A. UNSUPERVISED LEARNING
Unsupervised learning is mainly used to detect unusual correlations, and it is applied where it is expensive to obtain labels (i.e., multiple analysts need to review a significant number of data points). The main principle in money laundering detection is to quantify how a transaction (or group of transactions) deviates from the norm.
In the fraud detection domain, Ramaswamy et al. [13] propose a formulation for outliers in terms of the distance of a point from its neighbors.
Williams et al. [14] prove that a Replicator Neural Network can detect anomalies in very diverse datasets, and in some cases, it overcame issues commonly affecting Neural Networks such as training with a small dataset.
BankSealer [15], [16] works in an unsupervised setting, extracting local, global, and temporal profiles [17] for each user to capture their behaviors. The same authors also study the security of fraud detection systems against mimicry and adversarial attacks [18], [19].
In the AML domain, a recent research trend has demonstrated the effectiveness of the application of Isolation Forests and Support Vector Machines to the detection of money laundering patterns [20], [21].
Le-Khack and Kechadi [22], [23] focus on detecting anomalies in investment funds; they suggest an approach based on clustering profiles into categories and feeding a Backpropagation Neural Network with the transformed data to output an anomaly score for each transaction. This approach is specific to the problem and dataset: the entire learning process is based on two high-level features derived from the raw data. This seems to offer a limited perspective on the complexity of the underlying behaviors.
Torres and Ladeira [24] propose a hybrid approach composed of unsupervised outlier detection algorithms and the use of Visual Analytics methods to support the real-time human analysis to reduce the incidence of false positives. Similarly to Amaretto, the proposed approach targets the problem of improving the analysis of the vast daily volume of financial transactions. However, due to the exploitation of Visual Analytics techniques, the presented approach impacts human analysts' processing time, possibly increasing the investigation costs.
Paula et al. [25] address the problem of money laundering in Brazilian exports using Deep Learning Autoencoder demonstrating its effectiveness against PCA-based methods.
The disadvantage of unsupervised models is that in practice, an analyst will still have to verify whether all the predictions were correct, and an unsupervised model will not be able to fully leverage the output of the reviews as part of subsequent runs. Also, unsupervised techniques tend to generate a large number of false positives due to unusual data correlations that are perfectly acceptable [26]. This is an issue for institutions since reviewed false positives translate into a direct cost for the organization. Moreover, the lack of focus due to the high number of alerts could lead to potential money laundering cases not being reviewed promptly or missed.

B. SUPERVISED LEARNING
Supervised learning is used when labels are available. The main principle in money laundering detection is to quantify how a transaction (or group of transactions) is similar to known fraudulent patterns.
In the fraud detection domain, Batthacharyya et al. [27] demonstrate that in a real credit card fraud scenario, a Random Forest model outperforms Support Vector Machines and Linear Regression across all metrics used for the comparison. One of the first AML specific studies focused on rule-based methodologies (Decision Trees). This approach was used to create automated systems such as Financial crime law enforcement network AI System (FAIS) [28]. This system allows the analyst to follow evidence left by linked transactions and computes an anomaly score for each transaction. Simple Bayesian networks are used to update and combine evidence that a transaction or activity is illicit.
In the AML domain, the most common used techniques are Random Forests, Support Vector Machines, Decision Trees, deep neural networks, and rule-based systems [9]- [11], [29]. In the last year, also gradient bosting techniques have been successfully exploited [29]- [33].
Jullum et al. [30] detect money laundering at a transactional level using XGBoost and demonstrate its effectiveness against traditional rule-based systems. Alkhalili et al. [34] propose a watch-list filtering component applied to ML methods (i.e., Support Vector Machines, Decision Trees, and Naive Bayes) to reduce the number of false positives and to minimize analyst effort. They demonstrate that the SVM outperforms other algorithms. Similarly to Amaretto, both works [30], [34] highlight the importance of a selection strategy that takes into consideration non-reported alerts/cases. However, their works focus only on supervised learning techniques.
Tertychny et al. [31] address the scalability and the imbalance-resistance problem of the AML detection domain by proposing a two-layered ensembled modeling approach composed of a Logistic Regression model and gradient boosting techniques. They validate the approach using a real dataset of customer profiles and transaction histories, together with labels provided by AML experts.
Farrugia et al. [32] extract features from the historical transaction data of accounts marked as illegal activities by the Ethereum community and regular accounts on the Ethereum platform. The authors use XGBoost to build a classification model to detect illegal accounts.
Vassallo et al. [33] propose an adaptation of the XGBoost algorithm and present a comparative analysis of various offline decision tree-based ensembles. They demonstrate that decision tree-based gradient boosting algorithms outperform state-of-the-art Random Forest results at both account and transaction levels. In this work, as presented in Section VI, we compare gradient boosting and random forest techniques too, which achieve comparable performance. However, we decided to use the Random Forest algorithm for Amaretto supervised module due to its lower cost in terms of false positives and negatives.
Supervised learning directly exploits manually reviewed transactions (i.e., labeled data) and generally outperforms unsupervised learning in anomaly detection and classification tasks [35]. However, a large amount of labeled data is required to achieve adequate performance. Additionally, supervised learning is not as effective at detecting new anomalous patterns (resulting in false negatives) compared to unsupervised learning. This is where active learning plays an important role in bridging unsupervised and supervised anomaly detection.

C. ACTIVE LEARNING
Amaretto implements an active learning system combining both supervised and unsupervised learning, leveraging their strengths and mitigating their weaknesses. Active learning is a process whereby a traditional anomaly detection system is enhanced with a model that queries a subject matter expert to label a transaction or group of transactions (suspicious or genuine). This model is used to select which transactions the subject matter expert should investigate to minimize manual data reviews and, at the same time, ensure the output of the overall anomaly detection system is improved. In [36], the authors exploit analyst feedback to self-tune and improve BankSealer's detection performance using a multi-objective genetic algorithm. In [37], the authors propose an ensemble of unsupervised methods, including a Density-based model, a Matrix Decomposition-based model, and a Replicator Neural Network. By combining the anomaly scores computed by the three models, their system ranks the instances based on the most anomalous ones and then presents them to the subject matter expert for review; the feedback collected is used to train a Random Forest model. Further to this research, in [38], the authors point out the importance of selecting different types of anomalies to enhance active learning frameworks (i.e., selecting different classes of anomalies).

D. DISCUSSION
With respect to the presented works, Amaretto explicitly focuses on reducing the cost and risk for a financial institution; the cost is due to the resources involved in manually reviewing multiple transactions, whilst the risk is linked to not detecting illicit activities. To do so, we directly exploit the main insights and results of the presented research works, evaluating them in terms of the investigation costs and not only from the detection performance point of view. Amaretto also optimizes the selection strategy (i.e., the strategy used to select the transactions for the subject matter expert to investigate) in order to spot novel anomalous patterns and improve the detection rate. This strategy prioritizes which transactions should be further investigated by AML investigators by considering the ''diversity'' of the output produced by the unsupervised module.
Another approach commonly applied in fraud and money laundering detection is the analysis of graphs, which is TABLE 1. Capital market dataset details: Number of transactions (total (T), legitimate (L), and anomalous (A)), the ratio of anomalies over the total number of transactions, number of attributes, and number of transaction originators. The dataset is highly unbalanced, with only a small portion of transactions being anomalous. out of scope for Amaretto. For instance, Alarab et al. [39] present a novel approach based on Graph Convolutional Network combined with MultiLayer Perceptron to predict illicit transactions in the Bitcoin transaction graph. We refer the reader to [40]- [42] for a review of the research regarding graph-based anomaly detection methods in fraud detection, intrusion detection, telecommunication, and opinion networks.

IV. CAPITAL MARKET DATASET ANALYSIS
In the AML domain, one of the major limitations is the difficulty to obtain a real dataset from financial institutions due to privacy concerns. Besides, it is even more complicated to get a labeled dataset. Therefore, we make use of a synthetic dataset generated by our industry partner using a custom-built data generator, which simulates transaction profiles of clients transacting in international capital markets. We use the same synthetic dataset for all the experiments presented in Section VI in order to provide a meaningful evaluation of the performance of Amaretto and the other methods under analysis. In addition, to allow the reproducibility of the results and a fair comparison with future works, we released our synthetic dataset of transactions at https://github.com/necst/amaretto_dataset.
The data combines more than 10,000 parameters extrapolated from real market data. The dataset consists of 29,704,090 transactions executed by 400 end clients buying and selling specific securities in a specific market. Circa 90% of the users made at least 50,000 transactions while 10% of the users performed circa 400,000, which means that 10% of the clients executed almost 50% of the transactions. 98.43% of the transactions have an amount less than 1M USD, and 40% of them have an amount less than 10K USD. Data covers 60 days divided into 12 weeks (a week is composed of 5 days. Saturdays and Sundays are not included because during the weekend markets are closed). Transactions are evenly distributed between the 12 weeks, and most of them are executed during market opening hours, while only a small percentage is executed during the early morning hours and at the end of the day. Table 1 shows a summary of the statistics of the dataset. Key fields contained in the data include the transaction amount, the product class (There are 17 different products representing the main product traded in the capital market -e.g., Equity, Fixed Income), product type (e.g., cash equity, future equity, bond), time field, currency, market. Within the data, it is not possible to identify any specific statistical distributions in any key field.

A. SYNTHETIC ANOMALIES OVERVIEW
Financial datasets are known to be extremely unbalanced, usually containing from 0.1% to 1% [18] of anomalous transactions. This information was also confirmed by the domain experts we interviewed. Therefore, to replicate realworld scenarios, we set the number of anomalies to less than 1% of the data. As part of this dataset, we generated five classes of anomalies based on examples of suspicious patterns suggested by FATF [3], an inter-governmental body that promotes effective implementation of legal, regulatory, and operational measures for combating money laundering. Anomalous transactions will follow the same trend to reinforce the concept that anomalous transactions are well hidden in the dataset. Below, we describe the classes of anomalies injected into the dataset.
Small but highly frequent transactions generated within a short timeframe: A pattern that contains multiple transactions below applicable reporting thresholds.
Transactions with rounded normalized amounts bought or sold within an account: It is unusual for transactions in capital markets to have rounded amounts (unless they occur in markets where foreign exchange conversion causes rounding errors).
Security bought or sold at an unusual time: It is unusual for clients to trade specific securities outside of a specific timeframe (for example, outside of the opening and closing times of a stock exchange).
Large asset withdrawal: A sudden spike in transaction amount withdrawn from an account or transferred out, which deviates from the previous transactional activity and is absent of any commercial rationale or related corporate action event.
An unusually large amount of collateral transferred in and out of an account within a short period of time: This behavior is unusual as a client would not be able to invest by simply trading collateral, or at least such a strategy would be unusual.

V. AMARETTO APPROACH
Supervised detection of money laundering requires sufficient labeled data. The only way to have reliable labels is to have all transactions manually reviewed by subject matter experts, which is not feasible. For this reason, we opted for a hybrid solution, using active learning [43]. This consists of combining both unsupervised and supervised techniques in an analyst in-the-loop framework. In this active learning framework, the system uses unsupervised learning to analyze the most suspicious activities ranked by anomaly score. Supervised models are, then, trained on the domain experts' feedback (i.e., labeled dataset) to select additional data points for review.

A. APPROACH OVERVIEW
In Figure 1, we present an overview of the approach implemented in Amaretto. The first step in the Amaretto workflow is to aggregate the raw transactional data across a specific timeframe to produce features representing high-level vectors that capture the behavioral profile of a customer. The models employed in Amaretto are trained with these high-level vectors generated from historical data. After the training phase, Amaretto computes an anomaly score for each new vector using both unsupervised and supervised models (if enough data is available to train the latter). A specific selection strategy based on the anomaly score is then used to choose vectors that will be sent to the domain expert for review. The number of transactions sent each day for review (k) is a parameter of our system, based on the resources that a financial institution can allocate to this task. The domain expert analyzes these transactions to ascertain whether they are anomalous or not. This information is then saved as labels in the dataset. The reviewed labels contribute to a historical set of labeled data that is the input for the supervised component of the system. The supervised component is then used alongside the unsupervised model to continuously select the data to be reviewed by the domain expert. In Algorithm 1, we outline the pseudocode for the key steps of Amaretto.

B. DATA PREPROCESSING MODULE
Amaretto generates a set of high-level features derived from the transactional data. These aggregated features are based on an aggregation window. In particular, this window represents the time over which transactions are aggregated and is used for computing each set of aggregated features. The aggregation windows have the objective of capturing the short-term, mid-term, and long-term behavior of the user. We look for the most used sizes in literature [17], [37]: We use 1 hour and 1 day for the short-term, 7 days for the midterm, and 1 month for the long-term. For example, as shown in Figure 2, if a daily window is chosen, the aggregated features' set is produced for each day by aggregating all the transactions the customer performed on that day. Also, aggregating transactions over a period of time is helpful in the AML use case since it can be used to capture correlations over time across multiple transactions. Within each period, if t = 0 then 3: Train mod unsup on U t−1 4: Query K samples from U t using the sampling strategy strat unsup 6: sample t unsup = Collect analyst feedback 7: Add the selected points to L t−1 8: end if 9: if t ≥ 0 then 10: Train mod unsup on U t−1

11:
Train mod sup on L t−1 12: Query K 2 samples from U t using the sampling strategy strat unsup 14: sample t unsup = Collect analyst feedback 15: Select K 2 samples from U using the 17: sampling strategy strat sup 18: sample t sup =Collect analyst feedback 19: 21: end for multiple features are extracted and aggregated: total amount traded; average amount traded; the number of transactions; the number of transactions traded for each product class; the number of transactions traded for each currency; total amount traded for each product type for each product class; total amount traded for each product type for each currency; the average amount traded for each product type for each product class; the average amount traded for each product type for each currency and the number of transactions traded during specific times of the day. When aggregating transactions over a timeframe for a customer, the set of aggregated transactions is considered anomalous if at least one of the underlying transactions is anomalous. The result of the aggregation and feature extraction process will be referred to as highlevel vectors. First of all, the EntryDate column is used to extract temporal features like Weekday, Month, Hour. The DataFrame containing the transactional data is grouped by using the Originator and the temporal features mentioned before. Then, the financial features are extracted from the DataFrameGroupBy object. A data frame is created for each window in which the user performed at least one transaction, collecting all the customer's activities. These records are uniquely indexed through the features used to create the DataFrameGroupBy. The financial features extracted in this phase are designed to model the behavioral signature of the user in each window, capturing the spending patterns. The features of interest comprise a combination between Currency, Product Class, Product Type, and InputOutput columns. These high-level features are carefully selected, exploiting the domain expertise of the Napier AI team to be able to detect all kinds of behavioral variations that might indicate anomalous activity. The first step is to transform the Product Class, Product Type into new features called Cash and Collateral that indicate if a transaction is performed through a simple transfer using cash or other types of security. Then, a first aggregation, called Amount_IO_Aggregation, is carried out extracting information about the amount of the transactions included in the aggregation window as statistical features like mean_amount, sum_amount or code_small_amount, code_round_amount. Furthermore, during this step, InputOutput_delta and Collateral_delta are determined, indicating the difference between bought and sold operations or the difference between collateral and the other securities. Afterward, another DataFrameGroupBy object, called Product_Currency_Aggregation, is created aggregating by Currency, Product Class, Product Type, and InputOutput columns and computing the mean_amount, sum_amount, and count for each different value of the pivot columns. Finally, the two aggregations, Amount_IO_Aggregation and Product_Currency_Aggregation, are merged and indexed using the Originator and the temporal columns. This is the final high-level vector used to train or analyzed by the Anomaly Detection Module. An example of high-level vector is shown in Table 2.

C. UNSUPERVISED MODULE
As pointed out in Section III, an unsupervised method is essential to detect new anomalous patterns never seen before. We decided to use Isolation Forest [20], [21] due to its high performance in detecting outliers even if they are present in small amounts [20], [21]. Another feature of Isolation Forest is its ability to deal with random noise. This is particularly useful in scenarios where subject matter experts may provide an incorrect label for a set of transactions.
Isolation Forest: The Isolation Forest algorithm is based on the isolation principle: it tries to separate data points from one another by recursively and randomly splitting the dataset into two partitions along its features axes. The idea is simple: if a point is an outlier, it will not be surrounded by many other points, and therefore it will be easier to isolate it from the rest of the dataset with random partitioning. The algorithm uses the training set to build a series of isolation trees, which, when combined, form the Isolation Forest; each isolation tree is built upon a subset of the original data, randomly sampled. The splitting is performed along a random feature axis, using a random split value that lies between the minimum and maximum values for that feature amongst the data points in that partition. This split process is performed recursively until a single point has been isolated from the others. The number of splits required to isolate an outlier is likely to be much smaller than the one needed by a regular point due to the lower density of points in the surrounding feature space. Isolation Forest leverages an ensemble of isolation trees, with anomalies exhibiting a closer distance to the root tree. The anomaly score can be derived from path length h(x) of a point x, which is defined as the average number of splits required to isolate the point across all the trees in the forest. The anomaly score s of an instance x is defined as s(x, n) = 2 − E(h(x)) c(n) and c(n) = 2H (n − 1) − 2(n−1) n , where E(h(x)) is the average of h(x) from a collection of isolation trees. Furthermore, c(n) is the average path length of unsuccessful searches in binary search trees.
The system extracts high-level vectors related to historical transactions for each customer and uses an Isolation Forest to generate an anomaly score per high-level vector. We built a model for each customer to capture variations in individual behaviors and, at the same time, use the score to compare different behaviors. Subsequently, the score generated by the Isolation Forests is used as part of the selection strategy of the system to select the transactions for the subject matter expert to investigate.

D. SUPERVISED MODULE
A supervised model, used alongside an unsupervised model, improves the ability of the system to make future predictions. Supervised models usually yield more accurate predictions than unsupervised ones. Therefore the combination of these approaches leads to more robust results. In Amaretto, we adopt Random Forest [44] because it exhibited the best performance when compared to other supervised algorithms (as highlighted in Section III).
Random Forest: The basic component of a Random Forest is a Decision Tree [44]. It is a structure that allows the categorization of data points into different classes. Starting from the root node, each data point traverses through different branches of the tree, depending on conditions set out for each node, until a leaf node is reached. The node rules are simple conditions verified by a given feature of the data point (e.g., is feature a 1 ≥ K ? Or, for categorical features, is feature a 2 equal to C 0 ?). In the leaf node, the class of the data point is determined by looking at the most common / majority class present in that leaf node. A major advantage of this technique is the possibility to explain the outcome of the algorithm by following the route of the datapoint through the tree to determine which conditions were met / not met in order to classify the point. One of the key challenges in using decision trees is overfitting. To deal with this problem, an ensemble of multiple decision trees can be utilized to form a Random Forest [45], which consists of multiple weak learners characterized by low bias and high variance. The bagging ensemble of these weak learners will be a robust model since the overall prediction is made by averaging the prediction of each individual tree. Initially, when the system is bootstrapped, no labeled data is available. In this situation, the unsupervised model is used for anomaly detection. When enough feedback from the subject matter expert is collected, it is possible to train the Random Forest model. We train a single Random Forest model using high-level vectors across all customers. The supervised model outputs an anomaly score by computing the probability that a high-level vector is anomalous or not. As new labels are obtained from the subject matter expert, the Random Forest is re-trained accordingly, and predictions are run against the remaining unlabeled vectors (unlabeled data).

E. SELECTION STRATEGIES MODULE
Amaretto combines supervised and unsupervised learning in three stages, each one with a different selection strategy.

1) FIRST STAGE: NEW ANOMALOUS PATTERNS DETECTION
The purpose of the first stage is to detect new anomalous patterns as well as common anomalous patterns. As previously mentioned, the anomaly score computed by the Isolation Forest is fundamental to detecting new anomalous behaviors. Two possible active learning selection strategies are available for this stage: the SELECT-TOP and SELECT-DIVERSE strategies. In the SELECT-TOP strategy, high-level vectors are ranked in decreasing order based on the anomaly score generated by the Isolation Forest. The system then selects the topmost anomalous vectors. However, this approach may not guarantee that all types of anomalies are covered (i.e., the top anomalies by anomaly score may all belong to the same anomaly type). In Algorithm 2 we present the pseudocode of the SELECT-TOP strategy.
As previously evidenced in Section III by [38], it is important to diversify the type of unusual patterns that are selected. For this reason, the SELECT-DIVERSE strategy uses clustering to group similar high-level vectors and draw samples from each cluster based on the anomaly score. Samples are drawn from each cluster, starting from the least dense cluster until the desired number of samples has been reached. The decision of starting from the least dense cluster is motivated by the following assumption: given that the number of non-anomalous high-level vectors is greater than the number of anomalous vectors, the latter should form less dense clusters. In Algorithm 3 we present the pseudocode of the SELECT-DIVERSE strategy.
The clustering algorithm used for this strategy is HDBSCAN [46], [47]. This algorithm is based on the work by [46] and [47]. The first step of the algorithm is to build a weighted graph, where each data point  d(a, b)}. core k (x) is the core distance for a point x which is the distance between that point and its k-th farthest neighbor. The mutual reachability distance defines the density of the areas around each point, and it is used for spreading apart isolated points. A minimum spanning tree is constructed from the resulting graph using Prim's algorithm, which aims to connect every point in the graph whilst minimizing the total weight of the edges in the resulting graph. The next part of the algorithm focuses on building a hierarchy of clusters. This is achieved by removing all edges sorted by decreasing weight. This split process is recursively performed, starting with the edges of the tree that have the lowest weight. This is defined by a parameter ''minimum cluster size''. The first step in cluster extraction is condensing down the large and complicated cluster hierarchy into a smaller tree. The key point is to consider points that are split close to a cluster belonging to this single persistent cluster. To do so, the notion of minimum cluster size is applied. Again, a different measure than distance is defined to measure the persistence of clusters: λ = 1 distance . For a given cluster, values λ birth and λ death represent the value when the cluster split off and became its cluster and the lambda value (if any) when the cluster split into smaller clusters respectively. In turn, for a given cluster, for each point p in that cluster, we can define the value λ p as the lambda value at which that point ''fell out of the cluster'', which is a value somewhere between λ birth and λ death . This is because the point either falls out of the cluster at some point in the cluster's lifetime or leaves the cluster when the cluster splits into two smaller clusters. For each cluster we compute the Algorithm 4 Third Stage: SELECT-ENTROPY Strategy. It Uses the Probability Scores Generated by the Supervised Model. Samples Whose Probability Is Close to 0.5 Have a High Chance of Being Selected Due to Their High Entropy and, Hence, Uncertainty Input: U t , ratio sup , p center Output: C center 1: dist i = |s i − 0.5| Compute the distance of the score for each aggregation i to the center = 0.5 2: Sort dist i in ascending order 3: C center =Select ratio sup ×p center least distant aggregations 4: Return C center stability as p∈cluster (λ p − λ birth ). If the sum of the stabilities of the child clusters is greater than the stability of the cluster, then we set the cluster stability to be the sum of the child stabilities. If, on the other hand, the stability of the cluster is greater than the sum of its children, then we declare the cluster to be selected and unselect all its descendants. Once we reach the root node, we call the current set of selected clusters our flat clustering and return that.

2) SECOND STAGE: RANDOM FOREST SELECTION
The second stage of the selection phase relies on the probability score generated by the Random Forest classifier. This process comprises two steps: the first one is the selection of the most anomalous vectors; the second one is the selection of the least anomalous aggregated transactions. The purpose of this stage is to take advantage of the higher accuracy of the Random Forest classifier to reinforce the information contained in the labeled dataset by automatically scoring all transactions in the dataset.

3) THIRD STAGE: UNCERTAIN DATAPOINTS SELECTION
In the third and last stage of the selection phase, the system selects the high-level vectors for which the two models show the most uncertainty. To assess the level of uncertainty, two active learning strategies can be followed: the SELECT-ENTROPY and SELECT-CONFLICT strategies. The SELECT-ENTROPY uses the probability scores generated by the supervised model. Those samples whose probability is close to 0.5 have a high chance of being selected due to their high entropy and hence the uncertainty. In Algorithm 4 we present the pseudocode of the SELECT-ENTROPY strategy.
The SELECT-CONFLICT strategy takes into account the difference between the scores generated by the supervised and unsupervised model. A discrepancy in the score for each set of high-level vectors indicates that the outputs of the two models disagree. For this reason, the samples with a score discrepancy close to 1.0 are selected (i.e., the models disagree on whether the vectors are anomalous or not). In Algorithm 5 we present the pseudocode of the SELECT-CONFLICT strategy.

Algorithm 5 Third Stage: SELECT-CONFLICT Strategy. It Takes Into Account the Difference Between the Scores Generated by the Supervised and Unsupervised Model. A Discrepancy in the Score for Each Set of High-Level Vectors Indicates That the Outputs of the Two Models
Disagree and, Therefore, They Are Selected Input: U t , ratio sup , p center Output: C center 1: unc i = Compute the difference between the supervised score and the unsupervised score for each aggregation i 2: Sort unc i in descending order 3: C center =Select ratio sup × p center most uncertain aggregations. 4: Return C center Standardization and Ensembling: Amaretto exploits the power of Random Forest, which outputs class probabilities ∈ [0, 1], and Isolation Forest, which outputs an anomaly score ∈ [−1, +1). Even if the models yield outputs in the same range (e.g., probabilities in [0, 1]), their prediction distribution could be significantly different, so the sum of the predictions could be misleading. For this reason, we combine the supervised and unsupervised anomaly scores using an ensemble technique based on the Weibull distribution (see Figure 3). We selected the Weibull probability distribution because of its shape and flexibility, which fits the anomaly score distribution and allows us to better discern between normal and anomalous instances (i.e., it amplifies the distance between these two classes). By doing this, we transform the anomaly scores produced by each model into probabilities in the interval [0, 1]. The following procedure is carried out to perform the ensembling: we fit a Weibull probability distribution function to the anomaly scores produced by each model (see Figure 3a). Then, we compute the corresponding cumulative distribution function through integration (see Figure 3b). Finally, for each new prediction s x , we redefine the anomaly scores as F(s x ) = P( x ≤ s x ). This is performed by plugging the old anomaly scores into the Weibull cumulative distribution function (see Figure 3c).

VI. EXPERIMENTAL EVALUATION
In this section, we describe the experiments conducted to assess the performance and effectiveness of Amaretto. First, we compare the Isolation Forest used in Amaretto with state-of-the-art unsupervised solutions to confirm that our choice is the best for money laundering detection (Section VI-B). Then, we test the unsupervised techniques to assess their prediction ability with different daily budgets (Section VI-C). Afterward, we evaluate Random Forest against other supervised solutions to prove that they perform the best in our domain (Section VI-D). Furthermore, we prove the importance of an unsupervised model in combination with a supervised one in detecting new anomalous patterns (Section VI-E). Then, we compare the different selection strategies of Amaretto (Section VI-F). Finally, we compare Amaretto with AI 2 , a state of the art active learning framework, in an AML scenario (Section VI-G).

A. EVALUATION APPROACH AND METRICS
The data contained in our dataset can be considered as time-series data. We split the dataset into two sets. The first one, which is used for training the models and for hyper-parameter optimization, contains the first 7 weeks of data, corresponding to 17,327,387 transactions. The second one, which is used to evaluate the model performance by running tests, includes the subsequent 5 weeks of data, corresponding to 12,376,703 transactions. It is important to highlight that we perform the entire experimental evaluation on the same dataset to provide a meaningful comparison between Amaretto and the other approaches under analysis.
Given the temporal link of the data, for the experiments in which Amaretto was evaluated in a realistic setting (Section VI-C and VI-G), we used a walk-forward testing approach [48], as shown in Figure 4. This allows us to fully test the system on a daily working routine, like the real-world scenario in which a subject matter expert has to investigate a set of anomalous cases each day. We split the testing data on a daily basis: Each ''simulated'' day, K samples from the unsupervised ranking are provided to the analyst for investigation (i.e., labeling). Based on the assigned labels, Amaretto will train both the supervised and unsupervised learning models and will use them for ranking the samples of the subsequent day. In Section VI-C we provide an analysis of daily budget K that allows the system to achieve a suitable detection rate whilst minimizing the effort of the subject matter expert in reviewing the samples.
For the hyper-parameter optimization, we used Bayesian Optimization [49] due to its ability to achieve accurate parameter selection within a reasonable amount of time. Bayesian Optimization is a probabilistic model-based approach for finding an input value or a set of values to an objective function that yields the lowest loss.

1) METRICS
To evaluate Amaretto, we adapt common evaluation metrics to our context. A True Positive (TP) is an anomalous high level vector correctly classified as anomalous, False Positive (FP) is a legitimate high level vector wrongly ranked as anomalous, a False Negative (FN ) is an anomalous vector wrongly ranked as legitimate, and a True Negative (TN ) is a legitimate vector correctly ranked as non-anomalous. On the basis of these definitions, we assess the system performance by computing: Accuracy: Percentage of high-level vectors correctly classified: Precision · Recall Precision + Recall Matthews Correlation Coefficient: Quality of the detection rate in terms of the correlation coefficient between the observed and predicted classifications; a coefficient of +1 represents a perfect ranking, 0 no better than a random prediction, and −1 indicates total disagreement between prediction and observation: Area Under the Receiver Operating Characteristic (ROC) Curve: This is the area under the ROC curve, obtained by plotting the TPR against the corresponding FPR at various threshold settings. The AUROC gives a measure of the solution performance, where a perfect model has an AUROC of 1.
The test data is very imbalanced (0.27% of anomalous transactions), so metrics like accuracy are not very meaningful. However, to make a fair comparison with the state-of-theart solution, all metrics are included as a reference.
The AUROC is a useful indicator for benchmarking algorithms; if the ROC curve of a model is consistently higher than the curve of other estimators, this indicates the former achieves better performance. For these reasons, we use the AUROC and the ROC curve to assess the performance of various unsupervised models and the metrics described above for assessing the performance of the supervised models.
We also considered additional metrics to account for class imbalance and different classification costs: the Precision-Recall Curve and the Cost Metric. The Precision-Recall Curve shows the tradeoff between Precision and Recall for different thresholds. A high area under the curve represents both high Recall and high Precision, where high Recall relates to a low false negative rate, and high Precision relates to a low false positive rate. High scores for both reflect that the classifier is returning accurate results (high Precision), as well as returning a majority of all positive results (high Recall). The Cost Metric is described in [50] as: A normalization process can be applied to obtain a value that is independent of the number of transactions: As suggested in [50], 100 is a reasonable estimation of C_R, that is the cost ratio between FN and FP. 100 was the value used to assess the optimal operating condition of our system. However, it could be set to reflect the real costs of anomalous transactions based on different scenarios. This metric takes into account the cost of false positives for an institution. A unit cost is applied to a FP, whilst a higher cost is applied for a FN since the cost of allowing a money launderer in the system is hundreds of times higher than the cost of false positives and may result in fines for the institution.

B. EXPERIMENT 1: UNSUPERVISED ALGORITHM COMPARISON
In this experiment we compare state-of-the-art unsupervised solutions with the Isolation Forest used in Amaretto. In particular, we take into consideration Autoencoders [25], Variational Autoencoders [51], Extended Isolation Forests [52]. In addition, we also test the unsupervised techniques described in [37], which exploits a Matrix-decomposition model, a Density-based model, and an Autoencoder, using PCA as a Matrix decomposition model [53] and using a Copula distribution as a Density-based model. We also tested a threshold-based model [15] that uses mean and standard deviation computed for each feature of the high-level vector. Given these descriptive statistics, we compute a one-sided threshold as the sum of the mean and standard deviation. In order to score new samples, all features that exceed their respective threshold add the surplus to the risk score, while features below the threshold yield a risk score of 0. For this experiment, we train and tune all the models on the training dataset composed of 7 weeks of data (17,327,387 transactions) and evaluate the performance on the subsequent five weeks of data (12,376,703 transactions). As shown in Figure 5, Isolation Forest exhibits the best performance with an AUROC of 0.9. Surprisingly, the threshold-based model and the matrix decomposition-based model outperformed the Auto Encoder, which is considered one of the best models for outlier detection.

C. EXPERIMENT 2: DAILY BUDGET K ESTIMATION FOR UNSUPERVISED RANKING
For this experiment, we benchmark the performances of the unsupervised models analyzed in the previous experiments when varying the number of samples reviewed each day by the analyst. For every day existing in the test set, each model computes the anomaly score for the high-level vectors, which is then used to rank the vectors. Then the K top anomalous vectors are considered anomalous, e.g., for K = 10, the first 10 vectors with the highest score are selected for the review. The purpose of this experiment is to assess the best daily budget that allows the system to achieve a suitable detection rate whilst minimizing the effort of the subject matter expert in reviewing the high-level vectors. The metrics presented in Table 3 are the average metrics computed for each technique and budget. As shown in Table 3, the Isolation Forest is the model that achieves the best results for every budget K, achieving an average Precision of 0.904 and an average FPR of less than 0.01 (for K = 10). This means that the Isolation Forest allows the subject matter expert to focus only on the most anomalous vectors. The matrix decomposition-based model achieves comparable performance only with a higher budget K , whilst for lower values, the Isolation Forest is better. The daily budget values considered in this experiment represent a small percentage of the daily vectors that are generated. For this reason, the FNR is high for a small daily budget, and it reduces as the budget increases. It is important to highlight that the financial company with which we collaborated considers a percentage around the 1% and the 2% of the data received daily as a reasonable number of transactions that they can manually inspect with its specialized analysts. In our dataset, this percentage corresponds to K = 5 (approximately 6,000 transactions) and K = 10 (approximately 12,000 transactions).

D. EXPERIMENT 3: SUPERVISED ALGORITHMS COMPARISON
With this test, we compare the Random Forest model of Amaretto with state-of-the-art supervised models based on Gradient and Category Boosting techniques. Gradient Boosting [54] is considered one of the best algorithms for classification tasks. Category Boosting [55] is an alternative boosting algorithm based on decision trees. It offers computational and efficiency improvements compared to Gradient Boosting-based models. For this experiment, we train and tune all the models on the training dataset composed of 7 weeks of data (17,327,387 transactions) and evaluate the performance on the subsequent five weeks of data (12,376,703 transactions), running predictions on a daily basis. Table 4 presents the average metrics for each technique. In line with the results obtained by state-of-the-art works [30] and as shown in Table 4, the metrics are quite similar between Random Forest, Category Boosting (CatBoost), and Gradient Boosting (LGBM) models, whilst other supervised methods do not perform as well. The CatBoost model exhibits the highest Precision, although Random Forest achieves the highest TPR and the lowest FNR. Furthermore, if we take into account cost-related metrics, we can conclude that Random Forest is better suited in this context compared to the other models.

E. EXPERIMENT 4: DETECTING NOVEL ANOMALOUS PATTERNS
The goal of this experiment is to assess the performance of the supervised and unsupervised techniques employed in Amaretto to detect new anomalous patterns. We execute several runs of this experiment in order to test each combination of the classes of anomalies existing in the dataset. For every run, a class of anomalies is withheld from the TABLE 3. Experiment 2 -Daily budget K estimation. Performance metrics of the unsupervised models are shown varying the number of samples reviewed by the analyst each day (K). Isolation Forest is the model that achieves the best performance for every budget K, achieving an average Precision of 0.904 and an average FPR of less than 0.01. training set and only introduced in the test set for evaluation purposes. During the run, the models are trained using the high-level vectors obtained from the training set that contains the remaining class of anomalies, excluding the withheld anomalies. After several iterations of the system (precisely after the 15th day of the experiment), the withheld pattern is introduced in the test data to assess the behaviors of the two models. The results of each run are then averaged on a daily basis and shown in Figure 6a and Figure 6b. As expected, the Isolation Forest model performance is consistent, while Random Forest exhibits a decay in performance when new anomalies are introduced. The TPR of the Random Forests model halved, while the False Negative Rate (FNR) tripled. This proves that Random Forest performs poorly at detecting the new anomalous patterns introduced in the dataset. On the other hand, the performance of the Isolation Forest is not negatively affected by the new anomalies, showing its capability of detecting the new anomalous pattern introduced.

F. EXPERIMENT 5: AMARETTO CONFIGURATIONS
The purpose of this experiment is to test which of the strategies is the most suitable for our dataset. The experiment works as follows: on the first day, the labeled dataset is empty.   Overall, the performance are similar. Amaretto_3 (red bar) provides the best trade-off in performance and costs, especially for lower K, since it achieves a comparable detection performance but a lower amount of transactions to investigate. Therefore the supervised model is not used. After the first day, the labeled dataset contains the samples selected by the Isolation Forest, which have been reviewed by a subject matter expert. From this point onwards, the entire selection strategy can be employed (first stage, second stage, and third stage). The mapping between the approaches adopted in the first and third stages, as well as the names of the configurations, are outlined in Table 5. Figure 7a and 7b show the average Precision and the average AUROC of the score generated by Random Forest. The performance of the 4 configurations is similar. We decided to focus on the system that employs SELECT-DIVERSE and SELECT-ENTROPY (Amaretto _3, red bar in the experiment 5 Figure 7b) because it provides the best overall average Precision and AUC, reducing the cost of manually reviewing multiple transactions and risk for a financial institution of not detecting illicit activities. In fact, with these two strategies selected, even with a daily budget of K=10, the average Precision is higher than the one obtained with K=20 and comparable with the one obtained with K=50.

G. EXPERIMENT 6: COMPARISON WITH THE STATE OF THE ART
In this final experiment, we compare Amaretto with AI 2 [37] since it represents, to the best of our knowledge, the stateof-the-art active-learning framework for anomaly detection. AI 2 comprises an ensemble of three unsupervised techniques, including a density-based model, using a Copula-based multivariate distribution, a matrix decomposition-based model, using a PCA-based model, and an Autoencoder. The combination of the anomaly scores computed by the three models is used to rank the most anomalous high-level vectors for review by a subject matter expert. The feedback collected is then used to train a Random Forest that additionally analyzes the high-level vectors.
The experiment is divided into three parts: I -Static Scenario, II -Real-world Scenario, and III -Real-world Scenario with Risk Profiles. In the first part, we compare the frameworks in a static scenario, i.e., we collect 10 samples per day over a period of 10 days from each framework and then use this labeled data to train the Random Forest and predict all remaining high-level vectors. In the second part, we compare the frameworks in a real-world scenario, studying the effective support to the daily routine of a subject matter expert and the performances of the frameworks. In the third part, we compare the frameworks' performance in a real-world scenario by taking into account different risk profiles for a financial institution.

1) STATIC SCENARIO
The purpose of the first part of this experiment is to verify the framework's performance with a minimum amount of training data. During this part, we also assess the active learning inner modules, i.e., the components of the framework in charge of computing the anomaly score and selecting the samples to be shown to the subject matter expert. For the first 10 days of the test set, only the inner module is employed with a minimum daily budget (K = 10), collecting 100 samples. This labeled dataset is then used to train a Random Forest. Subsequently, the trained Random Forest model computes the probability score and the prediction for all the remaining high-level vectors. Figure 8a and Figure 8b show the comparison of our system against AI 2 using the probability score. As exhibited by the ROC curve plotted in Figure 8a, Amaretto achieves an AUROC of 0.93, whereas AI 2 obtained 0.89. Furthermore, as shown in the Precision/Recall curve in Figure 8b, Amaretto reaches an average precision of 0.61, which is 31% better than AI 2 .

2) REAL-WORLD SCENARIO
In the second part of the experiment we assess how applying this framework can decrease the workload of the subject matter expert in a real-world scenario. Initially, only the unsupervised machine learning techniques could be employed since no feedback was collected. After the first day, the Random Forest works alongside the unsupervised models in the active learning loop and the prediction phase. For this test, we consider the worst-case scenario with a minimum daily budget of (K = 10). Figure 8c shows the average precision computed using the probability score. Amaretto doubled its Precision in approximately ten days, i.e., with a dataset of 100 high-level vectors, constantly increasing its performance. During the tests, Amaretto reaches a maximum average precision of 0.78, while AI 2 0.57. As shown in Figure 8d, Amaretto achieves better performances also considering the AUROC. As in the previous test, the maximum AUROC is 0.94, and the average AUROC is 0.847, improving AI 2 by circa 14%.

3) REAL-WORLD SCENARIO WITH RISK PROFILES
In the last part of the experiment, we test the frameworks considering different risk profiles that a financial institution can adopt. This is done considering different threshold values for the probability score corresponding to different use cases. For example, a lower threshold (corresponding to a lower anomaly probability score) could be used where high financial risk is estimated. By doing this, more transactions will be considered as candidates for review, hence reducing the false negatives but increasing false positives. As shown in Figure 8e, Figure 8f, Figure 8g, and Figure 8h, Amaretto outperforms AI 2 across all thresholds. In the low-risk use case, Amaretto achieves a TPR of 0.428%, while in the high-risk use case, a TPR of 0.596% represents an improvement of circa 50% w.r.t AI 2 . Only in a low-risk case does AI2 achieve a better FPR than Amaretto. On the other hand, Amaretto achieves a higher TPR, a lower FNR, and cost, balancing the overall performance. In addition, in every scenario, the cost of Amaretto is always lower than AI 2 .

VII. LIMITATIONS AND FUTURE WORKS
The main limitation of this research work is the lack of an intuitive explanation for the anomaly score returned by Amaretto that is used to rank the high-level vectors. The two composing algorithms (i.e., Random Forest and Isolation Forest) are used in a black box fashion and function as rule-based trees whose prediction can be explained following the path that led to the given classification. However, since we are considering a ''forest,'' we are averaging the output of several trees. Therefore, the resulting probability does not provide helpful insight and cannot help the human analyst better understand the result. Future work may focus on integrating into Amaretto a library and implementing the explainability processes. For example, SHAP [56] is one of the most interesting approaches for explaining the model output.
The available dataset represents another limitation. Even though it was synthetically forged from a real-world dataset, it is very limited in timespan, containing 60 days of transactions. A broader dataset would have allowed us to analyze the system's evolution over time deeply. For example, FIGURE 8. Experiment 6-Comparison of Amaretto (blue) versus AI 2 (orange) with K = 10 for the three experimental settings: Scenario I-Static, Scenario II-Real-world, and Scenario III-Real-world with risk profile. Amaretto outperforms AI 2 in almost all the metrics and scenarios under analysis at the cost of a slightly higher average FPR but lower overall cost.
seasonal models and models based on specific anomalous patterns could be developed.
It could be interesting to see how Amaretto works on more complex scenarios and further analyze the anomalies-for example, investigating the relationships among these anomalies or which anomalies occur more frequently. We assumed that the subject matter expert always provided correct labels when reviewing the data; further research should be conducted to assess the impact of incorrect labeling on the system and ensure that the models accommodate such errors. Another avenue for future research includes automatically tuning Amaretto parameters to improve performance and reduce the number of samples required to achieve satisfactory results. Furthermore, alternative unsupervised and supervised models should be tested.
In this work, we did not consider graph-based or deep learning models since they usually require higher computational requirements with respect to standard machine learning approaches. In fact, we aimed to build the simplest and lightest active learning system to outperform state-of-the-art solutions. In addition, even if the available dataset seems to offer enough data for applying such categories of algorithms, the experimental evaluation performed in this work seems to suggest the opposite. In fact, the deep learning model tested in this work (i.e., Autoencoder and Variational Autoencoders) does not perform well when evaluated in a real-world scenario in which an analyst can review only a limited number of samples. Future works may explore applying deep learning algorithms (e.g., models that overcome the shortcomings of random forests)to complement Amaretto on larger datasets that span over a greater period. For example, LSTM-based neural networks could be employed due to the temporal correlations in money laundering patterns.
Finally, it would be interesting to evaluate Amaretto on detecting other kinds of financial crime, like credit card fraud detection, to investigate its flexibility.

VIII. CONCLUSION
In this paper, we presented Amaretto, an active learning system for anomaly detection applied to transaction monitoring for money laundering detection in capital markets. Amaretto comprises an unsupervised model for detecting known and unknown anomalous patterns, including four strategies to optimally sample the data for a subject matter expert to review. This data is fed into a supervised learning model to continuously improve the system's performance. Amaretto was able to process over 29 million transactions, extracting aggregated features and highlighting customer behavioral patterns over time to detect unusual correlations. We then presented the experimental results conducted on a synthetic dataset generated to resemble genuine, as well as potential money laundering patterns. We compared unsupervised techniques, commonly used in anomaly detection tasks and state-of-the-art solutions. We demonstrated that Isolation Forest is the best algorithm in the AML domain. we also compared supervised techniques, and we determined that Random Forest outperforms the others. Subsequently, we proved how important it is for the unsupervised component of the system to detect novel patterns, since the supervised models could not accurately detect these. Finally, we conducted experiments to confirm the robustness of our design in a real-world scenario. We demonstrated the best selection strategies amongst the ones proposed and proved that Amaretto achieved state-of-the-art performance within a short time frame and with minimal manual input from a subject matter expert.