Sequence Embeddings Help Detect Insurance Fraud

Roughly 10 percent of the insurance industry’s incurred losses are estimated to stem from fraudulent claims. One solution is to use tabular data to construct models that can distinguish between claims that are legitimate and those that are fraudulent. However, while canonical tabular data models enable robust fraud detection, complex sequential data have been out of the insurance industry’s scope. For health insurance, we propose deep learning architectures that process insurance data consisting of sequential records of patient visits and characteristics. Both the sequential and tabular components improve the quality of the model, generating new insights into the detection of health insurance fraud. Empirical results derived using relevant data from a health insurance company show that our approach outperforms state-of-the-art models and can substantially improve the claims management process. We obtain a ROC AUC metric of 0.873, while the best competitor based on state-of-the-art models achieves 0:815. Moreover, we demonstrate that our architectures are more robust to data corruption. As more and more semi-structured event sequence data become available to insurers, our methods will be valuable for many similar applications, particularly when variables have a large number of categories, such as those from the International Classification of Disease (ICD) codes or other classification codes.


I. INTRODUCTION
Fraud causes substantial costs and losses for the finance and insurance industries. Examples include fraudulent credit card transactions and insurance fraud. Indeed, experts estimate that each year roughly 10 percent of the insurance industry's incurred losses and loss adjustment expenses stem from fraudulent claims. 1 Fraud detection is a critical function and core competence in these industries and their claims management processes.
The proliferation of digitization in finance and insurance has led to big datasets suited to fraud detection. In this paper, we propose architectures for categorical sequence embeddings via deep learning that help improve the classification of fraudulent and valid claims compared to other machine learning methods.
Analyzing fraud with statistical and machine learning methods poses particular challenges. First, claims data are often available in a so-called unstructured format that is 1 Cf. https://www.iii.org/article/background-on-insurance-fraud. challenging to process using classic machine learning approaches. Second, fraud data are highly unbalanced because the number of fraudulent cases is minimal compared to the number of non-fraudulent ones, and we can consider each fraud an anomaly. Third, claims do not have a fixed length because the number of items billed in a claim varies. These characteristics influence the choice of classification approach and performance measures.
It is well known that deep learning outperforms other machine learning methods for analyzing unstructured data, such as text or images. In this paper, we develop deep learning architectures tailored to claims data and to handling each of the challenges mentioned above. For our analysis, we use claims from outpatient doctor visits, which have a particular structure. These consist of unstructured categorical sequences of treatments and have properties of text data (for example, the varying number of billed items mentioned above). Moreover, medical claims usually encode treatments as categorical variables with thousands of categories. In this paper, we develop  Figure 1: Workflows for a prediction model for classic machine learning and deep learning. The development of machine learning models requires accurate feature engineering, whereas deep learning models can handle unstructured categorical sequences without sophisticated preprocessing and, thus, we can skip the Feature engineering step. In the paper we briefly consider each step of the data-based model construction, while focusing on approaches in the middle stage related to the model construction.
methods for analyzing such (semi)unstructured data. We test our methods on a dataset from a major health insurance company. Our empirical results show that they outperform other state-of-the-art methods in the prediction of fraudulent claims, making the claims management process more efficient. The summarized workflow we follow is presented in Figure 1.
We start with some information on insurance fraud and fraud detection and an overview of the literature in section II. In section III, we present models and methods for analyzing general text data and analyzing the unique structure of claims data to detect insurance fraud. After this, we describe our data in section IV. Similarly, we describe the models available for these tasks in section V. Further, in section VI, we present the results of our analysis and experiments. Finally, we conclude in section VII. Additional results of our analysis can be found in Appendix .

A. INNOVATION
Many approaches to anomaly/fraud detection require problem-specific solutions. At the same time, deep learning promises to provide more general models that will extract insights as embeddings or representations: from unstructured, complex and high-dimensional data, deep learning generates a meaningful representation of small dimensions that we can easily use to indicate a particular system state. There are many successful cases and applications where this was achieved in complex settings. Numerous cases back this hope for complex scenarios with unstructured data, sequential data, and imbalanced class distribution. Nevertheless, one should still carefully examine the robustness of the various deep learning models by deploying them.
We aim to address this gap in the literature. There are no deep learning models that consider fraud prediction for medical insurance based on both claims data and general patient features.

II. BACKGROUND AND RELATED WORK
The literature review consists of four parts related to different aspects of the studied domain. We start with a description of anomaly detection as a major challenge in machine learning and its relation to fraud detection. After that, we provide an overview of the application of fraud detection algorithms in various fields, such as healthcare and insurance sector. Then, we discuss how the concept of embedding is leveraged in order to work with diverse types of sequential data. Finally, we review different ways to cope with class imbalance problem that is typical for fraud detection.

1) Anomaly Detection
Detecting anomalies in data is one of the core problems in data analysis ( [1]). Researchers from many disciplines investigate it for application domains, including time-series modeling, [2], [3], predictive maintenance of technical systems, [4], [5], and applications in the finance and insurance industries, [6].
Anomalies in data are worthy of attention because they can translate into important and often critical actionable information in a wide variety of application domains. For example, in credit card transactions, anomalies can indicate when unauthorized purchases have occurred due to credit card or identity theft, [7]. Money laundering, as a type of financial fraud, can be detected by a simultaneous analysis of trading networks and features of its entities, [8]. The outcome of seeking a low-rank approximation and analyzing the residuals of a graph-based similarity matrix and feature matrix leads not only to detection of fraud patterns but also to tracing of the suspicious features. For a comprehensive overview, we refer to the reviews by [1], [9], which are dedicated to various fraud detection problems and based on machine learning.
Similarly, anomalies in health insurance claims can flag up potential insurance fraud carried out to gain inappropriate compensation for services not rendered, [10].

2) Machine Learning for Healthcare and Insurance
While experts have been using machine learning methods to solve healthcare and insurance problems successfully for several years, the literature does not seem to have covered deep learning and embeddings for fraud detection until recently. [11] provide one of the few examples focusing on detecting automobile insurance fraud. They process text descriptions of the accidents, extract standard text features manually, and combine these with deep-learning models. Although the accuracy of their model is superior to that of existing approaches, neither the precise architecture of the best model nor the approach to training and validating it is outlined clearly in their paper. Another example is [12]. The authors use hierarchical clustering with deep neural networks to detect fraud in candidates' descriptions during job recruitment, significantly improving the prediction accuracy of conventional methods. [13] uses manually-crafted features to predict instances of automobile insurance fraud. [14] highlight the importance of constructing robust data-based models in healthcare. The authors generate adversarial examples for predictive models based on multivariate electronic health records (EHR), represented by temporal sequences of numerical data. To find efficient attacks on medical records, they propose an optimization-based attack strategy limiting the size of the perturbation of the initial input. An analysis of the best attacks helps to identify susceptible locations in each patient's medical records and subsequently prevent mistakes in the most critical measurements. [15] develop an unsupervised deep learning model for fraud detection by utilizing the information on insured people. The authors use the Autoencoder to obtain the aggregate reconstruction error (A-RE) for the underlying data and further indicate high A-RE instances as fraud. [16] present a comparison of metrics when using various machine/deep learning models with combination of data-imbalance techniques to detect health insurance fraud.

3) Machine Learning Embeddings
We can use embeddings to address anomaly detection problems in different application domains. For example, [17] develop an approach to embedding entities, representing events from computer systems, into a common latent space. Each event involves heterogeneous attributes such as time, user, source process, destination process, and more. [18] study the problem of detecting structurally inconsistent nodes in graphs, to identify, for example, outlier authors in a network in which different authors connect through co-authorship of papers.
At the moment, however, the community is especially interested in embeddings for applications related to natural language processing. Given the breadth of the field, we focus here on works with embeddings of simple entities, such as words. These include the classic TF-IDF approach, [19], the more recent and well-known word2vec, [20], and GloVE, [21]. The last two methods consider concurrences of words, whereas TF-IDF is solely a normalized one-hotencoding for a dictionary of words at hand.
For event sequences related to sequences of clients' visits, there are also numerical approaches such as that described in [22]. For a related use case, [23] construct unsupervised embeddings based on deep learning architecture for various types of event sequence data. Financial transaction data are another type of event sequence. Here, deep learning provides significant quality improvements, allowing better model and embeddings quality, for example, in the works of [24], [25]. However, [26] emphasizes the importance of assessing the robustness of this type of approach.
In addition to embeddings for single events, we need those for whole sequences. There are multiple approaches to concatenate embeddings of simple entities. For example, we can construct a text embedding from an embedding of each word within a given text. Simple heuristics include taking the maximum value among each dimension for word embeddings or taking mean values, [27]. More complex approaches use convolutional neural networks, recurrent neural networks and transformers, [28], [29], [27].

4) Imbalanced Classification Problems
Skewed distributions or imbalanced classes are one of the most critical challenges to solving fraud detection problems. Generally speaking, there are far fewer instances of fraudulent items than normal ones. We refer to classes with fewer objects as minority classes and other classes as majority classes in the literature. The resulting imbalance makes it difficult for learners to detect patterns in the minority class data. [30] mentions three broad approaches to learning from imbalanced data: • Data-level methods that modify the dataset to achieve balance between the minority and majority classes in their distributions and remove difficult observations, • Algorithm-level methods that directly modify existing learning algorithms to alleviate the bias towards majority class objects and adapt them to mining data with skewed distributions, • Hybrid methods that combine the advantages of these two approaches.
For the data-level approach, [31] use under-sampling for a skewed class in a fraud detection system for credit cards, and [32] assess how resampling the multiplier selection influences classification accuracy. For the algorithmic-level approach, [33] use cost-sensitive classifiers to address the class imbalance problem. In turn, [34] propose the FraudMiner model to handle class imbalance by entering the unbalanced data directly into the classifier. Some papers try to combine both approaches to make them work for the general data VOLUME 4, 2016 scenario such as [35]. More general-purpose approaches include over-sampling, [36], combinations of over-and undersampling, [37], and meta-learning to automate the selection of imbalanced classification methods, [38]. [39] explore highly imbalanced datasets connected with credit card frauds and show that features selection methods can enhance the metrics of the classifiers. [40] consider financial transactions and develop the proactive strategy to fraud prevention that helps to overcome the issue of imbalanced data. The proposed conversion of the time-series into a transformed domain allows exploiting only a legitimate class, thereby making it possible to operate even in the absence of previous fraudulent cases. [41] provide the comprehensive overview of existing classic machine learning and deep learning techniques to address class imbalance. Moreover, the authors discuss various metrics and their particular application to achieve a better reflection of models performance.

A. LEARNING OF CLASSICAL DATA-BASED MODELS
The typical scenario for supervised learning starts with data consisting of a sample of observations. More formally, we define {(x p , y p ) P p=1 } as our dataset with P input/output pairs. For each input x p , an N -dimensional vector, we have a corresponding output y p , a categorical or discrete variable with so-called labels.
Each sample, (x p , y p ), contains a description of an object given by features, x p , and the value/label of the target variable for that object, y p . In disability insurance, for example, annual income, education, occupation, age, and past medical records describe a customer. Furthermore, the target variable, y p , is a binary label, y p ∈ {0, +1}, that represents whether the customer's claim is fraudulent (+1) or valid (0).
Our dataset stems from a large international health insurance company. The observations consist of medical bills, including information about the treatments provided, as well as their costs, types, total amounts, and general client features such as age and occupation. The target variable depends on whether the bill was classified as fraudulent by a clerk handling the claim.
Thus, we can learn a model that predicts the target variable, y p , by taking a new object's features, x p , as its input.
The power of machine learning is that we can learn a model that adapts to a given sample and is able to generalize well to unseen data similar to that in the training sample. Machine-learning methods make it possible to learn nonlinear and complex relationships in datasets.
Data scientists have devised various ways to generate features manually from complex but unstructured data such as images and texts. They use these features as input for classic machine learning models. The standard limitation of these approaches is that they require object descriptions in a restrictive format. Usually, they use fixed, small-length vectors, which is not feasible for many real-world objects such as texts or images with millions of pixels.
For insurance-related tasks, the corpus consisting of bills has different lengths for different patients and describes visits to a doctor. Each patient has a different number of visits or medical bills listing a varying number of treatments. We can construct one-hot-encoding features counting the number of specific prescriptions or specific visit types for a given patient. In other fields, such as economics, manually generated features are also widespread. For example, the variable "age" is often constructed from the variable "date of birth". However, we observe that such approaches yield results of reasonable but limited quality.

B. THE DEEP LEARNING REVOLUTION
The deep learning revolution changed the rules of the game for machine learning data-based models, [42]. Now algorithms can learn representations or embeddings of object descriptions to generate features that are informative enough to provide accurate predictions while using relatively simple machine learning models. Examples are the fully connected neural networks (FCNs) with only a few layers. The strength of deep learning lies in feature extraction, which means learning informative features from high-dimensional unstructured and complex input data.
The most successful applications of deep learning are in the field of image processing. However, deep learning impacts other areas, [43], such as natural language processing, [44], or graph data, [45].
The key idea behind deep learning is to apply a sequence of non-linear transformations on object descriptions, the socalled layers of the neural network. The objective is to produce an informative embedding and then use it as input for a final classifier.
Deep learning models enjoy numerous architectures and variations that can be chosen for the task at hand. With this in mind, in this study, we test several basic architectures from the deep learning literature, to find the best one for our settings. These are Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), Simple Word Embeddings-based Models (SWEM), and Transformer. GRU and LSTM belong to the family of Recurrent Neural Networks (RNN), which have unique mechanisms for regulating the flow of data and handling the problems of the canonical RNN by enabling long memory and efficient training. CNN is an architecture widely used in computer vision tasks. However, the leveraging of processing data with convolutional kernels might also be beneficial in other domains. Furthermore, the SWEM is sometimes comparable with more complex models and deserves study when working with embeddings. In addition, we consider the Transformer model, whose attention-based mechanism has revolutionized deep learning for natural language processing.

C. CONCEPT OF EMBEDDINGS
In this paper, we address the problem of representing healthcare insurance data by using embeddings for fraud detection.
An embedding is a transformation of object descriptions to vectors that belong to the same low-dimensional space. These low-dimensional representations have the characteristic that similar instances have a smaller distance between them in the embedded space. For example, a helpful embedding provides vector representations of words so that the relationship between two vectors mirrors the relationship between the two terms. We can see this in the popular word2vec model by [46] from the natural language processing literature. It constructs a low-dimensional vector of real numbers. As a result, words appearing in a similar context have similar vector representations. In short, embeddings are a general framework for dimensionality reduction and a practical approach to extracting features of intrinsic relations between complex objects.
In our case, we learn an embedding space explicitly constructed for sequential data from healthcare insurance claims. Such representation significantly helps in the detection of fraudulent patterns.
To make these ideas clear, let us suppose that a text consists of words from a dictionary of V different words. The classic way to transform the text into numeric features is with a onehot encoding for each word in the text. In other words, we represent each word in the text sequence as a V -dimensional vector consisting of zeros except for one entry at the location corresponding to that word. This approach is standard for encoding categorical variables. However, the representation is not efficient if the dictionary is extensive, as is typical for general texts and healthcare data. In turn, via modern approaches, we can represent embeddings of words from the dictionary with real-valued s-dimensional vectors, such that s is much lower than the size of V . This allows for a compressed representation of the textual input description. In this representation, the entries of the embedded vector are usually all different from zero. The embedding of the dictionary into the vector space should also maintain some of the relations between the words. For example, the desirable property of the word embedding is that the difference in the vector space between the words "queen" and "king" should be similar to that between the words "woman" and "man". Learning word embeddings with such properties makes them a powerful tool for text analysis [47]. We can learn such embeddings from unstructured data in a supervised or unsupervised way.

D. APPLICATION OF EMBEDDINGS
In recent years, learning embeddings that represent complex relationships within data is becoming common in the machine learning community. Authors apply different embedding types to various domains, such as natural language processing (NLP), network analysis, and computer vision.
As noted above, word embeddings, such as the word2vec, GloVe by [21], AdaGram by [48] and others, provide vector representations of words such that the relationship between two vectors mirrors some linguistic connection between the two terms. In supervised problems, word and sentence embeddings have proven effective for natural language pro-cessing tasks such as part-of-speech tagging, [49], phrasebased machine translation, [50], named-entity recognition, [51], and word sense disambiguation, [48].
Beyond text, we can use embeddings for all types of data representations. For example, graph and network embeddings attempt to capture local and global attributes on graphs. One option consists of engineered graph features, and another of training on graph data. Classical approaches for graph embeddings include feature-based methods, such as graph kernels, [52], [53], and data-driven algorithms that yield distributed graph representations, [54], [55], [56]. Using such embeddings, we can solve various tasks related to network data analysis. One example is [57] who use anonymous walk embeddings for graph influence set completion.

A. OVERVIEW
Our dataset consists of many health insurance claims from the outpatient care setting. It comprises about 0.38 million patients' medical bills with 3.3 million items in total. Each data point is a sequence of treatments encoded with anonymized IDs.
There are 15 input features in total. In the data, we have two types of feature for each patient: general and visitspecific. The first type includes age, sex, insurance type, and doctor specialty. These features relate to the patient, insurance, and doctor in general. For each patient, visitspecific features describe individual outpatient visits. For this type of feature, we consider treatment IDs, the type of treatment, the number of treatments, the cost of therapy, a factor that increases the cost of treatment due to potential complications, the total billable amount, the billing type, the cost category, and the performance type. We encode the type of treatment using one of more than two thousand categories. However, some of the available features are uninformative, and our experiments reveal that the models perform better if we discard them. In the following discussion, all features except treatment IDs will be referred to as global features.
For each record, we have a label. The label is either "fraudulent" and coded as 1 or "non-fraudulent" (valid) and coded as 0. Here "fraudulent" refers to the fact that the final amount of the bill was amended, which can happen for many reasons. About 2% of records are fraudulent. The challenge is to identify whether the record corresponds to a fraudulent activity based on the input features available.
We present in Table 1 a sample of the final version of the refined data, which consist of both input features and a target to train all our models.
The number of treatments on a bill varies significantly among patients. In Figure 2, we provide a histogram of the number of treatments and items for the dataset. The distribution of treatments is nonuniform. Most patients have only a small number of treatments in the outpatient setting.
We make a represenatative random subsample of our data VOLUME 4, 2016 and its description on a public repository available. 2 .

B. TREATMENTS
Our approach aims to determine whether information from labels of treatments can help automatically identify fraud. Because the number of treatments varies for each patient, we must aggregate all treatments into one vector. To do so, we construct an embedding of all treatments into a vector of a fixed dimensional size. In the literature, we can find approaches to dealing with varying input sizes. One example is [58]. In our case, the natural way to construct embeddings is to use methods that have their roots in natural language processing (NLP) because a medical bill lists a series of individual treatments from a sizeable but finite dictionary. Each anonymized treatment belongs to a dictionary of size 2205. We summarize treatments in up to 17 upper-level treatment groups. Similarly, we create another dictionary feature with a size of 24 for the type of benefit.
Also, we emphasize that specific treatments are not more prone to fraud. If we measure the correlation between the presence of a particular treatment and the target variable, the maximum absolute value for correlation is only 0.0243. We must therefore use more sophisticated machine learning approaches to make it possible to identify fraudulent series of treatments.

C. DISTRIBUTION OF TREATMENTS
To understand the essence of our data, we rank the number of treatments in the dataset. For example, the most frequent treatment has a rank of one. This approach is close to the empirical Zipf's law, [59], commonly found in NLP. Intuitively, the frequency of any treatment is roughly inversely proportional to its rank in the frequency table. Figure 3 demonstrates this behavior for our dataset as a log-log plot. However, we observe a heavier tail with rarer treatments having higher frequencies than what we would expect from Zipf's law. One interpretation is that there are few rare treatments in our data and that the diversity is higher for tasks using natural language texts.

A. CLASSIC MACHINE LEARNING APPROACH
Using classic machine learning algorithms, we consider two types of representations for the sequence of treatments. These are bag of words (BoW) and term frequency-inverse document frequency (TF-IDF).
The idea behind BoW is to represent a sequence t = {w 1 , w 2 , . . . , w it } by counting the number of times n w,t a token w appears in it. This technique generates a vector of frequencies of the tokens in the considered sequence. Some terms, such as "a", or "the", appear multiple times but provide little information in sentences. Therefore, a normalized version of BoW leads to TF-IDF. A TF-IDF is the product of the term frequency and inverse document frequency for texts and documents. For the frequency term, we divide n w,t by the total number of words in the text w n w ,t , whereas, for the inverse document frequency, we divide the logarithm of the total number of documents by the number of records that contain the considered word w. We can swiftly transfer the idea behind the TF-IDF to other contexts beyond words, texts, and documents. We emphasize that neither of these approaches considers the order of a token in a sequence. However, neglecting the order can decrease performance in many problems.
To apply the ideas of BoW and TF-IDF in our setting, we form a dictionary of all unique treatment IDs. In this case, the quality of the classic machine learning models with BoW features is slightly better than with TF-IDF. Hence, we discard the latter and consider only a more straightforward BoW processing in the following discussions.

1) Logistic Regression, Random Forest and Gradient Boosting
The literature considers logistic regression a common baseline for predictive modeling. Despite the simplicity of the rules underlying the construction of the relationship between features and the probabilities of belonging to a particular class, this model gives solid results for many problems.
Another popular model in the classic machine learning literature is the Ensemble of decision trees according to [60]. In this model, each decision tree distributes the input objects to the leaves based on the features of the object and learned rules in the nodes. In a leaf, the classifier returns the probability of belonging to a specific class. In the Ensemble, we use a weighted sum of basic decision tree classifiers. The ensemble of decision trees offers many benefits: it is fast to construct, can almost avoid overfitting, successfully handles missing values and outliers, and provides competitive performance, [60].
A popular and efficient way to perform classification using the predictions of several decision trees is the Gradient boosting algorithm. It trains sequentially by setting the target to the next tree based on the errors of previous trees. One of the many advantages of Gradient boosting is its ability to solve imbalanced classification problems and easily incorporate various imbalanced classification heuristics, [35].
We select the LightGBM framework presented by [61] as our preferred implementation of Gradient boosting for our experiments. This high-speed implementation provides state-of-the-art performance, needs less memory to run, and supports learning on graphical processing units (GPUs). An added benefit is that we can tune it by adjusting a vast number of hyperparameters.

B. DEEP LEARNING APPROACHES
We examine several deep learning models to identify those capable of processing our specific data best and yielding the most precise predictions. For this, we describe the overall pipeline of working jointly with diverse types of features to   obtain the fraud probability distribution for each insurance claim.

1) Model Pipeline a: Treatments embedding layer
First, we embed each treatment in a sequence into a vector of dimension d. The model identifies its optimal values during the learning process. However, there are a different numbers of treatments for each patient. Thus, we pad all lists of treatments with empty treatments to achieve an equal length across our vectors. In conjunction with this, we construct masks for the padded parts to pass them to the subsequent layers and ignore them.
b: Obtaining the encoding vector from treatment embeddings Then, we pass the resulting embeddings to the input of one of the following neural models: CNN, GRU, LSTM, SWEM, and Transformer to get a single output vector. At this stage, we construct a so-called encoding vector, which should encapsulate as much information as possible from a sequence of embeddings.

c: Approaches to use global features
Third, the global features consist of numerical and categorical features. As part of our pipeline, we scale them beforehand. Then, we try to use information from categorical features in two ways for our classification problem. In the first approach, we treat all global features in each record as a vector and pass it as an input for several feed-forward layers. After that, we concatenate the resulting vector with the encoding vector obtained from the treatments, i.e., the visit-specific features. In the second approach, we obtain the embedding of each distinct global feature and then receive an encoding vector by simply averaging all embeddings.

VOLUME 4, 2016
Fourth, we process it further to get the final probability distribution after obtaining the concatenated vector, which incorporates global and visit-specific features. In particular, we apply dropout and then several fully connected layers such as a gated combination of linear and non-linear transformations. Finally, we implement a linear layer to obtain logits, which we can transform into probabilities through the softmax function. Then, by analyzing the output probabilities, we can decide whether a record is fraudulent. An example of the architecture we use to obtain the encoding vector from treatment embeddings with an LSTM model is available in Figure 4. We depict the global features related to the patient as a single vector. We assume that we process them with one of the above mentioned methods.  Figure 4: The model architecture to identify fraudulent claims. We pass each treatment's embedding to the input of a Neural Network model, for example, an LSTM, to obtain the encoding vector. Then, we concatenate this encoding vector with processed global features and pass it to the several layers of linear and non-linear transformations to obtain the final probability distribution.

2) Training Deep-learning Models
We train the full model for 100 epochs and set the parameter "patience" to 5 epochs. This means that if our target metric does not improve after five epochs, we stop training. We use the Adam optimization algorithm by [62] with a learning rate equal to 0.001 and minimize cross-entropy loss. Our classification problem is imbalanced, so we train the model with balanced batches for more effective learning, implying oversampling from the minority class, because this approach shows the best results in general and specifically for our problem.

A. METRICS
To evaluate the classification models, we can use many metrics. However, handling our imbalanced classification problem requires detecting the minority class with high precision. For fraud detection, it is crucial to find all actual positive events. An outstanding classifier would correctly predict all instances of the minority class with a low false alarm rate.
Below we consider the canonical metrics for measuring the quality of imbalanced classification problems. The confusion matrix forms the basis of our metrics. Using Table 2, we can generate the most common metrics in the literature to estimate the performance of a classifier with different focuses, such as the area under the ROC curve (ROC AUC) and the area under the PR curve (PR AUC). These metrics use standard concepts such as recall, precision, and false-positive rate. To be more precise in our discussion, we introduce the following notation.
The recall is the true-positive rate, T P T P +F N . It is the percentage of positive instances correctly classified. When this metric is equal to one, it means that we can identify all fraud cases.
The precision is the percentage of positive instances among positive predictions. We define it as T P T P +F P . A high value for precision means that our model captures the underlying fraud behavior.
The false-positive rate is F P F P +T N and represents the percentage of positive instances that the model misclassifies.
We define the F1 score as 2 Precision·Recall Precision+Recall . It lies in the interval [0, 1]. Hence, we prefer higher F1 scores.
The canonical ROC curve shows how well the model classifies items in the two classes. Ideally, the first class must display a high value for the true-positive ratio (TPR), whereas the second class must show a low false positive ratio (FPR). Therefore, a high-quality prediction corresponds to a "balance" between these values.
Similarly, the PR AUC describes how well the model classifies the minor class, or fraud, class. We want to maximize the TP number. This metric is well-suited for imbalanced datasets, because it reflects the model's ability to identify fraudulent behavior. The focus on the minority class enables the PR AUC metric to predict the most relevant category better than than other metrics.
To sum up, we consider four relevant metrics to evaluate the performance of our models: ROC AUC, PR AUC, F1 score, and Confusion matrix. ROC AUC, PR AUC, and F1 score lie in the interval [0, 1], and our goal is to maximize them. Confusion matrices that we provide have the following structure: [[T N, F P ], [F N, T P ]]. [41] prove that our chosen metrics ensure the comprehensive performance review.

B. VALIDATION PROCEDURE
To evaluate the performance of the model, we use data splitting. We randomly split the dataset and follow a standard approach among data science practitioners, using 60% of the medical bills during the model training phase and the remaining 40% to validate and test our models. Given our imbalanced data, we do the splits in a stratified way: the ratios of classes in the training and test samples coincide with those of the initial selection. In practice, to get a more reliable estimate of model performance, we generate several partitions and distribute them into train and test sets (crossvalidation). In subsection A in Appendix we show that the metrics evaluated on several splits and a single split are close. Due to this fact, we provide an analysis of the experiments for a fixed train-test split.

C. RESULTS
In this section, we evaluate the metrics of our classic machine learning and deep learning models. We compare the results for different subsets of features and identify the model that performs best. Besides, we try several techniques to alleviate class imbalance issue. Also, we provide dependencies of metrics on the size of the train set, the dimension of encoding, and embedding vectors. Moreover, we explore the robustness of our models by corrupting the initial data. Such a thorough analysis of the models provides a clear picture of the best way to construct the classification model considering the nature of the data. For reproducibility purposes, all implementations and experiments are available on a public repository 3 . All necessary packages and their versions to train deep learning models are indicated in poetry.lock and pyproject.toml files in our repository. To build classic machine learning models we use scikit-learn library [63] and LightGBM framework [61]. The technical details for the experiments are specified in subsection B in Appendix.

1) Performance of Machine Learning Models. Usefulness of Visit-specific Features
We recall that we have two types of feature. One set consists of global features, and the other is visit-specific. With this in mind, we generate BoW features from the visit-specific ones and compare three different sets of features. These are global, visit-specific, and a combination of the two. We present the results of Machine Learning models in Table 3.
The results indicate that using all available features leads to the most accurate predictions and, consequently, the highest ROC AUC and PR AUC values among all models. Thus, we focus on the use of both sets of features in further discussion. With the LightGBM model, we achieve the best ROC AUC score.

2) Performance of Neural Network Models
In Table 5, we report the performance across different models to obtain the encoding vector from treatment embeddings in 3 https://github.com/fursovia/fraud_detection/tree/2021_update our overall pipeline.
GRU and LSTM process input embeddings sequentially, taking one vector at each step and computing the corresponding functions to use on further timesteps. In the CNN encoder, each convolution layer outputs a vector of fixed dimension. This output dimension corresponds to the number of filters learned by that layer. After several convolutions interleaved with max-pooling layers, we receive an encoding vector, and we transfer it to the next steps in the pipeline. In turn, SWEM does not have learnable parameters. This model produces an encoding vector by simply taking the average of the treatment embeddings. Lastly, in the Transformer model, we encode a sequence of treatment embeddings into another series of vectors of the exact dimensions with the attention mechanism. Subsequently, we average the sequence of new vectors to obtain a final encoding vector. We note that higherquality results are produced by the Optimized LSTM with hyperparameters optimized for better performance.
We find that processing treatment embeddings with SWEM yields the best results in our settings. Meanwhile, the Transformer performs considerably worse, possibly due to an inappropriate architecture for constructing the encoding vector for this particular task. This more complex model fails to capture the data patterns and yields results comparable to those of other models. Obtaining the encoding vector with a simpler model has therefore proven to be beneficial for our problem.
We can observe better results when we process our global features with linear layers rather than obtaining embeddings. The reason for this might again be that a more straightforward approach is more beneficial here. We hypothesize that we can better preserve valuable information when we apply some transformations to initial global features, whereas constructing new representations becomes harder.
In some tasks, processing the data from two directions and using both the previous and future contexts might improve quality. To check whether this is the case for our problem, we compare metrics from uni-directional and bi-directional GRU and LSTM models in Table 4. As we can see, under a bi-directional setting, we slightly improve the results of both models.

3) Optimization of Hyperparameters
The model that performs slightly worse than SWEM but outperforms all of the remaining models and has learnable parameters is the LSTM. We decide to find optimal hyperparameters of our entire pipeline and, especially for the LSTM, to assess any possible improvements in the metrics and whether we manage to beat the performance of the classic Machine Learning model, LightGBM. To optimize the hyperparameters, we use a modern software library for automated hyperparameter search, Optuna ( [64]).
We search the optimal values of the 13 parameters, including dropout rates, dimension of treatment embeddings, the learning rate of the optimizer, number of linear layers, and type of activation function between them, among others. We VOLUME 4, 2016  present the results from the optimized model in Table 5. The optimization procedure allows us to achieve metrics that are superior to LightGBM for the same set of features. A deep learning model with properly fine-tuned hyperparameters and an optimal architecture is thus a suitable classifier for our problem. Detrimentally, classic Machine Learning models require constructing high-dimensional BoW vectors, which may pose problems with memory and longer training times if the dictionary size of unique tokens is large.

4) Addressing class imbalance problem
Data in fraud detection problem are inherently imbalanced. We try basic data-level approaches to address this problem in case of using classic machine learning models. In particular, we consider Random Under Sampling, Random Over Sampling, SMOTE and ADASYN. In Random Under Sampling method we remove random samples from the majority class. When using Random Over Sampler, we duplicate random fraudulent examples in the train set. SMOTE and ADASYN generate new synthetic samples of the minority class. We provide the performance of the models trained on the initial set and on the resampled sets in Table 6. The results are given for the optimal ratio of the number of samples in the minority class over the number of samples in the majority class. Some of the approaches significantly enhance the metrics.
As for deep learning models, we incorporate balanced batch sampler in the training procedure. It samples each batch in such a way that the numbers of examples of each class are equal. Balanced batch sampler may facilitate more effective training of neural networks if we compare with the case when we pick samples in the batch randomly. The evidence of balanced batch benefit for some models is observed in Table 6.

5) Dependence of Model Quality on Sample Size
We examine how the size of the training set affects the quality of the model on the test set. We train the model with increasing random subsets of 10, 20, . . ., 100 percent of the initial training data. In Figure 5, we see that the PR AUC and ROC AUC continue to increase as we increase the proportion of used training data. Having limited training data is more detrimental for LightGBM than for the LSTM model. This fact should be taken into account when we are in a regime of small data. We also see that the results are stable for different datasets used. Thus, we can conclude that our deep learning methods and LightGBM will provide satisfactory quality for different subsets of our data.

6) Dependence of Metrics on Embedding and Encoding Dimensions in LSTM Model
To understand the dependency of metrics on the dimensions of treatment embeddings and the encoding vector obtained from them for the LSTM model, we conduct two experiments. First, we fix the size of the embeddings and increase only the dimension of the encoding vector. We depict the result in Figure 6a. After reaching some optimal value of about 100, a subsequent increase of the encoding vector dimension leads only to a deterioration in performance.
Second, we show that we can improve our performance if we augment the dimension of the embedding vector along with the size of the encoding vector. The evolution of metrics  when we simultaneously change the embeddings and encoding sizes and set them to one value is visible in Figure 6b. Mapping from the initial feature space to an embedded space of small dimension results in information loss and consequently an increase in unsatisfactory model performance.

7) Reliability of Models
Two significant issues for machine learning models are reliability and resistance to malicious attacks. A relevant challenge in fraud detection is when malicious users of a decision model can provide slightly distorted data to the system and fool it. In such a case, the method is of limited use. We can see examples of this in [65], [66] and surveys in [67], [68].
We consider two approaches to evaluate the reliability of our models. The first assesses whether "a model is robust concerning random errors in data submitted to a system", [69]. The second verifies whether "a model is robust to malicious efforts when someone tries to break the system in a particular way by corrupting the input to the system," [65].
In our case, we test the reliability of the history of treatments. We ask ourselves: Can we change this slightly and obtain an entirely different outcome with the model? To do so, we compare the quality of the model before and after corrupting the test data.
We test these issues in two ways. First, to test the model's reliability, we randomly add a different number of treatments from the vocabulary to the end of the sequence of treatments for each patient. Second, to test the model's resistance to malicious attacks, we select a subset of 100 treatments from the vocabulary and add them one by one to the patients' treatment history to find the model output most affected by the addition of a single treatment. After selecting the most harmful treatment from our subset, we repeat the same procedure to choose the second most malicious treatment. This VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.   represents a greedy approach to performing an adversarial attack on our data, [26].
In Figure 7, we show the ROC AUC values after corrupting the sequences of treatments with a varying number of tokens. We see that changing inputs to the trained model leads to a drop in its quality, particularly if we use the greedy strategy to attack the LightGBM model. At the same time, we can observe that the LSTM model is more stable to data corruption and shows only a slight drop in quality even after adding a substantial number of malicious tokens.
However, to make the models more robust, we should augment the training data with more cases and possible distortions of the initial data, a so-called data augmentation, and keep the model undisclosed to avoid malicious attacks.

VII. CONCLUSION
A constant challenge for insurers and financial companies, insurance fraud is, in essence, an anomaly detection problem. In this paper, we propose and examine deep learning architectures that are tailor-made for insurance claims data based on embeddings for unstructured data and compare them with classic machine learning approaches based on careful feature  engineering. During the model training we also construct embeddings for treatment IDs.
We analyze the performance of classical machine learning models and our proposed methods in solving the task of claim classification for a real-world data. Processing unstructured categorical sequences related to outpatient doctor visits with our best model, we get ROC AUC score equal to 0.873, whereas state-of-the-art model shows a worse result with ROC AUC 0.815. Moreover, our empirical experiments confirm that we can improve our model further by optimizing the neural network architecture, increasing the volume of data used for training and incorporating techniques for addressing class imbalance problem. The significance of choosing the proper embedding and encoding dimensions in our deep learning models is also demonstrated. In addition, we identify that our architecture is robust to random disturbances of the data, as well as adversarial and malicious changes, and can enhance the claims management process. If we add 5 malicious tokens in a sequence, classic machine learning performance degrades to ROC AUC 0.640, while the deep learning model has small performance degradation with ROC AUC value for corrupted input 0.840.
As digitization continues to proliferate, increasing amounts of unstructured data in the form of text will become available, including electronic health records, claims data, personnel files, and financial statements. These data will often have a unique structure and contain variables with many categories that classical methods cannot handle. The deep learning architectures and embeddings that we propose in this paper are bespoke for such data. As a result, our approach is relevant to researchers, administrators and mangers in healthcare, organizational economics, insurance and other fields.

A. CROSS-VALIDATION FOR PERFORMANCE EVALUATION
For a more credible assessment of the generalization ability of the models, it is helpful to evaluate the performance not on a single train-test split but several sets repeatedly. A popular procedure for that is called cross-validation. The main idea behind cross-validation is that we split our initial data into k sets (folds). Then, we train the model on k − 1 folds and test them on the remaining part.
We calculate the model metrics on a single train-test split. To understand whether we get biased results, we evaluate several models with a cross-validation procedure. We split our dataset into three folds to preserve the testing to training sizes ratio as for a single train-test split. In Table 7, we provide the comparison of results calculated with a single train-test split and 3-fold cross-validation. As we observe, almost all metrics evaluated on a single split lie within the standard deviation of the mean values that we calculate with the cross-validation procedure. Therefore, we can conclude that the size of our dataset is large enough to get a reliable estimation of the models' performance by implementing a single train-test split.

B. EXPERIMENT DETAILS
We perform the experiments with a single NVIDIA TITAN RTX GPU with 24 GB memory and Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz. All Deep Learning models were trained using GPU. During the training process of Machine Learning models only CPU was utilized. The overall size of our data is about 240 MB. We launch all the experiments on the system with Linux Mint 19.3 (Tricia) OS.
The training time for different models is given in Table 8. We provide the averaged result of five runs. The running times are acceptable in all cases given typical requirements for the model training in industry. Moreover, they allow to run hyperparameter optimization and utilization of large sample sizes. RODRIGO RIVERA-CASTRO is a research engineer with the Skolkovo Institute of Science and Technology. He pursues his doctoral research in the area of unsupervised methods for learning event sequences and time series. His interests are extracting rich representations from temporal sequences to enable downstream tasks and power applications in finance, marketing, supply chain, and the web.
ALEXEY ZAYTSEV was born in Kharkiv, Ukraine. In 2012, Alexey graduated from MIPT. In his Master's thesis, Alexey proposed a modification of Bayesian approach for linear regression that allows an automated feature selection. He completed a Ph.D. in Math at IITP RAS in 2017. Now Alexey is an assistant professor in Skoltech. Dr. Zaytsev focuses his research on the development of new methods for sequential data, Bayesian optimization, and embeddings for weakly structured data.