Big Data-Driven Abnormal Behavior Detection in Healthcare Based on Association Rules

Healthcare insurance frauds are causing millions of dollars of public healthcare fund losses around the world in various ways, which makes it very important to strengthen the management of medical insurance in order to guarantee the steady operation of medical insurance funds. Healthcare fraud detection methods can reduce the losses of healthcare insurance funds and improve medical quality. Existing fraud detection studies mostly focus on finding normal behavior patterns and treat those violating normal behavior patterns as fraudsters. However, fraudsters can often disguise themselves with some normal behaviors, such as some consistent behaviors when they seek medical treatments. To address these issues, we combined a MapReduce distributed computing model and association rule mining to propose a medical cluster behavior detection algorithm based on frequent pattern mining. It can detect certain consistent behaviors of patients in medical treatment activities. By analyzing 1.5 million medical claim records, we have verified the effectiveness of the method. Experiments show that this method has better performance than several benchmark methods.


I. INTRODUCTION
Medical insurance is a social insurance system established to compensate workers for economic losses caused by disease risks. The medical insurance funds are established via payments from insured employers and individuals, and their medical expenses for medical treatment will be partly compensated by medical insurance institutions. The establishment and implementation of the medical insurance system can enable patients to obtain the necessary help, reduce the burden of medical expenses, and prevent the diseased members of the society from becoming ''poor after illness'' [1].
In recent years, China's social medical insurance has developed rapidly. Increasing the coverage of social medical insurance has become the most important task for China's social security system. By the end of 2018, 1.345 billion people had registered in the basic medical insurance, covering The associate editor coordinating the review of this manuscript and approving it for publication was Xin Luo . more than 95 percent of the total population. As shown in Table 1, the total income of the basic medical insurance funds for the whole year of 2018 were 2,109,011 billion yuan, and the total expenditure was 1,760.765 billion yuan [2]. It can be seen from Figure 1 that the amount of China's medical insurance funds keeps increasing every year, while the balance rate keeps decreasing. From 23.0% in 2012 to 10.0% in 2018, there has been a continuous decline. Therefore, how to ensure the normal operation of social medical insurance funds, improve the level of medical insurance management, and reasonably and effectively avoid potential business risks has become an extremely important issue.
In August 2016, the National Audit Office of China authorized local audit institutions to conduct special audits on medical insurance funds, such as basic medical insurance and urban and rural residents' critical illness insurance [3]. This was the most comprehensive audit ever since China's health care reform. The audit randomly selected funds from 28 provinces, 166 cities, and 569 counties (cities and districts) to check their performances in 2015 and the first  half of 2016. The total funds reached 343.313 billion yuan, and 1.578 billion yuan of them were illegal, revealing many irregularities including repeated reimbursement of medical expenses, fraudulent medical treatment by some designated agencies and individuals, and fraudulent medical insurance funds through decommissioning and hospitalization.
And from a global perspective, the problem of medical insurance anomalies has also attracted much attention [4]. In 2016, a large German public medical insurance company was forced to pay a fine of ¿7 million to the German Federal Insurance Agency over medical insurance anomalies. The same year, the U.S. Department of Justice cracked down on the biggest medical anomaly case in its history, involving up to $ 900 million and more than 300 people, such as doctors, nurses, and pharmacists, who were accused of participating in medical anomalies. Abnormal medical insurance seriously endangers the entitlements of the insured, and these abnormal behaviors of medical insurance must be put under control.
Common abnormal medical insurance behaviors can be divided into three categories according to different subjects of such behaviors [5]. Category one includes abnormal behaviors of medical insurance individuals, including frequent medical consultations, fraudulent use of other people's health insurance cards, etc. Category two refers to abnormal behaviors of medical institutions, including over-diagnosis and unreasonable medication, and Category three is a kind of joint fraud conducted by individuals and institutions, including fake invoices and admission checklists. Due to the unequal flow of information, category two and three often remain so elusive that even non-medical staff of the General Medical Insurance Bureau can hardly find hidden abnormal behaviors in verification [6].
Although it is difficult to find medical insurance anomalies, medical units have kept a large number of medical visit records and data with the widespread use of medical information systems. Similarly, all medical reimbursement behaviors have been recorded in the medical reimbursement data set. Through researches and analyses of the medical reimbursement data set, abnormal behaviors hidden in it can be discovered. Traditional analyses of medical insurance anomalies mostly take on medical practitioners' experience to make artificial rules and simple statistical analyses. It is difficult to accurately sort out complete abnormal behavior information from complex medical insurance data. As far as the current level of informationization in the medical industry, medical data has developed four basic characteristics of big data [7], [8]. In the context of big data, we can establish a distributed medical insurance abnormal behavior VOLUME 8, 2020 detection model based on Hadoop, which makes it easier for medical practitioners to find medical insurance anomalies more quickly, dig out abnormal points in massive data, and supervise abnormal behavior for medical practitioners. This has great practical significance and value.
There is a special medical phenomenon in the analysis of medical insurance fraud. This special phenomenon is usually manifested in the fact that multiple medical insurance cards are consumed too frequently at the same time, which is called the medical agglomeration behaviors [9]. This kind of behaviors may be conducted by certain special illness groups, such as chronic patients, and may also be a kind of fraud. It has great meaning on finding who have medical-treatmentrelated behaviors. On the one hand, efforts could be made to provide targeted management and services for people with special diseases. On the other, fraud shall lead to effective improvement of regulation.
Therefore, we propose a distributed anomaly detection method for medical aggregation behaviors. Our main contributions in this paper are listed as follows: 1) constructing a medical aggregation behavior model that includes a formal description of medical aggregation behaviors; 2) designing a distributed anomaly detection algorithm and corresponding interpretation of the detection results; 3) compared with several benchmark methods for frequent itemset mining, the performance advantage of this method becomes more significant as the amount of data continues to increase, which can significantly improve the accuracy of fraud detection. More specifically, our DCMMAB is better than the comparison method by more than 20% in precision.
At present, this method has been integrated into the medical big data analysis platform to provide decision support for auditors in the medical insurance claims system to assess the possibility of fraud.
The rest of the paper is organized as follows: Section 2 reviews related works on fraud detection issues; Section 3 briefly introduces the framework and related concepts of big data; Section 4 gives the problem definition of medical aggregation behavior detection, and introduces the method of mining medical aggregation behavior based on distributed computing(DCMMAB); Section 5 analyzes the real medical insurance data through our method and interprets the experimental results. And compared with several other benchmark methods, it proves the superiority of our method; Section 6 summarizes our work and discusses several future research directions.

II. RELATED WORK
Medical insurance fraud is not a problem unique to a country, and countries around the world that implement medical insurance systems are facing corresponding problems. At present, the research on medical insurance fraud is mainly divided into three aspects: the causes and characteristics of fraud, how to combat fraud, and the identification of fraud.
In terms of the causes and characteristics of fraud, reference [10] explains the causes of medical fraud based on the perspective of information asymmetry. Reference [11] refers to the profound experience of anti-fraud behaviors, and applies phenomenology of qualitative explanation to explain the causes of fraud behaviors. Reference [12] constructs a patient-centric analysis model by analyzing various frauds. Reference [13] details the classification and causes of fraud in American medical insurance funds.
In terms of how to combat fraud, reference [14] analyzes fraud behavior from the perspective of the costs and benefits of the fraudster, and proposes an impact factor model of fraud behaviors. Reference [15] analyzes the causes of fraud and its harm, and gives corresponding suggestions on how to combat it. Reference [16] analyzed fraud in the process of collection, payment and funds management of medical insurance funds, and proposes a series of measures to combat fraud.
In terms of the identification of fraudulent behaviors, traditional medical fraud detection methods are mainly based on rules established by experts [17]. Once the given rules are violated in the medical records, it will be judged as fraudulent behaviors. The effectiveness of these methods is constrained by the correctness of the rules. With the widespread application of big data technology in the medical field, data mining technology has been applied to the detection of medical fraud. As early as 1999, studies have pointed out that potential data patterns in data set can be discovered through data mining technology, thus providing a basis for scientific decisionmaking [18]. Reference [19] introduces the successful cases of data mining technology in medical fraud detection. Reference [20] applies data mining and machine learning techniques to the construction of model library and method library from the perspective of risk prevention and control of medical insurance funds.
At present, some anomaly-detection methods [21]- [24] have also been applied to the detection of medical fraud. Reference [25] uses IBM Bluemix platform and open cloud platform to build a medical reimbursement data analysis and display platform. He describes a diagnosis and treatment process between patients and doctors with diagrams, where patients and doctors are different nodes, and a diagnosis and treatment process is regarded as an edge so that the connection between the patient and the doctor can be analyzed. Reference [26] comprehensively applies the semi-supervised IsoMap method and LOF method to detect the abnormal expenses, and time constructed a medical insurance claim data anomaly detection system. In addition, many studies [27]- [29] have focused on dividing patients into different groups by using certain rules, and using different models for each group to fit medical expenses to determine whether there are abnormalities.
In summary, we can see that the method of detecting medical insurance fraud has gone through three stages. The first stage is the prevention and treatment of medical insurance fraud from the perspective of system management and funds payment model, where the application is relatively simple and insufficient. The second stage is the introduction of data mining technology to control the risk of medical insurance fraud. Compared with the first stage, the efficiency of medical insurance fraud detection has been greatly improved.
With the development of medical information, medical insurance data have grown rapidly, and medical insurance fraud has become more diverse. So it has entered into the technology stage of machine learning. Therefore, on the basis of previous researches and big data related technology [30], this paper proposes a distributed detection algorithm for mining medical aggregation behavior fraud, and applies it to the actual data analysis.

III. BACKGROUNDS AND PRELIMINARIES
In this section, we discuss some preliminary terms and concepts that need to be understood before understanding the general concept of MapReduce [31] and HDFS (Hadoop Distributed File System) [32].
Hadoop is a distributed computing platform developed based on the Java language by Apache. Because of its high reliability, high scalability, high efficiency, high fault tolerance, low cost, and complete open source, it is widely used in many industries and scientific research fields. Hadoop provides users with a distributed infrastructure with transparent underlying details of the system. Its distributed file system HDFS and distributed computing model MapReduce have been proven to be able to successfully analyze and process big data in parallel on a large number of computer clusters. In MapReduce-based development, developers only need to pay attention to the segmentation of the data set, the division of Map and Reduce tasks, and the implementation of Map and Reduce functions. All other complex parallel computing programming problems, such as distributed storage, task scheduling, load balancing, fault tolerance and network communication are all completed by the MapReduce framework, which greatly reduces the difficulty of development.
Users can use Hadoop to easily organize computer resources, build distributed platforms and complete parallel programming with the help of MapReduce computing models, providing a feasible and efficient solution for the storage and processing of massive data.
The whole processing process of MapReduce is shown in Figure 2. The principle is to use a set of input key-value pairs <key, value> to generate a set of output key-value pairs. The user expresses this calculation process by customizing MapReduce's calculation process is specifically described as follows: Input: The input data set is divided into M splits of the same size, the split information and configuration information are stored on HDFS, and the task is submitted to Job-Tracker. JobTracker assigns M Map sub-tasks and R Reduce sub-tasks to idle TaskTracker, and puts all tasks in a queue.
Map subtask: Obtain data from HDFS, generate <key, value> after processing, call Map function to receive all input key-value pairs, generate an intermediate set as the output of Map function, and divide it into R parts by the same Hash function. The result is written into the file and the location information is sent to the JobTracker. JobTracker sends the location information to the node that assumes the Reduce subtask.
Reduce subtask: The node obtains the output subset (1 / R) of the Map task according to the received location VOLUME 8, 2020 information, sorts them based on the key value, and then combines all <key, value> with the same key value to form a smaller set as input to the Reduce function. After the Reduce function finishes running, it outputs the results to a file.
Output: After all Map and Reduce sub-tasks are completed, JobTracker returns the output results of the Reduce sub-tasks to the client program, which is merged by the client program to obtain the final result.

IV. MATERIALS AND METHODS
In this section, we propose a distributed fraud detection method based on the definition of medical insurance gathering behavior in reference [9], which can mine data records that may participate in aggregation fraud from medical insurance data. This problem is a novel and practical fraud detection problem. Obviously, it overlaps with the work on frequent pattern mining, but it is significantly different from them.

A. PROBLEM DEFINITION
According to the definition of the medical insurance gathering behavior in the reference [9], we can know that this behavior usually manifests as multiple medical insurance cards being consumed in the same place too frequently at the same time when the patient is in the hospital. This phenomenon of medical agglomeration may be a tendency to violate the rules: one person with multiple medical insurance cards may consume several cards for one treatment. Therefore, the manifestation of medical agglomeration behavior can be simplified as a kind of consistency: multiple medical insurance cards being used in one exact hospital at the same time. This kind of consumption behavior can be regarded as an anomaly if it is too frequent, and we will supervise it.
Definition 1: The two core data in the medical record are the visit time and the visit place. In our model, we use one day as the unit of consultation time, so let d be the set of visit time, and l be the set of visit place. The two together form a medical visit matrix. According to definition, the medical records of each medical insurance card MC i can be expressed as According to definition 1, the medical gathering behavior can be expressed as in the MDB, {MC 1 , MC 2 , · · · , MC i } has the same value on l m d n , so the following definition is introduced.
Definition 2: Let S be the medical behavior matrix composed of {MC 1 , MC 2 , · · · , MC i }, then S is a subset of MDB, and the same row in S represents the medical records of MC every day, and the same column represents the medical records of different MC on the same day. According to definition 1, if {MC 1 , MC 2 , · · · , MC i } there is a medical gathering behavior, each row of S has the same value.
Let the parameter min_row be the shortest number of rows and the parameter min_column be the shortest number of columns. If rows number S.row of the matrix S is not less than min_row, S is considered to be a frequent pattern, and if the number of columns S.column of S is not less than min_column, S is considered to be abnormal. The mining of medical gathering behavior needs to find all the abnormal matrix S. Definition 3: Mining the medical gathering behavior for a given min_row and min_column, find all matrices S that simultaneously meet the following conditions: 1) S.row ≥ min.row; 2) S.column ≥ min.column. Definition 3 transforms the aggregation behavior mining problem into mining frequent patterns. The distributed aggregation behavior mining algorithm in the big data environment is introduced in detail below to solve the above problems.

B. MEDICAL AGGREGATION BEHAVIOR MINING METHOD BASED DISTRIBUTED COMPUTING
The vast majority of databases have a horizontal data format {ID,MC,l,d, · · · }. The distributed medical aggregation behavior mining method first scans the original medical database, deletes data that does not meet the requirements, and then transpose the format to generate vertical data format. The first order matrix is used to generate the second order matrix, and so on. Each new matrix intersects with the first-order matrix to generate a higher-order matrix until no new matrix is generated.
In the original medical database, if S.column is less than min_column, it cannot exist in a pattern greater than or equal to min_column. Therefore, when scanning the original medical database, the algorithm needs to delete this part of data first. Then the MCs with the same l and d are merged and transposed, so that the medical database format is converted to l i d j , [MC 1 , MC 2 , · · · , MC i ] .
During cross calculation, the S.row can be calculated at the same time, so it is no longer necessary to repeatedly scan the entire medical database.
In addition, since most of l i d j values are 0, the efficiency will be greatly improved compared to other frequent pattern mining algorithms. After the transposition is completed, the algorithm can generate the matrix S 1 simultaneously.
After generating S 1 , higher-order matrices can be continuously generated through crossover operations.

12) end if
For S i , S i+1 can be generated by the self-linking of S i . But for ∀S ⊆ S i+1 , if S can be generated by S i self-connection, then S can also be generated by connecting S i with S 1 . Therefore, connecting S i and S 1 to generate S i+1 can replace the self-connection operation of S i . Since S i contains a large amount of data, this method can greatly reduce the number of cross operations.
In addition, according to the nature of association rule mining, any sub-pattern of frequent pattern is frequent, so S i can be pre-expanded to S i+1 before cross-operation. If S i+1 has infrequent S i , S i+1 can be deleted.
We assign the scanning, construction, and generation results of each stage to the map subtask and reduce subtask. Algorithm 1 to Algorithm 3 are the pseudo-codes of each stage functions.

V. EXPERIMENT RESULTS
In this section, we present the experimental results using the proposed method. First, we describe the experimental environments and provide implementation details. Then, we demonstrate the effectiveness of this method by comparing with several benchmark methods. Finally, we show the practical usage of this method in real systems.

Algorithm 3 MRJoin
Input: S, S.row, R i {l, d} , min_column, min_row Output: Key: S len+1 , Value: Experimental platform is a Hadoop analysis platform built by 14 servers. The detailed description of the nodes is shown in Table 2. The medical insurance data in the experiment are selected from the medical insurance claim system of the medical insurance administration department of a county in China, so there is almost no sparse input data in the data set. The data set covers the county's outpatient records for the past three years (January 2017 to October 2019). Table 3 shows an example of the original medical insurance record. After cleaning the data set, we obtained 1,574,775 outpatient reimbursement information from 151,679 patients.

B. RESULTS AND ANALYSIS
After detecting and analyzing this data set through our algorithm, we obtained 872,042 pieces of correlation data in which the number of correlation data distribution are shown in Figure 3 to Figure 5. Figure 3 is ''the quantitative distribution of the associated data over different S.row values'', which can be regarded as a two-dimensional display of Figure 5 to some extent. From Figure 3, we can see that as S.row increases, the number of associated data keeps falling and reaches a stable state after S.row = 12. Obviously, a higher S.row means more simultaneous patient visits and less correlated data, and the higher possibility of fraud, which is in line with what we know. Figure 4 is ''the quantitative distribution of the associated data over different S.column values''. Similarly, it can also be regarded as a two-dimensional display of Figure 5 to some extent. From Figure 4, we can see that with the increase of S.column, the number of associated data increases first and then decreases, presents a symmetrical relationship before and after. Likewise, a higher S.column means that the more patients go to the clinic at the same time, and the higher possibility of fraud, which also fit  our perception. Of course, it is not comprehensive enough to analyze the associated data by S.row or S.column, we need to combine them. Figure 5 is the distribution of the number of associated data under different values of S.row and S.column. Compared with Figure 3 and Figure 4, the distribution of the associated data in Figure 5 has some changes, this is because S.row and S.column have a mutual influence on each other. Although the increase of S.row or S.column means the higher possibility of fraud, but Figure 5 shows that they will not increase at the same time, so we need to find a balance point as min_row and min_column.
For further analysis, we need to limit min_row and min_column. When min_row = 5, the data set related records are reduced to 719. Then we remove the record of a single medical insurance card, that is, min_column = 2. At this time the data set contains 291 pieces of data, and the quantity distribution is shown in Figure 6. We performed frequent itemsets S.row analysis and support S.column analysis on 291 pieces of data, and obtained the following findings: 1) Among 291 pieces of data, when min_column = 2, there are 18 data of S.row 7, among which the S.rowmax = 11. That is, during the three years from January 2017 to October 2019, two medical insurance cards appeared in the same hospital more than 7 times at the same time, and this phenomenon occurred 18 times. At most, two medical insurance cards appear 11 times in the same hospital   at the same time, so it's reasonable to believe that this is not an accidental phenomenon, but more likely to be a fraud. 2) Among 291 pieces of data, S.column max = 4. There are four pieces of data at this time, and all the values of S.row are 5. That means for the three years from January 2017 to October 2019, four medical insurance cards appeared five times in the same hospital at the same time. This phenomenon occurred four times. Since S.column is big enough, we have reasons to believe that it is likely to be a fraud.
In this method, S.row and S.column in the medical behavior matrix S explain this anomaly from two different angles. Of course, through data analysis, anomalous data can be detected, but it is not certain that this must be a fraud. There is just such a possibility, and the greater the S.row or S.column, the higher the possibility. Although in many cases it still needs such professionals as doctors or government regulators to determine if it is really a fraud, this paper can provide the anomaly data detected by the algorithm, making it a meaningful job.
As in Reference [9], we apply the classic frequent item set mining algorithms Apriori [33], Eclat [34] and BP-Growth [35] to the frequent pattern mining in this paper and compare with this method. When min_row = 2, the running time of these methods under different data amounts is shown in Figure 7.
The experimental results show that under the same condition of S.row, Apriori has the longest running time, which is because it needs to scan the database repeatedly, so it takes a great deal of time. At the same time, many candidate sets are generated in the process of pattern growth, which requires repeated crossover operation. Compared with Apriori, BP-Growth adopts a tree structure, it can directly obtain frequent sets without generating candidate frequent sets, which greatly reduces the number of times to scan the transaction database, so it is more efficient. However, it is not suitable for parallel computing. Although Eclat also adopts longitudinal format data mining, our method preprocesses the data before transposing the format, and deletes some data that does not meet the requirements in advance. At the same time, it uses the connection operation between S i and S 1 instead of S i self-connection operation, so it has higher efficiency.   Figure 8 shows the performance of DCMMAB against other approaches. We have several interesting observations which confirm our research motivation from Figure 8. Due to the extremely low percentage of positive data, the performance of the Apriori method needs to be improved, as most fraudsters will try their best to bypass routine detection rules. The BP-Growth method has high precision but low recall because there is few behavior pattern in the crowd. And Eclat can hardly find meaningful frequent itemsets from the whole crowd because of the curse of cardinality. In contrast, our DCMMAB method significantly improve the precision by more than 20%. This observation shows that our approach can effectively reduce the false positives. Moreover, our method also performs better in terms of other metrics. For example, the recall rate of our method is 15% higher than the Eclat method. As a result of high precision and high recall, when these two metrics are combined together to form the f-measure shown in Figure 8, DCMMAB consistently beats the comparison approaches in the experiments. On average, DCMMAB outperformed the comparison method by more than 10% on the F-measure.
Experimental results show that our method has better performance for medical insurance fraud of medical aggregation behavior.

VI. CONCLUSION
In this paper, we give a definition of medical aggregation behavior and propose an effective method of fraud identification. The method DCMMAB combines the MapReduce distributed computing model and association rule mining to detect abnormal behaviors in the medical insurance reimbursement process. We use a real dataset from a county's medical insurance system in China, which contains 1.5 million records of medical claims activity from 150,000 users. Experimental results show that as the amount of data increases, the performance advantages of this method become more obvious, which can significantly improve the accuracy of fraud detection. More specifically, our PCDHIFD is better than the comparison method by more than 20% in precision.
At present, this method has been integrated into the medical big data analysis platform to provide decision support for auditors in the medical insurance claims system to assess the possibility of fraud. In subsequent research, we will focus on the differences between different diseases, and explore the potential links between disease types and medical behaviors.