A Statistical Model to Detect DRG Outliers

This study aims to develop a statistical model to detect both high and low outlier cases in terms of diagnosis-related group (DRG) distributions. A data set containing five DRGs with 458 patient cases was selected for the study. The distributions of DRG cost and length of stay (LOS) are examined firstly, and all the distributions of DRG costs are lognormal whereas all the distributions of LOS are not lognormal or normal. A statistical model referred to as LM is set out for outlier detection in terms of the lognormal distributions of DRG costs. The LM algorithm is compared with the geometric mean (GM), Inter-quartile (IQ) and L3H3 algorithms. LM has the highest statistics for the Accuracy, Kappa coefficient, Sensitivity and Youden’s index. In addition, LM has the largest area under the ROC curve (AUC). We find that LM is a superior method to detect both low and high outliers for DRG costs, thereby improving the efficiency and effectiveness of DRG prospective payment systems and equity of healthcare.


I. INTRODUCTION
To curb the rapid growth of health expenditure, increasingly countries are switching from reimbursing hospitals for their costs to diagnosis-related group prospective payment systems (DRG-PPS). DRGs are ''diagnosis-related'' groups of patients with similar patterns of resource use and that are clinically meaningful [1]. Hospitals are paid a fixed price for each patient treated in a DRG [2]. DRGs can also contain some treatment episodes (cases) with resource use much higher (or lower) than the average case; these are called outlier cases.
DRG-PPS essentially pay hospitals the regional average cost in each diagnosis-related group (DRG) for each patient admitted to the hospital in that group. If DRG weights were calculated based on the average costs of patients within a DRG, including the outlier cases, this would lead to hospitals being overpaid for the majority of patients. Furthermore, if outlier cases were not paid for separately, hospitals would experience particularly strong incentives to avoid these high-cost casts, or to discharge them inappropriately early. In addition to the regular DRG payment, the insurers pay 'outlier' payments for especially long or expensive cases. That mitigates problems of access and under-provision of The associate editor coordinating the review of this manuscript and approving it for publication was Senthil Kumar . care for very sick patients providing additional payments to the hospitals that take care of them, thereby making payments to hospitals more equitable. Outlier payments can be viewed as insurance against excessive losses on a case. Hence, outlier payments may make payment more equitable by giving additional money to hospitals that treat sicker and more expensive patients than average. They can reduce the problem of access to care for patients that hospitals can identify as likely to need very expensive treatment.
Consequently, the detection of outliers is crucial for hospitals to provide services equitably, avoid undertreatment and improve the quality of healthcare services. One approach, ''L3H3'' [3], [4], defines outliers based on the average length of stay (ALOS) for each DRG; cases with LOS less than onethird of the State (or region) ALOS are defined as low outliers, and those with LOS more than three times the State ALOS are defined as high outliers. Alternatively, an algorithm referred to as inter-quartile (IQ) defines the high resource-use outliers as ''75 th -percentile + 1.5 × inter-quartile-range'' of DRG cost, and low outliers according to ''25 th -percentile −1.5 × inter-quartile-range'' of DRG cost [5]- [7]. Another algorithm referred to as GM3 uses the geometric mean plus three standard deviations (SDs) of DRG cost, and GM2 defines the geometric mean plus two SDs of DRG cost proposed by Cots et al. [8]. For the sake of brevity, we refer to GM2/GM3 VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as GM. GM only considers the high outlier cases in spite of the concomitance of low outlier cases. In other words, GM only detects high outliers, but does not work for low outliers, which is incomplete for outlier detection or outlier trimming. Moreover, GM centring on geometric means rather than arithmetic means implies that the DRG cost is not assumed to be normally distributed, but the standard deviation as the range of variation implies the DRG cost is assumed to be normally distributed. In other words, the two hypothetical premises of GM are contradictory. We propose a new approach referred to as LM for outlier detection, with the mean and the range of variation in terms of the distribution of DRG cost-lognormal. This study contributes to the current literature of outlier detection in DRGs in the following ways. (i) A new statistical model is proposed to detect both high outlier cases and low outlier cases in terms of the probability distributions of DRGs; (ii) the distributions of DRG cost and DRG LOS are examined, and the former is found to be lognormal; (iii) a general mathematical model for outlier detection is presented; (iv) real-world data illustrates our approach and the GM, L3H3 and IQ algorithms are compared and systematically investigated.
The paper proceeds as follows. Section 2 briefly presents the related works and points out their limitations. Section 3 introduces our approach and mathematical model for outlier detection. Section 4 presents an empirical application to demonstrate the proposed model's merits. Finally, concluding remarks are presented in the last section.

II. RELATED WORKS
Apart from the aforementioned outlier definitions, Burnett et al. [9] define outliers as cases in which lengths of stay exceed the mean by the lesser of 20 days or 1.94 SDs for spinal cord injury (SCI); Alameda and Suárez [10] define ''Medical outlier'' as a patient admitted to a ward different from the Internal Medicine ward since, due to the lack of beds in medical wards, many patients are placed in other departments' wards (usually in surgical wards).
Outlier detection is widely applied in the healthcare sector. Cots et al. [11] analyse patient discharge records of the acute public hospitals' Minimum Data Set in Catalonia (Spain) in 1998, and conclude that hospital structural level influences the presence of outliers. Spanish and Belgian studies have shown that high deficit cost outliers account for roughly 5% of high outlier cases producing 11-20% of inpatient costs [7], [11]. Pirson et al. [7] define outliers using the IQ and show that the probability of a patient being a high resource use outlier was higher with an increase in the length of stay, when the patient was treated in an intensive care unit, with a major or extreme severity of illness. The outliers defined by the IQ are used to select nursing activity outliers and LOS outliers respectively by DRG and by the severity of illness (SOI) [12]. Ijaz et al. [13] develop a cervical cancer prediction model (CCPM) that removes outliers by using outlier detection methods such as density-based spatial clustering of applications with noise (DBSCAN) and isolation forest (iForest) and by increasing the number of cases in the dataset in a balanced way to offer early prediction of cervical cancer. Ijaz et al. [14] propose a Hybrid Prediction Model (HPM), consisting of Density-based Spatial Clustering of Applications with Noise (DBSCAN)-based outlier detection to remove the outlier data, Synthetic Minority Over-Sampling Technique (SMOTE) to balance the distribution of class, and Random Forest (RF) to classify the diseases, which can provide early prediction of type 2 diabetes (T2D) and hypertension based on input risk-factors from individuals. Srinivasu et al. [15] propose a computerized process of classifying skin disease through deep learning based MobileNet V2 and Long Short Term Memory (LSTM). Ali et al. [16] propose a smart healthcare system for heart disease prediction using ensemble deep learning and feature fusion approaches.
The algorithms of outlier detection are continually developed. Vidmar et al. [17] introduce regression-through-origin based control limits to identify outliers in healthcare quality monitoring. Vidmar and Blagus [18] develop control limits for a double-square-root chart on the basis of prediction intervals from regression-through-origin by adjusting the confidence level and transforming the chart into an asymmetric funnel plot. Bauder and Khoshgoftaar [19], [20] provide a general outlier detection model, based on Bayesian inference, using probabilistic programming to detect claims fraud in Medicare medical insurance payments. Hauskrecht et al. [21] exploit the support vector machine (SVM) with a linear kernel to learn a probabilistic model that predicts future clinical care actions from the patient-state features to identify any unusual clinical actions in the electronic health records (EHRs) of a patient, and Hauskretcht et al. [22] apply it to a separate test set of 8158 ICU patient cases to generate alerts. Massi et al. [23] employs k-means clustering of providers to identify locally consistent and locally similar groups of hospitals to detect fraud such as upcoding in Hospital Discharge Charts (HDC) databases.
Outliers play a vital role in the DRG system. Keeler et al. [24] simulate per-case payments for a policy that did not include any outlier payments, the current outlier policy, and several other policies that minimize risk subject to different coinsurance constraints, and show outlier policy achieves each of its goals to at least some extent. Felder [25] study the adequacy of outlier threshold rule for risk-averse hospitals with preferences depending on the expected value and the variance of profits. A comparative static analysis reveals that the optimal outlier threshold decreases with an increase in the hospital's degree of absolute risk aversion as well as with an increase in the per diem cost of treatment. Freitas et al. [26] calculate LOS outliers for each diagnosis-related group (DRG) and show that age, type of admission, and hospital type are significantly associated with high LOS outliers in the Portuguese National Health Service. Mehra et al. [27] exploit an interquartile range of case earning (i.e. DRG income -total cost) with LASSO regularized logistic regression to predict the outliers of high profit and high deficit under SwissDRG of a tertiary care centre.
The provision of anthroposophic medical complex (AMC) leads to a prolonged length of stay and cannot be adequately reimbursed by the current G-DRG system and it is suggested an additional payment should be negotiated individually due to the heterogeneity of the patient population [28].
A rich literature introduces the algorithms of general outlier detection, but little literature has systematically discussed and compared the approaches of DRG outlier detection.

III. METHODOLOGY A. FRAMEWORK
This section describes the block diagram and steps of the process. As shown in Fig. 1, there are various components involved in the proposed framework. The proposed work focuses on outlier detection in terms of DRG distributions. The process is partitioned into three phases: examining the distributions of DRG cost, modeling for outlier detection in terms of DRG distributions, and analysing the algorithm of outlier detection.

1) DATA INPUT
The required DRG data contains at least the following four fields: case ID, DRG code, cost and length of stay (LOS). The patient cases are grouped by DRG code.

2) DISTRIBUTION TEST
We examine the DRG distributions first to build up the appropriate statistical model. As some researchers assert the distribution of DRG cost is lognormal [7], [29], we focus on examining whether it is plausible.

3) MODELING FOR OUTLIER DETECTION IN TERMS OF DRG DISTRIBUTIONS
Based on our empirical test, a statistical model is developed in terms of the distributions of DRG costs.

4) ANALYSIS OF THE ALGORITHMS
Outliers being detected by different algorithms are usually inconsistent, thus it is essential to analyse the performance of outlier detection.

B. MODELING
Let G be the proposed DRG system which is defined as where I is a finite nonempty set of objects, and I = {i 1 , i 2 ,. . . , i n } which is a set of all inpatients; A is a finite nonempty set of attributes; V a is a nonempty set of values for a ∈ A; F: I → O is the set of functions. For example, F 1 = f 1 (I → O 1 ) for the computation of L3H3, or F 2 = f 2 (I → O 2 ) for the computation of GM3 and so on.
Each information function F is a total function that maps an object of I to exactly one value in V a . An information table represents all available information knowledge. That is, objects are only perceived, observed, or measured by using a finite number of properties.

C. LM MODEL
Assuming X is the cost of a DRG, we consider two cases: Case A: X follows a normal distribution. Under the normal distribution assumption, the meanx and standard deviation SD are as follows: Case B: X follows a log-normal distribution. Under the lognormal distribution assumption, let Z = ln (X ) ∼ N (µ, σ 2 ). The probability distribution function of X is: Then the geometric mean GM, the expected mean E(X ), the expected standard deviation √ Var(X ) of lognormal X are as follows: Var(X ) = e µ+ 1 2 σ 2 (e σ 2 −1) The quantile of lognormal distribution is [30]: where Q x(ϕ) , the ϕ-th percentile in lognormal distribution; q (ϕ) , the ϕ-th percentile of the standard normal distribution. Proofs of Eqs. (5)−(8) are provided in the Appendix. Spanish and Belgian studies have shown that high deficit cost outliers account for roughly five percent of cases but produce [11]- [20] percent of inpatient costs [7], [11]. Other studies suggest that nearly five percent of cases fall in a high outlier category except for very low-volume groups, with a recommendation that the percentage should not exceed ten percent [8], [31]. Thus, the confidence level α = 0.10 is reasonable. Under the standard normal distribution, q (ϕ) = 1.64 at the confidence level α = 0.10. Thus, we have the following definition. We refer to this algorithm as LM. where e µ+1.64σ is referred to as the high threshold (HT) for high outlier detection of LM, and e µ−1.64σ is referred to as the low threshold (LT) for low outlier detection of LM.

IV. CASE STUDY A. DATA DESCRIPTION
To illuminate our proposed approach, we arbitrarily select five DRGs with 458 patients from a hospital in August 2017. The hospital implements CN-DRG developed by the Chinese government with a total of 787 DRGs. The data set to support the findings and conclusions of this study has been uploaded to a data repository-IEEE Dataport [32]. Cost and LOS are the important attributes of DRG systems, that is A = {Cost, LOS}. In this study, we have used the average length of stay and total billings for the episode as proxy measures for real cost and therefore resource use. The costs are measured in Yuan and LOS in days. The selected five DRGs are BR23, BY13, DT10, ES10 and PV13, and their description and basic descriptive statistics are listed in Table 1.

B. EXAMINATION OF DISTRIBUTIONS
To observe the distributions of the DRGs, we employed kernel density estimation (KDE) to simulate. The 'R' language is used to run KDE, whose default implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator, i.e., 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power [33] (p. 48). Figs. 2 and 3 show the kernel density estimations for DRG cost and DRG LOS, respectively. The shapes look like normal or lognormal distributions, but a rigorous statistical test is more reliable. As mentioned, a number of researchers have asserted that probability distributions of DRG cost and LOS are lognormal [7], [29]. There are two common approaches to examine the probability distribution. One is a χ 2 -test, the other is a normal distribution test with data transformation. To avoid repetition, we only demonstrate the latter.
The normality tests predominantly include Jarque-Bera   or what moment (skewness, kurtosis) they are examining. Razali and Yap [34] concluded that the Shapiro-Wilk test has the best power for a given significance among these tests. Therefore, we select the Shapiro-Wilk test for the distribution test.  Table 2 shows the results of the normality test for DRG costs and LOSs without and with logarithm transformation. As shown in rows 2 and 3, we cannot accept that the cost and LOS are normally distributed at the confidence α = 0.05. As shown in row 4 with logarithm transformation, we cannot reject that all DRG costs are lognormally distributed at the confidence α = 0.05. However, we cannot accept all DRG LOSs are lognormally distributed except for PV13 at the confidence α = 0.05 as shown in row 5.

C. ANALYSES OF ALGORITHMS
After concluding that all the DRG costs follow a lognormal distribution, LM can be conducted using the DRG costs to detect the outlier cases in terms of the lognormal distribution of DRG costs. In order to compare other algorithms, we redefine GM2/GM3 to enable them to detect low outliers as follows.  (14) Therefore, e µ +2SD and e µ −2SD are referred to as HT and LT of GM2, respectively; Analogously, e µ +3SD and e µ −3SD are referred to as HT and LT of GM3, respectively. Note that e µ is the geometric mean.
Let Q 1 denote the first quartile, Q 2 the second quartile, Q 3 the third quartile. The IQ algorithm can be expressed as follows: Let τ stand for LOS andτ stand for the average LOS, then the L3H3 can be expressed as follows: HT of L3H3: H3 := {τ |τ ≥ 3τ } (17) LT of L3H3: L3 := {τ |τ ≤τ /3} (18) To understand the LM algorithm, we take a DRG-BY13 as an example to illuminate. There are 33 cases of BY13 in total. Table 2 shows that the distributions of DRG costs cannot be rejected as lognormal. Thus, we take logarithms of the costs, then calculate the mean of the logarithm costs µ = 1 n n i=1 ln(x i ) = 9.02 and calculate the standard deviation of the logarithm cost σ = = 0.865. Hence, we have the HT and LT of LM for BY13 as (19) and (20), respectively: Table 3 displays the HT for high outlier detection and the LT for low outlier detection of the algorithms. Except for DT10, all the low thresholds for GM2 and IQ are negative; all the low thresholds for GM3 are negative. That is, if the low threshold value is negative, it means the detection for low outliers does not work. In other words, only LM and L3H3 can detect both high and low outliers while GM3, GM2 and IQ can only detect high outliers. Typically, L3H3 detects outliers only by LOS. Theoretically, LM, IQ and GM can detect outliers by cost or LOS, but practically the healthcare providers and insurers are more concerned about the DRG cost, thus LM, IQ and GM are only conducted with the DRG cost in outlier detection.
Since the results of outlier detection by different algorithms are usually inconsistent, a better way to judge outliers is by the agreement between the algorithm result and the actual. Table 4 is a 2 × 2 contingency table presenting the agreement and non-agreement for outliers and inliers in terms of different algorithms.

1) ACCURACY, PRECISION, SENSITIVITY AND SPECIFICITY
In this work, we consider standard evaluation parameters such as accuracy, precision, sensitivity and specificity.

2) KAPPA COEFFICIENT
The Kappa coefficient (κ) is popularly used to quantify the agreement between raters [35]. Kappa is a function of the observed and expected consistent ratio, whose value can be [−1, 1], where 1 indicates complete consistency and 0 indicates inconsistency or independence. A negative value means that the agreement is worse than random. In this study, Kappa is used to determine the degree of agreement between the results for detected and actual cases: where PO is the proportion of observed agreement (in Table  4:PO = (a + d)/T ) and PC is the proportion of chance agreement (in Table 4: PC = (g1·h1+g2·h2)/T 2 ).
The results of the outlier detected by the LM, GM2/GM3, IQ and L3H3 algorithms in form of a contingency table are provided in Table 5 for both high outlier and low outlier detection. LM has the highest number of correct outlier detections (37) and L3H3 the lowest (7). The accuracy, precision, Kappa coefficient, sensitivity, specificity and Youden's index for the algorithms are presented in Table 6. As one can see, LM has the highest statistics for Accuracy, Kappa coefficient and Sensitivity while GM3 has the highest Precision and Specificity. LM also has the highest Youden's index, indicating that overall, LM is the best to detect outliers among these algorithms. To intuitively observe the performance of outlier detection between algorithms, receiver operator curves are exhibited in Fig. 4. As one can see, LM far outweighs the others. LM has the largest area under the ROC curve (AUC), which indicates the best performance, followed by GM2, IQ, GM3 and L3H3.  With respect to computer memory space and computation time, all of the LM, L3H3, GM2/GM3 and IQ algorithms set the outlier threshold firstly, and then judge the outliers by the threshold. In this regard, the complexities of the algorithms are identical, namely O(2n) for both high outlier and low outlier detection.

V. CONCLUSION
Resources are scarce and need to be properly distributed and clearly justified. Outliers have an influence on hospital costs and therefore should be considered in the financing of hospitals. It is important to be aware of this information for hospital planning and policy. Outlier cases lead to the overvaluation of the estimated mean cost of each DRG due to the lognormal distribution of the cost function. For cost analysis, it is essential to identify the effect of this overvaluation because each method yields different results. The designations of the multiples and fractions in L3H3, the quartile and multiples of inter-quartile range in IQ, and the multiples of standard deviation in GM are arbitrary. Among these approaches, L3H3 only detects outliers in terms of LOS, while IQ and GM fail to detect low outliers where the thresholds for low outliers are negative.
Given these shortcomings, the present paper sets out a statistical model to detect both low and high outliers in terms of the lognormal distribution of DRG cost that is theoretically deduced and empirically verified. Compared with the other approaches such as L3H3, IQ and GM, our proposed approach with mathematical modeling, algorithm derivation and parameter setting is theoretically sound and more rigorous, so that the results of outlier detection are reliable. In addition, from the accuracy, sensitivity, Kappa coefficient and Youden's index, LM exhibits excellent performance to accurately detect outliers, which helps improve the efficiency of DRG payments and equitable treatment to advance the quality of healthcare. However, there are limitations to this study. Our approach assumes that the probability distribution of DRG cost is lognormal, which was not applicable to DRG LOS that is not lognormally distributed. Moreover, the small sample size of the illustrative example is a limitation to fully demonstrating the differences between these approaches. Hence, developing new models for different probability distributions and examining our proposed approach in large-scale data presents a future research direction.
The expected standard deviation of X is: In terms of Daly and Bourke [30], the quantile in lognormal distribution is Q x(ϕ) = e µ+σ q (ϕ) = e µ e σ q (ϕ) (A10) where q (ϕ) , quantile of the standard normal distribution. Proofs complete.
SHUGUANG LIN is currently pursuing the Ph.D. degree with the School of Economics and Management, Fuzhou University. He is a Visiting Scholar with the Business School, The University of Auckland. His research interests include data mining, data envelopment analysis, performance evaluation, decision-making theory, and methodology with applications in health.
PAUL ROUSE is currently a Professor of management accounting with The University of Auckland. His research interests include performance and productivity measurement with specialist applications to health, productivity modeling, case mix funding models, and performance in hospitals and revenue and cost management systems.
YING-MING WANG is a Doctoral Supervisor and a Distinguished Professor of ''Yangtze River Scholar'' with the Ministry of Education of China. He is currently a Professor with the School of Economics and Management, Fuzhou University. His research interests include data envelopment analysis, rule base reasoning, decision-making theory, and methodology.
FAN ZHANG is currently a Chief Physician, a Professor, and a Master Supervisor. His main research interests include performance evaluation, data envelopment analysis, and performance in hospitals.