An Auto-Approval Approach for Laboratory Test Assessment

Background: Auto-approval (also known as autoverification) is the task of automatically evaluating the consistency of a test result throughout the laboratory information system rather than its manual evaluation by the biochemists. Most of the existing auto-approval systems rely on a rule-based solution obtained from expert knowledge. However, it is a challenging issue to produce a complete and general rule-base for every single test type. To that end, the studies have relied only on a small subset of laboratory tests. Methods: The rule-based auto-approval process was re-investigated in this study, and the rules predetermined by human experts were utilized as a pre-filtering step for grouping the laboratory test result via some common criteria. Subsequently, a machine learning-based approval method, smart-approval, was proposed to approve the tests more precisely. At this point, the expert knowledge in the rule-based pre-filtering was extended by the tendency to imitate the experts’ behavior in the smart-approval step. Two novel datasets (entitled with plot and real-time datasets) containing human experts’ responses to previously studied tests have been used to train the machine learning models. Results: Experiments have been handled on several machine learning models on plot dataset to obtain the trained models based on cross-validation. Here, the random forest classifier provided the best approval performance while also outperforming the approval success of existing studies in the literature. To observe the real-time performance of these trained models, they were also evaluated on real-time unseen data for 4 months. Here, random forest reaffirmed that it was the best approval model. Conclusions: The proposed auto-approval system relying on random forest can provide convincing classification performance on both of the obtained datasets. With the correct approval rate of 98.51%, it surpasses many powerful approval methods in the literature.

concerns have been clearly explained in [6]. Accordingly, the instrumentation's capabilities in terms of validity of results, error flags, data management, and transmission can not be ensured. Linking the LIS with an auto-approval procedure may be difficult for some analyzers. Because the criteria for evaluating the results may vary according to the analysis and tests. Therefore, a strict rule-based process is likely to fail unless it is fully tested and test-specific validated. Due to these limitations, automatic approval of laboratory test results has been completely succeeded in neither academic studies nor real-world LIS cases. Indeed, the full auto-approval process that eliminates the concerns above should be further investigated.
The studies on auto-approval of laboratory test results are presented in Table 1 by considering the used approval method and the approval success respectively. Here, it clearly shows that the common practice of auto-approval approaches on laboratory test results has been based on determining a rule-based decision-making process considering some criteria in testing [12]- [14], [16], [18]. In [12], the reference range, measurement range, critical values limits, and auto-verification limits were considered to decide whether the corresponding test should be automatically approved. In [13], limit check, delta check, essential limits of value, and consistency check were used. Here, 80.0% of the tests were auto-approved. This approval ratio was increased to the range of [89.6 %, 99.5 %] in [14], where 11 criteria were used to determine the rule-based auto-approval procedure. The results are highly convincing in terms of showing the prediction capability of simple rules on auto-approval, however, more consideration should be gathered on measuring the falsely approved tests to make fair comparisons. More recently, in [16], a rule-based auto-approval process was defined by basing 31 biochemistry tests. Here, four specific criteria were determined to decide whether the corresponding test result should be auto-approved or not. These criteria were Analytical Measurement Ranges (AMR), critical values, interference indices, and delta check. In [19] review rules were defined to be set in the expert software director for routine urinalysis on the AutionMAX-SediMAX. A data set of 1002 urinalysis was used, For the complete rules-set the review rate was 47.6% and the efficiency for clinically significant sample selection was 58%.
By a statistical approach called LabRespond 86.6% of laboratory tests were auto-approved in [10]. LabRespond considers the plausibility of the test result, observed frequency, expected frequency, delta check, and sensitivity for auto-approval decision-making procedure. As state-of-theart, Machine Learning (ML) has been entered in the autoapproval process, which uses historical data to learn what to approve among all the test results. Another statistical approach was proposed in [11]. The objective was to implementing a Bayesian method to detect mismatched specimens using blood laboratory data, and LabRespond was baselined in comparisons. In [15], an artificial neural network was used to make this decision by taking the same criteria with [13] and 78.4 % of test results were auto-approved. Here, an expertincluded evaluation was performed to observe the conflicts and agreements between the neural model and the analyzer. The authors reported that the neural model approved 9 (13.8 %) results that had been rejected by the human analyzers and rejected 145 (3.9 %) results approved by the analyzer. It means that the False Positive Rate (FPR) was 13.8 % and recall was 96.1 %.
In an auto-approval system, two fundamental requirements should be fulfilled. First, the system should auto-approve as many tests as possible, and the tests worth being approved should not fail to minimize the need for a human analyzer. However, it should also distinguish the fault-prone tests and avoid approving them. In other words, false negative (FN) decisions increase the overall cost of the process according to time and human resource usage, while false positive (FP) decisions are risky in terms of diagnosis and treatment. Indeed, both of these two requirements are crucial, and there is a trade-off between them which makes the problem challenging. Even though the studies above have provided promising results, they tend to approve the laboratory tests as much as possible, i.e. maximize the approval ratio. However, none of them has reached a mature level of approval when it was evaluated in terms of both the auto-approval ratio and the capability of discriminating the fault-prone tests from the tests to be approved.
In the approval process of a human analyzer, (s)he first checks the result for some predefined criteria and approves the tests which look ordinary. (S)he approves the rest of the tests based on some other factors, such as the correlation between some criteria, experience, and checking similar cases. Thus, the auto-approval task may be considered as a process instead of a one-step decision-making method. In other words, applying a rule-based approach on some criteria is necessary to distinguish the tests should be approved at a low level. In these circumstances the resting tests have ambiguity on approval since any test that fails in a rule-based algorithm may be worth being auto-approved, and even the human analyzer needs some other materials to make the final decision that indeed corresponds to expertise. To interpret these cases, it is necessary to extract the hidden relations of the criteria from the previously made decisions by experts. On the other hand, it is a well-accepted observation that the approval of the tests based on the predefined criteria causes that every result is subjected to the same review process [19]. However, in practice, the review and approval process may differ according to the test type and some other factors since the correlation of the criteria would probably change as well. To that end, test-specific auto-approval should gather attention, and besides the general rule-based approval process, the correlation of criteria and hidden patterns between these criteria and target approval decisions should be investigated. In computer science, these two points bring the subject to data-driven ML which obtain these patterns by imitating the expertise from historical data. ML also has the ability of customization based on the relations in the data for each individual test type. Since the task here is to approve or disapprove the laboratory tests, the ML model's decision on the approval of the tests corresponds to the binary classification.
Basing this foresight, the approval of laboratory test results has been handled by a smart decision-making process in this study. This process composes of a rule-based prefiltering phase considering only the common criteria for autoapproval, and a test-specific smart-approval phase based on ML. The idea behind this process is that the pre-filtering is needed to assess the individual impact of the criteria on the corresponding test's approval decision, it handles the auto-approval process by predefined rules considering common criteria without specializing per tests. It only examines the criteria that will cause a test to be definitely approved and the tests that may or may not be approved over these criteria. These criteria were common for each laboratory test, and contain expert opinions. Subsequently, the smartapproval aims to distinguish the tests as approve or disapprove. It investigates the tests that failed from the pre-filtering via an extended parameter space, and it extracts some potential correlations between parameters with the final decision and each other. Here, an ML model decides whether the test should be auto-approved or not. Briefly, while the applied pre-filtering controls general expectations from a test result, the smart-approval involves the personalized decisions for each test type individually in the approval process.
Different ML methods were implemented in the smart-approval to find the most superior one for the problem. These methods are Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), Naive Bayes Classifier (NB), and Random Forest (RF). Additionally, a Fuzzy Rule-Based System (FRBS) was also implemented to bring an expert-based solution to the problem. SVM, MLP, NB, and RF are fully VOLUME 9, 2021 data-driven methods and FRBS only used data for fuzzy rule generation tasks. However, in any case, labeled data was needed to train these models, and evaluate their approval performance. Therefore, a pilot dataset was acquired to have the tests with experts' approval decisions.
Besides, a real-time dataset was also formed to observe the performance of the resulting auto-approval process embedded in the LIS cycle against new incoming stream data. On this dataset, the decision of the auto-approval process observed and compared with the human decision simultaneously arises. This process deployed on LIS has been monitored for four months and behavioral similarities and differences between the proposed method and human analyzer were measured. At the end of this 4-months of observation, results showed that the expert rules determined in pre-filtering can not cover all conditions worth approval and a large number of tests fail which causes FN predictions to be high, as expected. The proposed smart approval then correctly classifies the resting tests in all of the studied test results for almost every type of utilized ML method. Since the solution here is starting the auto-approval by using common criteria in a rule-based step and extending the procedure by a test-specific ML approach, the hidden relations extracted by ML precisely contributed to the approval process for each test type. With the most outstanding results, the smartapproval including the RF classifier performed better than the smart-approval with other ML approaches and fuzzy rule-based approaches. In total, it approved more tests to be approved while also distinguishing the fault-prone tests more accurately.
The contributions of this paper are briefly listed below: • An auto-approval process was proposed which uses both expert knowledge by rules in the pre-filtering step, and the expert previous behaviors to feed the machine learning models respectively.
• As much as our literature knowledge; it is the only study that handles the auto-approval problem by both test-common and test-specific approaches jointly. And some of the utilized ML solutions (RF, SVM, FRBS) were first employed in the auto-approval problem.
• RF has achieved the highest success while being implemented for the first time in this field.
• Two novel datasets were presented containing human responses to given test results.

II. DATA ACQUISITION
The data were obtained from As already mentioned, the proposed auto-approval process provides a test-specific solution to the problem. That's why the scope of this study has been limited to the most common test types. Utilizing the proposed auto-approval process, the solution can be easily extended for a different and even a broader set of test types.
Containing the employed test types only, two different datasets were collected from the data repository and named as pilot and real-time datasets. These datasets have identical attributes but also different values for that attributes. There is no intersecting record between them because they have been collected in different time intervals. The pilot dataset was collected for 4 months (from Wednesday, 1 st January 2020 to Friday, 1 st May 2020). It was used in model training and performance testing of ML methods, and it was also assessed as a benchmark dataset in comparisons to determine the superiority of ML methods on each other. Subsequently, more recently performed tests between Saturday, 2 nd May 2020, and Wednesday 12 th August 2020 were acquired to form the real-time dataset. This dataset was prepared to measure the generalization and accuracy capabilities of the proposed auto-approval process in the unseen data flow of ALIS in real-time. This dataset needs more attention during the performance evaluation since it directly gives preliminary information about the performance of the real-time implementation of ALIS with the proposed auto-approval process. The details of these tests have been presented in Table 2. Here in the first column, an ID is presented as a unique number of the corresponding test type used in the following tables and analysis. The second column gives the national code determined by the government. SUT (Health Implementation Communique) is a legislative communique that enables the implementation of the state's health-related social policies, guides, prices, regulates, and other implementation details. The third and the fourth columns, on the other hand, present the Logical Observation Identifiers Names and Codes (LOINC) of utilized biochemical tests.
In Table 3 and Table 4, the general statistics of the pilot and real-time datasets is presented with age and gender distribution according to each utilized biochemical test. Here, the total number of tests in the raw form of the datasets are presented in the second column. Subsequently, the number and the percentage of approved tests and disapproved tests are given, respectively. The total number of tests in the pilot dataset is 1 048 000, where 1 022 966 of them have been approved, and the resting 25 034 tests have been disapproved by the expert. Regarding the real-time dataset, on the other hand, 1 052 743 of 1 068 640 performed tests have been approved by the experts, and the total number of disapproved tests was 15 897.
In Table 3 and in Table 4, it can be clearly seen that the disapproved tests are in minority. It means that the experts tend to approve the test results, and few tests were rejected or requested to be repeated. This indicates that the data does not have a balanced distribution considering the frequency of approved and disapproved tests. For example, the disapproval of the test urea nitrogen (Test ID = 2046) is so rare that there are only 10 fault-prone tests among 76 216 recorded tests, which would be misleading for any ML model. Therefore, several ML models have been investigated and carefully selected among those robust against this problem. Additionally, the performance metrics have been also extended to cover both the test to be approved and the fault-prone tests.

III. A SMART AUTO-APPROVAL MODEL FOR THE LABORATORY TEST RESULTS
In the proposed auto-approval process, the problem has been handled by a 2-step process as seen in Figure 1: a rule-based pre-filtering step and the smart-approval rely on ML methods.
In the first step of the proposed auto-approval process, prefiltering, a rule-based decision making process was defined to eliminate fault-prone test results based on some predefined criteria. The rules have been provided by human experts to imitate their approval process, and these rules are the minimal and common rules that can be used to approve any kind of test. Indeed, it may not provide a complete approval scheme with high coverage. However, the main objective here is to provide a simple rule-based approach to distinguish the tests to be approved and the tests that may be faulty and to leave the major approval task to the subsequently performed ML models.
In Table 5, the utilized 11 criteria used in pre-filtering are presented with their short descriptions. These are (i) the test result, (ii) average median, (iii) has previous test result(s), (iv) repetition information, (v) delta check,(vi) interference    The criteria/parameters used in the proposed auto-approval process. The upmost block represents the common criteria used both in pre-filtering and proposed smart-approval, the second block represents the criteria that pre-filtering examines, and the criteria utilized only in smart-approval are presented in the third block.
The pseudo-code of the rule-based pre-filtering is presented in Figure 2. Once a test result arrives, the algorithm first expects the moving average and average median criteria to be acceptable (in Line 2). To that end, the average and median of the last 100 results of each test are calculated at regular intervals and this value is recorded as the moving average / average median values respectively. If the average of these results from the device is not within the expected values, the corresponding test result is marked as fault-prone. Then in Line 5, it controls the interference value. If there exist VOLUME 9, 2021 interference test values such as lipemic, icteric, or hemolytic index value, including the device alerts such as test calibration requirements, the test is considered as fault-prone, At the same line, the information of repetition is also tested, and the previously repeated tests are considered as fault-prone. Moreover, if the result is in panic range, it is marked as fault-prone as well.
In Line 9 and Line 12 of Figure 2, the quality control and delta check are evaluated respectively, and invalidated tests are considered as fault-prone. Regarding the delta check, the test result is compared with the previous results that have been studied at different times and with different samples, if the difference between the results is within the expected values, it may be an alert for faulty tests. Subsequently, in Line 15, any previous test in the same test type for the corresponding patient is searched. If there are not any previous test results for the corresponding patient (in Line 16), the current test result was evaluated to be in a limit range. Otherwise (if there is at least one previously studied result for the patient), the approval range is extended to the analytical range as shown in Line 20. If the result is not within the specified range (limit range or analytical range), the test is considered fault-prone.
In the calculation of the lower and upper limits of the analytical ranges, Equation 1 was used based on the reference range of the test for the corresponding patient where m represents the midpoint of the reference range, and q l and q h are the lowest and highest quantitative limits.
To apply the ML methods in the presented study, the tests as fault-prone in the pre-filtering process were evaluated. Basing this approach, a general approval mechanism was employed by expert rules, and the resting test results are approved by test-specific models to reduce the FN predictions which are the tests to be approved while the pre-filtering rules can not cover and mark them as fault-prone. Here, the ML-based auto-approval contributed to the decision of rule-based pre-filtering in analyzing the hidden patterns in data containing the previous behaviors of experts. Details of its implementation and performance comparisons will be included in the following subsections.

A. DATA PREPARATION
The parameters used in the ML-based smart-approval autoapproval proposal are listed in Table 5. Accordingly, (i) the test result, (ii) delta check, (iii) average median, (iv) repetition, (v) gender, (vi) date of birth of the patient, (vii) date of the assay, and (viii) the reference range (ix) in the reference range (binary) and (x) has any previous test result(s) (binary) and (xi) the set of previous test results were utilized to derive ML models. Some of these parameters were tune, and new parameters were derived to be included in the final parameter set. By using date of the assay and the date of birth, age parameter was provided, using the reference range, distance to reference range parameter was calculated; and lastly, by fusing the set of previous test results, previous test result parameter was obtained. The calculation details of these derived parameters are presented as follows: Using the date of birth of the patient and the time of blood taken and assay to be employed, a new parameter was obtained as age. This parameter was calculated as a dairy-based approach by simply subtracting the date of assay from the date of birth.
There may be plenty of laboratory test results for the corresponding patient for the same type of test. In these circumstances, the current test result (t) should be considered together with these previous results. In Eq One of the most important parameters used in the approval process is the reference range, which determines whether the current test result is appropriate comparing the average of healthy people in a similar profile [20]. Therefore, this range was also embedded in the smart-approval model based on a newly generated parameter called distance to reference. The value of this parameter was calculated by measuring the numerical distance of t to the reference range ([min, max]), For the test results correspond to the reference range, the value of this parameter was set to zero.
In the resulting parameter set, total 10 parameters have been provided as (i) the test result, (ii) delta check, (iii) average median, (iv) repetition, (v) gender, (vi) age (vii) distance to reference range (viii) in reference range (ix) has previous test result (x) previous test result. Five of these parameters were numerical, and the resting three parameters were categorical. In the ML implementations, the categorical attributes were one-hot encoded.

B. ML METHODS AND IMPLEMENTATION
Five methods have been implemented to decide the most well-performed ML model in smart-approval, and their behaviors on the pilot dataset were examined. Accordingly, a kernel-based SVM, a perceptron-based MLP, a statistical learning method NB, an ensemble learning approach RF and a rule-based FRBS were employed, and compared by their auto-approval capability.
An SVM aims to create a hyper-plane or a set of hyper-planes in a high dimensional space to correctly separate the space into sub-spaces that are distinguished from each other as much as possible. It can be used for both classification and regression problems. A good separation is achieved when the obtained hyper-plane has a large functional margin, which means it has the largest distance to the nearest training data points of any class. In general, the larger the margin, the higher the generalization capability of the classifier. In this study, the support vector classifier in scikitlearn library [21] was used for SVM implementation basing LIBSVM in [22]. The Radial Basis Function (RBF) kernel was chosen by determining the parameters C as 1.0 and gamma as (n σ 2 X ) −1 where n is the number of features in the dataset X and σ 2 X is the variance of the flattened X . MLP is a kind of feed-forward artificial neural network in which the major aim is mapping the input features to the output(s) by arranging their weights on the corresponding output to minimize the loss function. In this study, a 4-layer MLP with two fully connected hidden layers was implemented with back-propagation. Adam [23], a stochastic gradient descent method, was used for weight optimization based on binary accuracy metrics. The initial weights of the network were acquired with Glorot and Bengio's method [24]. The numbers of hidden layer neurons were determined empirically as n + 5 and n respectively, where n represents the number of features. The activation function in the first 3 layers was Rectified Linear Unit Function, while the sigmoid was used for the last layer. The implementation was on Tensorflow library [25].
NB is a statistical supervised learning method based on conditional probability theory. It relies on the assumption that each particular attribute affects the output class label individually. The attributes are independent of each other during the determination of the class label of the corresponding sample. It is a fast, accurate, and reliable algorithm and broadly accepted for classification purposes on a large dataset [26], [27]. Therefore, it was used in the decision-making process in the smart-approval module. Since some of the smart-approval parameters are in numerical form, the Gaussian NB was applied in this study since the likelihood of the attributes is assumed to be Gaussian during conditional probability calculation [28]. In this study, the support vector classifier in scikit-learn library [21] was used for NB implementation.
Random forest is an ensemble supervised learning method which constructs a number of decision trees at training time by bootstrapping [29]. The output of the model is then obtained by voting (classification problems) or average prediction (for regression problems) of each tree. To inject randomness and reduce the correlation between trees each tree utilizes a part of entire parameters in each splitting step as the predictor. The implementation of RF was performed by using scikit-learn library [21]. Here Entropy and Gini Impurity were performed respectively to determine the best split, and it was observed that none of these metrics has a remarkable superiority over the other. Yet Gini Impurity was selected in the final experiments. The number of predictor variables was the square root of the total number of parameters, and the maximum depth of the tree was selected as 5. The hyper-parameters of tree construction were decided empirically to maximize the generalization capability of the resulting forest.
Fuzzy rule-based systems are expert system relies on fuzzy set theory. The decision-making mechanism of FRBSs is based on fuzzy rules, and these rules are ideally learned from a human expert [30]- [32]. In terms of this particular condition of FRBSs, they are slightly distinguished from conventional ML approaches that need data to be trained. The reason for considering the FRBSs as ML approach in this study is learning the rules from data instead of human experts, since there are several parameters to be taken into account while determining the fuzzy rules, and it is highly challenging for a human to determine that complex fuzzy rules in high coverage of all conditions of these parameters [33]. Therefore, in this study, the fuzzy rules are generated by Wang-Mendel's rule generation algorithm (WM) detailed in [34]. Briefly in WM method, each transaction in the utilized dataset is considered as a candidate fuzzy rule to a certain degree. First, the parameters (correspond to the linguistic variables in the fuzzy system) in each transaction are mapped to the highest fired fuzzy set beside the consequent linguistic variable. The rule degree is calculated by the product operator considering the membership degrees of linguistic variables to the corresponding fuzzy set. Surely there may be conflicts in the consequent parts of these candidate rules. In such cases, the candidate rule with a higher rule degree is included in the final rule set.
Mamdani style fuzzy inference was employed for classification basing the resulting fuzzy rules [35]. FRBS implementation was based on scikit-fuzzy library [36]. Each parameter addressed a linguistic variable and had three triangular fuzzy sets where the universe of discourse was divided into equal intervals, and the medium points of these intervals correspond the most vagueness (i.e., the degree of membership to the neighboring fuzzy sets are equal). During the inference process, minimum-maximum operations were used for rule fitting and aggregation, respectively. The fuzzy set obtained from the aggregation step was defuzzified by taking the center of the area as crisp output.

IV. EXPERIMENTS & RESULTS
Performance evaluation of the proposed auto-approval procedure was completed by several experiments based on two main scenarios. First, conventional 5-fold cross-validation was performed just to observe the performance of ML models in the smart-approval module. The best ML model was then selected to place into the ALIS to monitor the accuracy of the proposed auto-approval process on unseen real-time data.
The models' performance evaluation was based on precision (P), recall (or true-positive rate, sensitivity) (R), accuracy (A), true-negative rate (or specificity) (TNR), negative predictive value (NPV), false-positive rate (FPR) and area under the Receiver Operating Characteristics curve (ROC-AUC) measures which are commonly used for classification purposes. Here the positive class corresponds to the approved tests and the negative class corresponds to the disapproved tests.

A. EXPERIMENTS ON PILOT DATASET
The pilot dataset was used to obtain the ML models and evaluate the success of each one for comparison. In regular procedure on ALIS, the tests are directly sent to the biochemist for manual approval through the ALIS. As seen in Table 3, up to 95% of these tests were directly approved by biochemists for each test type. The rest of the tests are disapproved by repetition request or rejection. At the end of this process, the manually determined binary labels (approve, disapprove) for the pilot dataset tests were obtained, and the data became applicable to supervise training.
The ML models mentioned above (FRBS, MLP, NB, RF, and SVM) were respectively evaluated by 5-fold crossvalidation on the resulting pilot dataset. Here, the data was partitioned based on test type to make each test type individually evaluated. There occur 28 datasets specified for each test type and each test type caused an individual smart-approval model. For each model, the obtained dataset was partitioned into five equal subsets by randomly selectİN the tests to belong to each subset. In each iteration of cross-validation, 80% of tests were used for model training and 20% of tests for model testing.
The average values obtained from 5-fold cross-validation have been presented in Table 9 and Table 10 for training and testing. Since each test type has been evaluated individually, the performance metrics were presented respectively for each test type. In Table 6, the average values obtained from Table 9 and Table 10 are also given to represent the results more clearly and briefly. Here, the over-line symbols correspond to the average, and the average values are given  with standard deviation (in brackets) of the corresponding performance metric. It can be seen that predicting the tests to be approved is easier than predicting the fault-prone tests, and the models have a common tendency to approve the tests.
The main important motivation emphasized here is to automatically approve as many tests as possible. Accordingly, Table 6 shows that each implemented method has great accuracy in predicting the tests to be approved. However, this tendency of the models to approve increased the importance of the ability to disapprove the tests that could be faulty. In these circumstances, the metrics TNR, NPV, and FPR should also be examined during the performance evaluation of the ML models to analyze the performance of negative class predictions. According to Table 6, NB, RF, and MLP provide better prediction performance on this side of performance evaluation. In fact, NB and MLP are on par with each other, and basing the NPV, MLP outperforms the NB while NB is superior on TNR and NPV. Yet, RF is the best model according to each evaluation method.

B. EXPERIMENTAL REAL-TIME USAGE OF AUTOMATIC APPROVAL SYSTEM IN ALIS
In this section, the proposed auto-approval procedure was deployed to ALIS to be assessed on real-time unseen data. Once the result of a laboratory test request has been completely obtained, the laboratory analyzer makes the data entry to the ALIS in test-bases or collectively. These data contain the required criteria/parameters, and the test result for each test. First, these tests were pre-filtered by the aforementioned criteria, and this step applied a rule-based approach to distinguish tests to be approved and the tests may be faulty. Then, the tests selected as fault-prone were transferred to the smart approval to finalize the auto-approval process by deciding on approval or disapproval. As already mentioned, the decisionmaking of this step is modeled based on the decisions of human experts to existing data. Additionally, different test types were investigated separately. To that end, while pre-filtering proposes a general strategy to distinguish the tests to be approved from fault-prone tests for all test types,  the subsequent smart-approval step provides a test-specific approach to increase efficiency and precision.
Here in real-time experiments, the ML models (MLP, NB, RF, and SVM) trained on the pilot dataset were used. In other words, there is no re-training stage in these experiments to observe their generalization capability of pre-trained models. Regarding the FRBS, since the Wang Mendel method used here was inadequate to develop a complete ruleset, the resulting fuzzy systems could not produce precise outputs for a considerable number of tests in the unseen data. However, it is also challenging to define a complete ruleset with this number of input parameters, the system requires 3 10 fuzzy rules. Even if it was generated somehow, making inference in real-time application causes memory and cost insufficiency. Therefore this method was not included in this experiment.
In the real-time experiments, the human biochemists and the 4 ML models assessed the entering test simultaneously without seeing each other's decisions. This process continued for four months for each applied test. At the end of this process, the consensus and conflict between the ML models' and human expert decisions became measurable, and the accuracy-based performance evaluation of the proposed auto-approval procedure could be performed.  (2847) it is still weak. As the most approved tests by pre-filtering, iron (2012) and iron-binding capacity.unsaturated (2854), have almost %70 of the approval ratio. However, the main point considered here is truly on the approval capability of test-specific ML approaches, and this pre-filtering application was only a pre-processing step to   obtain the fault-prone tests. Using the resulting fault-prone tests, the pre-trained ML models were performed on daily usage in real-time. The experimental results were presented in Table 11 for the MLP, NB, RF, and SVM models respectively.
In Table 11, the TP, FP, TN, and FN counts were obtained by setting the classification threshold is 0.5. Accordingly, the values of P, R, A, TNR, NPV, and FPR metrics were computed by these counts. Basing these metrics, the results can be evaluated from two perspectives. First of all, when the model evaluation was performed by considering the prediction capacity of the tests to be approved, the P and R values gather attention. The corresponding table shows that all of the machine learning models provided high P values, which means that the models' approved tests are belong to the class of tests to be approved. However, according to R, NB and MLP may fail for some of the test types such as the tests in alanin aminotransferase (2001) The second perspective in evaluation corresponds to the prediction capacity of fault-prone tests. Accordingly, TNR, NPV ve FPR were related, and while the TNR and NPV values are expected to be close to 1, the best FPR value is 0. Table 11 shows that the best values for these metrics can be only provided by RF for almost all test types except Aspartate transaminase (2008), Potassium (2036), and Direct bilirubin (2847) where the model tends to approve some of the fault-prone tests. The reflection of this tendency can be seen regarding the ROC-AUC values for the same test types as well. However, it is also clear in Table 11 that the overall performance on these tests is worse for the other ML approaches, and RF is still the superior method among others. Table 11 has been summarized and simplified in Table 7 by taking the average values of the performance measures of each test type to make it more traceable. Here, the superiority of RF on auto-approval can be seen for each of the performance measures.
As already detailed in previous sections, the first aim in auto-approval is to approve as much as tests worth approval to decrease the need for human intervention for approval.
Besides, predicting the fault-prone tests is the second significant aim here. In this study, an RF-based auto-approval process has been proposed and its approval accuracy has been deeply analyzed. In Table 8, the overall process has been evaluated based on the ratio of correctly approved tests for each test type. Here, the total number of tests has been first presented with the number of fault-prone tests after the rule-based pre-filtering. Among these tests, the numbers of approved and disapproved tests have been shown. And in the last column, the approval ratio has been calculated as dividing the number of correctly approved tests (by rule-based prefiltering and smart approval) by the total number of tests. The last row in the table shows the result for the complete realtime dataset. Accordingly, it is shown that the %98.51 of the applied tests have been auto-approved correctly.

V. DISCUSSION & CONCLUSION
In this study, an RF classifier was performed for auto-approval purposes of biochemical tests on ALIS. Several experiments evaluated the proposed auto-approval system based on different perspectives. Here, the auto-approval performance of proposed ML-based models is tested on a pilot dataset for verification. However, it is crucial to perform testing on real-time usage for validation, because how such a system will behave in daily use can only be understood by real-time implementation and monitoring. Accordingly, the following findings were obtained: • The proposed auto-approval system can provide convincing classification performance for both pilot dataset and real-time tests. Regarding the pilot dataset average performance evaluation metrics precision, recall, accuracy values reached almost 100%, and 99% ROC-AUC value was obtained. This high-performance values slightly decreased but sometimes remained even when the models trained by pilot dataset was transferred to the real-time application on ALIS in a different time interval. While precision, recall, accuracy, and negative predictive values reached almost 100%, the 91% ROC-AUC value was obtained with 20% false positive rate.
• According to the classification performance, it can be concluded that the automatic approval success of the proposed model is higher than the machine-learningbased studies in the literature [15]. Regarding the MLP approach in [15], it has been shown in this study that RF radically outperforms the MLP. Besides, performance can be still increased by fine-tuning the RF models by newly obtained data.
• In the real-time application that lasts for 4 months, %98.51 of the applied tests have been auto-approved correctly. This approval ratio is remarkably higher than the approval ratio in [17] which is one of the most stateof-the-art studies with the best approval performance in literature. After the validation of the proposed auto-approval process on real-time experiments, the test results were started to be presented together with the auto-approval decision obtained from the proposed RF models on the ALIS's interface. In the scope and applications of this study, the ALIS did not auto-approve the tests by using the proposed auto-approval process yet. Still, the proposed system was only used to support the expert's decision and measure the behavioral differences between the auto-approval process and the human expert. The following studies will be focusing on implementing the auto-approval process to be invisible to the system's user. Additionally, more test types will be covered to reduce the magnitude of human effort for approval.