Machine Learning Model for Hepatitis C Diagnosis Customized to Each Patient

Machine learning is now widely used in various fields, and it has made a big splash in the field of disease diagnosis. But traditional machine learning models are general-purpose, that is, one model is used to evaluate the health status of different patients. A general-purpose machine learning algorithm depends on a large amount of data and requires abundant computing power support, relies on the average level to describe the model performance, and cannot achieve optimal results on a specific problem. In this paper, we propose to train a unique model for each patient to improve the accuracy and ease of use of the model. The proposed approach to solving a problem in the paper is from three perspectives (1) targeted data processing, (2) model structure design: Passing in patient-related information into the model, and (3) hyperparameter tailored optimization. The preliminary experimental results show that using the custom model has advantages of high accuracy, high confidence, and low resource required to diagnose a patient. In the Hepatitis C dataset, over 99% accuracy and 94% recall were achieved using a smaller dataset (only 615 individuals’ data) without knowledge of the relevant field. Traditional algorithms such as XGBoost or multi-algorithm ensemble could achieve less than 95% accuracy and only less than 70% recall. Out of a total of 56 patients, the custom model was able to identify 53 patients 20 more than traditional methods, bringing a new and efficient tool for future hepatitis C prevention and treatment efforts.


I. INTRODUCTION
Hepatitis C is an undetectable silent killer, a serious disease that is slowly progressive and potentially carcinogenic, and can remain latent in the body for 10-20 years [1], [2]. Typically, only about% of patients with hepatitis C virus infection can recover spontaneously within six months, and 70% of patients turn into chronic viral infection [3]. The hepatitis C virus is extremely stealthy, and WHO estimates that only about one in five of the more than 50 million people living with hepatitis C worldwide are aware that they have the disease, with an underdiagnosis rate of up to 80% [4]. In the early to mid-stages of hepatitis C infection, there are usually no obvious signs and symptoms. Patients may experience dizziness and weakness and poor sleep, which can easily be The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang . confused with fatigue caused by work or study [5]. As a result, many patients are often found to have hepatitis C when they are examined for other diseases, and some patients are even found to have hepatitis C when cirrhosis or liver cancer is detected. It is because of this stealthy nature that the damage caused by hepatitis C is chronic and progressive. The hepatitis C virus replicates primarily in the liver cells and damages them [6]. Over time, liver cells in the body will continue to develop inflammation, degeneration and necrosis. There is no vaccine to prevent hepatitis C, so people at risk can only be diagnosed and treated for hepatitis C in a timely manner by taking the initiative to get tested for the hepatitis C virus at the hospital [7], [8]. Although hepatitis C is dangerous, only 1 ml of blood is needed to test for infection with the virus. Once diagnosed, there is no need to panic, as more than 95% of patients with hepatitis C can be cured with standardized and systematic treatment [9], [10] [11], [12]. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ However, the greatest difficulty in the prevention and treatment of hepatitis C disease is that most patients do not know that they have hepatitis C. The mainstream diagnostic tools for hepatitis C are: 1. Liver function tests(LFTs), which assess liver disease from liver-related metabolites [13], [14] [15]. 2. Hepatitis C antibody tests, which clarify whether the body is infected with the hepatitis C virus. If the test result is positive, it indicates that the patient is currently infected with hepatitis C or has previously been infected with hepatitis C [16], [17]. 3. Hepatitis C virus RNA test, this test can effectively determine how long the infection has been present and also how much of the virus is present in the patient's body [18], [19]. 4. Liver puncture or ultrasound: this is the main way to determine the severity of the liver disease. Generally speaking, if the disease is serious or has a long duration, these two tests should be used to analyze the progress of the liver disease, which is also the key to the current treatment process for patients who are diagnosed [20], [21].
Among these tests mentioned above, only liver function tests are easy to perform at regular checkups and have high marginal utility (LFTs can be used to analyze many diseases related to the liver). Antibody and RNA tests are more targeted and less prevalent in general health care facilities, and are relatively costly and not conducive to mass adoption. Puncture tests or ultrasound are generally used to detect the progression of disease in patients with confirmed disease and are not suitable for making early disease diagnosis. This is why making good use of the data from liver function tests has become an effective means of identifying hepatitis C patients earlier.
In order to make the best use of the collected data, a powerful tool such as machine learning is natural. However, the current direction of machine learning is deep learning, which relies on a large number of datasets, which contradicts the small amount of medical-related data accumulated today. Borisov et al. points out that deep learning methods have a major disadvantage in the processing of structured data [22], and the performance of deep learning models with huge numbers of participants is even far behind some commonly used tree models [23]. And there are also obvious ethical issues with today's machine learning models when dealing with medical-related problems, as they are judged by their average performance on the validation set. Perhaps the model can perform well on average, but who wants to be the ''unlucky patient'' who is misjudged by the model? Every data sample that is processed by the model is closely related to a patient. The disease diagnostic model is not just discussing the categorization of this data sample, but will actually affect the future of a flesh-and-blood real individual. In order to overcome the above ethical issues, the primary pursuit of a custom model for the selected target patient is the highest possible degree of accuracy. Aim not only for the overall average performance of the model, but also to ensure that the worst performance of each case is acceptable.
In this paper, we propose a machine learning model for hepatitis C diagnosis customized for each patient. The major difference from the traditional model is that the data of the patient to be diagnosed is incorporated into the training process during the training session. The comparison of the customized solution and traditional machine learning is shown in Figure 1 With the help of richer information, the model achieves better accuracy and can correctly categorize almost all patients. The second section analyzes some of the relevant research developments, and the third section describes the dataset used and some basic data processing tools. The fourth section will clarify the principles of the model construction and detail the process as much as possible in order to facilitate the replication of the results by subsequent scholars or medical practitioners. The fifth section presents some experimental results, and the last section will provide some summary and outlook.

II. RELATED WORK
Although this article is a study of the diagnostic issues of hepatitis C, it is essentially an analysis based on medical data already collected and does not involve relevant medicalrelated knowledge. Therefore, in this section, we will not analyze the virology of hepatitis C and disease-related knowledge, but mainly summarize the existing data analysis tools for the disease and their effects. In this section, we will discuss: 1. the development of structured data processing in the field of machine learning; 2. the customization and lightweighting of traditional models for specific application scenarios to make them easier to use; and 3. the progress of studies using the same dataset.

A. METHODS OF PROCESSING STRUCTURED DATA 1) GRADIENT BOOSTED DECISION TREE
The field of structured data (i.e., tabular data) has historically been dominated by conventional machine learning algorithms like Gradient Boosted Decision Tree (GBDT) [24]due to their better performance [23]. Scientists and businesses alike rely heavily on several GBDT algorithms, the most popular of which being XGBoost, LightGBM [25], and CatBoost [26]. A scalable gradient boosting tree technique, GBDT produces state-of-the-art results on numerous tabular datasets, and XGBoost is one of the most prominent implementations of GBDT. The process known as ''gradient boosting'' builds new models using the residuals of older models to produce more accurate predictions [27], [28]. XGBoost's foundation is the same as GBDT's, but it's been improved upon.For example, the second-order derivative makes the loss function more accurate; the use of regular terms to avoid tree overfitting; Block storage allows parallel computation, etc.

2) DEEP NEURAL MODELS
Since deep neural networks have been so successful in image recognition, numerous recent research have extended deep learning to the area of tabular data, with the goal of improving the performance of tabular data by introducing novel neural architectures [22], [29] [30]. Based on the deep learning ideas these models draw from, the models can be classified into two categories.
Attention-based models. Given the novel route taken by attention-based models in deep learning, several researchers have experimented with attention-like modules in tabular deep networks. Two types of focus have recently been proposed: inter-sample attention, where characteristics within a single sample interact, and intra-sample attention, where individual data points make advantage of row-level or samplelevel interactions. [31], [32].
Differentiable trees. The series of work presented here seeks to make decision trees differentiable because of the impressive results obtained by decision tree ensembles when applied to tabular data. Due to their lack of differentiability and gradient optimization, classical decision trees are limited in their use in some specific application scenarios. Fortunately, recent research has found a solution to this issue: by making tree functions and tree routing differentiable by smoothing the decision functions in the internal tree nodes differentiable [33], [34].
But even with the improvement of these new approaches and the combination of them, it is still difficult for deep neural models to outperform traditional GBDT across the board in structured data.

B. CUSTOMIZATION AND LIGHTWEIGHTING OF COMPLEX MODELS
With the rapid accumulation of data [35], [36], a variety of all-encompassing datasets have been built [37], [38] [39], the differences between data and compatibility issues were ignored. This neglect leads to the difficulty for complex models for complete scenarios to perform consistently on all problems [40], [41], there will always be particular problems that are substantially off in prediction, and there will always be images that cannot be correctly classified. This leads to the fact that if one wants to apply large proven models to specific particular datasets, that is, to adapt the original models to specific problems, this is not easy to achieve. In the field of problem-based machine learning research, there has been a minimal exploration of this.
Some researchers [42], [43] [44], [45] [46] discusses how existing complex models can be tailored to specific problems, making the original model better applicable to specific datasets using transfer learning. Since traditional mature neural networks are large and bloated. Some researchers [47], [48] [49], [50] [51] attempts to compress the parameters of the model based on the existing model employing knowledge distillation and model simplification to achieve the effect of improving the speed of computing.
There also are several scholars who proposed some tricks for data augmentation [52], [53] [54], [55] [56], which can make the model improve the accuracy of analysis in specific scenario.
However, these solution ideas are still rarely discussed for very specific individual problems, and this paper will try to fill the gap and demonstrate the feasibility.

C. HEPATITIS C DISEASE DIAGNOSIS USING THE SAME DATASET
In the field of medical diagnostics, machine learning has been showing its capabilities since very early on. Back in 2017, Hashem et al. compared several ways to predict hepatitis C using blood markers, yielding a best accuracy rate of 66.3% to 84.4% [57]. In 2018, Hoffmann et al. collected and organized the dataset used in this paper, several medical researchers analyzed the data through a tree model, yielding an accuracy rate of best 75.3 [58]. This dataset was donated to the UCI Machine Learning Repository in June 2020 [59], [60]. After that, Chicco and Jurman used the dataset to perform Ensemble Learning on the AST/ALT ratio to achieve a 95.4% accuracy rate on whether the disease was present or not [61]. Chawathe et al. achieved a 95% accuracy rate and 89% recall rate by fusing multiple models. But for specific applications in medical diagnosis, all this needs to be enhanced [62].
We need to make every effort so that all patients are accurately identified and all healthy patients can be correctly classified without additional biopsies.

III. METHODOLOGY
The algorithm design in this paper is based on thinking from two perspectives, from the perspective of the user of the model and from the perspective of the data.

1) USER's PERSPECTIVE
When a patient's certain laboratory indicator contributes significantly to the outcome of a disease, it means that this indicator is important and should be given attention. This kind of judgment is what experienced physicians are good at, and as someone who has some experience working with key characteristic variables, understanding them is a must. Likewise, for indicators that are not important in the laboratory results, the physician will find that the impact of this indicator is not important in the diagnosis of a particular disease [63], [64] [65].
In general, after a systematic study of medical knowledge and a certain period of internship, a doctor can make a general judgment about various laboratory indicators, which ones are important and which ones do not play a role. However, the superficial cognition of inexperienced doctors is not enough to judge the causal relationship between variables in a short time, so if doctors are allowed to intervene in the screening of data at the early stage of data processing, the inaccuracy of doctors' own judgment will be transferred to the data. Therefore, it is wiser to have physicians with extensive experience review the trained model to check whether the importance differences of the weights in the model are consistent with the objective laws of the real world, so as to ensure the reliability and interpretability of the model. It's also an effective way to get more value out of experienced physicians and allow excellent medical resources to serve more people [66].

2) THE PERSPECTIVE OF DATA
The data itself will naturally present differences in the influence of different variables, and will also show the relationship between different data samples. When it comes to data related to disease diagnosis, the data will then reflect similarities between patients. The general process of machine learning mainly describes the relationship between variables, but not much attention is paid to the relationship between samples. The data processing customized for patients proposed in this study is going to fill this gap and explore how to use the relationship between samples to improve the accuracy of the model. The effect of focusing on some of the key samples can be achieved by modifying the ratio of the number between samples, just like a person focuses on the key information in a scene.
Guided by the above ideas, the algorithm proposed in this paper implements model customization for patients in three stages. 1) data processing stage: targeted sample augmentation. 2) model structure design stage: patient data are skillfully passed to the model. 3) hyperparameter optimization stage: model performance under different hyperparameters are judged by new evaluation criteria. We call an individual patient who needs a disease diagnosis a ''target patient''. Each patient's laboratory results can be considered a sample, and a medical dataset will have a very large number of samples.
The framework of the algorithm is depicted in Figure 2, which is divided into three major parts, they are data processing, model building, and parameter optimization. The yellow box on the left is the acquisition process of the traditional machine learning model, and the blue box on the right is the acquisition process of the custom machine learning model proposed in this paper.

A. TARGETED DATA AUGMENTATION
This paper proposes to adjust the proportion of training samples (targeted data augmentation). The operation of this part is shown in Figure 3.
EDA (Exploratory Data Analysis) [67], [68] is an essential part of machine learning and is the first step that starts after acquiring data. In this process, the original data is explored with as few a priori assumptions as possible, summarizing the structure of the data and presenting specific patterns. For a single feature, the data engineer always expects that the variables under that feature can be uniformly distributed or normally distributed within the data. For the whole sample space, the data engineer always expects that each data point can be uniformly distributed in the sample space (it means that the probability (density) corresponding to each sample point in the whole sample space is equal) [69]. This is because imbalanced data can seriously affect the model's effectiveness and even affect the judgment of the model, good or bad. The accuracy of the model is very high for the high proportion categories, and the deviation of the prediction is exceptionally high for the low proportion categories. Nevertheless, The researcher naively thought to get a good model because the higher proportion categories had a more significant effect on the loss and metric [70].
However, in a dataset containing a large number of samples, there always is only limited sample data in the region that should be focused on. If a laboratory result for patients who need to be diagnosed is introduced in the original sample space, the percentage of samples in the training set that are similar to the target patient is tiny. In order to improve the accuracy of the model for the target patient, the proportion of training samples can be adjusted by targeted data augmentation [71]. In the case of the disease diagnosis problem discussed in this paper, to make the custom model more accurate for the target patient, what is done is to reduce the level of attention to the cases that differ significantly from the target patient and pay extra attention to the cases that very similar to the target patient. This allows the model to be more sensitive in identifying potential patients and also allow the model to make correct judgments when faced with healthy cases [72].
The specific operation is as follows: 1. Find several samples from previously collected case datasets closest to the target patient in the whole sample space; 2. Increase the number of these similar samples by a specific method to occupy a more significant proportion of the entire sample space [73], [74]. After determining the idea of targeted data augmentation, two questions arise: 1. how to describe the similarity between samples in the sample space, that is, how to determine that the laboratory results of two patients are more similar; 2. how to expand the number of similar samples by what means. In machine learning and data mining, the concept of ''statistical distance'' is often introduced to describe the magnitude of differences between individuals and thus evaluate the similarity and class of individuals. Depending on the characteristics of the data, different measures can be used. In general, to define a distance function d(x,y), the following criteria need to be satisfied [75]: 1.Non-negativity: 2.Identity of indiscernible: 3.Symmetry: 4.Triangle inequality: Based on these criteria, the Euclidean distance [76] was selected, Mahalanobis distance [77], Chebyshev distance [78], Minkowski distance [79], and Bhattacharyya distance [80] as alternative options. After a comparison test, the Mahalanobis distance was finally chosen as the criterion to describe the sample similarity. Its most prominent VOLUME 10, 2022 FIGURE 2. Framework of the custom algorithm. The yellow box is the acquisition process of the traditional machine learning model, and the blue box is the acquisition process of the custom machine learning model. advantage is modifying the traditional Euclidean distance, which corrects the problem of inconsistent and correlated scales of each dimension in the Euclidean distance. It can genuinely reflect the similarity relationship between samples without the constraints of dimensional scales. Other distance criteria in the comparison experiments were more or less influenced from the complex dimensions, resulting in calculated distances that did not satisfy the needs of subsequent experiments. In the future, it will also try to update the similarity criteria in the form of Metric-learning after introducing additional information from professionals. This option allows experienced physicians to judge and score the similarity of patients. An evaluation criterion for evaluating the degree of similarity of patients is then summarized by learning these scores by means of Metric-learning. To increase the number of few samples in the training set that are similar to the target patient to achieve sample balancing, the SMOTE (Synthetic Minority Over-sampling Technique) [81] algorithm was chosen after comparing various methods for adjusting the sample proportions. The SMOTE method is an interpolation-based method that synthesizes new samples for small sample classes. By calculating the Mahalanobis distance between the sample points in the training set and the target patient, a certain stem of samples that are most similar to the target patient is oversampled. The result of this processing is shown in Figure 4. The figure shows a two-dimensional (feature) sample space in which the yellow triangle represents a positive sample and the blue pentagon represents a negative sample. The target patient is to categorize the green squares (target samples) in the sample space. After the Targeted data augmentation process, the samples in the original sample space are targeted augmentation (which can be interpreted as simply copying the samples to increase the weights). The augmentation results in augmenting the samples that are more  similar to the target samples and paying more attention to the samples that are more similar.
Define the data before processing as D shown in equation (5), where there are n samples in total and each sample is differentiated by the i. The features (dimensions) are d in total and are distinguished by the k. The output labels (dimensions) are l in total and are distinguished by the o. So write X and Y in the form of separate matrices as Equation (6).
After the targeted sample augmentation is performed, it makes the original dataset richer, and here m is defined as the increased number of samples. The new dataset is defined as D new shown in Equation (7). X and Y have also changed as Equation (8).

B. CUSTOM MODEL STRUCTURES FOR PATIENTS
This study proposes a form of subtly passing information of target patient to the model under the guidance of the above research idea. The operation of this part is shown in Figure 5. Is there any part of a neural network model design that allows the model to receive specific information directly? The answer is yes. Most ordinary algorithms are one-to-one correspondence between input and output; one input gets one output. There is no connection between different inputs. The structure of the traditional neural network is relatively simple: input layer-hidden layer-output layer [82].
RNN [83] is different from the traditional neural network in that each time, the output of the previous time is brought to the next hidden layer and trained together. Inspired by RNN, this study proposes to take the selected target patient as a particular input and bring it into the hidden layer for operation. The biggest advantage of this approach is that it makes the model more sensitive to the target patient right through the training process. And since only one layer of neural network is added, only one hyper parameter that can be pre-set and one parameter that can be trained, there is little impact on the overall complexity of the model.
The specific way is divided into two steps: After normalizing the data uniformly, the selected target patient is multiplied by a ''bias coefficient: e'' and added to all the input data of the training set. The ''bias coefficient'' can be freely set and represents the initial offset to the target patient on the entire training set. e can be positive or negative, with larger absolute values indicating a greater influence on the training set according to the target patient. The physical meaning of this operation in the sample space can be understood as a shift of all sample points in the sample space in the direction of the selected target patient. If the value of bias coefficient e = −1, the origin of the whole sample space coordinate system becomes the selected target patient sample points. A bias layer with only one parameter is added immediately after the input layer. In the bias layer, the input data are multiplied with the selected target patient by a ''restore coefficient: e '' and added again. e can be automatically adjusted during the model training by backpropagation. That is, the only parameter added to the model that can be automatically adjusted during the training process. The Structural comparison between the traditional model and custom model is shown in Figure 6 [84]. Only one layer is added to the model structure, and only one parameter is added that needs to be trained.
In short, the input data is moved twice bias according to the direction of the selected target patient. The first move is a move of the overall training set according to the predefined parameter e. The second move is a move of the samples in the bench during the training process and the training parameter e by back-propagation at the same time. During this e iteration, the origin of the coordinate system of the source dataset is displaced back and forth in the direction of the selected target patient, which forces the model to be stable for all sample points in the direction of the selected target patient.
At the beginning of this approach design, it is expected that the restore coefficient e would gradually converge to the opposite of bias coefficient e during the training process, that is e = −e. When e = −e is achieved, it means that the input passed into the subsequent hidden layer is original data, and all artificially added bias is counteracted.
In the sample space, in addition to the coordinates of the absolute position which contains all the information about the sample, the direction of the sample is also crucial information. In the process of Adaptive bias adjustment, the directions of almost all samples changes with each change of e , and only the direction of the target sample is always constant. The change of direction vector of each sample in Figure 7 illustrates this change very visually. The left panel represents the unbiased sample space, while the right panel shows the biased sample space. A comparison of the two plots shows that only the direction of the target patient represented by the green square is stable, while the direction of all other samples has changed.
However, during the experiments, it was found that the final result of e was mostly negative regardless of whether e was set to positive or negative values by reading the final parameter value e after iteration. In other words, the model tends to orient the overall sample space to the negative half-axis during the learning process. This situation was analyzed: because the Rectified Linear Unit(ReLU) [85] is used as the activation function used in the subsequent hidden layer, more negative semi-axis variable values will be processed to zero, and the sample space will be more concentrated, making the overall function more likely to converge. In order not to lose the accuracy and separability of the data in the original sample space, a parameter selection procedure for the e value is subsequently introduced. The above method can be easily applied to MLP (Multilayer Perceptron). A simplified MLP with only one hidden layer is defined, and its operational logic is summarized in the mathematical formula f (x) shown as equation (9) After introducing the adaptive bias adjustment into the MLP model, the equation changes as shown in Equation (10).
x p is the laboratory report data of target patient to be analyzed, and X p is obtained by copying x p to the same size as X . Among the two newly added variables, e is selected during the model construction phase and can be optimized later as a hyperparameter. e is updated iteratively during the training process. Only one scalar, e , that needs to be iterated is added during the training process, and the impact on the number of parameters of the model can be negligible. (10) Adaptive bias adjustment is not directly applicable to multiple patients for the time being. As an alternative, adaptive bias adjustment can be trained for each patient first, and finally, the effect of customization for the selected target patients can be achieved by model fusion. That is, it is possible to customize both to individual patients and to several patients who share common characteristics. For example, mass testing for hepatitis C infection in some alcoholic populations.

C. VALIDATION AND PARAMETER TUNING
Once a model has been built for a single problem, a question arises. How can the effectiveness of this new model be evaluated? In the past, all-purpose models were evaluated by reserving a separate portion of the collected data as the validation set and then tuning the model by evaluating the performance of the trained model on the validation set [86]. However, such a process is no longer applicable to small sample sizes. First of all, samples in a small sample space are already very rare, and each unique sample contributes significantly to the complexity of the entire sample space. Once some samples are stored separately as validation sets and do not participate in the training process, the training effect of the model itself will be greatly affected. If evaluated by K-Fold Cross-Validation, it again suffers from the loss of accuracy when the final model is fitted [87]. Therefore, Novel constructs of Validation sets are proposed in this study. The operation of this part is shown in Figure 8.
When it is necessary to evaluate the excellence of a completed training model, two main criteria are generally used as a reference. One is the loss such as root-mean-square error (RMSE) [88] of the training set and the other is the loss of the validation set. This general case requires us to be able to calculate the loss or RMSE of the validation set, meaning that the correct output of the validation set need to be known. This is possible in the general research and development phase because these validation sets are divided from the complete dataset. But how to evaluate the accuracy of the model for the validation set when nobody has the correct output results of the validation set in the real scenario of the application? This study propose to find a number of samples from the training set that are closest to the selected target patient as the validation set to evaluate the accuracy and stability of the model for the selected target patient [89], [90].
The result of such an operation mainly affects the operation of the loss function [91], the original loss function as in Equation (11).
After replacing the new validation set, only the selection of y-values for the loss function formula is changed (as in Equation 12), without adding additional computational effort. The main advantage of this is that it allows the validation set to represent the accuracy and stability of the model for the target patient, rather than the traditional validation set for the entire sample space.

1) OPTIMAL PARAMETER SELECTION
With parameters that evaluate the accuracy and stability of the model with respect to the selected target patient as a guide, Hyperparameter optimization of custom models can be carried out with the help of optuna [92], [93] framework. In addition to the usual hyperparameters, such as the number of nodes per layer, epochs, and drop-out ratio, it is found that hyperparametric optimization of the bias coefficient e not only preserves the accuracy and separability of the data in the original sample space as much as possible, but also improves the accuracy of the model for the selected target patient.

2) EARLY TERMINATION OF TRAINING
In the training process of conventional models, training is usually terminated by setting epochs or terminated early when the validation set loss is no longer decreasing [94], [95].
Since the scenario developed in this paper has a more explicit the selected target patient, the model can be called to compute the selected target patient after each iteration and terminate the training early when the output is more stable, or the loss of the validation set is no more significantly decreases.

IV. DATASET AND PREPROCESSING
To cope with the shortcomings of traditional detection means, difficult, costly, and time-consuming, this paper tries to diagnose hepatitis C status through the use of blood biomarkers. The Hepatitis dataset from UCI machine learning repository was selected to show the effectiveness of the custom algorithm. This dataset is about blood biomarkers for hepatitis c virus detection. 615 cases of laboratory values of blood donors and hepatitis C patients and demographic values like age. The target attribute for classification is category (blood donors vs. Hepatitis C). And there are 14 attributes. Ethical Considerations: The data involved in this paper are all data obtained from publicly available sources [59] and have been properly cited according to the data publisher's requirements. Some of the data related to case information of some patients, where information related to identity has been removed or desensitized by the data publisher so as not to reveal the privacy of the patient.
But obviously, accuracy of previous work is not sufficient for medical applications, so more advanced tools are needed to analyze the data. This section will also analyze this data using some of the most popular algorithms in the field of machine learning classification nowadays, in order to compare the advancedness of the proposed approach in this paper.
Exploratory Data Analysis: Perform basic evaluation checks on the data by calling the functions of pandas [96], NumPy [97]. Load the training and test sets and briefly browse the data: head() + .shape(), get familiar with the relevant statistics of the data by .describe(), get familiar with the data types, view the corresponding data column names, and NAN missing information by .info(). View the presence of NAN for each column to determine missing and abnormal data. Have a preliminary perception of the data. Some basic information and analysis of the data are shown in Table 1.
Handling of abnormal data and missing values [98]: Each kind of data has its own actual meaning behind it. When the data value exceeds the normal range or is a meaningless expression, it needs to be adjusted or supplemented in a targeted way. The dataset used here was reviewed by the medical staff, and there were no obvious abnormal values. For patient data with missing values in the dataset, this study chose to remove them.
The processed data samples have the following features from x1 to x12: Age, Sex, ALB, ALP, ALT, AST, BIL, CHE, CHOL, CREA, GGT, and PROT. Each sample has a label y1 with 1 and 0 for disease or absence of disease, respectively.
The feature selection process in filtered and wrapped feature selection approaches is explicitly decoupled from the learning training process, which allows for more accurate correlation analysis. As the name suggests, correlation analysis involves looking at how closely related two variables are by analyzing them together. Correlation analysis can only be carried out if there is some sort of link or probability between the associated elements. Carl Pearson, a well-known statistician, developed the correlation coefficient [99]. The correlation coefficient is a statistical measure of how strongly two variables are related to one another. By multiplying the two deviations from their respective means, the product-difference approach yields the correlation coefficient; this method is especially useful for calculating the linear single correlation coefficient. Use of the seabon visualization package to create a scatter plot of the correlation analysis matrix, shown in Figure 9 [100].
To be precise, −1 to +1 describes the range of the correlation coefficient. In terms of its characteristics, it has the following [101]. Positive correlation between two variables is shown by a r value greater than zero, whereas negative correlation is indicated by a r value less than zero. When |r| = 1, there is a perfect linear correlation between the two variables; in other words, they are functionally related. If r = 0, then there is no linear relationship between the two metrics. Whenever 0 < |r| < 1, linear correlation exists between the two variables. As |r| approaches 1 (perfect linearity), the relationship strengthens; as it approaches 0 (poor linearity), the relationship weakens.
Generally, it can be divided into three levels: |r| < 0.4 for low linear correlation; 0.4 ≤ |r| < 0.7 for significant correlation; and 0.7 ≤ |r| < 1 for high linear correlation [102]. The correlation analysis revealed that the x6 feature (AST, Aspartate aminotransferase) is very important for the final label. There are also x7 (BIL, Bilirubin), and x11 (GGT, Gamma-Glutamyl Transferase) that contribute to some extent. There is also a clear correlation between features x3 (ALB, Albumin) and x12(PROT, Protein). Visualization of the relationship between digital features based on correlation analysis and several common means of preliminary data analysis were also used to gain a preliminary understanding of the data, but no modifications were made to the data at this stage.
There seems to be a lot of noise/outliers [103]. Some data engineers choose to remove outliers at a fixed rate and VOLUME 10, 2022 then normalize the data to facilitate analysis [104]. However, considering that this is a medical dataset, all the data is kept in this case to ensure that the data can cover more rare cases. The main purpose of feature engineering is to improve the performance of machine learning by transforming data into features that better represent the underlying problem. Outliers are processed to remove noise and features are constructed to enhance the representation of the data. In order to better enable the use of machine learning models by people who do not have a rich industry background, no additional knowledge is introduced in this case to perform complex processing of the data.
Because of the limited amount of data in the medical dataset, each patient's data information is very precious. Therefore, in order to make full use of this information, the training set is divided in a special way. Each time a specific patient is analyzed, we define the patient's laboratory results as a separate test set and assign all the remaining data to the training set. Whenever a patient changes, the training set changes as well. This is designed to mimic the actual scenario of hospital diagnosis, i.e., for a new patient seeking medical treatment, all the previously saved analysis data is used as the training set to train the model for the new patient.

V. EXPERIMENTS
The experiment will be divided into two phases, the first phase is the comparison experiment phase and the second phase is the hyperparameter tuning experiment phase. The comparison experiment phase is to verify the effectiveness of the custom model and compare the performance of the custom model with other commonly used models on some evaluation criteria. The second phase is to show the extreme performance level of the custom model by tuning for some hyperparameter settings. The experiments were conducted on workstation with an Intel Xeon W-2125 CPU, Quadro RTX 4000 with 8 GiB video memory,32 GiB of DDR4 RAM, and an SSD for secondary storage. All experiments were performed multiple times and the average results were recorded.

A. PARAMETERS OF THE APPROACHES
The Three main improvement approaches are presented in the methodology phase, all of which introduce some new hyperparameters that were not present during the construction of the original machine learning model. Some of these hyperparameters are presented and analyzed next. In the first phase of the experiments, the parameters of the improvement approaches were chosen using invariant parameter settings to verify the generalizability of the improvement scheme. The following is a description of the special parameters.

1) TARGETED DATA AUGMENTATION a: SIMILARITY THRESHOLD
By calculating the Mahalanobis distance, the degree of similarity between the samples in the training set and the selected target patient can be obtained, and the smaller the value of the Mahalanobis distance, the more similar it is. A threshold value is set in order to facilitate that samples with a Mahalanobis distance less than the threshold value are identified as extremely similar to the selected target patient, and smote oversampling is performed on these extremely similar samples. In this experiment phase, the similarity threshold was set to a fixed value of 2.5.

b: THE PROPORTION OF MINORITY CLASSES AFTER OVERSAMPLING
The samples identified as extremely similar to the selected target patient were oversampled and expanded. The number of expanded minority classes accounted for the majority of samples (samples considered less similar) up to a set value. In this experiment phase, the proportion of the oversampled minority class was set at a fixed 0.3.

2) NOVEL CONSTRUCTS OF VALIDATION SET
The Number of Results Identified as Similar: The samples in the training set are sorted from smallest to largest by calculating the Mahalanobis distance. The number of similar results is set, and the samples that are most similar to the selected target patient are copied from the training set according to the number of similar results to form the validation sets. In this stage, this parameter is set to a fixed number of 10, i.e., the ten samples that are most similar to the selected target patient are selected as the validation set.

3) ADAPTIVE BIAS ADJUSTMENT
Bias Coefficient(e): As introduced in 3.2 above. In this phase, the bias coefficient is set to a fixed -0.2.

B. MODEL CONSTRUCTION
The most basic Back Propagation neural network model (MLP, multilayer perceptron) with three sequential fully connected layers is chosen as the backbone network [105]. Based on the number of independent variables, the number of nodes in each of the three fully connected layers is set to 120, and each layer is output using the activation function ReLU. The final output layer has only one node and use sigmoid as activation function. Total 30,722 trainable parameters. The Adam [106] optimizer, binary_crossentropy, is chosen as the loss function. The overall model construction is simplified as much as possible to evaluate the merit of the final output without using complex techniques. Callbacks are used for the model, val_loss is used as the monitored quantity, and the optimal model is saved. The batch size is chosen to be 256, and the maximum epochs are 100.

C. COMPARISON MODEL SELECTION
After completing the processing of the data, the initial screening of the algorithm was performed with the help of the AutoGluon platform [107]. First, the TabularPredictor and TabularDataset classes of AutoGluon are imported, and then the training data are loaded into the AutoGluon TabularDataset object [108]. Next, AutoGluon is used to automatically train different models based on different algorithms, and the trained models are used to evaluate model performance by making predictions on the reserved test set data. And XGBoost and LightGBM are the most two efficient algorithm, so the XGBoost and LightGBM model is chosen as a reference.

D. CUSTOM MODEL PERFORMANCE 1) THE ORIGINAL PERFORMANCE
Next, it is time for the custom model to make its appearance. In the context of this problem, the laboratory result information is mainly unique to each patient. This session focuses on experimenting with combinations of parameters involved in the three improvement approaches so that the most accurate results can be obtained for each selected target patient. The parameter selection phase has three rounds. In the first round, the number of nodes per layer, epochs, and batch size is determined based on the problem complexity, the number of parameters, and the number of samples. In the second round, repeatable experiments are conducted on a certain number of samples to find parameters that are common to the whole dataset: Similarity threshold, Number of results identified as similar, and Proportion of minority classes after oversampling. These parameters are all closely related to the distribution pattern of the samples in the overall sample space. These parameters are determined as fixed values, which basically satisfy all the selected target patients. In the third round, bias coefficient, and drop-out ratio are then selected according to each selected target patient by the optuna framework. Optuna framework is actually a repetitive experiment for multiple parameters, and the optimal parameter is output according to the amount of monitoring. The monitored quantity selected is the MAE of the validation set. The variation interval of bias coefficient is from -0.5 to 0.5, and the variation interval of the drop-out ratio is from 0 to 0.3. The results of the optimal parameters: similarity threshold:3, Number of results identified as similar:5, and Proportion of minority classes after oversampling:0.3. Subsequently added judgment conditions. 1. Stop targeted data augmentation when there are less than 6 samples below the Similarity threshold. 2. When there are more solutions below the Similarity threshold, the data augmentation selects up to 15 samples as the expansion base.
For the same test set samples as the XGBoost and Light-GBM model, each sample is treated as the selected target patient, and the custom model is constructed and classified for each the selected target patient in turn, and the final results are as Table 5. It can be seen that there is a significant improvement over XGBoost and LightGBM, and the entire dataset is iterated in order to better verify the applicability of the method. For the whole dataset, each sample is treated as the selected target patient, and the custom model is constructed and classified for each the selected target patient in turn, and the final results are as Table 6. VOLUME 10, 2022

2) THE PERFORMANCE WITH UPGRADED APPROACHES
By analyzing the results of each selected target patient, the following conclusions can be drawn. Although the results have been good, it can be still found that: in the real-world environment, the unevenness of the sample space can cause much trouble for the custom model. (1) In the case of insufficient similar samples, if the less similar samples are forcibly selected as the benchmark for augmentation, it will increase the density of the overall training set in the region that deviates from the selected target patient. It is more sensible to turn off the target sample augmentation at this time. (2) In the case of too many similar samples, the similar sample set will be more evenly distributed in the overall sample space because of its more significant number. Then, the similar sample set no longer has the sensitivity to the selected target patient. At this time, the overall performance of the validation set cannot accurately reflect the accuracy of the custom model for the selected target patient. (3) For some of the selected target patients with a strange distribution, the most similar samples may be of the opposite category. This strange case can not be distinguished by the custom model for the time being, and more comprehensive and rich balanced data are needed to solve this strange case.
The new results are shown in Table 7 after introducing the automatic disablement of the targeted sample augmentation and the setting of the upper limit of similarity samples.

3) THE PERFORMANCE FOR HIGHER RECALL
Among the target application scenarios of this study, especially when the model is applied to large-scale screening, accuracy is certainly a crucial evaluation criterion. The highest possible accuracy rate allows patients to be accurately identified and treated, and also eliminates the need for additional follow-up testing in healthy individuals.
But the situation changes when hospitals are allowed to diagnose patients who visit them through custom models. For each patient, a false negative poses a much greater risk than a false positive, so it is important to improve the recall rate of the model as much as possible. Guided by such a specific need, the evaluation criteria of the custom model were adjusted. A partial modification of the binary_crossentropy used for the loss function is to make the model consider that the penalty for false negatives is greater than that for false positives. With such an adjustment, the performance of the model in the test set new is shown in Table 8. It can be seen that all patients in this test set were correctly identified, but this also led to a more significant decrease in other evaluation criteria. In the recall enhancement experiment on the total data, 55 patients could be identified out of a total of 56 patients.

VI. CONCLUSION
This study combines target patient analysis into a three-stage process of machine learning data processing, model building, and parameter optimization. The first stage of data processing: Targeted data augmentation is performed on the training data considering the patients information, so that the dataset generates relevance according to the target patient. By calculating the relevant parameters such as the Mahalanobis distance, the relevant information within the data is fully explored. And the important weight of the samples closely related to the target patient is increased according to scenario requirement. In this process, the target patient information provides the optimization direction for data processing. The second stage is the training model, which uses the target patient information as an additional fixed training data to achieve the target patient as a constraint at all times during the training process. The goal of this model is to have better performance in specific target patient, which is different from the goal of previous models that emphasize broad adaptability. In this process, the target patient information provides additional information for model training. The third stage of parameter optimization uses the target patient information as a reference standard. This criterion can both verify the magnitude of the error after each iteration and back-propagate the model for tuning based on the magnitude of the error, and compare the advantages and disadvantages between several approaches after all training is completed. It provides a reliable reference for parameter tuning related to the target patient, and it is worthwhile to conduct some interesting and meaningful research on them.
In the testing of the hepatitis C dataset, an extremely accurate model (accuracy of 99.4%) was built without introducing additional information and without having any relevant medical background at all. Comparison of test results among various models is shown in Figure 10. This far exceeds the decision tree model based on expert system logic used in the For the five models compared, all of them use the same test set, except for custom-pro which is the result obtained on the whole dataset. It can be seen that the custom and custom-pro model show advantages in various metrics, and costom-recall shows that the customized solution can achieve almost 100% recall at the expense of some of the remaining criteria.

TABLE 9.
Richer comparison of experimental results. The custom models at the top of the table are the ones proposed in this paper, the models in the middle of the table are the results of tuning and optimization using the mature algorithm, and the models at the bottom of the table are the results of other teams using the same dataset. RF is short for Random Forest, LR is short for Linear Regression, DT is short for Decision Tree, and KNN is short for K-NearesNeighbor.
article by the original provider of the data with a medical background. A richer comparison of experimental results is detailed in Table 9.
In testing the hepatitis C dataset, the custom model outperformed the extremely well-developed XGBoost and Light-GBM model (selected by AutoGluon). This efficient and accurate model does not require cumbersome tuning and data processing, and does not require medical practitioners to master complex machine learning techniques to use it directly to aid diagnosis.
The analysis time for each individual patient from this part of the case study is about 30s, which can meet the time requirement for medical diagnosis when a patient has finished the blood test. The equipment requirements involved in the training and analysis of the model are very common and easy to implement. This means that there is no need for a separate viral test, and that only the simplest of blood tests are needed to detect the vast majority of patients with hepatitis C, providing a powerful tool for hepatitis C disease control. And this custom model can discard some of the accuracy to achieve higher recall as needed, and the confidence level is greatly improved and no longer relies on the general average level of the model for evaluation, which ensures that the model can be applied to the treatment of specific patients.
LERAN CHEN was born in Shandong, China, in 1998. He received the B.S. degree in mechanical engineering from the Southern University of Science and Technology (SUSTech), in 2020, where he is currently pursuing the Ph.D. degree with the joint Ph.D. degree between SUSTech and The Hong Kong Polytechnic University. His research interests include machine learning, custom algorithms, and smart manufacturing.
PING JI received the Ph.D. degree in USA. In 1984, he joined as an Assistant Lecturer at Beihang University. He was at the National University of Singapore (NUS), in 1992. He joined The Hong Kong Polytechnic University (PolyU), Hong Kong, in 1996, where he is currently a Professor with the Department of Industrial and Systems Engineering. He has authored or coauthored more than 100 journal articles. His current research interests include enterprise resources planning, operations management and optimization, and its applications.
YONGSHENG MA received the B.Eng. degree from Tsinghua University, Beijing, in 1986, and the M.Sc. and Ph.D. degrees from UMIST, U.K., in 1990 and 1994, respectively.
He started his career as a Polytechnic Lecturer in Singapore from 1993 to 1996; and then a Research Fellow, a Senior Research Fellow, and a Group Manager from 1996 to 2000 at the Singapore Institute of Manufacturing Technology. He was an Associate Professor with Nanyang Technological University, Singapore, from 2000 to 2007. He was a Full Professor at the University of Alberta (UA) from 2007 to 2021. He has been a Full Professor with the Southern University of Science and Technology (SUSTech), Shenzhen, China, since July 2021. He has an established research profile with many research projects from different sources, and published more than 200 papers internationally in recognized top journals, conferences, and book chapters. His research interests include CAx interoperability, CAD/CAE integration, collaborative and concurrent engineering in MRP/ERP/CRM, and product life cycle management. His specialty is in feature-based intelligent product and engineering process informatics.
Dr. Ma has been a member of ASEE, SME, SPE, ASME, CSME and a Canada (Alberta) registered Professional Engineer (P.Eng.), since 2009. In 2012, he received the prestigious ASTech Award from The Alberta Science and Technology Leadership Foundation together with Drader Manufacturing Ltd. He was an Associate Editor of IEEE TRANSACTIONS OF AUTOMATION SCIENCE AND ENGINEERING, from 2009 to 2013. He has been an Editorial Board Member of Advanced Engineering Informatics (ADVEI, Elsevier), since 2012, and has been an Associate Editor, since 2020. Concurrently, he is an Associate Editor of ASME Journal of Computer Information Science and Engineering (JCISE), and an Editorial Member of Scientific Reports (Springer Nature).