Identification Tool for Gastric Cancer Based on Integration of 33 Clinical Available Blood Indices through Deep Learning

Gastric cancer (GC) is one of the most common cancers in the world. In cancer detection, liquid biopsy, as a noninvasive and rapid method, is growing in importance. Different from traditional liquid biopsy using a single biomarker, this study integrated a variety of blood biochemical indices and established an identification system by means of deep learning under the H2O framework method. Based on data from 2951 samples, 58 routine blood biochemical indices, age and gender were collected as comprehensive indices to establish the identification model. Then, the number of indices was reduced to simplify the model, and 33 indices were utilized to build the final identification tool. A tenfold crossvalidation technique was used to evaluate the performance of the proposed method. The sensitivity, specificity, accuracy, and area under the ROC curve on the cross-validation set were 85.44%, 83.82%, 84.54% and 0.9165, respectively. The identification tool is built free online at http://www.cppdd.cn/GC2. The proposed system provides a new approach to identify GC with advantages of being efficient, noninvasive and economical. The deep learning of the integration of these blood biochemical indices will bring insights into the comprehensive understanding of GC pathology, as well as the prevention, screening, diagnosis, and prognosis of GC.


I. INTRODUCTION
Gastric cancer (GC) was responsible for over 1.08 million new cancer cases in 2020 and more than 768,000 deaths worldwide [1]. Although the incidence and mortality of GC have declined, it is still the fifth most commonly diagnosed cancer and ranks fourth in the mortality rate globally [2]. The high prevalence of GC, poor prognosis [3], and limited treatment options have resulted in a heavy medical burden [4].
The development of GC is a multistage, multistep, and multimechanism process that is highly heterogeneous in terms of structural growth, cell differentiation, molecular pathogenesis, and stages [4]. Therefore, the diagnosis of GC is difficult, especially early GC. Meanwhile, the 5-year survival rate of patients with early GC after undergoing surgery exceeds 90%. In contrast, the 5-year survival rate of patients with stage IV GC is less than 5% [5]. The earlier a patient with GC is diagnosed, the easier it can be cured [6]. However, due to nonspecific symptoms of early patients, less than 25% of patients with GC are detected at an early stage [3]. Therefore, a fast and low-cost detection tool is urgently needed to detect GC.
Endoscopy and image examination are the gold standard diagnostic methods for GC [7]. However, the detection accuracy of conventional endoscopy is only 69% to 79% [8], and image examination has difficulty identifying early lesions with high accuracy [9]. In addition, endoscopy may cause gastric bleeding, gastric perforation, and bacterial infections [10]. Imaging examination should be combined with pathological examination to make a final diagnosis [11]. Doctors are required to have experience in endoscopy and image examination [8]. There is a severe shortage of endoscopists and large-scale endoscopy centers in povertystricken areas [12]. Therefore, endoscopy and imaging examinations are not suitable for large-scale screening. As a minimally invasive method, liquid biopsy can provide information for cancer diagnosis, tumor monitoring, and clinical prognosis [13]. Here, blood is the most important body fluid [14], and plasma is excellent at reflecting gastrointestinal diseases [15]. Experts believe that GC tissue releases tumor markers into the peripheral blood, such as circulating tumor DNA (ctDNA), cell-free DNA (cfDNA), tumor-associated RNA, protein, and circulating tumor cells (CTCs) [16]. With the progress of genomics and proteomics and the development of PCR detection technology, an increasing number of blood biomarkers have been found and applied to the diagnosis and prognosis of cancer [17]. Traditional serum markers, such as carcinoembryonic antigen (CEA) and carbohydrate antigen 19-9 (CA19-9), have been applied to diagnose GC [18]. Studies have shown that the comprehensive sensitivity of blood DNA methylation as a diagnostic marker of GC is 57% (95% CI, 50-63%) [19]. CtDNA is affected by tumor type and stage, which needs further study [13]. CfDNA is more sensitive than conventional tumor markers, but it is insufficient to distinguish other diseases, such as inflammatory diseases and infection [16]. MicroRNA (miRNA) [20], long noncoding RNA (lncRNA) [6], and circular RNA (circRNA) [21] play a key role in the occurrence and development of tumors. However, the chemical instability of RNA prevents it from becoming a good biomarker [19]. The count of circulating tumor cells is considered to be less than 5 in 7.5 ml of blood, which makes detection difficult [13]. The combined detection of multiple biomarkers can avoid the interference of blood mutation template molecules and reduce sampling deviation and individual differences [17], [22]. Various organs, tissues, blood exchange substances, and blood indices are in a state of dynamic balance. This balance is likely to be disrupted due to body diseases, resulting in abnormal blood indices [23]. Wu et al. showed that the neutrophillymphocyte ratio (NLR) and platelet-lymphocyte ratio (PLR) were significantly different between GC patients with different stages. The combination of PLR and CEA is better than CEA alone (AUC=0.671) for the diagnosis of gastric cancer (AUC=0.780) [10]. The deep learning method with excellent self-learning ability can deal with multidimensional nonlinear statistical relationships better than traditional methods [24]. The Clinical available blood indices based on deep learning provide a new idea for the differentiation of GC.
In this work, an identification model of GC was established based on blood indices through a deep learning algorithm. GC and other diseases could be effectively distinguished in this model.   indices from Ortho VITROS 5600. The details of the dataset are listed in Table I, and detailed information can be shown in supplementary table S1. This study met the ethical requirements with the consent of all patients and healthy volunteers and was reviewed and approved by the ethics committee of the Second Hospital of Lanzhou University.

B. Machine Learning Method
Deep learning utilizes multiple neural network layers to obtain more important information from multidimensional data [8]. The artificial neural network can effectively reflect the nonlinear process of tumor development and metastasis [24]. The H2O framework includes an advanced artificial neural network [25]. Deep learning algorithms are embedded in the H2O framework. The deep learning algorithm is similar to the classical multilayer perceptron (MLP) [26], which is optimized through continuous iteration. Stochastic gradient descent was used for training the model, and backpropagation was used for optimization. The performance of the identification model can be further optimized by adjusting the hyperparameters of the neural network.
The principle of layered sampling was followed to divide the total data into the training set (2682 samples) and test set (269 samples). The initial H2O classification prediction model was evaluated by tenfold cross-validation on the training set, and the test set was used for external validation and did not participate in the construction of the model. The first was the choice of the number of hidden layers. In the initial experiment, we set the number of neurons in each hidden layer to 50 and then increased the number of hidden layers to obtain the AUC and logloss of the training set and tenfold cross-validation, as shown in Fig. 1. As shown in Fig. 1a, with the increase in the hidden layer, the AUC of the training set continued to increase, but the AUC of the tenfold cross-validation increased first and then oscillated. The logloss of tenfold cross-validation tended to increase as the hidden layer increased, as shown in Fig. 1b. Experiments showed that as the number of hidden layers increased, the performance of the model improved, but tenfold cross-validation showed that the model tended to fall into overfitting. To avoid overfitting and underfitting, three hidden layers were selected as the basic structure of the neural network. Then, the grid search method was applied to adjust the hyperparameters of the h2o.deeping function, including "input dropout ratio", "activation", "initial weight distribution", "loss", "distribution", etc. Multiple hyperparameters were combined to determine the best solution. The specific settings are shown in Table II. Finally, the number of iterations and the size of the hidden layer were adjusted. The final iterations were set to 1000, and the three hidden layers were 80, 80, and 110. At first, all the features, including age, gender, 26 routine blood indices and 32 biochemical indices were used as the input layer of the neural network, constructing model-1. To simplify the blood detection process and reduce the model noise, the built-in function of the H2O package (h2o.varimp) was applied to calculate the importance of indices. The value of the calculated importance were listed in table III as importance percentage. Then, the number of indices was reduced to establish the final model (model-2).
The deep learning neural network was executed by the H2O package (version 3.32.0.1) in R (version 3.2.4). The H2O package is a parallel machine learning package that provides fast, scalable machine learning algorithms.

C. Assessment Method
The generalization capability of the model can be evaluated by cross-validation, thereby avoiding excessive fit [27]. Tenfold cross-validation divided the training set into 10 parts, 9 of which were used to build the model, and the remaining one was used as the internal test set to verify the performance of the model. Ten models were established by repeating 10 times to detect the accuracy and reliability of the model. The external test set did not participate in the establishment of the model and was only for verifying the performance of the model.
The visual effect of the model is presented with a confusion matrix [28]. Sensitivity (Sens, (1)), specificity (Spec, (2)), and accuracy (ACC, (3)) were calculated by true-positive (TP), false-positive (FP), true-negative (TN) and false-negative (FN). Note: * The statistical significant difference of features between gastric cancer (positive sample group) and all negative sample group, including three 3 sub-groups(normal group, other cancer group and other gastric disease group). 1 The statistical significant difference of features between gastric cancer group and normal sub-group. 2 The statistical significant difference of features between gastric cancer group and other cancer sub-group. 3 The statistical significant difference of features between gastric cancer group and other gastric disease sub-group. The detailed statistical method about that how the significant differences were calculated was shown in supplementary The receiver operating characteristic curve (ROC) was drawn by using sensitivity as the y-axis and 1-specificity as the x-axis. The area under the curve (AUC) was obtained by calculating the area under the ROC curve. The closer the AUC is to 1, the better classification of the model.

A. Good Identification Performance
Model-1 is based on 60 features, with excellent classification and prediction performance. For the external test set, the AUC of model-1 was 0.9152, the sensitivity was 80.20%, the specificity was 91.07%, and the accuracy was 86.99%. The ROC curve of the external test set is shown in Fig. 2a.
As shown in Fig. 3, with the increase in features involved in modeling, the performance of the model increases. Finally, we selected 33 features to build the final model (model-2). Model-2 has good performance, with a sensitivity, specificity, total accuracy and area under the curve of 85.44%, 83.82%, 84.54% and 0.9165 for the cross-validation set, respectively. The selected features are shown in Table III. For the external test set, the sensitivity, specificity, and accuracy of model-2 were 85.15%, 81.55%, and 82.90%, respectively. The ROC curve of the test set is shown in Fig. 2b, and the AUC was 0.9126.

B. User-Oriented Online Diagnostic Tool
We have designed an online identification website for patients and medical workers at http://www.cppdd.cn/GC2. The user can input the corresponding blood test data in the text box according to the prompt and obtain the corresponding identification results. The homepage of this website is shown in Fig. 4. Users should use the same equipment as the blood testing equipment used in this study. The system deviation caused by using other blood testing equipment might affect the accuracy of this identification.

A. Advantages of This Study
A comprehensive comparison of previous works and our method is given in Table IV. In general, the identification method for GC presented in this study is suitable for the screening of GC on a large scale compared with previous methods, especially in regions that lack medical resources, with the advantages listed below. First, the blood indices on which this method were built were all collected from a Clinical available blood biochemical detector. These data are inexpensive and easy to obtain. Traditional methods for the detection of gastric cancer always rely on the detection of characteristics in gastric tissue. For example, Lu et al. detected the content of hsa_circ_0005758 (circRNA) in gastric tissue, and the sensitivity and specificity were 75.0% and 67.7%, respectively [30]. Pang et al. detected the content of LINC00152 (lncRNA) in gastric tissue; the sensitivity was 62.5%, and the specificity was 68.1% [31]. In addition, spectral-based methods on tissue images were also discovered and achieved outstanding results. For example, Li et al. analyzed the spectral characteristics of gastric mucosa tissue to diagnose GC through a deep learning method and achieved good results, with a specificity of 96.7% and a sensitivity of 96.6% [9]. The above studies using gastric tissue require experienced gastroscopic doctors to accurately extract gastric mucosa samples, which is not suitable for screening GC on a large scale. The sensitivity of the traditional GC biomarker pepsinogen was 69% [32]. Lin et al. used urinary surface-enhanced Raman spectroscopy (SERS) to diagnose GC from healthy sample based on gold nanoparticles, with a sensitivity of 90% and specificity of 93.8% [33]. However, Lin's study was not efficient at distinguishing between GC and breast cancer, with specificity of 81.4% and a lower sensitivity of 62.0%. Cui et al. measured the content of mir-106a in gastric juice samples, and the sensitivity and specificity were 73.8% and 89.3%, respectively [20]. The sampling of gastric juice is more complicated than that of blood. Research on blood markers is extensive. Hu et al. conducted a meta-analysis of DNA methylation in serum and plasma, and the results showed high specificity and low sensitivity [19]. Li et al. and Zhao et al. analyzed the content of circular RNA in plasma, and the results were not ideal [15], [18]. The use of plasma markers, such as protein P08493, MYC (cfDNA), and miR-20a (MiRNA), did not achieve good results [5], [34], [35].
Comprehensive multiple indices can achieve better results  0.05-0.01, one star; p-value=0.01-0.001, two stars; p-value<0.001, three stars) in the diagnosis of GC. Similar to this work, Zhu et al. used the gradient boosting decision tree (GBDT) method to analyze clinical blood indices, such as hemoglobin, and biomarkers, such as carcinoembryonic antigen (CEA), to establish a gastric cancer identification model and achieved good results [36]. Su et al. used machine learning to analyze the mass spectra of various proteins in serum and obtained good results [37]. Mass spectrometers are costly and have high requirements for operators, so this method is not suitable for large-scale screening.
Second, this method was developed based on a large and comprehensive dataset, including not only gastric cancer samples and healthy samples but also other major cancer samples and other gastric diseases with symptoms similar to gastric cancer. These samples included samples that are likely to be confused in mass screening of GC. Most of the previous work was only deduced on a small sample size [9], [33], [37]. Only a few works were studied on a large dataset with thousands of samples, as in this work, such as in works by Hu et al. [19] and Huang et al. [32].
Third, this method presents a better identification performance than previous works, especially for the test set, which implies that this method has a better generalization ability and stronger robustness, even than our previous work [29], with a better AUC value.

B. Analysis of Key Features
Cancer is associated with dysregulation of multiple biological processes, and the analysis of key features facilitates the discovery and validation of biomarkers that contribute to understanding cancer pathogenesis and developing drugs [38]. To discover insights into key features in different populations, all samples were categorized into four population groups: the GC group, normal group, other cancer group, and other gastric disease group. We selected features with significant differences and drew a boxplot between groups, as shown in Fig. 5. In Fig. 5a, the levels of PCT in the normal group were higher than those in the GC group. Current studies have shown that tumor cells can indirectly promote platelet production and activation. Activated platelets can also protect and even promote the growth and metastasis of tumor cells [39], [40]. Our statistics may show that the hematopoietic function of bone marrow in some patients with gastric cancer is affected by chemotherapy drugs or radiotherapy, resulting in the suppression of hematopoietic function and thrombocytopenia [41], [42].
As shown in Fig. 5b, the levels of potassium in the normal group were higher than those in other groups. Cancer and gastric diseases may lead to disordered homeostasis and potassium ion balance. Wu et al. showed that the potassium content in GC tissue was higher than that in normal tissue [43]. Ding et al. proved that the Eag1 potassium channel was overexpressed in GC tissue [44]. In this experiment, the low levels of potassium in the blood of GC patients may be related to this, which requires further study.
In Fig. 5c, the basophil count of the normal group is shown to be higher and more concentrated than that of the other groups, which is also confirmed by the higher basophil ratio in Fig. 5d. Tumor-infiltrating basophils are considered to be a poor prognostic factor for GC, but there is no significant correlation between blood basophils and tumor-infiltrating basophils [45]. The higher percentage of basophils in the normal group may be due to the increase in white blood cells in the other groups.
As shown in Fig. 5e, the calcium content in the normal group was higher than that in the GC group. Studies have shown that calcium may inhibit the damage of salt to gastric mucosa, thereby reducing the risk of GC [46]. However, the work of Xie et al. indicated that serum calcium concentration was positively correlated with the expression of calcium-sensing receptor (CaSR). Calcium can activate the overexpression of CaSR to promote the proliferation of GC cells [47]. These contradictory conclusions need further study.
Monocytes (Mo) are believed to promote tumor growth and proliferation [48]. High levels of monocytes are thought to confer poor prognosis in GC patients [49]. Figure 5f shows that the Mo% of the normal group is significantly higher than that of the other groups, but there is no significant difference in the Mo content among the groups. The results show that the count of white blood cells in all groups except the normal group increased significantly.
A low concentration of creatine kinase (CK) in GC tissue has been demonstrated [50]. This is similar to the results shown in Fig. 5g. CK is very important for ATP homeostasis in cells. Abnormal CK levels may lead to apoptosis [51].
The increase in eosinophils in the tissue can improve the prognosis of patients with GC. It has been proven that eosinophils have a good anticancer effect on colorectal cancer [52]. In Fig. 5h, the EO% of the normal group is significantly lower than that of the GC group. Eosinophils in cancer patients may increase due to autoimmune responses.
The red blood cell distribution width-SD (RDW-SD) in the normal group was lower than that in the other groups, as shown in Fig. 5i. Wei et al. showed that the red blood cell distribution width of GC patients is significantly higher than that of the normal group and further increases with the development of tumors. This phenomenon may be related to the decrease in hemoglobin caused by malnutrition in GC patients and may be related to inflammation [53]. Chang et al. showed that the ratio of CK-MB to total CK in hematological malignancies was higher in patients with colorectal cancer, lung cancer, and hepatocellular carcinoma [54]. This conclusion is roughly the same as the distribution of CK-MB and CK in this study and provides a direction for distinguishing patients with gastric diseases from patients with gastric cancer.
One should note that there were some features that are not statistically significant (with high p-value) between gastric cancer sample group and whole negative sample group, such as Mg and DBil in table III, but they were at least had significant difference (p-value <0.05) with one of the 3 negative sub-groups (normal group, other cancer group and other gastric cancer group). This is just the reason why that deep learning is utilized here. Deep learning methods with excellent self-learning ability can handle multi-dimensional nonlinear statistical relationships better than traditional statistical methods [24]. Therefore, these features were also included as key features in the final model since their relationship with gastric cancer is in a multi-dimensional nonlinear way instead of linear statistical way.

VII. CONCLUSION
We utilized an artificial neural network under the H2O framework method to process routine blood data and blood biochemical data, extracted 33 important indices as biomarkers for the identification of GC, and established an identification model and user-friendly web testing terminal. These biomarkers are considered to have great value in the physiological research of GC. Compared with conventional diagnostic methods, the diagnostic model established in this study has the advantages of higher accuracy, noninvasiveness, and inexpensive detection. It has the potential to screen a wide range of people and effectively reduces the pressure on the medical system. With the further collection and learning of blood test data, the performance of this diagnostic model can be further improved.

APPENDIX
The original data of the samples for modeling and validation are provided in supplementary table S1. The detailed statistical method and results for difference of selected features between gastric cancer group and other subgroups are provides in supplementary table S2. The box plots for all 33 selected features are provided in supplementary file S3.