Systems Medicine Design based on Systems Biology Approaches and Deep Neural Network for Gastric Cancer

Gastric cancer (GC) is the third leading cause of cancer death in the world. It is associated with the stimulation of microenvironment, aberrant epigenetic modification, and chronic inflammation. However, few researches discuss the GC molecular progression mechanisms from the perspective of the system level. In this study, we proposed a systems medicine design procedure to identify essential biomarkers and find corresponding drugs for GC. At first, we did big database mining to construct candidate protein-protein interaction network (PPIN) and candidate gene regulation network (GRN). Second, by leveraging the next-generation sequencing (NGS) data, we performed system modeling and applied system identification and model selection to obtain real genome-wide genetic and epigenetic networks (GWGENs). To make the real GWGENs easy to analyze, the principal network projection method was used to extract the core signaling pathways denoted by KEGG pathways. Subsequently, based on the identified biomarkers, we trained a deep neural network of drug-target interaction (DeepDTI) with supervised learning and filtered our candidate drugs considering drug regulation ability and drug sensitivity. With the proposed systematic strategy, we not only shed the light on the progression of GC but also suggested potential multiple-molecule drugs efficiently.


INTRODUCTION
G ASTRIC cancer (GC), an important cancer worldwide, is responsible for over 1,000,000 new cases in 2018 and an estimated 783,000 deaths, making it the fifth most frequently diagnosed cancer and the third leading cause of cancer death [1]. The current optimal GC therapy is surgical resection with intent and adjuvant chemotherapy or radiotherapy [2]. Although there are many molecularly targeted drugs entering clinical trials for GC, none of them is approved by U.S. Food and Drug Administration (FDA) except for Trastuzumab and Ramucirumab targeting HER2 and VEGFR, respectively [3]. With the lower response rate to both of them and the emergence of secondary-resistance to Trastuzumab, there is an urgent need to find new therapies for GC patients [4].
MicroRNAs (miRNAs) are small non-coding RNAs (usually 20-22 nucleotides long) which play an essential role in all biological pathways in multicellular organisms including mammals [5]. It is known that miRNAs participate in the progression of cancers by modulating cell development, differentiation, proliferation, and apoptosis. To the target of miRNA, the alteration in miRNA expression would lead to change its function as a tumor suppressor or promoter of tumorigenesis [6]. Moreover, the other types of non-coding RNAs that exceed 200 nucleotides in length are long noncoding RNAs (lncRNAs). They mediate epigenetic regulation on cancer-related genes, such as DNA methylation, DNA hydroxymethylation, and post-translational modifications, and the regulation of miRNAs [7]. The dysregulation of oncogenes and tumor suppressor genes due to multiple genetic and epigenetic alterations caused by miRNAs and lncRNAs is considered a driving force of tumorigenesis, including GC. Recent evidence shows that the microenvironments of GC are associated with lymphatic invasion, vascular invasion, lymph node metastasis and the survival of GC patients [8]. Therefore, both microenvironments of GC and the molecular mechanisms triggered by the corresponding cascade signaling pathways are important for improving the management of GC and discovering potential therapeutic targets.
On average, for traditional drug discovery, the pipelines take 12 to 16 years from inception to market and cost one to two billion dollars [9]. However, drug repurposing, which is a strategy for identifying new uses for approved or investigational drugs differing from their original objective and purpose, could be done in less than half a time at a quarter of the cost [10]. The general approaches used in drug repurposing are novel data sources, retrospective clinical analysis, pathway mapping, genetics association, molecular docking and, signature matching [11]. For signature matching, the connectivity map (CMap), which was established by the Broad Institute, consists of gene expression profiles generated by dosing of more than 1300 compounds in numbers of cultivated cell lines. It has widely been applied in several human disease investigations, including colon cancer, prostate cancer, breast cancer, diet-induced obesity, and Alzheimer's disease [12], [13], [14], [15].
In this study, we propose a systems medicine design procedure using systems biology approaches for finding essential biomarkers and recommending potential candidate drugs by the application of DeepDTI and two filters considering drug regulation ability and drug sensitivity. For systems biology approaches, first, candidate PPI and GRN networks are constructed by big databases mining, respectively. Integrating interaction and regulation information from multiple databases, a candidate GWGEN is built and used for system modeling of NGS data. Second, with the help of gene expression profiles for each stage of GC, we conduct systems identification by solving the constrained linear least-squares problem and the system order detection scheme by Akaike information criterion (AIC) to obtain real GWGENs. Third, the principal network projection approach is used to extract the core GWGENs from the real GWGENs. According to the projection values, the core signaling pathways are obtained in respect of the annotation of KEGG pathways. Based on the analyses of the core signaling pathways, the progression molecular mechanisms for each stage of GC could be found as well as essential biomarkers. In order to recommend candidate drugs for the selected biomarkers, we train a DeepDTI in advance. The drug-target pairs are transferred to feature vectors by PyBioMed under python 2.7 environment [16].
The trained DeepDTI, which is applied to predict candidate drugs based on our selected biomarkers, is with testing AUC of 0.982. Moreover, considering drug regulation ability and drug sensitivity, two filters are designed to propose multiple-molecule drugs for preventing the progression from early stage to middle stage and middle stage to the late stage of GC. Nowadays, few studies propose systematic design procedure from identifying potential biomarkers for one disease to recommending their potential candidate drugs from the viewpoint of systems engineering. Systems medicine design procedure, an interdisciplinary approach to understand disease progression mechanisms by interpreting heterogeneous data, gives us an alternative way for drug discovery to overcome GC.

RELATED WORK
Over the past decade, many researchers have been dedicated to analyzing protein-protein interactions (PPIs) dataset combining with functional information by different algorithms. It not only improves the accuracy of protein complex detection but also broadens our understanding of biological processes mediated through protein interactions. Based on topological features of the PPI network and Gene Ontology (GO) annotations, cancer genes were identified and validated by well-known classifiers, support vector machines (SVMs) [17]. Carried out by walking on the fingerprints similarity network randomly, a multi-level PPINs reconstruction method (MLPR) was proposed [18]. Clustering-based on maximal cliques (CMC) algorithm with iterative scoring method has shown to be more robust toward random noise and achieve better performance of protein complex prediction. By applying the shortest path algorithm in PPI networks, candidate genes which are associated with the formation and development of gastric cancer could be found [19]. The surge of public next-generation sequencing (NGS) data availability has facilitated the application of systems biology by integrating heterogeneous data to investigate pathogenetic mechanisms of various human diseases with computational modeling approaches. Based on constructing the pathway logic symbolic systems model via genomic, transcriptional and proteomic profiles, the subnetworks within EgfR-MAPK pathway are identified [20]. Bayesian networks, which could describe direct molecular interaction as well as indirect influences that proceed through additional unobserved components, have demonstrated to be useful for understanding the operation of cell signaling networks [21]. Moreover, one study has proposed a mathematical model of the plant clock using biological hypotheses and parameters solved by optimization methods to gain insights into the clock components within core mechanisms [22]. Systems biology approaches have been used to identify essential biomarkers and propose potential drugs based on molecular mechanisms and pathogenic mechanisms in papillary thyroid cancer and two strains of Candida albicans infection [23], [24]. Systems biology approaches with modeling techniques enable us to elucidate the pathways, which are critically involved in tumor formation and progression, consequences of altered cell behavior in tissue environment and effects of molecular therapeutics [25].
Drug-target interaction (DTI) could be divided into three main categories, which are ligand-based, docking simulation, and chemogenomic approaches. Ligand-based methods compare candidate ligand to the known ligands of a target protein to predict its binding via machine learning methods. However, insufficient knowledge between ligands and the given target would lead to poor prediction performance [26]. Docking simulation has been used to predict DTI successfully [27] whereas it heavily relies on the available three-dimensional native structure of the protein target. It also should be noted that three-dimensional structures obtained from methods of nuclear magnetic resonance (NMR) spectroscopy and X-ray crystallography are quite costly and time-consuming. Meanwhile, docking simulation cannot be applied to ion channel proteins and G-protein coupled receptors (GPCRs) whose structures are too complex to get [28]. Chemogenomic methods, namely featurebased methods, combine drugs and targets into sets of descriptors (i.e. feature vectors). For instance, He et al. have turned drugs into a vector-based on 28 common chemical functional groups, encoded target proteins into another vector with the pseudo amino acid composition by incorporating biochemical and physicochemical features, and applied the nearest neighbor algorithm to predict DTI [29]. Moreover, Li et al. have represented drugs by their substructure fingerprints capturing the existence of certain functional groups or fragments and transformed target protein sequence data to make drug-target interaction classifier by discriminative vector machine [30].
Inspired by the previous works, we proposed a systems medicine design procedure including reverse engineering, system order detection, and principal network projection approaches for identifying essential biomarkers as shown in Fig. 1. Moreover, in order to explore the drug-target interaction among these identified biomarkers, a DeepDTI is trained in advance. Considering drug regulation ability and drug sensitivity, we design two filters for candidate small molecule compounds. Consequently, we suggest two multiple-molecule drugs to prevent the progression from early to middle and middle to late stage of GC.

IDENTIFYING BIOMARKERS BASED ON MOLECULAR PROGRESSION MECHANISMS
To identify potential biomarkers as drug targets for the progression stages of GC (early stage to middle stage, middle stage to late stage) based on molecular progression mechanisms, we apply big data mining, model construction, network identification, and principal network projection methods. The related flowchart is shown in Fig. S1, available online. The procedure could be divided into four steps: (1) Big database mining. (2) Construction of the candidate GWGEN consisting of PPIN and GRN. (3) Identifying the real GWGEN in each stage (early stage, middle stage, and late stage) with the help of NGS gastric cancer data. (4) Extraction of the core GWGENs from the real GWGENs via the principal network projection (PNP) approach. After getting the core GWGENs, we can find core signaling pathways in the annotation of KEGG pathways and analyze molecular progression mechanisms of the GC to obtain potential biomarkers.

Datasets
In order to construct the candidate GWGEN, we refer to numbers of databases. For the candidate protein-protein interaction, we use the following databases: DIP [31], BIND [32], Biological General Repository for Interaction Datasets (BIOGRID) [33], IntAct [34], and MINT [35]. Besides, for the transcriptional regulations between transcription factors (TFs) and their target genes, we consider the following databases: Integrated Transcription Factor Platform (ITFP) [36], the Human Transcriptional Regulation Interactions database (HTRIdb) [37], and the candidate TRANScription FACtor database (TRANSFAC) [38]. Moreover, for the posttranscriptional regulation between miRNA, lncRNA and their target genes, corresponding databases we look up are shown below: TargetScanHuman [39], CircuitDB [40], and StarBase2.0 [41]. Based on the information above, we successfully construct the candidate GWGEN comprising candidate PPIN and candidate GRN. The genome-wide NGS data are downloaded from the Cancer Browser website (https://genome-cancer.ucsc.edu/). The data which we used in our analyses are gene, miRNA, lncRNA expression profiles. They were both measured experimentally using the Illumina HiSeq 2000 RNA Sequencing platform by the British Columbia Cancer Agency TCGA genome characterization center. According to the seventh edition of the "TNM classification of malignant tumors" published by the UICC, the GC could be classified into stage IA/ IB, stage IIA/IIB, stage IIIA/IIIB/IIIC and stage IV. In this study, we take stage IA/IB as early stage; stage IIA/IIB as middle stage; stage IIIA/IIIB/IIIC/IV as late stage. In order to avoid overfitting problem caused by few samples in stage IV, we put stage III and IV together to be our late stage. After preserving common samples in gene, miRNA, and lncRNA expression profiles, there are 46, 106, 149 samples in early stage, middle stage and advanced stage, respectively.
To compare selected biomarker gene expression with normal stomach tissue, we used GSE20795, which collected from 142 solid tissue samples representing 20 organs. They were taken from post-mortal human healthy donors of different ages killed in road accidents no later than 36 hours after death [42]. For drug regulation ability, we explored gastric cancer related cell line in the library of integrated network-based cellular signature (LINCS) L1000 (level 5) dataset [43]. It has over a million gene expression profiles Candidate genome-wide genetic and epigenetic network (GWGEN) consisting of candidate protein-protein interaction network (PPIN), gene regulatory network (GRN), lncRNA regulatory network (LRN), and miRNA regulatory network (MRN) was constructed by big database mining. Afterward, with the help of NGS gastric cancer data, false positives of the candidate GWGEN were pruned to build up real GWGENs by system identification and system order detection methods. By the principal network projection (PNP) approach, we could extract core GWGENs from the real GWGENs. Based on the projection values, the differential core pathways from early to middle and middle to late stages could be obtained by the annotation of KEGG pathways. Subsequently, we identify essential biomarkers based on the molecular progression mechanisms and compared their signatures with the normal stomach tissues. Consequently, the potential candidate small-molecule compounds are predicted by DeepDTI.
treated with 19811 small molecule compounds across 75 human cell lines. Considering drug sensitivity, profiling relative inhibition simultaneously in mixtures (PRISM) repurposing 19Q3 primary screen, developed by the Broad Institute of MIT and Harvard, was downloaded from the DepMap portal [44].

Systematic Interaction Models for the Candidate GWGEN
We integrate interaction and regulation information from multiple databases to build the candidate GWGEN represented by the Boolean matrix (i.e. 0 or 1 if the interaction is nonexistent or existent). Here, to get real GWGENs for each stage of GC, we build systematic models for proteins, genes, miRNAs, and lncRNAs. For the candidate PPIN, which is a part of the candidate GWGEN, the systematic interaction model of the kth protein is given as follows: where y k ½n and y g ½n denote the expression level of the kth protein and the gth protein for the nth sample, respectively; a kg indicates the interaction ability between the kth protein and the gth protein; b k;PPIN is the basal level of protein k;v k ½n represents the stochastic noise of the kth protein caused by the model uncertainty and measurement noise for sample n; K is the number of proteins; and N is the number of samples. The systematic models of GRN, lncRNAs regulation network (LRN), and miRNA regulation network (MRN) could be found in the Supplementary Materials, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TCBB.2021.3095369.

Parameter Estimation for the Systematic Interaction Model via System Identification and System Order Detection Methods
In order to identify the parameters a kg and b k;PPIN for the candidate PPI; s ik ; t is ; d it , and b i;GRN for the candidate GRN; g qk ; u qs ; z qt , and b q;LRN for the LRN; pk ; m ps ; c pt , and b p;MRN for the MRN in the candidate GWGEN, we apply system identification method and system order detection scheme on our systematic models in Equations (1), (S1), (S2) and (S3), available online. We rewrite PPIN interactive Equation (1) in the following regression form: . . .
where f k ½n denotes the regression vector which can be obtained from the NGS data and u k represents the unknown parameter vector to be estimated for the kth protein in PPIN. The Equation (2) of the kth protein can be augmented for N samples of the NGS data as below: which could be simply represented as: Therefore, the parameters in the vector u k can be estimated by solving the following constrained linear leastsquares problem, u k is the estimated interaction parameters for the kth protein. By the similar parameter identification method in Equations (S4)-(S15) in Supplementary Materials, available online, we could also estimate the interaction parameters of GRN, LRN, and MRN in the candidate GWGEN. Due to various experimental conditions might lead to errors within data coming from different databases, the candidate GWGEN constructed by big databases mining would exist false-positives of interactions. We employed the system order detection scheme on PPIN model in (4) as well as the GRN, LRN, and the MRN models, which could be found in Supplementary Materials, available online, to prune these false-positives. According to the system order detection scheme by Akaike information criterion (AIC), the insignificant interactions and regulations in candidate GWGEN, which were out of the system order, were removed. Finally, we obtained the real GWGENs in early, middle, and late stage of GC cells.
For the PPIN model in (4), AIC for detecting the number of interactions of the kth protein could be defined in the following equation: whereû k indicates the estimated interaction parameters vector of the kth protein by solving the least square problem in (6); 1 N ðY k À F k Áû k Þ T ðY k À F k Áû k Þ is the estimated residual error. According to the AIC theory, the real system order To each protein, we used forward and backward search to find the real system order by AIC. The insignificant protein interactions which are out of the real system order would be pruned away to get the real PPIN consequently. By doing so, we could obtain the real GWGENs of three stages of GC in Figs. S1-S3, available online, by pruning false-positives in candidate GWGEN with the help of system order detection scheme in (7). The total number of nodes containing receptors, proteins, lncRNAs, TFs and miRNAs and edges of their interaction in candidate GWGEN and real GWGENs are shown in Table S1, available online. It is noted that the nodes and edges decrease a lot compared to those in the candidate GWGEN. This phenomenon indicated that the false-positives were removed by system order detection scheme. Moreover, the real system order detection of GRN, LRN, and MRN are demonstrated in the Supplementary Materials, available online.

Extracting Core GWGENs From the real GWGENs by Using the Principal Network Projection Method
To the real GWGENs in three stages of GC as shown in Figs. S1-S3, available online, they are still too complicated to further investigate molecular progression mechanisms directly. Therefore, we propose the principal network projection (PNP) method to obtain the core GWGENs. In order to apply PNP method to extract the core GWGEN from the real GWGEN for each stage of GC, we have to construct a combined network matrix H that contains all estimated parameters in the real GWGEN as follows: whereâ kg could be obtained inû k by solving the Equation (6) and pruning false-positives by AIC method in (7). With the same concept, the rest of parameters could be computed and found in the Supplementary Materials, available online. For the combined network matrix H, the first K rows denote the PPI network; the following I rows denote the gene regulation network; the following Q rows denote the lncRNA regulation network and the last P rows denote the miRNA regulation network. Furthermore, in the PPI network, proteins do not have interactions between lncRNAs and miRNAs so that we set zeros in the combined matrix H. The PNP method is based on the singular value decomposition of H in (8) as bellows: . The fraction of eigenexpression fE m g is calculated from the eigenexpression levels fd m g which are listed in the diagonal of D. The eigenexpression fraction E m is defined as below: In terms of energy, we chose minimum Z singular vectors of Uand V with energy (i.e. P Z m¼1 E m ) satisfying P Z m¼1 E m ! 0:85. The top Z principal singular vectors construct 85 percent principal network structure of the real GWGEN. Subsequently, we define the projection of H to the top Z right-singular vectors of V for extracting important downstream nodes in the following: where h :;c is the c-th column vector of H; v T n;: is the n-th row vector of V T .
Afterward, we defined the 2-norm projection value of each downstream node in the real GWGEN to the top Z right-singular vectors which also stands for 85 percent principal network structure as follows: where DðcÞ is the 2-norm projection value of each downstream c-th node including TFs, genes, lncRNAs, miRNAs and proteins in the real GWGEN on the top Z right-singular vectors. If the projection value DðcÞ approximates zero, it means that the corresponding c-th node is almost independent to the principal network structure. In other words, the larger the projection value of a node in real GWGEN, the higher possibility the node will be an essential component of the principal network structure. Similarly, we could use the same method to calculate the upstream nodes' projection values by taking each row vector of combined network matrix H to project on the top Z left-singular vectors. In conclusion, the core GWGENs shown as Figs S4-S6, available online, could be extracted from the real GWGENs based on the top-ranked 3000 projection values of the downstream and upstream nodes, which reflected 85 percent of the real GWGENs in GC cells.

Investigating Molecular Progression Mechanisms by Differential Core Signaling Pathways From Early to Middle Stage of Gastric Cancer Cells
Based on each element's projection value in the core GWGEN, we could obtain the differential core signaling pathways from early-stage to middle-stage GC cells shown in Fig. 2. For the first core signaling pathway in the earlystage GC cells, the receptor, ITPR1, a ligand-gated ion channel that mediates calcium release from the endoplasmic reticulum [45]. Upon the stimulation by ligand inositol 1,4,5-trisphosphate, the mutated receptor ITPR1 would trigger downstream signals through BCL2, BECN1, and SNAI1 to activate TF GATA2, which was affected by methylation [46]. TF GATA2 could positively regulate target gene PRKAA2 to induce autophagy and inhibit apoptosis. Moreover, the BECN1 might be modified by phosphorylation [47]. In the second core signaling pathway, the ligand IL6 bound to the receptor IL6R and activated TF ZEB1 through signaling transduction proteins TRAF6, WWC3, and MACF1. The TF ZEB1 would positively regulate its target genes ATG5 and WNT5A, bringing about autophagy and inflammatory, respectively. Moreover, the gene ATG5 with mutation might contribute to cancer development by deregulating the autophagy process [48]. Accompanying with the post translational effect from the miR200, TF ZEB1 might alleviate cancer progression in primary tumors [49]. In the last core signaling pathway in the early-stage GC cells, the ligand calcium ion interacting with receptor PLSCR1 transmitted signaling to TF STAT1 through a cascade of proteins MYL6, RAB37, and MAP2K1, which was influence by mutation and phosphorylation [50], [51], [52]. Subsequently, the TF STAT1 would positively regulate its target genes TXNIP, which led to apoptosis, and IRF1, which promoted cell cycle and inflammatory response and inhibited cell proliferation. It is noted that the interaction between TF STAT1 and miR155 is associated with responding to inflammatory signals or infection. Furthermore, compared with middle-stage GC cells, we found that miR155 had a lower regulation ability on STAT1. We inferred the stronger regulation ability of miR155 would cause STAT1-deficient and make gastric cancer cells susceptible to viral and bacterial infection [53]. For the core signaling pathways of middle stage GC cells shown in Fig. 2, the ligand CCL25 combined with receptor CCR9 would modulate TF ELK1 through the mediation of signaling transduction proteins PCM1, MOS, KRTAP4-2, OSR1, and MARS. The TF ELK1 would posit-tively regulate its target genes PDCD10 resulting in the promotion of angiogenesis and the inhibition of cell proliferation, and NAP1L1 leading to cell proliferation as well. It is known that the OSR1 is a tumor suppressor [54]. Modified by methylation, it would lose its original functions, including inhibition of cell growth, arrested cell cycle, and induced apoptosis against GC [55]. In the next core signaling pathway in the middle-stage GC cells, the ligand M6P bound to receptor M6PR and transmitted the signal to TF FOXP3, which was influenced by methylation and acetylation [56], [57]. Regulated by the TF FOXP3, the first target gene LST1 would have an inflammatory response; the second target gene MTDH would result in the promotion of autophagy and angiogenesis and the inhibition of apoptosis; the third target gene ATG16L1 would lead to autophagy. The regulation of miR384 on MTDH suppressed the growth, migration, and invasion of GC cells [58]. However, the identified lncRNA CRNDE negatively regulated miR384, which might promote cell proliferation, migration, and invasion in GC cells [59]. According to the differential core signaling pathways analysis results and considering the overlap nodes with the LINCS dataset, we identified PLSCR1, BCL2, BECN1, and STAT1 to be the essential biomarkers for preventing the progression from early stage to middle stage of GC. Comparing these identified biomarkers' expressions with the normal stomach tissues from GSE120795 [42], we find that they have higher expressions than normal stomach tissues, which are marked in red in Fig. 2.

Investigating Molecular Progression
Mechanisms by Differential Core Signaling Pathways From Middle to Late Stage of Gastric Cancer Cells The differential core signaling pathways from mid-stage to late-stage GC are shown in Fig. 3. For the first core signaling pathway, the ligand estrogen binds to receptor ESR1 with mutation triggering downstream cascade transmitting signaling to TF PRDM14 through signaling transduction proteins TUSC3 and CDH1. The aforementioned signaling also transmitted to TF JUN through RAB12. Among them, we found that both TUSC3 and CDH1, which are tumor suppressors, were affected by mutation and methylation [60], [61]. The TF PRDM14 would regulate gene TXNRD1 resulting in cell proliferation. Owing to the regulation of miR125b2 on TXNRD1, cell proliferation might slow down. However, we inferred that TXNRD1 with overexpression would abolish the inhibitory effect of miR125b2, 'which is the same as the phenomenon found in hepatocellular carcinoma, and promote the progression from mid-stage to late-stage GC [62]. Moreover, the TF JUN would regulate gene BFAR to inhibit apoptosis. For the second pathway in the Fig. 3, receptor LGR5 interacting with R-spondin ligand could trigger the canonical pathway and transmit signaling to TF NFKB1 through LCK, ZNF26, AKT1, KLF5, and VCAM1. It is noted that the AKT1 modulated by phosphorylation would contribute to gastric cancer progression [63]. After receiving the signaling, TF NFKB1 would regulate EGFR to inhibit apoptosis and induce cell migration. The low expression compared with normal cells might be influenced by methylation and the regulation of lncRNA LINC00596. For the third core signaling pathway, the ligand androgen combining with receptor AR transmitted the signaling to TF YBX1 through ZEB2. The TF YBX1 would regulate TGFB3 modulated by methylation and lead to epithelial-to-mesenchymal (EMT) [64]. Along with the enhanced expression of receptor EGFR, the ligand TGFA bound to EGFR initiating the signal transduction cascade through NFKB1, MT1F, MAP3K7, BCL11A to TF EGR1. One of signaling transduction proteins, MT1F, was modulated by methylation, resulting in gastric carcinogenesis [65]. After receiving the signaling, TF EGR1 regulated EAPP to promote cell proliferation in the middle stage GC.
For the core signaling pathways of late-stage GC, the hypermethylation and mutation maintained the receptor EGFR with overexpression in gastric cancer cells. Therefore, the ligand TGFA bound to the receptor EGFR initiating signal transduction cascade through FOXJ1 and GDF5. The activated TF ETS1 would regulate SOX9 mediating EMT and cell migration. For the second core signaling pathway of late-stage GC, the receptor ERBB4, which was activated by the ligand NRG4, transmitted the signaling to GRB2 and triggered three core signaling pathways. The first one went through LCP1, MYH9, and CASP14 to activate the TF YBX1. It would regulate FGF10 inhibiting cell proliferation and ARF4 promoting cell migration. The second one passed the signaling to the TF PDX1 directly. After receiving signaling, the TF PDX1 would regulate ABL1 resulting in cell migration and ANGPTL2 leading to cell migration and cell invasion. The third one delivered signaling through HSFX2 and RTN4 to TF GATA1 modified by acetylation [66]. The activated TF GATA1 would regulate MIEN1 to inhibit apoptosis and induce cell migration. It is noted that MIEN1 modulated by demethylation was associated with the inhibition of cell proliferation and support of cell migration [67]. According to the differential core signaling pathways analysis results and considering the overlap nodes with the LINCS dataset, we identified PCM1, AKT1, JUN, EAPP, and EGR1 to be the essential biomarkers for preventing the progression from middle stage to late stage of GC. Comparing these identified biomarkers' expressions with the normal stomach tissues from GSE120795 [42], we find that JUN, EGR1, and EAPP owing higher expressions than normal stomach tissues and PCM1 and AKT1 owing lower expressions than normal stomach tissues, which are marked in red and blue in Figs. 1 and 2, respectively.

Deep Neural Network of Drug-Target Interaction Prediction
In order to explore the drug-target interaction toward our identified biomarkers, as shown in Table S2-S3, available online, we trained DeepDTI in advance. The flowchart of designing DeepDTI for drug discovery is in Fig. S7, available online. The drug-target interaction data used in this study were from BindingDB [68]. In our collected dataset, there are 38015 drugs and 7292 targets. Among them, the number of known and unknown interactions is 80291 and 19966109, respectively. The descriptors of drugs and targets were transformed by PyBioMed [16]. The PyMolecule module in PyBioMed, which we used to transform the drug descriptors, is responsible for calculating the commonly used structural and physicochemical descriptors. The drug features include constitutional and geometrical descriptors among other molecular properties. The PyProtein module in PyBioMed, which we applied to transforming target descriptors, is responsible for calculating the widely used structural and physicochemical properties of proteins and peptides from amino acid sequences. The descriptor transformation is under python 2.7 environment. After the transformation, the properties of drugs and targets could be described by a feature vector, respectively. The total number of drug features is 363 and the total number of target feature is 996. Every drug-target pair is represented by concatenating drug feature vector and target feature vector. A drug-target pair is described by the feature vector in the following:\eqalign{ where n drugÀt arg et denotes a feature vector of drug-target pair; D represents the feature vector of the corresponding drug; d i denotes the i-th drug feature; T indicates the feature vector of the corresponding target; t j is the j-th target feature; I is the total number of drug features; J is the total number of target features. Moreover, since features in the drug-target pair are measured in different scales, they do not contribute equally to the model fitting and might make model suffer from poor performance. To deal with this potential problem, we did data scaling by standardization. Consequently, we applied the principal component analysis (PCA) to remove noisy features and reduce memory consumption, which makes feature dimension in (13) decreased from 1359 to 1000 features [69]. For DeepDTI, it has four hidden layers with the rectified linear unit (ReLU) activation function. We used sigmoid function as an activation function for the output layer, which could make the output in the range of 0 to 1. Since drug-target interaction is a binary classification problem, we choose binary cross-entropy to be the loss function as below: ð y n Ã log ðŷ n Þ þ ð1 À y n Þ Ã log ð1 Àŷ n Þ Þ whereŷ is the predicted probability distribution and y is the actual probability distribution. N is the number of samples. y n andŷ n are the n-th label and the n-th predicted probability, respectively. Moreover, the loss function in (15) is the summation of binary cross-entropy for each sample.
LðuÞ ¼ Here, we want to obtain the optimal network parameter set u Ã to achieve the minimization of the objective function as follows: which could be achieved by the backpropagation algorithm.
In the deep neural network, the update of network parameter set, including weight ðw e Þ and bias ðb e Þfor the i-th layer, is described in the following form: where h is the learning rate, and u iÀ1 and rLðu iÀ1 Þ are given as follows: ; The backpropagation algorithm containing forward pass and backward pass helps us compute rLðu iÀ1 Þ which is a high dimension vector efficiently. Before using the DeepDTI to predict potential candidate drugs, we have to turn our selected essential biomarkers into drug-target pair feature vectors, which were the input data of Deep-DTI. If the predicted result approximates one, it reflects the higher probability that the drug-target pair would have interaction.

The Application of Deep Neural Network of Drug-Target Interaction
As shown in Fig. S7, available online, in order to propose the potential candidate multiple-molecule drugs having a higher probability of drug-target interaction for the identified biomarkers, we trained a DeepDTI in advance. The interaction dataset used for constructing DeepDTI is collected from BindingDB [68]. In total, there are 80291 known drug-target interactions between 38015 drugs and 7292 proteins. It is noted that the number of unknown drug-target interactions is 19966109. In order to address the class imbalance problem, we randomly choose the number of negative instances (unknown drug-target interactions) in the same size of our positive instances (known drug-target interactions).
We trained the entire model using 70 percent of data together with 10 percent being the validation set. The remaining 30 percent of the data were used for testing. Subsequently, we did feature scaling by standardization. Assisted with principal component analysis (PCA) for dimensionality reduction, we obtained 1000 out of 1359 features. For the architecture of DeepDTI, we used Adam as an optimizer (learning rate ¼ 0.003) with binary cross-entropy loss. The input layer had 1000 neurons, followed by four hidden layers with 512, 256, 128, and 64 neurons. The output layer has one neuron. The activation function was set as ReLU except for the output layer using a sigmoid function. Moreover, to mitigate the overfitting problem, we applied dropout 0.5, 0.4, 0.3, and 0.1 to the hidden layers, respectively. In addition, early stopping was used to terminate model training once the model performance stop improving on the validation set. In terms of model selection, we used 5-fold cross-validation to search for the optimal hyperparameters, including the number of node, batch size, and dropout rate based on the validation accuracy. The training and validation accuracy and the training and validation loss of 5-fold cross-validation are shown in Fig. S8, available online. and Fig. S9, available online, respectively. Taking the found hyperparameter settings, we retrained the Deep-DTI. For the retrained DeepDTI, the training accuracy, validation accuracy, and testing accuracy are 95.162, 93.061, and 93.351 percent, respectively. Meanwhile, the loss of training and testing sets and the accuracy of training and testing sets for the retrained DeepDTI are shown in Fig.  S10, available online. Furthermore, the area under receiver operating characteristic curve (AUC) is useful for organizing binary classifiers and visualizing their performance. The training ACU, validation AUC, and testing AUC are 0.983, 0.983, and 0.982, respectively. From the perspective of the DeepDTI application, we used it to predict potential candidate drugs for the identified biomarkers. Based on the DeepDTI prediction results, the candidate drugs with a score approaching one would be selected, which means that they hold a higher probability of interacting with our identified biomarkers.

Design Two Filters for Drug Regulation Ability and Drug Sensitivity
In order to narrow down our candidate drugs predicted by DeepDTI, we designed two filters for considering the drug regulation ability and drug sensitivity as design specifications. For considering drug regulation ability (perturbation signature) on our identified biomarkers, we used the LINCS L1000 level 5 dataset consisting of assay results from 978 genes within 75 cell lines treated with 19811 small molecule compounds. With the help of LINCS L1000, we could know whether a gene was up-regulated or down-regulated after treating it with a small molecule compound. The first filter goal is to select candidate drugs, which could reverse the abnormal gene expression to normal expression, as shown in Tables S2-S3, available online. In the LINCS L1000 dataset, we pulled out the information of AGS, which was a stomach cell line. If one drug is with a negative value, we regard it as a down-regulator to a specific gene; if one drug has a positive value, we regard it as an up-regulator to a particular gene. Afterward, we used the drug sensitivity (PRISM Repurposing Primary Screen) dataset containing 4518 drugs tested across 578 human cancer cell lines to consider drug sensitivity. The second filter aims to find the candidate drugs with negative values implying that they are much more sensitive toward the stomach cell line. Consequently, we suggested sulforaphane, neratinib, pimozide, fostamatinib, and parthenolide as a potential multiple-molecule drug for preventing the progression from early stage to middle stage GC; elesclomol, fostamatinib, parthenolide, rucaparib, bortezomib, and YM-155 as a potential multiple-molecule drug for preventing the progression from middle stage to late-stage GC. The corresponding drug targets for each small molecule compound are shown in Tables 1 and 2.

The Identified Biomarkers and Their
Corresponding Potential Multiple-Molecule Drug to Prevent the Progression From Early to Middle Stage of Gastric Cancer Considering the drug regulation ability and drug sensitivity as design specifications for the identified biomarkers, we proposed a multiple-molecule drug containing sulforaphane, neratinib, pimozide, fostamatinib, and parthenolide to prevent the progression of GC from early stage to middle stage. The corresponding drug targets were PLSCR1, BCL2, BECN1, and STAT1 as shown in Table 1.
In terms of drug regulation ability, these small-molecule compounds are able to restore to the normal expression of four target genes. The results showed that the expressions of PLSCR1, BCL2, BECN1, and STAT1 were decreased through the treatment with the proposed potential multiple-molecule drug. Among them, sulforaphane has been found to be active against several forms of cancers by protecting cells from DNA damage to the modulation of the cell cycle via proapoptotic, anti-angiogenesis and anti-matastasis activities [70]. Treatment of neratinib could attenuate the invasive ability of gastric cancer cells [71]. There is a study suggesting neratinib interacts with BCL2 inhibitor to kill mammary cancer cells [72]. Pimozide is a clinical drug, which is used to treat psychotic disease. However, more and more studies have indicated that it also held the ability to suppress cancer cells [73], [74], [75]. Furthermore, pimozide has been demonstrated that its treatment resulted in the decreased level of BCL2 [76]. In addition to the treatment of chronic immune thrombocytopenia [77], there is a study showing that repurposing of fostamatinib is an effective treatment to   [78]. Moreover, evidence has demonstrated that fostamatinib could also prevent the promotion of pro-tumorigenic microenvironment mediated by B-cell [79]. To parthenolide, a natural compound extracted from Pyrethrum partheniuma, is primarily used for the treatment of fever, arthritis, and migraine [80], [81]. In one synergistic analysis, parthenolide has been demonstrated to facilitate apoptosis and inhibit proliferation, migration, and invasion by inhibiting the STAT3 signaling pathway in human gastric carcinoma cells [82]. Notably, parthenolide not only induced cellular apoptosis in many types of tumors but also inhibited cell growth in gastric cancer cell lines significantly [83], [84].

The Identified Biomarkers and Their
Corresponding Potential Multiple-Molecule Drug to Prevent the Progression From Middle to Late Stage of Gastric Cancer For preventing the progression from middle to late stage of GC, we proposed a multiple-molecule drug comprising of elesclomol, rucaparib, bortezomib, YM-155, fostamatinib, and parthenolide. The corresponding drug targets were PCM1, AKT1, JUN, EGR1, and EAPP as shown in Table 2.
From the viewpoint of drug regulation ability, the expression of PCM1 and AKT1 were increased and the expression of JUN, EGR1, EAPP were decreased after the treatment with the proposed potential multiple-molecule drug. In other words, it has the ability to reverse the abnormal expression toward the five drug targets. Elesclomol, a novel small molecule compound, induces apoptosis in cancer cells through the induction of oxidative stress [85]. One study has suggested that elesclomol-elicited apoptosis was influenced by AKT survival signaling in breast cancer cells [86]. Moreover, it has been granted orphan drug and fast track designation from the U.S. Food and Drug Administration (FDA) for the treatment of metastatic melanoma [86], [87]. Rucaparib has been used to treat ovarian cancer or used as maintenance therapy in adult patients with recurrent or relapsed ovarian cancer who are in a complete or partial response to platinum-based chemotherapy [88]. It is noted that rucaparib has been found to share the key benzamide moiety with a weak AKT inhibitor [89]. Bortezomib, a proteasome inhibitor, is one of the clinically approved molecular drugs for multiple myeloma [90]. Additionally, it has been shown to display anti-tumor effects in several solid tumors including gastric, prostate, and pancreatic cancer [91], [92], [93], [94]. Functional cooperation of MYC and EGR1 is required for bortezomibinduced cell death [95]. YM-155 is identified as a specific inhibitor of survivin. The treatment of YM-155 significantly inhibited cell proliferation and induced apoptosis of gastric cancer cells [96]. Moreover, most of the proposed small-molecule compounds are approved by the U.S. Food and Drug Administration (FDA) except for sulforaphane, parthenolide, elesclomal, and YM-155. Sulforaphane, ClinicalTrials. gov Identifier NCT03232138, has been in the phase 2 clinical trial studying about whether it is a useful dietary supplement for prevention of lung cancer in human. YM-155, Clin-icalTrials.gov Identifier NCT00281541, is in the phase 2 clinical trial for patients with unresectable stage III or metastatic stage IV melanoma. Taken together, repurposing old drugs to treat both common and rare disease is gradually becoming an attractive proposition. Drug repurposing with the help of computation approaches leads to reducing overall development costs and shorten development timelines. Identifying new uses for approved or investigational drugs that are outside the scope of the original medical indication might bring new hope to cancer therapeutics.

The Model Evaluation for Systems Identification
to Infer the Core Signaling Pathways of Gastric Cancer In drug discovery, biomarker identification is an important problem. The ligand in the microenvironment binds to receptor triggering downstream signaling cascade and leads to the progression of tumor cells. To investigate the core signaling pathways of GC and identify biomarkers based on the molecular progression mechanisms, we leveraged NGS data to do system identification using reversed engineering method. Afterward, we applied AIC, which could prune the false positives of interactions and regulations and conquer the overfitting and underfitting problems, to system order detection for obtaining real GWGENs. For model evaluation, we took another independent dataset (GSE26899), including 11 early-stage GC samples, 18 middle-stage GC samples, and 63 late-stage GC samples and calculated the AIC values for the common symbols. Subsequently, we executed random permutation for 1000 times on our original data. To the new data, taking one common symbol as an example, if the common symbol's AIC value could be lower than all of the AIC values after random permutation in the original dataset, we would say that the common symbol has significant p-value (p-value ¼ 0.001). For the original data, during 1000 times random permutation, if we could not find the AIC value, which is lower than the identified AIC value, we would say the symbol with significant p-value (p-value ¼ 0.001). Here, we would like to know how many significant genes in independent data overlap with significant genes in the original data. In other words, it reflects the robustness of our proposed model. According to the model evaluation results for each stage of GC in the independent data, the overlap proportion is 0.28, 0.3, and 0.47, respectively. Even if the sample size is small coming from a different platform, the proposed model could still be robust to provide some significant reliable results. Moreover, the more samples we have from new data, the more significant results we obtain from our proposed model.

CONCLUSION AND FUTURE WORKS
According to our proposed systems medicine design procedure, we applied systems biology approaches, including system modeling, reversed engineering, system order detection, and principal network projection methods to identify biomarkers. Moreover, we trained a DeepDTI in advance to predict drug-target interaction for the identified biomarkers. Considering drug regulation and drug sensitivity as design specifications by the two designed filters, two multiple-molecule drugs are proposed to prevent the progression from early to middle and middle to late stage of GC, respectively. The data-driven computational approaches involving systemic analyses and deep neural network learning scheme lead to the formulation of repurposing hypotheses. With the proposed systems medicine design procedure, we provide alternative methods to medicine design for the therapeutics of GC. In this study, we demonstrated the validity of our pipeline for the identification of stage-specific biomarkers. It should be noticed that only gene expression data is being used in the current pipeline. Along with the development of NGS technology, more and more genomics data have become publicly available giving us a great opportunity to enhance pipeline. For example, epigenetics (e.g. histone modification) has been proved to play pivotal rules and such regulation could be modeled by integrating Chip-seq data into our pipeline; similarly, by integrating Bisulfite-seq data, we can characterize the regulation of DNA methylation on gene expression; in addition, by integrating Hi-C (or CHIA-PET) data, we can model the regulatory effect of long-range chromosome interactions (e.g. enhancer-promoter, enhancer-enhancer) on gene expression. In summary, we expect that as multiple types of genomics data being considered, the proposed pipeline could help us reveal a much more comprehensive gene regulatory landscape of GC cells and speed up drug repurposing efficiency. The DeepDTI and model evaluation analysis code are available upon request to the corresponding.

ACKNOWLEDGMENTS
This work was supported by the Ministry of Science and Technology of Taiwan under grant number MOST 107-2221-E-007-112-MY3.