Integrative Network Modeling Highlights the Crucial Roles of Rho-GDI Signaling Pathway in the Progression of non-Small Cell Lung Cancer

Non-small cell lung cancer (NSCLC) is the most prevalent form of lung cancer and a leading cause of cancer-related deaths worldwide. Using an integrative approach, we analyzed a publicly available merged NSCLC transcriptome dataset using machine learning, protein-protein interaction (PPI) networks and bayesian modeling to pinpoint key cellular factors and pathways likely to be involved with the onset and progression of NSCLC. First, we generated multiple prediction models using various machine learning classifiers to classify NSCLC and healthy cohorts. Our models achieved prediction accuracies ranging from 0.83 to 1.0, with XGBoost emerging as the best performer. Next, using functional enrichment analysis (and gene co-expression network analysis with WGCNA) of the machine learning feature-selected genes, we determined that genes involved in Rho GTPase signaling that modulate actin stability and cytoskeleton were likely to be crucial in NSCLC. We further assembled a PPI network for the feature-selected genes that was partitioned using Markov clustering to detect protein complexes functionally relevant to NSCLC. Finally, we modeled the perturbations in RhoGDI signaling using a bayesian network; our simulations suggest that aberrations in ARHGEF19 and/or RAC2 gene activities contributed to impaired MAPK signaling and disrupted actin and cytoskeleton organization and were arguably key contributors to the onset of tumorigenesis in NSCLC. We hypothesize that targeted measures to restore aberrant ARHGEF19 and/or RAC2 functions could conceivably rescue the cancerous phenotype in NSCLC. Our findings offer promising avenues for early predictive biomarker discovery, targeted therapeutic intervention and improved clinical outcomes in NSCLC.

in NSCLC. We further assembled a PPI network for the feature-selected genes that was partitioned using Markov clustering to detect protein complexes functionally relevant to NSCLC. Finally, we modeled the perturbations in RhoGDI signaling using a bayesian network; our simulations suggest that aberrations in ARHGEF19 and/or RAC2 gene activities contributed to impaired MAPK signaling and disrupted actin and cytoskeleton organization and were arguably key contributors to the onset of tumorigenesis in NSCLC. We hypothesize that targeted measures to restore aberrant ARHGEF19 and/or RAC2 functions could conceivably rescue the cancerous phenotype in NSCLC. Our findings offer promising avenues for early predictive biomarker discovery, targeted therapeutic intervention and improved clinical outcomes in NSCLC.

I. INTRODUCTION
N ON-SMALL cell lung cancer (NSCLC) represents around 85% of all lung cancers, with a 5-year overall survival rate of about 15%. Until recently, there were unclear guidelines and only a generalised approach for the management of NSCLC. There have been increasing attempts to develop strategies to combat NSCLC that take into account a wide range of aspects such as specific oncogenes, signaling pathways, stage of the disease, and other predictive factors [1]. Key to these strategies is the development of ad-hoc computational approaches that can pinpoint key drug targets and develop specific therapies for individual patients [2].
The rapid advancements in high-throughput omics technologies have led to a proliferation of biological data. Consequently, computational approaches in the domains of machine learning, statistics, and data science are increasingly being used to analyse large biological datasets and solve biological problems [3]. These techniques have paved way for cutting down on the low hit-to-lead ratio and the laborious, time-consuming and expensive experiments. Machine learning methods can help significantly to interrogate high-throughput biological data and generate new hypotheses without solely relying on experiments [4].
Biological pathways are an interconnected series of interactions between different biomolecules that modulate key cellular processes. When the functioning of a key pathway is significantly perturbed by aberrantly behaving oncogenes or tumor-suppressor genes, cellular physiology can be seriously compromised and may potentially lead to cancer [5]. Cancer genes have also been shown to clump together in a few, yet key cellular networks. Consequently, pathway and network analyses are useful tools to analyze and decipher the genes that are likely to drive the onset and progression of the cancerous phenotype. Bayesian networks is one such methodology that has been successful in probing the mechanistics of various cancers, including breast [6], pancreatic [7], and lung cancers [8].
In this study, we have used a merged lung cancer transcriptome dataset to unearth key cellular factors that are likely to strongly correlate with the onset of NSCLC and that can be targeted to potentially steer the out-of-control cancerous cell proliferation back to normalcy. The paper is designed as follows. In Section II, we discuss the methodology used using machine learning models, interpretative phenomenological analysis, and bayesian networks. In Section III, we discuss our results. In Section IV, we discuss the significance of our results, offer concluding remarks, and lay out the probable course of action in investigating NSCLC and other diseases in the future.
An overview of the workflow used for this analysis is presented in Fig. 1.

A. Datasets and Feature Selection for Machine Learning
We analyzed a publicly available merged lung transcriptome dataset comprising 1,118 patient-derived samples that included 925 primary NSCLC tumors paired against 193 normal lung tissues from ten independent Gene Expression Omnibus (GEO) microarray datasets [9]. Our analytical pipeline examined 10,077 genes within this dataset as features and disease status (NSCLC/Normal) as the target variables. To select the best contributing features to optimize model complexity, variance, and to avoid over-fitting, we employed Boruta feature selection [10]. Boruta algorithm employs the random forest classifier to estimate the variable importance feature. To minimize randomness and improve accuracy, Boruta generates additional shadow variables by shuffling the values of the original features and subsequently estimating feature importance. Boruta feature selection identifies the necessary features in predicting the target variable and decreases the model complexity at the same time. This procedure was repeated 100 times to impart robustness to the feature selection process.
Following Boruta feature selection, we retained only those features/genes that were also profiled in an independent study that performed next generation sequencing (RNAseq) of nonsmall cell lung cancer (GEO accession GSE81089) [11]; this RNAseq data was also used as an independent test set to assess model performances. The dataset was then separated into a training and validation set at an 80/20 split.

B. Functional Enrichment Analysis
The selected feature genes were examined for the enrichment of specific biological themes using TargetMine 1 data analysis platform [12] and Ingenuity Pathway Analysis (IPA) platform. The enrichment of specific KEGG pathways [13], Reactome pathways [14], Gene Ontology (GO) associations [15] and IPA pathway associations was estimated using Fisher's exact test. The inferred p-values were further adjusted for multiple-test-correction to control the false-discovery-rate using the Benjamini-Hochberg procedure [16] and the annotations/pathways were judged to be significant if the adjusted p-values ≤ 0.05.

C. Co-Expression Gene Module Network Analysis
We employed the weighted gene co-expression network analysis (WGCNA) package in R [17] to identify groups of coexpressed genes among the selected features. The blockwise-Module function was used with the default parameters and power = 14 (high correlation); deepSplit = 2 (medium-sensitivity) to allow for reasonably-sized modules of highly correlated genes. The selected genes were examined for functional enrichment analysis with TargetMine [12].

D. Outliers and Over-Sampling
We used Z-score estimation to identify the outliers in the dataset [18]. Z-score for each data point was computed using the equation given below, where x is any data point and μ, σ are the mean and standard deviation of the dataset respectively. If |Z| > 3, the data point was judged to be an outlier and purged from the dataset. A heavily-skewed distribution in favour of the NSCLC samples in the merged dataset would have significantly impacted the efficacy of the ML algorithms and render classification a challenging task. We therefore, attempted to address this data imbalance by oversampling the minority class, i.e., the 'Normal' class using Synthetic Minority Oversampling Technique (SMOTE) [19]. SMOTE is a well known data augmentation approach that randomly selects a minority class instance p and finds its k nearest minority class neighbors. Next, it selects one of the k nearest neighbors q at random and connects p and q to infer a line segment in the feature space to create a synthetic data point. The synthetic instances are finally generated as a convex combination of the two chosen instances p and q. We used SMOTE to create as many synthetic examples for the minority class as were required to balance the two cohorts. This synthetically augmented dataset was used for the subsequent model building using different ML algorithms.

F. Model Evaluation
Model performances were evaluated by testing their prediction performances on the validation set as well as the independent test dataset. We estimated model accuracy and the area under the AUROC score using the scikit-learn package [27] from Python.

G. Bayesian Network Analysis
Interactions among genes are relatively sparse and therefore, bayesian modeling is an attractive approach to model gene regulatory networks [28], [29]. Bayesian networks have been previously deployed in modeling diverse biological pathways such as Peroxisome proliferator-activated receptor pathway, Alzheimer's disease, and diabetes [29]- [31].
A bayesian network is a collection of nodes which interact in a directed acyclic manner. Bayesian network may be represented as G, Φ , where G is a Directed Acyclic Graph (DAG) and Φ is the Conditional Probability Distribution (CPD) of each node. Using the local Markov independence assumption, any Joint Probability Distribution (JPD) that satisfies the Markov condition can be described as a product of CPDs [32]. That is, where X i is a random variable and P a(X i ) is the set of parents of X i .

1) Discretization:
Discretization of gene expression values into a binary framework buffers the model against noise and reduces computational complexities [33]. We used the maximum likelihood estimator for the mean μ as the threshold and discretized the gene expression values into 0 s and 1 s. Values greater than the threshold were all normalized as '1' and values lesser than the threshold were all normalized as '0'.
2) Prior and Posterior Distributions: Next, we applied bayesian modeling to estimate the network parameters. Let Φ X be the probability that a node X takes the value '1'. For each node, we assumed that Φ X was Beta distributed. That is, Γ(α+β) and Γ is the standard Gamma function.
The data likelihood for n observations is defined as: where i is the number of successes ('1 s').
Since the binomial likelihood is a conjugate to the beta distribution, the posterior distributions of the variables were defined as: where α X = (α X + i) and β X = (β X + n − i).
Using the above equation, the expected value is thus defined as: Similarly, if we have two nodes X and Y , connected such that Y is the parent of X, the conditional posterior probability is defined as: and the expected value of X 1 |Y 1 is defined as: where α X 1 |Y 1 = (n 11 + 1), β X 1 |Y 1 = (n 01 + 1), n 11 is the number of observations where both X and Y are '1 s,' and n 01 is the number of observations where X is '0' and Y is '1' simultaneously.
3) PPI Network Analysis: Lung-specific PPI networks for the selected features were constructed using TargetMine [12] as described previously [34]. We further employed Markov-Clustering (MCL) algorithm [35] to extract specific sub-clusters from the extended PPI network (library (markov-clustering) ver. "0.0.6.dev") To ensure that the inferred clusters were biologically meaningful and minimally noisy, we retained only those clusters that included between five and twenty nodes. The clusters were examined for enriched functional themes as described above.

A. Feature Selection and Functional Enrichment Analysis
We used Boruta feature selection protocol to identify an initial list of 489 genes out of the 10,077 genes originally sampled in the dataset. To ensure robustness of the selected features, we retained only 412 of these 489 genes (features) that overlapped in the training and the independent test datasets. Further, principle coordinate analysis followed by tSNE distribution analysis demonstrated that the selected features neatly resolved the NSCLC cohort from the Normal/Healthy cohort (Fig. 2(a)).
We then examined the selected features for enriched biological themes to identify the key cellular processes that were dysregulated in NSCLC. Specifically, the signaling by Rho GTPases emerged one of the key enriched pathways-13 of the selected feature genes were mapped to the enriched Reactome pathway "Rho GTPase cycle" (R-HSA-194840; p = 0.024901); WGCNA algorithm further sequestered 44 genes (Blue module, Additional File 1) specifically perturbed in NSCLC cohort compared with the normal cohort; five of these genes were mapped to enriched Reactome pathway "Rho GTPases Activate Formins" (R-HSA-5663220; p = 0.006487). RhoGDI pathway also emerged among the top five canonical pathways associated with the selected features by IPA analysis (Table I). Hence, we further investigated the probable roles of RhoGDI signaling in the context of NSCLC and as a potential target for therapeutic intervention.

B. Machine Learning
We generated multiple prediction models using different machine learning classifiers and computed their area under curvereceiver operator characteristic (AUROC) score [36] and the Cohen's kappa coefficient [37] to assess their performances. AUROC score ranges from 0.5 and 1.0; where a score of 0.5 indicates that the model has no discriminatory ability to differentiate between classes, while a score of 1.0 indicates high measure of separability and therefore, an excellent model performance. Similarly, the Cohen's kappa coefficient ranges from 0 and 1, with 0 reflecting the worst performing model and 1 reflecting the best performing model.
Using the 412 selected features, we trained the classifiers both on the validation and the independent test data sets. Figs. 3 and 4 show the AUROC and Cohen's kappa coefficient scores obtained by each method on the validation and test sets respectively. Based on scores and the strength of the classifiers, we further categorized them into ensemble classifiers (CatBoost, RandomForest, XGBoost, LGBM) and normal classifiers (LogisticRegression, DecisionTree, SVC).
Next, we analyzed the ensemble and normal classifiers by plotting the respective ROC curves and computed their AU-ROC values as shown in Fig. 5. The support vector classifier (SVC) and the decision tree classifier models seemingly underperformed, since these models lag behind when faced with a large dimensional dataset with only 412 features and 1,118 rows before over-sampling. This lag results in over-fitting of the normal classifiers, and hence, the ensemble classifiers are employed to overcome this limitation.
XGBoost (Booster) emerged from this analysis as the bestperforming method (Figs. 3 and 4). Based on a 5-fold crossvalidation, XGBoost displayed the least deviation in the scores, and consequently, the smallest over-fitting model and offered the most robust predictions. XGBoost also achieved the highest AU-ROC score of 0.993 and the highest Cohen's kappa coefficient value of 1.

C. Bayesian Network Analysis
In parallel to the previously described machine learning approach, we constructed a bayesian network equivalent of the RhoGDI signaling pathway from literature [38], and mapped the key regulatory genes that could potentially be involved in the progression of and cellular proliferation in NSCLC.
Starting from our original list of 412 genes, we selected those that matched the RhoGDI signaling pathway or that are known to be related to said pathway as according to the definitions of the KEGG, Reactome and IPA pathway maps, and used them as input for the definition of the network. Then, we inferred the expected values for each node (using equations 6 and 7) and overlaid these to the nodes of the RhoGDI pathway in order to create both healthy (Fig. 6) and tumor (Fig. 7) versions of it.
The expected values reflect the probability of a given gene to be over-expressed given the over-expression of its upstream genes, with lower values indicating a relative lower probability of the gene being over-expressed, whereas a higher value indicates a relatively higher probability. To help quickly identify high-(and low-) likely over-expressed genes, we used color blue to display nodes with value less than 0.4; yellow for nodes with values between 0.4 and 0.6; and red for nodes with values higher than 0.6.
When comparing the over-expression probability values, we observed that genes at the bottom of the RhoGDI pathway, namely LACTB (lactamase beta; 0.67), ERMAP (erythroblast membrane associated protein [Scianna blood group]; 0.71), and CFL2 (cofilin 2; 0.92), were less likely to be over-expressed in NSCLC as compared to Normal (Fig. 6). Since these genes directly influenced the output processes of Cytoskeletal Reorganization, Actin Linkage, and Actin Stabilization (Fig. 6), we proceeded to examine these three processes, in order to   pinpoint their key modulators and potential targets for gene intervention.
1) Gene Intervention: The activity of output processes Cytoskeletal Reorganization, Actin Linkage, and Actin Stabilization appeared to be significantly diminished in NSCLC compared with Normal, hence, we attempted to pinpoint the key factors that could be potentially modulated to restore these processes to normalcy.
For this, we computed the conditional probabilities for an output gene (LACTB, ERMAP and CFL2) to be overexpressed given an upstream gene was under-expressed simultaneously. That is, we computed P (an output gene is ↑ | an upstream gene is ↓) for each combination of an output gene and an upstream gene in the RhoGDI pathway (Additional File 2). Then, to identify the genes that were significantly involved across all three output processes, we computed their average probability scores. Our calculations suggest that the genes with the highest scores, ARHGEF19 (Rho guanine nucleotide exchange factor 19; 0.77) and RAC2 (Rac family small GTPase 2; 0.77) are key modulators of the three output processes.

IV. DISCUSSION
NSCLC is the most widespread of all lung cancers and a leading cause of cancer-related mortality worldwide. Furthermore, NSCLC patients remain highly susceptible to a relapse post-operatively. The availability of high-throughput data from NSCLC clinical cohorts has offered unprecedented opportunities to investigate NSCLC biology and evolve new and more effective strategies for NSCLC treatment.
In this study, we have employed an integrative computational approach involving machine learning, functional enrichment analysis, bayesian modeling and PPI network analysis to analyze NSCLC transcriptome dataset. First, using Boruta feature selection, we prioritized the genes (features) that could effectively differentiate NSCLC cohort from the Normal/Healthy cohort. Next, we generated multiple machine learning models using the selected features and evaluated their effectiveness in delineating the NSCLC cohort from the Normal. XGBoost displayed the best performance in these trials, displaying high AUROC scores and Cohen's Kappa coefficient values. Our analysis therefore, demonstrated the suitability of the feature-selected genes to build predictive diagnostic models for NSCLC.
Functional enrichment analysis (and co-expression network analysis using WGCNA) of the feature-selected genes highlighted RhoGDI signaling as one of the key pathways modulating tumorigenesis of NSCLC. We generated a PPI network with the feature selected genes, which was subsequently partitioned into sub-clusters using the Markov Cluster Algorithm. Four of these sub-clusters were associated with the RhoGDI Signaling pathway. Moreover, one sub-cluster included KNL1, a key gene involved in the Kinetochore Metapahase Signaling Pathway, which had previously emerged as one of the key canonical pathways from the functional enrichment analysis. We believe that these four sub-clusters indeed represent protein complexes highly relevant to NSCLC.
In parallel, we implemented a bayesian framework to pinpoint specific genes and processes that were linked with aberrations in RhoGDI signaling and were likely to be responsible for the proliferation of the NSCLC phenotype. Bayesian modeling analysis suggested that the dysregulation of Actin stabilization induced by aberrations in the expressions of LACTB, ERMAP and CFL2 genes were likely to play crucial roles for NSCLC. Taken together, we hypothesized that these cellular processes are significantly under-expressed and therefore more likely to be destabilized in NSCLC as compared to Normal.
Next, we examined the activities of ARHGEF19 and RAC2 gene products in the context of RhoGDI signaling and we hypothesized that they were primal in modulating RhoGDI signaling and influencing Actin stabilization and cytoskeleton reorganization further downstream. Indeed, it has been demonstrated previously that the stabilization of the Actin cytoskeleton structures and inhibiting Focal adhesion turnover was able to impede the progression and invasion of NSCLC [39]. Moreover, the over-expression of ARHGEF19 was previously hypothesized to play a crucial role in NSCLC tumorigenesis by activating MAPK signaling [40]. Our analysis has now suggested that ARHGEF19 may play a similarly crucial role in modulating the stability of the Actin and cytoskeleton reorganization and therefore, would be an extremely attractive candidate for therapeutic intervention to normalize MAPK signaling and Actin stabilization in the context of NSCLC. Also consistent with our observations, RAC2 was previously reported to be upregulated in NSCLC tissues and linked with poor prognoses of NSCLC patients [41] and therefore is an attractive candidate for anti-NSCLC therapy. As mentioned before, our study is strongly focused on the expression of genes associated with NSCLC and the selection of the initial datasets clearly reflects this focus. While the treatment of NSCLC has been, to the best of our knowledge, focused on mutations in tumor driver genes such as EGFR, the emergence of resistance against EGFR targeted treatments has necessitated exploring additional parameters such as signaling pathway perturbations that can be examined in the context of gene expression. By using the Boruta selection process to determine the relevant genes in the original dataset, we highlighted the expression of genes that were strongly correlated with our NSCLC cohort, leaving out others that are known to be drivers in lung cancer when mutated.
One study by Liang and colleagues directly links the expression of protein EGFR (among others) to NSCLC, however, the data used by these authors include information from the patients medical records (such as smoking history) together with tumor histology [42]. Here, the authors were able to correlate the expression of EGFR to tumor stage and lymph node metastasis; however, given the limitations of the chosen dataset (which does not include such information), this was a path impossible for us to follow. Nevertheless, it indeed provides an interesting and challenging line of research to continue applying the methods first introduced here, and we look forward at addressing it in the future.
In conclusion, our study has highlighted gene expression signatures and protein complexes strongly correlated with NSCLC. More specifically, we have provided roles for RhoGDI signaling and the ARHGEF19 and RAC2 genes in the proliferation of NSCLC phenotype. Collectively, our findings have offered a deeper mechanistic understanding of the pathophysiology of NSCLC and showcased newer avenues for effective anti-NSCLC strategies. In particular, therapeutic targeting of RhoGDI signaling appears to be a promising approach for the treatment of NSCLC. As part of our future work, we expect applying our methodology in the context of collecting more data, could be helpful in further understanding the role the RhoGDI pathway and/or other well know tumor driver pathways/genes might have, not only in the identification, but also in the prognosis of NSCLC patients.

ACKNOWLEDGMENT
The statements made herein are solely the responsibility of the authors.
IV. DATA AVAILABILITY The custom scripts, data sets and supplementary material associated with this study are available at https://github.com/ lokeshtr/NSCLC.