By Topic

• Abstract

SECTION I

## INTRODUCTION

A gene regulatory network (GRN) plays a key role in uncovering various biological problems; for example, it can identify transcription factors of specific disease marker genes [1] and can be used to investigate evolutional clues in developmental processes which affect the development of certain diseases [2]. Thus, various mathematical and statistical methods have been introduced to identify gene regulatory networks [3], [4], [5], [6], [7], [8] that can be used to infer causalities among genes. On the other hand, intrinsically probabilistic models have long been used in computer system performance [9], and have been extended to the study of gene regulatory networks [10], while statistical inference of various data related issues has also been considered for many years [11]. Recently, many reverse engineering methods have been assessed by comparing their performance in a competition called DREAM (Dialogue for Reverse Engineering Assessments and Methods) [12]. However, in GRN related applications, the approach has focused on networks that consist of at most a few hundred genes, and the computational complexity of such relatively small system evaluations require the use of parallel programming techniques on multiprocessor machines to reduce the very large computation times [13], [14]. However, it is also important to improve the inference algorithms which are used to provide more reliable large-scale regulatory networks for researchers who are unable to use expensive computational equipment.

Generally, there are two main challenges for genome-wide scale GRNs. Microarray data used in GRN inference consists of at most a few hundred samples $(n)$ while the number of genes $(p)$ could be up to the tens of thousands. Therefore, models based on ordinary differential equations or linear regression using ordinary least square, will employ regularization approaches where a penalty in the ordinary least square estimation makes the solution unique and shrinks predictors towards zero [15], [16]. H. Zou and T. Hastie introduced the Elastic-net (Enet) with a combination of Lasso and Ridge penalties, which is useful if $n\ll p$ even when predictors are highly correlated [17], [18]. The Graphical Gaussian model (GGM) with a Stein-type shrinkage estimator [19], [20] is also available for the large-scale GRN construction. The GGM has been shown to outperform Mutual information and Pearson correlation based algorithms [21]. Along with this dimensionality problem, various sources of heterogeneous biological data need appropriate integration methods [22], [23] to enhance the network inference which primarily uses mRNA expression data. Bayesian networks (BNs) have been used to combine different types of data by updating their prior probability [24], [25]. However, the BNs with discretized data suffer from information loss and approach with continuous data is also computationally intractable.

In this paper we tackle these two problems together, which one of the important issues in systems biology and also difficult to achieve due to the large-scale and the complexity of living systems. Our approach is similar that of J. Zhu et al. who integrated transcription factor binding site and protein-protein interaction information along with gene expression data to construct large-scale GRNs [22]. Also ARACNE [8], [26], [27], [28] and MINDy [29] are well known reverse engineering methods based on mutual information. However, these approaches possibly lose gene expression information because of the discretization of the data.

In our work we use a Bayesian model averaging (BMA) technique which enables us to merge appropriate locally fitted models and also to integrate prior knowledge obtained from other sources of biological data. Based on this, we have performed a comparison of our proposed method with other regularization based linear models (Enet and GGM). The proposed method showed better sensitivities than the other linear models with a fixed specificity. In brain tumor analysis, DNA-protein infinity information [30] was integrated in constructing three grade dependent large-scale networks. We found a set of key “regulatory genes” which show high connectivity in the network structure and which are mainly involved in terms of gene ontology (GO) in regulation and developmental processes, and which may give useful information about the regulatory genes that trigger tumor cells.

SECTION II

## METHODS

### A. BMA for Linear Regression Models

Suppose a standardized gene expression dataset $X=({\bf x}_{1},\ldots,{\bf x}_{p})$ where ${\bf x}_{i}$ is length $n$ vector expression for the $i$th gene and $M_{il}$ is the $l$th regression model where the $i$th gene is dependent variable with a set of independent variables (in-degrees), $K$. TeX Source $${\bf x}_{i}=\sum_{j\in K}b_{ji}{\bf x}_{j}+e_{i}\eqno{\hbox{(1)}}$$ where $e_{i}$ is an error term and $b_{ji}$ is a coefficient representing the effect of gene $j$ on the $i$th gene. Let $\theta_{ji}$ be the true coefficient, then by Bayes' rule and the law of total probability, TeX Source \eqalignno{p(\theta_{ji}\vert X)=&\,\sum_{l=1}^{L}p(\theta_{ji}\vert X,M_{il})p(M_{il}\vert X)&\hbox{(2)}\cr{\rm where}\quad p(M_{il}\vert X)=&\,{p(X\vert M_{il})p(M_{il})\over\sum_{h=1}^{L}p(X\vert M_{ih})p(M_{ih})}&\hbox{(3)}} where $L$ is the number of all possible models which include the $j$th gene as one of their predictor variables. These equations mean that the full posterior distribution of $\theta_{ji}$ is a weighted average of its posterior distributions, $p(\theta_{ji}\vert X,M_{il})$. The weight is the posterior model probability, $p(M_{l}\vert X)$ in (3). Raftery A. E. obtained the posterior model probability by following approximation, $p(X\vert M_{il})\approx exp(-(1/2)BIC_{il})$ where $BIC_{il}$ is called Bayesian information criteria of the model $M_{il}$ [31]. $BIC_{il}$ is $n\log(1-R_{il}^{2})+K_{n}\log n$ where $R_{il}^{2}$ is the coefficient of determinant of the model $M_{il}$ and $K_{n}$ is the length of $K$ ($K_{n}=3$ in this study). So, in an exhaustive search, the maximum number of models $L=\sum_{k=1}^{K_{n}}{n\choose k}$. Though the maximum in-degree is fixed as three, our ensemble approach results that a node can have more than three in-degrees, which will be shown in the results section.

### B. Prior Update

Along with the mRNA expression data, there are various sources of biological data which can enhance the inference of the interactions among genes. In order to integrate this information together, we define a prior distribution having the form of Gibbs distribution which is successfully used in other gene network integration studies [32], [33]. This distribution of the model priors, $p(M_{il})$, enables to have steeply higher values as its variables are supported by the prior information.

Let $E_{1}(M_{il})$ be the “energy” function of $M_{il}$. This energy can be considered as a gap between observed and expected degree that the model $M_{il}$ is true. If we define $S_{i}$ is a set of genes whose expression affect the $i$th gene, then the corresponding energy function is TeX Source $$E_{1}(M_{il})={1\over K_{n}}\sum_{k=1}^{K_{n}}\left(1-I({\bf x}_{k}\in S_{i})\right)\eqno{\hbox{(4)}}$$ where ${\bf x}_{k}$ is the $k$th predictive variable of $M_{il}$ and $I(C)$ is an indicator function which is 1 when $C$ is true or 0 other wise. Equation (4) explains that the energy is zero when the coefficients of the model $M_{il}$ are all shown in the prior information of $S_{i}$ while it increases as the model contains no such information. So the prior distribution having the form of a Gibbs distribution is TeX Source $$P(M_{il}\vert\beta)={e^{-\beta E_{1}(M_{il})}\over\sum_{h=1}^{L^{\prime}}e^{-\beta E_{1}(M_{ih})}}\eqno{\hbox{(5)}}$$ where $L^{\prime}$ is the total number of selected regression model and $\beta$ is a hyperparameter which indicates the strength of the influence of the prior knowledge. In the same way, other biological sources can be taken to enhance the biological significance of the inferred GRNs. If we define two energy functions, $E_{1}$ and $E_{2}$, the prior probability distribution is TeX Source $$P(M_{il}\vert\beta_{1},\beta_{2})={e^{-\left(\beta_{1}E_{1}(M_{il})+\beta_{2}E_{2}(M_{il})\right)}\over\sum_{h=1}^{L^{\prime}}e^{-\left(\beta_{1}E_{1}(M_{ih})+\beta_{2}E_{2}(M_{ih})\right)}}\eqno{\hbox{(6)}}$$ We used $\beta_{1}=\beta_{2}=5$.

### C. Build a Final GRN Structure

In our ensemble approach, the large number of genes not only causes the massive computation time but also detects too many possibly true models even when their maximum in-degree is fixed. So we performed following model selection procedure instead of Occams window which is normally used in BMA algorithm.

• Choose the best $m$ models among the models having $k=1.$
• Extend the selected models to models with $k=2$ by adding one independent variable.
• Select the best $m$ models among the extended models.
• Repeat until $k=K_{n}.$

So total $L^{\prime}=m\times K_{n}$ regression models can be selected for one gene. From [31], the posterior probability that the effect, $\theta_{ji}$, is none zero is TeX Source $$P(\theta_{ji}\neq 0\vert X)=\sum_{l=1}^{L^{\prime}}p(M_{il}\vert X)I\left(j\in K(M_{il})\right)\eqno{\hbox{(7)}}$$ where $K(M_{il})$ is a set of predictors in the model $M_{il}$.

In order to determine the final GRN structure, we introduce a measure called edge ratio, $Er$, which can be used as a criterion to prune the edges whose probability of (7) is likely to be zero. TeX Source $$Er={\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}I\left(P(\theta_{ji}\neq 0\vert X)\neq 0\right)\over n(n-1)/2}\eqno{\hbox{(8)}}$$ where the denominator is the number of all possible undirected edges and the numerator indicates the number of undirected edges whose effect sizes are non-zero. It is known that the in-degree distribution of a network is scale free, $ck^{-\gamma}$ where $k$ is the number of in-degrees, $c$ is a normalization constant, and $\gamma$ is known as $2<\gamma<3$ in a biological network [34] ($\gamma=2.5$ in this study). Therefore, $Er$ can be expressed as a function of the total number of genes, $n$, and its criterion $\alpha_{Er}$ can be obtained in a numerical way using the following equation: TeX Source $$\alpha_{Er}={2\sum_{k=1}^{n-1}ck^{1-\gamma}\over n-1}\eqno{\hbox{(9)}}$$ The final network structure is determined by deleting the edges having the smallest probability of (7) until $Er\leq\alpha_{Er}$.

### D. Permutation Test for Detecting Regulatory Genes

In a conventional gene expression analysis, differentially expressed genes (DEGs) are identified as causal genes for tumor patients by comparing the gene expression levels to normal samples. However, a DEG does not mean its expression affects to the other genes' expression. So, in this analysis, we define “regulatory genes” whose expression induces the other genes' activation in tumor samples and identify the regulatory genes by comparing the degree distribution of each gene among the three network structure.

Let $k_{ij}$ be the number of nodes connected to the $i$th gene in the $j$th group [$i=1,2,\ldots,n-1$, $j=`N\hbox{'}$ (non-tumor), “L” (low grade), “H” (high grade)]. Then we performed permutation tests with an alternative hypothesis that $k_{iL}-k_{iN}(=d_{i})>0$ for the non-tumor and low grade tumor network comparison. To proceed this test, we chose the $i$th gene ramdomly for $k_{iL}^{\prime}$ in the low grade network structure and computed $d_{i}^{(m)}=k_{iL}^{\prime}-k_{iN}$. By iterating this process $M$ times, we obtained the empirical $p$-value of $d_{i}$ as follows: TeX Source $$p-{\rm value},\quad p_{i}={1\over M}\sum_{m=1}^{M}I\left(d_{i}>d_{i}^{(m)}\right)\eqno{\hbox{(10)}}$$ where $I(C)$ is the indicator function. Thus, if the $p$-value is less than a criterion $\alpha$ then the null hypothesis is rejected. We used $M=5000$ and $\alpha=0.01$.

SECTION III

## SIMULATED NETWORK COMPARISON

Fig. 1. In/out-degree distribtuions of 500, 1000, 1500, and 2000 node networks. x-axis is the logarithm of in/out-degrees and y-axis represents the logarithm of the propostion of corresponding in/out-degrees. (a) Indegree $({\rm p}=500)$. (b) indegree $({\rm p}=1000)$. (c) indegree $({\rm p}=1500)$. (d) indegree $({\rm p}=2000)$. (e) outdegree $({\rm p}=500)$. (f) outdegree $({\rm p}=1000)$. (g) outdegree $({\rm p}=1500)$. (h) outdegree $({\rm p}=2000)$.

To evaluate the proposed method, we generated large-scale gene expression data as follows. Firstly a large-scale network structure whose in-degree distribution follows scale-free [34] was obtained using R package “igraph. ” This network structure can be converted into a covariance matrix [35] with appropriate edge weights which are randomly chosen among {−1, 1}. Then the expression data were generated from the multivariate normal distribution with the obtained covariance matrix using R package “mvtnorm. ” We then employed following three measures to compare the proposed Bayesian model averaging based network (BMAnet) and BMAnet with prior (BMAnetP) to the other two methods, elastic-net (Enet) [17] and Gaussian graphical model with shrinkage estimator (GGM) [19]. TeX Source \eqalignno{{\rm Sensitivity}=&\,{TP\over TP+FN}\cr{\rm Specificity}=&\,{TN\over FP+TN}\cr{\rm Positive}\ {\rm predictive}\ {\rm value}\ (PPV)=&\,{TP\over TP+FP}&\hbox{(11)}} where $TP$, $FP$, $FN$, and $TN$ are the number of true positive, false positive, false negative, and true negative, respectively. In this study, we fixed specificity as 0.99 since the sparsity of a large-scale GRN structure causes relatively good specificity regardless of sensitivity.

We generated 40 datasets consisting of 10 replications for each of 500, 1000, 1500, and 2000 nodes. There are 50 samples in each dataset. Fig. 1 shows the scale-free in/out-degree distributions of the simulated network structures where the number of in-degree is at most few hundreds. In Fig. 2 and Table I, BMAnetP shows the best sensitivity in all four cases and BMAnet without prior also has better performance than Enet. In terms of PPVs, it is getting lower as the number of genes increases. It is mainly because the growth rate of number of possible edges is much faster than that of genes, which engenders the high possibility of false positives.

Fig. 2. The performance comparison among BMAnetP (black), BMAnet (dark gray), Enet (gray), and GGM (white). (a) ${\rm p}=500$. (b) ${\rm p}=1000$. (c) ${\rm p}=1500$. (d) ${\rm p}=2000$.
TABLE I THE PERFORMANCE COMPARISON (MEAN)

In order to evaluate the algorithms with non-linear expression data, DREAM4 dataset [12], [36] consisting of 100 nodes was taken and stochastic differential equation approach of GeneNetWeaver [37] was used for the simulation data generation. A total of 30 datasets were generated and each of them consists of 100 time-series samples which are obtained by knocking out one gene. We randomly chose 4 time points from each of the 30 datasets and made a complete simulation dataset with 120 samples. The comparison was performed 10 times with fixed 0.95 specificity and their results are shown in Table II. In this result, BMAnet and BMAnetP perform better than the others, and even Enet shows worse sensitivity while it is similar to those of BMAnet in Table I.

TABLE II THE PERFORMANCE COMPARISON OF DREAM4 DATASET (MEAN (STANDARD DEVIDATION))
SECTION IV

## HUMAN BRAIN TUMOR GRNS

Our method is applied to human brain tumor dataset GSE4290 which was collected from Gene Expression Omnibus (GEO). This dataset has total 180 samples, which consist of 23 non-tumor, 76 low grade tumor (26 astrocytomas + 50 oligodendrogliomas), and 81 high grade tumor (glioblastomas) samples. We chose 4422 genes based on $p$-values of ANOVA tests whose null hypothesis is that the expression levels of the $i$th gene are the same among the three brain tumor groups and built three network structures for each tumor grade. Note that the expressions of this dataset were not measured in time course, so the estimated networks are more likely to be “co-expression networks” rather than “gene regulatory networks.”
Fig. 3. (a) In-degree and (b) out-degree distributions of three network structures.

Fig. 3 shows the degree distributions of the estimated three networks. Though they have slightly higher proportion of nodes with around $7(\approx log(2))$ degrees than the power law distribution but they still shows similar patterns with the power law shape. For the purpose of comparable visualization of the regulatory genes, three pairs of estimated networks (a)-(b), (c)-(d), and (e)-(f) are shown in Fig. 4. In each network structure, edges are omitted for the clear presentation, and nodes are placed more closely to the center as they have higher degrees. Black nodes are the regulatory genes showing the significance of permutation test in (10) while white nodes represent the DEGs with ANOVA test. We identified total 65, 56, and 57 regulatory nodes in non-tumor vs. low grade, non-tumor vs. high grade, and low grade vs. high grade, respectively. As expected, they are mostly located near the center of networks because of their high connectivity. These regulatory genes are not overlapped with the DEGs, which implies that these regulatory genes could give different information from that of conventional DEG approach.

Fig. 4. Estimated network structures of non-tumor [(a), (c)], low grade [(b), (e)), and ((d), (f)]. In each network structure, black and white nodes represent the identified regulatory genes and DEGs, respectively. As a node has higher degrees it is more close to the center. Edges are omitted for the clear presentation. In each row (a pair of two network structures), we found regulatory genes whose degree of connections in the right network structure is significantly higher than that of the left topology.

In order to annotate the functional properties of these regulatory genes, we obtained the over-represented GO terms of this group of genes using GOstat [38] which performs hyper-geometric test evaluating statistical significance of functional and molecular mechanisms of the interested genes (Table III). The regulatory genes in non-tumor vs. high grade comparison are the same with that of low grade vs. high grade analysis except one gene. So they share the GO terms which are mostly involved in developmental and regulation processes. In low grade group, the regulatory genes have the similar GO terms with that of high grade tumor but GO:0005832 gives us an inspiration that there could be some recovery processes in this stage of tumor.

SECTION V

## DISCUSSION

In this study, we propose a large-scale GRN construction approach via the BMA technique, which is usually employed to take into account uncertainties in model selection. We suggest that BMA could be one of the best approaches for explaining GRNs where various molecules such as proteins alternatively interact with each other depending on their environmental conditions. Moreover, the Bayesian approach enables us to integrate different sources of biological information such as the protein-DNA binding affinity.

TABLE III SIGNIFICANTLY OVER-REPRESENTED GO TERMS OF THE IDENTIFIED REGULATORY GENES

In the simulation study that we also report in this paper, our method shows better performance than the previous methods, Enet and VAR, for both linear and non-linear model based simulation data. Though our method requires longer run time than Enet and VAR, it took approximately 15 h for 4422 genes, which may be acceptable in practice for the large-scale network analysis. In the application that we have considered regarding brain tumor analysis, the regulatory genes which were identified in the tumor networks could be the key genes causing disease by regulating other genes, since most of their GO terms are closely related to the regulation and developmental processes. This regulatory gene detection approach could be very important since it provides information which is not obtained from conventional DEG approaches. Via further investigation of regulatory genes, we may be able to identify more genes that are responsible for triggering a particular disease, so that those particular genes would then be used as targets for drugs in prevention and therapy.

Finally, the parallel programming technique with CPU/GPU processors [13], [14] enable us to improve the efficiency of our reverse engineering algorithm, and to use more complicated models [39], [40] which includes post-translation processes such as ubiquitination and phosphorylation.

## Footnotes

H. Kim is with the Intelligent Systems and Networks Group, Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. (e-mail: haseong.kim08@imperial.ac.uk).

E. Gelenbe is with the Intelligent Systems and Networks Group, Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K. (e-mail: e.gelenbe@imperial.ac.uk).

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available