A Sparse Bayesian Learning Method for Structural Equation Model-Based Gene Regulatory Network Inference

Gene regulatory networks (GRNs) are underlying networks identified by interactive relationships between genes. Reconstructing GRNs from massive genetic data is important for understanding gene functions and biological mechanism, and can provide effective service for medical treatment and genetic research. A series of artificial intelligence based methods have been proposed to infer GRNs from both gene expression data and genetic perturbations. The accuracy of such algorithms can be better than those models that just consider gene expression data. A structural equation model (SEM), which provides a systematic framework integrating both types of gene data conveniently, is a commonly used model for GRN inference. Considering the sparsity of GRNs, in this paper, we develop a novel sparse Bayesian inference algorithm based on Normal-Equation-Gamma (NEG) type hierarchical prior (BaNEG) to infer GRNs modeled with SEMs more accurately. First, we reparameterize an SEM as a linear type model by integrating the endogenous and exogenous variables; Then, a Bayesian adaptive lasso with a three-level NEG prior is applied to deduce the corresponding posterior mode and estimate the parameters. Simulations on synthetic data are run to compare the performance of BaNEG to some state-of-the-art algorithms, the results demonstrate that the proposed algorithm visibly outperforms the others. What’s more, BaNEG is applied to infer underlying GRNs from a real data set composed of 47 yeast genes from Saccharomyces cerevisiae to discover potential relationships between genes.


I. INTRODUCTION
A gene is a segment of DNA which is the basic physical and functional unit of heredity. Genes direct the synthesis of functional molecules such as proteins and functional RNA, and thereby determine biological functions and phenotypes. This regulation is mainly implemented via the process of gene expression, including transcription and translation [1]. With the development of high throughput technologies such as DNA microarray and RNA-seq, tons of comprehensive genomic data such as the genome-wide gene expression data The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Aljawarneh . and gene variations in biological individuals can be easily obtained via experimental methods [2]. A large amount of genomic data have been reported in prior literatures [3]- [5], and several public repositories (such as the Gene Expression Omnibus (GEO)) have been built to provide service for gathering genomic data from bacterias, yeasts, plants and humans.
Genes in living organisms usually interact with each other and act together rather than working in isolation. Gene expression reflects regulatory relationships among individual genes, and can be taken full advantages to form underlying GRNs [6]- [8]. Delineating the structure of a GRN is of significant importance for understanding gene functions, cell physiologies and biological mechanisms. Additionally, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ in practical applications, while several intelligent approaches have been developed to implement computer-aided diagnosis [9] and treatment [10], and help constructing better healthcare systems [11]- [13], GRNs can provide theoretical basis for various medical services at the molecular biology level (such as genetic testing service, genetic diagnosis service and molecular targeted therapy). Artificial intelligence is an important technology for big data analysis and has been applied successfully in many fields and scenarios [14]- [16]. Since the explosive amount of genetic data, machine learning is also essential to tractable GRN inference. Much progress has been made to infer GRN structures from gene expression data. In early stage, Boolean networks are popular for GRN inference, in which the state of a gene is represented by a Boolean variable and the interactions between genes are represented by Boolean functions determining the state of a gene on the basis of the states of some other regulatory genes [17]- [19]. Boolean models are simple but cannot reflect quantitative biochemical details. Information theoretic methods are used to infer GRNs by detecting the correlations or mutual information between genes [20]- [22]. This kind of methods have heavy computational burden for large data sets and cannot directly infer GRNs with feedback loops. As for approaches based on Gaussian graphical models, undirected GRNs can be determined by the precision matrix [23], [24], and Bayesian networks are employed to infer directed acyclic GRNs (DAGs) [25]- [27]. Differential equations [28], [29] and regression models [30] have been widely used to identify both DAGs and directed cyclic GRNs (DCGs) by estimating the adjacency matrices, and a series of related inference algorithms have been developed in past a few years.
Basu et al. [31] built a computationally efficient iterative random forest (iRF) algorithm to search for high-order interactions, which can also be used for inferring GRNs from Boolean gene expression data. The iRF algorithm adopted a bagging step to assess the stability of recovered interactions, which allows robust identification with respect to small bootstrap perturbations in the data. However, iRF didn't consider the genetic perturbations yet, and is not applicable for continuous gene expression data, the data discretization process may cause information loss. Several methods have been developed to infer GRNs by exploiting genetic perturbations, including the algorithms based on Bayesian networks [32], [33], the causal models based on likelihood test [34], [35] and the approaches based on SEMs. Among them, the approaches based on SEMs have attracted a lot of research attentions. Different from iRF, GRNs modeled with SEMs takes genetic perturbations into considerations explicitly. The gene expression data are treated as endogenous variables, and the genetic variations observed at eQTLs are generally viewed as genetic perturbations, which are treated as exogenous variables. Due to the character of SEMs, the regulatory effects of both types of variables on each gene can be analyzed simultaneously. Besides, SEMs can provide more accurate GRN prediction than iRF by supporting inference based on continuous gene expression data. Furthermore, the topological structures of GRNs are depicted by adjacency matrices, which makes it possible to directly infer both DAGs and DCGs from SEMs.
The dimension of gene expression data is usually high, it is difficult to process them without any constraints. As discussed in [36], [37], GRNs or more general biochemical networks are sparse, meaning that a gene directly regulates or is regulated by a small number of genes relative to the total number of genes in the network. Motivated by this fact, the network sparsity constraints need to be incorporated into the inference of GRNs modeled with SEMs. In 2004, Xiong et al. [38] proposed to model GRNs with SEMs, whereafter several related algorithms were put forward successively [39]- [41]. Cai et al. [40] proposed a systematic inference algorithms for sparse SEMs named SML to infer GRNs by exploiting both gene expression data and eQTL data, which was proved to significantly outperforms other previous algorithms [39], [41]. Subsequently, Dong et al. [42] formulated a linear regression model from an SEM and developed an iterative Bayesian inference algorithm named LRBI to infer GRNs. More recently, Chen and Ren [43] built large systems of SEMs by coming up with a 2SPLS algorithm. They obtained the consistent estimation via ridge regression at the first stage, and then employed adaptive lasso at the second stage to achieve the consistent variable selection. The simulation results in [42] elucidated that LRBI has better performance than SML in terms of power of detection (PD) whereas SML performs better than LRBI in terms of false discover rate (FDR). And in [43], the simulation results demonstrated that the 2SPLS algorithm has better PD than SML and lower FDR than SML when the sample size is relatively smaller, whereas for bigger sample sizes, the FDR of 2SPLS is worse than SML.
By reviewing the simulation studies in the above studies, we found that in general none of the above state-of-the-art algorithms developed for sparse SEMs has completely better performance than the others. Motivated by this, in this paper, we focus on the inference of GRNs modeled with SEMs and develop a novel algorithm named BaNEG to improve the performance of existing methods. We first reparameterize the SEM by merging the exogenous and endogenous variables, and then develop a Bayesian adaptive lasso inference algorithm with hierarchical NEG prior to solve the reparameterized linear type models. Several simulations are conducted to compare the performance of our proposed BaNEG algorithm with three state-of-the-art algorithms: LRBI [42], SML [40] and 2SPLS [43]. The results demonstrate that BaNEG has similar performance with LRBI [42] in terms of PD, which outperforms SML [40] and 2SPLS [43] visibly. In the meantime, the FDR of BaNEG generally outperforms all the other algorithms.

A. GRNS MODELED WITH SEMS
As in [39], [40], both gene expression data and genetic perturbations(such as eQTLs) can be integrated into SEMs to model corresponding GRNs. We consider an SEM with p endogenous variables and q exogenous variables sampled from N individuals, the variables here represent gene expression levels of p genes and genotypes of q variant cis-eQTLs, respectively. We use cis-eQTLs here mainly because the empirical evidence proved that local genetic polymorphism tends to have larger effects than trans-eQTL [44], [45]. The expression levels of these p genes can be measured by technologies such as cDNA microarray and RNAseq. Let Y := [y 1 , y 2 , · · · , y p ] be an N × p matrix, in which y i := [y 1i , y 2i , · · · , y Ni ] T , i = 1, · · · , p denote expression levels of the ith gene from N individuals. Let X := [x 1 , x 2 , · · · , x q ] be an N × q matrix, in which x Nj ] T , j = 1, · · · , q denote genotypes of the jth cis-eQTL from N individuals. Thus an SEM can be represented as a set of structural equations, which could be integrated as follows: where the p × p matrix B denotes the topological structure of a GRN inferred from N observations. In B, the value of each entry B ij represents regulatory effect of the ith gene on the jth gene. It is often assumed that a gene has no effect on itself, which implies B ii = 0 for i = 1, · · · , p. The q × p matrix F denotes causal effects of the cis-eQTLs, in which F km represents effect of the kth cis-eQTL on the mth gene. For the uniquely identifiability of the SEMs, as stated in [40], [42], [43], we assume that each gene in the GRN has a unique nonempty set of cis-eQTLs, so q is larger than or equal to p. E is an N × p matrix capturing the residual error terms. It is assumed that X and E are independent with each other.
For convenient calculation, model (1) could be split into p structural equations, each represents regulatory effects of all endogenous variables and exogenous variables on one gene. The ith model can be expressed as follows, where N × 1 vector y i is expression levels of the ith gene excluding the ith entry whose value has been already known to be zero. q × 1 vector f i denotes the ith column of F. e i is an N × 1 error vector, we assume elements in e i are independent and identical distributed as N (0, σ 2 ), that is a normal distribution with 0 means and variance σ 2 .
Without any other restrictions, the task of inferring a GRN from such a model is to estimate the p(p−1) unknown entries in B and passingly estimate the pq unknown entries in F. Since the GRNs or more general biochemical networks are always sparse [22], [30], the adjacency matrix B is sparse. We assume that the loci of cis-eQTLs have been obtained by VOLUME 8, 2020 applying existing eQTL mapping methods, whereas the effective sizes are unknown. As elucidated in [44], most eQTLs have weak effects on genes expression levels, so matrix F is usually a sparse matrix having a small number of nonzero entries whose locations have been determined. Despite both B and F are sparse, for relatively large p, the inference task is still challenging.

B. MODEL REPARAMETERIZATION
According to the rule of matrix multiplication, the SEM in (1) can be rewritten as the following form, where W = [Y, X] is an N × (p + q) design matrix, and β = [B, F] T is the parameter matrix composed of (p + q) × p parameters. Thus the original SEM is reparameterized as a multivariable linear model. In this linear model, the responding variable matrix Y is the same as that of the original SEM, the predictive variable matrix W is an N × (p + q) matrix including both of the exogenous and endogenous variables. Therefore, our main concern becomes the (p + q) × p parameters in matrix β.
Model (3) can also be easily split into p univariable linear models as follows, where β i is the ith column of β, namely a (p + q) × 1 vector. As aforementioned, B and F are both sparse. What's more, the diagonal elements of B are zeros, and the locations of nonzero elements in F have been obtained in advance, that is, the row indices of the unknown parameters need to be estimated are known before the inference. We adopt a vector s i to denote the row indices of the parameters to be estimated in β i . By selecting the columns of design matrix W identified by s i , the model (4) can be simplified into where W s i refers to a reduced form of W that only contains the columns in accordance with s i , and β s i is a reduced parameter vector that only includes the unknown rows. The dimensions of β s i for different values of i may be different because our BaNEG algorithm allows different number of cis-eQTLs for each gene. We use p i to represent the dimension of β s i . As such, our task is transformed into p linear models as shown in (5), in which the coefficients vectors are known to be sparse.
Since e i is assumed to follow a normal distribution with zero mean and variance σ 2 , the likelihood of model (5) can be expressed as According to Bayes' theorem, the joint posterior distribution can be obtained via

D. INFER GRNS VIA BAYESIAN ADAPTIVE LASSO WITH NEG PRIOR
One of the approaches to selecting λ 2 is to give λ 2 a hyperprior of Gamma distribution [51], which motivating the following three-level NEG prior [55]: where D τ = diag(τ 2 1 , · · · , τ 2 p i ), and ψ j is the penalty parameter corresponding to λ 2 in the Bayesian lasso. Note that in the Bayesian lasso, the same penalty parameter λ 2 for all coefficients in each iteration. However, generally we tend to put larger penalty parameters on coefficients corresponding to less important variables, which can increase shrinkage degree of such coefficients. So we adopt the Bayesian adaptive lasso to put different penalty parameters ψ j on each coefficient β s i ,j . In the third level of the NEG prior in (9), a conjugate Gamma prior with a shape parameter a 0 and an inverse scale parameter b 0 are assigned to ψ j , which alleviates the strong dependence of shrinkage degree on ψ j . Actually, we do not need to pay much attention on the choice of hyper parameters ν 0 , η 0 , a 0 , b 0 , because the deeper parameters are in the hierarchical Bayesian model, the less effects they have on the inference [56]. So appropriate values can be pre-specified for these hyper parameters.
By assigning σ 2 an Inverse-Gamma prior with shape parameter ν 0 /2 and inverse scale parameter η 0 /2, the fullconditional posterior distribution of all the parameters (β s i , σ 2 , 1/τ 2 j , ψ j ), j = 1, · · · , p i can be given by where It has been proved by Parl and Casella [51] and Leng et al. [52] in their Appendices that by assuming the above NEG type priors as in (9), the unimodal of posterior distribution can be guaranteed.
Then we can iteratively draw samples for all unknown parameters from the above full-conditional posterior distribution to estimate unknown parameters by using a Gibbs sampler. The convergence of the sampling process for such hierarchical models has been investigated and proved to be rapid [57]. To initialize the parameters, we preset ν 0 = N /5, η 0 = 1. The hyper parameters a 0 , b 0 are prespecified as small values (e.g. a 0 = 0.1, b 0 = 0.5) to make the prior for ψ j essentially noninformative. ψ j , τ 2 j , σ 2 are initialized by performing sampling from their prior distributions respectively. And the starting value of β s i is set as a vector of 1's.
With the initialization of each parameter, the Gibbs sampler can be performed iteratively, all parameters (β s i , σ 2 , 1/τ 2 j , ψ j ) are updated in turn by drawing samples from their conditional posterior distributions given current values of other parameters until sufficient effective samples of β s i are obtained. We introduce the potential scale reduction 40072 VOLUME 8, 2020 factor R [58] to monitor convergence of our Gibbs sampler. Two parallel chains with starting points set above are run, we calculate R for all p i entries of β s i to compare betweenand within-sequence variance. According to the description in [58], the convergence condition is set as R < 1.1. Once R for the p i entries of β s i all meet the condition, the simulations can be identified convergent. The subsequent iterations generate effective samples. Two stopping conditions have been used in our simulation as well to avoid excessive number of iterations. First, a maximum number of iterations is set (usually is set as 50 in the following). Second, after each effective iteration, we compare our interested parameters β s i with that of last iteration, once the sum of all square differences is small enough, we think the iteration has reached a stable state and can be stopped. For example, when k is smaller than the preset maximum number of iterations, in the (k + 1)th iteration, we calculate and test if the value of s i ,j ) 2 is small than a pre-specified small threshold like 0.01, if so, we deem enough iterations have been simulated; otherwise, the sampling iterations need to be performed continuously. Then we discard half of the convergent posterior samples by only extracting every 2nd simulation drawn from each sequence to avoid dependence between two adjacent iterations. From the simulations in next section, we find that with this sampling strategy, the Gibbs sampler can achieve convergence fast. Different from non-Bayesian lasso, the penalized analysis in Bayesian frameworks do not shrink the insignificant coefficients to zero automatically. They just shrink coefficients corresponding to insignificant variables to very small values. The simplest way for variable selection is to apply a threshold thr, more specifically, if an estimated coefficient β s i ,j is larger than the pre-specified threshold thr, it remains unchanged; otherwise, it is shrunken to zero. The smaller thr is, the performance in terms of PD would be better, meanwhile the FDR would be accordingly worse; conversely, a larger thr produce better FDR but worse PD.

A. SIMULATION STUDIES
The performance of a GRN inference algorithm is often measured by PD and FDR. PD is equivalent to the true positive rate, which measures the proportion of true edges that are correctly identified in all true edges. FDR measures the proportion of false positive edges in all detected edges. Assuming a positive edge and a negative edge indicate an edge exists or not respectively. Let N tp denote the number of true positive edges detected by an inference algorithm, N fp stand for the number of false positive edges, N tn represent the number of true negative edges, and N fn characterize the number of false negative edges. Then PD can be obtained by N tp /(N tp + N fn ), the FDR can be calculated from N fp /(N fp + N tp ). VOLUME 8, 2020 In the following simulations, we mainly evaluate the performance of the GRN inference algorithms by comparing their PD and FDR. Our main goal is to result in higher PD and lower FDR, the ideal situation is PD = 1 and FDR = 0. In addition, several other measures are also taken into considerations to further compare other performance of the algorithms, including Matthews Correlation Coefficient (MCC) and G-mean. MCC is a correlation coefficient (ranging from −1 to 1) evaluating the correlation between a detected GRN and the corresponding true GRN, and G-mean (ranging from 0 to 1) is usually used to evaluate performance for imbalanced data. The bigger the values of these two measures are, the better performance an algorithm has (the best situation is MCC = 1 and G-mean = 1).
To benchmark the performance of our BaNEG algorithm, we compare its performance with some other similar algorithms that infer GRNs modeled with SEMs incorporating genetic perturbations into models. A series of algorithms have been proposed to solve such problem, such as the PCalgorithm [59], the QDG algorithm [34], the QTLnet algorithm [35], the NEO algorithm [60], the AL-based algorithm [39], the SML algorithm [40], the LRBI algorithm [42] and the 2SPLS algorithm [43]. Logsdon et al. have compared the performance of their AL-based algorithm with that of the NEO algorithm, the QDG algorithm, the QTLnet algorithm and the PC-algorithm, the simulation results in [39] demonstrate that AL generally outperforms the other four algorithms. Later, Cai et al. compared their SML algorithm with the QDG algorithm and the AL-based algorithm in [40], the results showed that SML has significantly better performance than these two algorithms. Combined with the further comparative simulations in [42], [43], we can draw a conclusion that SML and 2SPLS outperform all other algorithms listed above in terms of FDR, while LRBI and 2SPLS perform the best in terms of PD. Therefore, in this section, we run simulations on synthetic data to compare the performance of BaNEG with LRBI, SML and 2SPLS to prove the superiority of BaNEG.
The GRNs and corresponding synthetic data used in this section are simulated following the way as recommended in [39], [40], [42]. The structure of a simulated GRN can be represented by a p × p adjacency matrix B, if there is an edge from node i to node j, the regulation effects B ij are randomly generated from a uniform distribution over the interval (−1, −0.5) (0.5, 1). Other elements and all diagonal terms of B are set to 0. For convenient calculation, without loss of generality, we assume F a p × p identify matrix, that is each gene has and only has one cis-eQTL. So the genotype matrix X is an N × p matrix. The genotype of each eQTL X ij is randomly generated following the setting of F2 cross, that is, takes value of 1, 2, 3 with probabilities 0.25, 0.5, 0.25, respectively. Each error term E ij is simulated from a Gaussian distribution with zero mean and variance σ 2 and intercept term is set to zero. Then Y can be obtained from Firstly, we conduct simulation studies to compare performance of our BaNEG algorithm with LRBI [42], SML [40] and 2SPLS [43]. DAGs and DCGs composed of 30 genes and 50 genes are simulated with σ 2 = 0.01. Each node is simulated to have N e = 3 regulatory edges on average. The sample size N ranges from 30 to 500. The variable selection threshold thr for BaNEG and LRBI are set to 0.1. For each type of GRN with 30 genes, We simulate 100 replicates, and for larger GRNs with 50 genes, 50 replicates are generated. Performance of each algorithm is obtained by averaging the PD and FDR of all replicates in the same setup. Simulation results of DAGs with 30 genes and 50 genes are presented in Fig. 1 and Fig. 2, and the results of DCGs are shown in Fig.3 and Fig.4. From Fig. 1 (a) (c), we can see for DAGs, the PD of BaNEG and LRBI are close to or equal to 1 for sample sizes from 50 to 500, BaNEG has very slightly better PD than LRBI when N = 30 and 50. Whereas the PD of SML and 2SPLS are obviously lower than that of BaNEG. As shown in Fig. 1 (b) (d), for DAGs, the FDR of BaNEG stays at 0 for sample sizes from 150 to 500, and is lower than that of other three algorithms for almost all sample sizes. The only exception case is when N = 30, the FDR of BaNEG is larger than 2SPLS, but at the same time, the PD of 2SPLS is much lower than BaNEG. As for MCC and G-means shown in Fig. 2, BaNEG performs better than all of the other three algorithms visibly. Fig. 3 and Fig. 4 present PD, FDR, MCC and G-means of the four algorithms for DCGs, where we can find that the performance of BaNEG and LRBI are similar with that for DAGs, but SML and 2SPLS perform obviously worse than that for DAGs, meaning that the superiority of BaNEG over SML and 2SPLS are more remarkable.
Secondly, we continue to evaluate and compare the performance of the above four algorithms for larger sparse DAGs with 300 genes, here we set N e = 1. The noise variances σ 2 are still set to 0.01, the sample sizes are from 100 to 1000 and 10 replicates are simulated for each sample size. The performance of BaNEG, LRBI, SML and 2SPLS are shown in Fig. 5. From Fig. 5 (a) (b), the PD of BaNEG are euqal to 1 except for the case when sample size N = 100, which outperforms that of LRBI, SML and 2SPLS for almost all sample sizes; the FDR of BaNEG are exactly equal to 0 for sample sizes from 400 to 1000, which also performs better than all the other three algorithms. As shown in Fig. 5 (c) (d), the MCC and G-means of BaNEG are equal to 1 except very few cases, whereas other three algorithms all perform weaker than BaNEG. Note that LRBI shows rather poor performance for small sample sizes (like N < 400).
Then, we run simulations on synthetic data sets with p = 50, N e = 3, σ 2 = 0.01 to study the impact of different variable selection threshold thr on BaNEG. DAGs and DCGs are simulated with thr ranging in {0.08, 0.1, 0.2}, and the sample sizes are from 30 to 500. Aside from BaNEG based on Bayesian adaptive lasso with NEG prior, the other Bayesian based method LRBI is also applied on the simulated data sets. The simulation results are presented in Fig. 6. From Fig. 6 (a) (c), we can find that the PD of the two algorithms are all equal to 1 for sample sizes from 100 to 500 with all different values of thr, nevertheless, when N = 30 or 50, the PD of BaNEG slightly exceeds that of LRBI with the same thr. The difference of FDR as shown in Fig. 6 (b) (d) is relatively more distinct. In an overall view, the FDR of BaNEG outperforms that of LRBI, especially for small sample sizes, and the FDR of each algorithm with a larger thr is better than that with a lower thr.
Moreover, we run simulations on DAGs and DCGs with 50 genes to analyze the performance of BaNEG for different N e and σ 2 . We continue to set thr = 0.1 for all simulations. DAGs and DCGs are simulated with N e equaling to 2 or 5 and σ 2 ranging in {0.01, 0.05, 0.1}. The simulation results for sample sizes from 50 to 1000 are depicted in Fig. 7. As shown in Fig. 5, when we keep N e constant and increase σ 2 from 0.01 to 0.05 and 0.1, the PD of both DAGs and DCGs suffer little affection, only reduced a little bit for sample size N = 50, whereas the performance in terms of FDR are negatively impacted evidently. When we keep σ 2 constant and increase N e from 2 to 5, both of the PD and FDR become slightly worse.
Finally, to compare the computational expenses of the above four algorithms, we record the running time of each algorithm when infer DAGs and DCGs with 30, 50, 100, 300 genes from the same data sets. All the algorithms were conducted by using a laptop with Intel(R) Core(TM) i7-6700HQ CPU 2.60GHz and 16G RAM. Reported in Table 1 are the running times of all algorithms on different GRNs at sample size N = 500. As shown in Table 1, while maintaining the performance advantage of BaNEG over the other three algorithms, BaNEG is much faster than SML and 2SPLS. However, it is slower than LRBI visibly because its more complicated hierarchical posterior model and convergence monitor strategy. According to the previous comparisons in various of performance measures, such moderate sacrifice in computational cost brought obvious advantages in network accuracy.

B. REAL DATA ANALYSIS
In this section, we apply the BaNEG algorithm to a real data set composed of 47 yeast genes from 112 segregates of Saccharomyces cerevisiae to explore the corresponding underlying GRNs. To further verify the simulation results in last section, we also run the other three algorithms: LRBI, SML, 2SPLS on the real data set.
Brem and Kruglyak [44] measured expression levels of 5727 genes and genotypes of 2957 genetic markers from 112 segregates of a cross between a lab strain BY4716 and a wild strain RM11-1a of Saccharomyces cerevisiae. The sample size is too small compared with the number of genes. Chen and Ren [43] filtered the data set and yielded a data set with 112 observations of 722 genes and 732 genotypes of cis-eQTLs. The 2SPLS algorithm was applied to infer this big GRN and generates a GRN with 323 edges, which formed a few subnetworks. The three largest subnetworks consist of 47 genes and 60 regulatory edges in total. We apply our proposed BaNEG algorithm to the yeast data set including gene expression levels and genotypes of cis-eQTLs of the 47 genes from 112 samples to infer the underlying yeast GRN. In the data set, the gene expression matrix is corresponding to the matrix Y in model (1) and the cis-eQTLs data represents the matrix X, the BaNEG algorithm can be directly applied to the data sets to infer the adjacency matrix B, thereby construct the underlying GRN of the 47 yeast genes.
As shown in Fig. 6, when we increase thr from 0.1 or 0.2, the FDR of BaNEG can be greatly improved, the PD became slightly lower but did not suffered too much influences. So when we make real data analysis on the yeast data set of 47 genes with 112 observations, we set thr = 0.2 to lower the FDR as far as possible and maintain goodenough PD in the meanwhile. In this setting, 324 regulatory edges were detected when BaNEG was applied. To improve the reliability of the inferred GRN, 100 bootstrap data sets with 112 samples were generated by random re-sampling the original 112 observations with replacement. Then the BaNEG algorithm was run on each bootstrap data set and a frequency matrix was constructed by counting the frequency of each edges in the 100 inferred networks. By eliminating the edges whose corresponding frequency were less than 80, we obtained a GRN of 45 genes and 90 edges, including 69 positive effects and 21 negative effects. The inferred GRN is shown in Fig. 8, each node stands for a gene and a directed edge from node A to node B indicates the source gene A has a regulatory effect on the target gene B, a positive effect in the adjacency matrix denotes an activating effect (depicted as →) and a negative effect stands for an inhibiting effect (depicted as ). Most genes were detected at least one regulatory or regulated edge. Only 2 genes, namely YJL140W, YPL031C, were excluded from the inferred GRN. There are According to the simulation results in last section, the BaNEG algorithm has similar PD with LRBI, larger PD than SML and 2SPLS, and has lower FDR than all of the other three algorithms. So we expect that for the same real data set, the BaNEG algorithm can detect more true edges than SML and 2SPLS, and may detect less false edges than the LRBI algorithm. Then LRBI, SML and 2SPLS were also run with 100 bootstraps, and only the edges whose frequencies were not less than 80 were retained in the inferred GRNs. As a result, the LRBI algorithm detected 46 genes and 155 edges, the only eliminated gene is YPL031C, which was also not included in the GRN inferred by BaNEG. Besides, almost all edges detected by BaNEG are included in the GRN identified by LRBI (only 5 edges were not in). The SML algorithm only found 30 genes and 37 edges, in which 27 edges were also in the BaNEG GRN. And the GRN inferred by the 2SPLS algorithm included 42 genes and 58 edges, 26 edges were in the BaNEG GRN. By comparing the results of the four algorithms inferred from the yeast data set, we found that our BaNEG algorithm detected less edges than LRBI and more edges than SML and 2SPLS due to its lower FDR and higher PD, which is in accordance with our expectation based on the simulation studies.

IV. DISCUSSION
Since the development of high-throughput sequencing technologies, considerable efforts were made to infer GRNs from gene expression data [17], [20], [23], [25], [30]. While most of these traditional methods were developed to deal with only the gene expression data. Another type of approaches were developed in recent years to integrated genetic perturbations with gene expression data to together infer GRNs, by exploiting additional genetic information, the accuracy of inference can be improved. SEMs provide a systematic framework that can directly integrate both types of gene data and offer flexibility to model both DAGs and DCGs by adjacency matrices [40]. Motivated by this, in this paper, we develop a more efficient novel approach based on Bayesian learning named BaNEG to infer GRNs modeled with SEMs.
In BaNEG, we combine the three level NEG type prior with Bayesian lasso to form an adaptive hierarchical posterior model for sparse linear models, then apply it to SEMs to infer GRNs from both gene expression data and gene perturbations for the first time. It is realized by two stages: First, the original SEM is reparameterized as a linear type model by merging the endogenous variables and the exogenous variables; then we propose to use Bayesian adaptive lasso with NEG prior and Gibbs sampling to infer the reparameterized linear type model. This proposed algorithm mainly has the following advantages: First, the reparameterization stage transfers SEMs to linear models. This linear model exploits the whole structure of SEM by integrating gene expression level and cis-eQTL loci into one design matrix, which makes it possible to infer the parameters together. Second, the simplified form as in model (5) reduces the dimension of the re-parameterized models, this significantly improves the inference efficiency, and would be more applicable for large GRNs (such as whole-genome GRNs). Finally, BaNEG adopts NEG type prior to achieve sparsity of GRNs. In another Bayesian based algorithm LRBI, an NG-type prior was chosen, which may have an infinite spike at zero and flatness for large values of coefficients, as a consequence, does not penalize such large values. Therefore, with the NG-type prior, the spike at zero has strong consequences for the model behavior of the posterior, not all of which are welcome. While the NEG distribution incorporates as limiting cases of the Laplace prior and has the advantage of a finite limit at zero for all parameter values in its range [55]. The simulations have confirmed that BaNEG maintains the high PD of LRBI (even be slightly better), and meanwhile receives much lower FDR.
For the analysis of a real data set including a yeast data set with 47 genes from 112 samples, BaNEG discovered more potential edges than the SML and the 2SPLS but less edges than LRBI. Combined with the simulation studies on synthetic data, there are good reasons to believe that BaNEG discovered less false edges than LRBI, and detected more true edges than SML and 2SPLS. Hence, we believe that the GRN constructed by BaNEG is more reliable than other algorithms and is of great reference value for inference of GRNs, which is meaningful for discovering gene functions and gene-gene interactions.

V. CONCLUSION
The inference of GRNs is of significant and profound importance for better understanding the inherent biological mechanisms and precision medicine. In this study, we develop and present a method to infer the topology structures of GRNs from both gene expression data and cis-eQTL data. Systematic simulation studies demonstrate that the BaNEG algorithm outperforms three state-of-the-art algorithms (LRBI, SML and 2SPLS). The results inferred from a real data set supports the simulation results and therefore can be considered reasonable and meaningful in a biological sense. In conclusion, the BaNEG algorithm is considered to be an effective and efficient approach that can be used to infer underlying GRNs from gene expression data and genetic perturbations, and would be a useful tool for medical treatment and genetic research in practical.
YUNGANG ZHU received the Ph.D. degree in computer science from Jilin University, China, in 2012. He was a Visiting Research Fellow or a Postdoctoral Fellow with the Vienna University of Technology, Austria, the Dresden University of Technology, Germany, and the University of Trento, Italy. He is currently an Assistant Professor with the College of Computer Science and Technology, Jilin University. He has authored or coauthored over ten articles on international journals or conferences. His current research interests include probabilistic graphical models, information fusion, statistical machine learning, and data mining, with applications to knowledge engineering.