Policy Iteration Approach to the Inﬁnite Horizon Average Optimal Control of Probabilistic Boolean Networks

—This article studies the optimal control of probabilistic Boolean control networks (PBCNs) with the inﬁnite horizon average cost criterion. By resorting to the semitensor product (STP) of matrices, a nested optimality equation for the optimal control problem of PBCNs is proposed. The Laurent series expression technique and the Jordan decomposition method derive a novel policy iteration-type algorithm, where ﬁnite iteration steps can provide the optimal state feedback law, which is presented. Finally, the intervention problem of the probabilistic Ara operon in E. coil , as a biological application, is solved to demonstrate the effectiveness and feasibility of the proposed theoretical approach and algorithms.

concepts and properties, such as stability, stabilization, controllability, observability, synchronization, and sampled-data control, pinning control of Boolean or multivalued logical networks have been well discussed and exploited [11]- [18], in recent years.
To deal with randomness of gene regulatory networks [19], probabilistic BNs (PBNs) were proposed [20]. As discussed in [20]- [22], the PBN model can well capture the key qualitative features of gene regularity network with inherent biological uncertainly, where, at each time step, the updating rule of each gene is randomly selected from among several possible regularity rules.
The dynamical model inference and network identification for PBNs have great practical significance in bioinformatics since the inference and reconstruction of gene regulatory networks are key issues for genomic signal processing [23], [24]. An effective method for calculating the Boolean functions, and the corresponding selecting probabilities of a Boolean function in the PBN, based on network structure and steady-state probabilities, was designed in [25]. Recently, a tractable learning algorithm for identification of large-scale PBNs was introduced in [26], in the framework of stochastic conjunctive normal form, an equivalent representation for the PBN. In recent years, the optimal control and optimization problem for Boolean control networks (BCNs) have received considerable attention [27]- [29]. A special finite horizon Mayer-type optimization problem for BCNs was discussed by Laschov and Margaliot [30]. In addition, they also investigated the minimum-time control of BCNs [31]. Finite horizon optimal control problems for PBNs and stochastic logical networks were investigated in [28] and [32], respectively. Integer programming algorithm [29] and polynomial optimization algorithm for the finite horizon optimal control problem of a PBN were developed by Kobayashi and Hiraishi [33] to reduce the computational complexity.
In general, analyzing the long (infinite) horizon criterion and designing the corresponding optimal controller are more challenging issues, comparing with the finite horizon problem. The basic criteria for infinite horizon problem of BCNs or PBCNs are twofold: the discounted and average cost criteria. Pal et al. [34] first investigated the infinite horizon discounted cost problem for PBNs and obtained theoretical results were successfully applied in intervention problem on melanoma gene-expression network. In addition, an improved This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ dynamical programming, which provides a finite time convergence algorithm for the discounted infinite optimization problem of PBCNs, was developed in [35]. Furthermore, a new policy iteration-type algorithm, which solves the discounted infinite horizon problems of PBCNs, was derived in [36], and it has been successfully utilized to design feedback law for the residual gas fraction control in internal combustion engine [37].
Applying topology properties of trajectories and the graph theory, the infinite horizon problem for deterministic BCNs was first addressed by Zhao et al. [38]. Using a recursive algorithm, Fornasini and Valcher solved the average infinite horizon optimization problem for deterministic BCNs as the limit of the corresponding finite horizon optimization one in [39]. The policy iteration approach for the infinite horizon optimal control for deterministic BCNs was also given in [40]. By introducing the optimal input-state transfer graph, Zhu et al. [41] successfully reduced both computational complexity and space complexity in finding optimal controllers for deterministic BCNs.
Pal et al. [34] first investigated the optimal control problem for PCBN, and a policy iteration algorithm was deduced under the assumption that the PBCN is ergodic (or recurrent), which requires that the transition matrix of PBCN for every stationary policy consists of a single recurrent class. As mentioned in [34], the context-sensitive PBN satisfies the ergodic assumption, and accordingly, the policy iteration algorithm given in [34] can work effectively for the context-sensitive PBN. However, it was found that the optimal criteria given by [34,Th. 5] and the corresponding policy iteration algorithm are no longer applicable for the general PBCN (please see Example 13). As far as the authors know, there are still no works reported solving the general nonergodic case, which is the main motivation of this work. The results of this work can be regarded as a generalization to probabilistic BCNs (PBCNs) of the results that we obtained in [40] for deterministic BCNs. This generalization is nontrivial because when designing the optimal controller for PBCNs, one must to carefully address the conditional transition probabilities resulting from selection of update the Boolean functions, especially in the nonergodic case.
The main contributions of this work can be briefly summed up in the following aspects. 1) By applying the technique of the STP, a nested type optimality criterion (see Theorem 10) for the infinite horizon problem of PBCNs with average cost is derived. Compared with the optimality criterion given in [34], which requires the ergodic assumption, the nested optimality criterion proposed in this work can applied to arbitrary PBCN without any requirement (please see Example 13). 2) By resorting to the Jordan decomposition technique (see Proposition 14) and the Laurent type series expression (see Lemma 17), a policy iteration algorithm (Algorithm 1) is deduced. Compared with the value iteration algorithm as given in [39], the theoretical analysis on the proposed iteration algorithm guarantees that finite  Remark 20). This article is organized as follows: An equivalent matrix expression of the model considered in this work is presented in Section III, after introducing the problem formulation in Section II. The main results of this article are derived in Sections IV and V. Then, the probabilistic intervention problem of Ara operon in E. coil, as a practical biological application, is solved by the proposed algorithm in Section VI. Finally, brief conclusions are provided in Section VII. For clarity of presentation, some technical proofs are presented in the Appendix.

II. PRELIMINARIES AND PROBLEM FORMULATION
This section summarizes basic concepts about the PBNs and the average optimal control problem for PBNs.

A. Definitions and Notations
The notations used in this article are listed in Table I. For statement ease, the following definitions will be used.
, and briefly defined as 3) The STP [10] of A ∈ R m×n and B ∈ R p×q , denoted as A B, is defined as where s is the least common multiple of n and p, and ⊗ is the Kronecker product. The symbol may be omitted without causing confusion. 4) The power-reducing matrix is given by M n pr = diag[δ 1 2 n , δ 2 2 n , . . . , δ 2 n 2 n ], i.e., a block diagonal matrix with diagonal elements δ 1 2 n , . . . , δ 2 n 2 n . It deduces an equation

B. Boolean Network
A BN, which is a directed network containing binary (Boolean) logical-valued state nodes, can be represented by a set of nodes V = {x 1 , x 2 , . . . , x n } and a set of Boolean functions with k specified input nodes, and the value of f i is assigned to next state of node x i . In general, k could be varying as a function of i , but, without loss of generality, we assume that for each x i , the corresponding update rule function f i : D n → D, by allowing unnecessary nodes in update rule function f i to be fictitious. For an update rule function f , the logical node Hence, the dynamics of a BN of V and F can be described as

C. Probabilistic Boolean Control Network
A PBCN is a directed network containing Boolean logical-valued state nodes V = {x 1 , . . . , x n } and input control nodes U = {u 1 , . . . , u m }. In a PBCN, the update rule of state node x i is regulated by one Boolean function, which is randomly selected from a set of Boolean functions. More precisely, define a collection of Boolean function sets F and the corresponding collection of probability set ϒ as follows.
is the probability that the logical function f j i will be chosen as the update law for node x i . Note that In the summary, the dynamics of a PBCN with n state nodes V = {x 1 , . . . , x n } and m input nodes U = {u 1 , . . . , u m } can be described as . . .
with the probability of f i choosing f j i is Pr

D. Average Optimal Control Problem
Now, consider the class of policies, which consists of an infinite sequence of control laws π = {μ t : t ∈ Z ≥0 }, where a control law μ t : D n → D m , t ∈ Z ≥0 maps logical states x(t) onto control input u(t) = μ t (x(t)) in such a way that μ t (x(t)) ∈ D m for all x(t) ∈ D n . For notational brevity, we refer to π μ = {μ, μ, · · · } as the stationary policy μ.
Given an initial state x 0 with a policy π = {μ 0 , μ 1 , · · · }, consider the infinite horizon average expected cost where the notation E means the expectation, and g : D n × D m → R is the per-step cost function.
The set of all policies π is denoted by , that is, the set of all sequences of functions π = {μ 0 , μ 1 , · · · }. Then, the optimal average cost function J * is given by The aim of the infinite horizon average optimal control for PBCNs is to find an optimal policy π * ∈ , which achieves the optimal cost J * , that is Example 1: Consider a PBCN (2)-(4) with three state nodes V = {x 1 , x 2 , x 3 }, one input nodes U = {u 1 }, and an update dynamics where and accordingly c(1) = 1, and c(2) = c(3) = 2, respectively.
As described in Fig. 1, the update rule of state x 2 with (or without) self-looped influence is given by f 2 2 (or f 1 2 ) assuming that the probability having self-looped influence for x 2 is 0.4, i.e., r 1 2 = Pr{ f 2 = f 1 2 } = 0.6 and r 2 2 = Pr{ f 2 = f 2 2 } = 0.4. Similarly, assume that the probability having self-looped influence for x 3 is 0.3. Hence, 7}. Consider the optimal average control problem (6) for this PBCN, with the following cost function:

III. MATRIX EXPRESSION OF MODEL
As illustrated in Section II, the PBCNs are expressed by the combination of logical functions and their stochastic switching. Since the logical functions are originally intractable, this section provides a systematic way to deal with PBCNs by exploiting STP techniques.
Since there are c(i ) possible update logical functions f j i for each node x i ∈ V , the number of possible realization of PBCN is We give a lexicographically order to possible realizations of PBCN, defining the following special matrix, introduced as with α j denoting the (α, j )th entry of the matrix . Then, according to (3) and (4), the probability that the network α is selected as a realization of the PBCN is Furthermore, a Boolean variable X ∈ D is identified with a vector x ∈ 2 in the following form: 1 ∼ δ 1 2 , 0 ∼ δ 2 2 . Under this vector identification, the whole states and inputs are expressed by an STP of elements as with N := 2 n , and u(t) = m j =1 u j (t) ∈ M , with M := 2 m , respectively. Accordingly, logical state space D n and input space D m can also be rewritten as N and M , respectively.
Applying the STP of matrix defined in (1), a logical function can be represented in an algebraic form.
Lemma 2 (10,Th. 3.2): Then, there exists a unique logical matrix M f ∈ L 2 q ×2 p , such that f can be rewritten in a multilinear form as The matrix M f is called the structure matrix of the logical function f , and (13) is called the algebraic expression of f . Assume that the structure matrix of the logical function f then, based on [44,Th. 2] and Lemma 2, we can obtain, for each α ∈ [1, C 0 ], the following algebraic expression of α possible realizations of PBCN as where which is called the transition matrix of the PBCN, where p[α] and L[α] are given by (12) and (15), respectively. For a given u = m j =1 u j ∈ M at a time step t, denote by p i j (u) the transition probability from a logical state δ i N to a next logical state δ The following fact implies that the transition probabilities of a PBCN can be directly calculated by the transition matrix .
Proof: See the Appendix. Lemma 5: For a given PBCN (2)-(4), the evolution dynamics of expectation of state x(t) can be expressed by the following linear form: where is the transition matrix of PBCN (2)-(4).
Proof: See the Appendix. The per-step cost function g : D n × D m → R in the formulation (5) can be expressed by a matrix expression as 1 1 As considered in [39], an equivalent linear form of the per-step cost function g : N × M → R can be given as g( Example 6: Recall Example 1. Since, in this example, by Definition (10). Then, accordion to (12), we get 1 3 , and M f 2 3 , respectively. As a result, according to (15) and (16), we get the transition matrix of this example, given by (23), as shown at the bottom of the page. Furthermore, according to Lemma 4, the transition probabilities of this PBCN can be obtained. The transition probability diagram with fixed control input u = δ 1 2 and u = δ 2 2 is shown in Fig. 2.
In addition, according to (20), the cost matrix of (9) is

IV. AVERAGE OPTIMALITY CRITERION
In this section, an optimality criterion, called a nested condition, for the avarice optimal control is derived without a conventional ergodic assumption.
The set of all feedback logical laws μ : N → M is denoted by U, that is, U = {μ | μ : N → M }. One can easily get that the capacity of U is |U| = M N ; by noticing, the state space N and control space M are both finite. In the framework of the vector formulation, the control law can be regarded as a logical function from N to M . Therefore, referring to Lemma 2, we will first present the following fundamental result.
Proposition 7: For an arbitrary control law μ ∈ U, there exists a unique logical matrix K μ ∈ L M×N that satisfies and it is called the structure feedback matrix of μ. For any given μ ∈ U, we define the matrix μ associated with μ, as (25) which includes all the information about the transition probability of the PBN under feedback control law μ, where K μ is the structure matrix of μ, and M n pr is the power-reducing matrix given in Section II.
For any given μ ∈ U with a structure matrix K μ , since holds.
Hence, for a given the state feedback control becomes a closed-loop system as where μ is given by (25). Lemma 8: For any policy π = {μ 0 , μ 1 , . . .}, the vector form of the expected cost J π = J π (δ 1 N ), . . . , J π (δ N N ) can be expressed as and especially, for a stationary policy π μ = {μ, μ, . . .} where the special cost vector g μ associated with feedback control law μ ∈ U, defined as Proof: See the Appendix. Lemma 9: For a given μ ∈ U, if there exist two vectors (J, h) ∈ R N × R N that satisfy the following equations: Then, we have J * J. Proof: Left-multiplying (32) by μ and applying equality (31), we get Repeating the process above with induction, we deduce the following equation: for any n ∈ Z ≥0 . Summing those expressions from 0 to n − 1, we have Recalling that μ is a stochastic matrix, we have μ = μ ≤ 1, and accordingly, ( and applying (29), we obtain that The proof is complete. Now, it is ready to present an optimality criterion for the average optimal control problem of PBCNs as the following theorem.
Theorem 10: Assume there are two vectors (J, h) ∈ R N × R N such that, for any i ∈ [1, N], the following condition holds Then, the vector J is the optimal cost vector of the average optimal problem (6), that is, J = J * . Remark 11: In the optimality conditions (34), the control index set i , which is the control candidate domain of the minimization problem in the left-hand side of (34a), depends on the index set of control inputs that achieve the minimum in (34a) when J is substituted into it. This is the reason that the optimality conditions (34) are said to be nested.
Proof of Theorem 10: Nested condition (34) implies i ∩ i = ∅, for each i ∈ [1, N], where the set i is given in (34a), and i is defined as In other words, the existence of a minimizer of the left-hand side of (34a) is guaranteed by the nested condition, and such an index Recalling definition of the sets i , i , and the cost vector g μ given by (30), we have for each i ∈ [1, N] which is equivalent to (31) and (32) for μ . Thus, according to Lemma 9, we get Next, we will prove that if vectors (J, h) ∈ R N ×R N satisfy condition (34), then there is a constant C ≥ 0 such that vectors J andh = h + C J satisfy the following condition: for each i ∈ [1, N]. We refer condition (39a) as the modified optimality condition of condition (34a). Notice condition (39a) is the same as the condition (34a). If vectors (J, h), given in ( which is equivalent to Furthermore, since j 0 ∈ [1, M]\ i 0 , recalling the definition of i 0 given by (34a), we have Now, seth := h + C 3 J , where C 3 > 0 will be given later. Then Hence, taking C 3 large enough such that C 3 > (|C 1 Hence, condition (39) is equal to and applying condition (43b) to μ 1 implies that Left-multiplying (46) by μ 0 and using inequality (44) imply that we obtain Then, repeating the abovementioned process with induction, we get for any n ∈ Z ≥0 where set μ −1 = I N , if n = 0. Therefore, summing up those expressions from 1 to n + 1, we obtain ∀i ∈ [1, N] In addition, applying ( μ 0 · · · μ n−1 Then, due to the arbitrariness of π, we deduce that for all i ∈ [1, N]. Finally, we get J = J * , combining (34) and (47), and finish the proof. Remark 12: As mentioned in Section I, Pal et al. [34,Th. 5] have given an optimal criterion for the average optimal control for PBNs, and it has been successfully applied in the case of context-sensitive PBCNs. However, it requires the ergodic assumption and cannot be applied in the general case, as illustrated in the following example.
Example 13: Consider the PBCN with two state nodes V = {x 1 , x 2 }, one input nodes U = {u 1 }, and update dynamics Here, F = {F 1 , F 2 }, and ϒ = {ϒ 1 , ϒ 2 }, with Then, according to the analysis given in Section IV, one easily obtains The corresponding transition probability diagram is shown in Fig. 3.
We consider the optimal average control problem (6) for this PBCN, where the per-step cost function g is given as If one tries to solve this optimal control problem based on the result of [34], it needs to solve the following optimal criteria given in [34,Th. 5] for all i ∈ [1,4] : According to transition matrix in (49) and cost matrix G in (50), the first and second equations of the abovementioned equations are The first equation of (52) indicates that λ = 0, which, on substitution into the second equation of (52), shows that the system is inconsistent. The abovementioned analysis implies that the criterion given in [34,Th. 5] cannot solve this optimal control problem. The main reason is that the algorithm in [34] requires the ergodicity of the PBCN (please see [34,Th. 4]). However, it is easily observed that the PBCN considered in this example is not ergodic. Indeed, it is observed that there is no possible way to go from the state δ 1 4 to state δ 2 4 , under both of control inputs u = δ 1 2 and u = δ 2 2 , as shown in Fig. 3, implying this PBCN is not ergodic since the ergodicity of a PBCN requires that it is possible to go from each state to all the states under each stationary policy.
In the following, we will see that the policy iteration algorithm based on Theorem 10 can successfully solve the problem in this example.

V. POLICY ITERATION ALGORITHM
By using matrix analysis techniques, including the Jordan decomposition and the Laurent series expansion, a policy iteration algorithm for the optimal control formulation given in Section II is proposed in this section. At the end of the section, the convergence of the algorithm with finite iteration steps is examined with the example shown in Sections II-IV.
Several preliminary results, which will be used to prove the correctness of the policy iteration algorithm, are introduced first.
Proposition 14: For any control law μ ∈ U, the rank of I N − μ ∈ R N ×N is less than N, that is In addition, there exist a nonsingular upper triangular matrix S μ ∈ R r×r and a nonsingular matrix V μ ∈ R N ×N such that Proof: By definition of μ , it is obvious that In addition, since r = Rank(I N − μ ) < N, applying the Jordan decomposition theorem (see [45,Th. 3.1.11]), there are a nonsingular upper triangular matrix S μ ∈ R r×r and a nonsingular matrix V μ ∈ R N ×N such that (53) holds.
Accordion to the Jordan decomposition form of the matrix I N − μ given in Proposition 14, we will deduce a computational formula for J μ as follows.
Lemma 15: For any μ ∈ U, define the corresponding limiting matrix μ as Then the following hold: 1) The Cesaro type limit, defined by the right-hand side of (54), always exists, and accordingly, the limiting matrix μ satisfies the following properties: 2) The vector J μ defined in (29) can be calculated by Then, from the definition (54) of limiting matrix μ , we get μ that means the first equation of (56) holds, and similarly, we also deduce that μ = μ μ . According to the Jordan decomposition (53), we have Then, using again the definition (54) of the limit matrix μ , we get Furthermore, recalling μ μ = μ of (56), we get S μ L 22 = 0, from (58) and (59). Then, from the fact that the upper triangular matrix S μ ∈ R r×r is nonsingular, we obtain L 22 = 0. Hence, from (59), we obtain (55), which also implies that the Cesaro type limit (54) exists.
Lemma 16: For any control law μ ∈ U, define h μ := H μ g μ , with Then, h μ can be calculated by Proof: From the Jordan decomposition (53) and (55), This implies that the matrix I − μ + μ is nonsingular. Then, according to 22 = 0, we get Therefore, noticing definition (60) of H μ , we obtain (61). We also present the Laurent series expansion of (I N − α μ ) −1 , which plays an important role in the proof of the correction of the policy iteration algorithm given in the following.
Lemma 17: For any given feedback logical control law μ ∈ U and 0 < α < 1, we have where F(α, μ) ∈ R N ×N denotes a matrix that will converges to zero when α → 1.
Proof: See the Appendix. Based on Theorem 10, we now present a policy iteration algorithm for PBCN.
Remark 18: In Algorithm 1, the initial policy μ 0 is a preset variable given by a user. As discussed in proof of correctness of Algorithm 1 afterward, an arbitrary initial policy μ 0 converges to an optimal policy. Indeed, the initial policy of Example 5.1 is simply chosen as μ(x) = δ 1 2 , ∀x. The next proposition characterizes the monotonicity property between two control laws in U.
Proposition 19: For any two control laws μ, η ∈ U, we define the following three special subsets of N : If two different feedback laws μ, η ∈ U satisfy the following condition where, for all 0 < α < 1

Step 1. Policy Evaluation:
For a given stationary policy μ n , compute the corresponding J μ n , h μ n based on (57), (61).
Step 2. Policy Improvement: 2.A Choose stationary policy μ n+1 such that its structure and set q n+1 i = q n i , if possible.

2.B
If μ n+1 = μ n , go to (2.C); else return to Step 1. 2.C Choose stationary policy μ n+1 such that its structure and set q n+1 i = q n i , if possible. 2.D If μ n+1 = μ n , stop and set μ * = μ n ; else return to Step 1 and repeat the process.
Proof: See the Appendix. The condition (68), which can guarantee the monotonicity relation (69) between two different control laws η and μ, will help to prove that the policy improvement process given in Step 2 of Algorithm 1 is greedy.
In addition, μ n+1 = μ n , at termination step μ n , and it implies that N]. Hence, applying Theorem 10, we obtain J μ n = J * , which implies that the stationary policy {μ n } is optimal. Remark 20: The abovementioned correctness analysis for Algorithm 1 also implies that finite iterations can deduce an optimal stationary policy since the number of stationary policies, which are considered as control candidates in the optimization formulation, is finite. At termination step n, an equation J μ n = J * holds, and {μ n } is a stationary optimal policy. As a result, the following fact holds.
Example 22: Now, we apply Algorithm 1 to solve the problem given in Example 1, following from Example 6.

VI. APPLICATION TO INTERVENTION PROBLEM
OF ARA OPERON NETWORK In this section, the optimal control on intervention of arabinose (Ara) operon in E. coil is performed as an application to practical biological networks. The contributions [46] and [47] have studied the regulation of Ara operon in E. coil and given an observation that the regulatory protein is determined in the presence and absence of arabinose. Fig. 5 shows a graphical interpretation of Ara Network. As investigated in [48], the update logics of Ara operon network can be discredited by the following Boolean equations (71): where M S denotes the mRNA of the structural genes (araBAD), M T is the mRNA of the transport genes (araEFGH), E is the enzymes AraA, AraB, and Ar a D, coded for by the structural genes, T is the transport protein, coded for by the transport genes, A is the intracellular arabinose (high levels), A em is the intracellular arabinose (at least medium levels), C is the cAMP-CAP protein complex, D is the DN A loop, and Ara + is the arabinose-bound AraC protein. For more details of the biological justification of each update function of (71), please see [48]. There exist four Boolean control variables: the AraC protein (unbound to arabinose), the extracellular glucose, the extracellular arabinose (at least medium levels), and the extracellular arabinose (high levels); those variables are denoted by Ara − , G e , A em , and A e , respectively. Furthermore, the variable D is 1 if the DNA is looped and 0 if it is not looped. All the other variables represent the concentration levels of the corresponding gene products: 1 denotes "present" or "high concentration" and 0 denote "absent" or "low (basal) concentration." Consider the context-sensitive case, as discussed in [42], with perturbation on the influence of node M S to logical function f E , and the influence of node D to the logical function f M S , as shown by dotted line in Fig   We consider a special intervention problem, that requires to design an optimal feedback law of four parameters (the extracellular arabinose, the AraC protein, and the extracellular glucose) to maximize the "present" level of the mRNA of the structural genes (araBAD). Under the abovementioned PBN expression of Ara operon network, this intervention problem can convert into an average minimum cost problem, with the cost function g as noticing that x 7 = M S denotes the mRNA of the structural genes (araBAD).   Now, using Algorithm 1, we easily get the optimal performance for this problem, as J * (x) ≡ −1 for all x ∈ 512 , and also obtain the corresponding optimal feedback law, as given in the following form: 0, 0, 0), if x 1 ∨ x 2 = 0, and x 9 = 0 (0, 1, 0, 0), if x 1 ∨ x 2 = 0, and x 9 = 1 (1, 0, 1, 0), if x 1 ∨ x 2 = 1, and x 9 = 0 (0, 1, 1, 0), if x 1 ∨ x 2 = 1, and x 9 = 1.
Furthermore, under new state representation y = (y 1 , . . . , y 9 ) with a special permutation Then, the optimal stationary policy is given in Fig. 10. The policy iteration takes 5084.525867 s to obtain the exact optimal law on a computer with 8-GB RAM memory and Quad-Core 3.2-GHz processor.

VII. CONCLUSION
This article deal with the infinite horizon optimization problem for general PBCNs with an average cost. Based on STP, an equivalent matrix expression of the model was presented. Then, combining the techniques of the Laurent series expression and the Jordan decomposition, a novel nested policy iteration-type algorithm, which solves the probabilistic optimization problem for arbitrary PBCN without any requirement, was deduced. Finally, some practical applications, including optimal intervention problem of ARA operon, are solved to illustrate the benefit of the proposed algorithm for solving optimal control problems on PBCNs.  (14) and definition (17) of the conditional transition probability p i j (u), we have In addition, by observing the relationship by recalling the definition (16) of transition matrix . Then, by combining (73) and (74), we complete the proof of this lemma.
Proof of Lemma 5: According to full probability formula Pr Pr In addition, using Lemma 4 and the fact that Combining (75) and (76), we have Pr We complete the proof. Proof of Lemma 8: Since (29) is immediately obtained from (28), we just prove (28).