Optimal Control of Probabilistic Boolean Networks: An Information-Theoretic Approach

The primary challenge with biological sciences is to control gene regulatory networks (GRNs), thereby creating therapeutic intervention methods that alter network dynamics in the desired manner. The optimal control of GRNs with probabilistic Boolean control networks (PBCNs) as the underlying structure is a solution to this challenge. Owing to the exponential growth in network size with the increase in the number of genes, we need an optimal control approach that scales to large systems without imposing any limitations on network dynamics. Furthermore, we are encouraged to use the graphics processing unit (GPU) to reduce time complexity utilizing the easily available and enhanced computational resources. The optimal control of PBCNs in the Markovian framework is developed in this paper employing an information-theoretic approach which includes Kullback-Leibler (KL) divergence. We convert the nonlinear optimal control problem of PBCN to a linear problem by using the exponential transformation of the cost function, also known as the desirability function. The linear formulation enables us to compute an optimal control using the path integral (PI) method. Furthermore, we offer sampling-based methodologies for approximating PI and therefore optimizing PBCN control. The sampling-based method can be implemented in parallel, which solves the optimal control problem for large PBCNs.


I. INTRODUCTION
T HE popularity of Boolean networks (BNs) has increased gradually since Kauffman introduced them [1]. BNs are qualitative models that describe the dynamics of the gene activity profile (GAP) characterizing the status of a gene as active/inactive (or 0/1) in gene regulatory networks (GRNs). BNs can be used to investigate GRNs in a restricted environment, however their simplicity, along with their deterministic rigidity, limits the ability to analyze reasonably complex network dynamics. The use of probabilistic switching between several networks substantially improves a model's capacity to represent behaviors that are near to actual observations. The probabilistic Boolean networks (PBNs) introduced by Shmulevich [2] include this stochastic nature, making them a natural choice for modeling the limited information GRNs. PBNs with the addition of Boolean control inputs are called probabilistic Boolean control networks (PBCNs). Since the introduction of semi-tensor product (STP) of matrices by Cheng et al. [3], many fundamental properties of switched Boolean control networks, PBNs and PBCNs have been characterized in the literature including but not limited to observability [4]- [6], controllability [7]- [9], reconstructibility [4], fault detection [10], stabilizability [11]- [15], structure identification [16], [17], output tracking control [18], [19] and model checking [20]. triggered control of PBCNs is examined in [23], optimal time-varying feedback controllability for PBCNs is discussed in [22], and the authors have proposed a pinning control approach in [24]. Because PBCN states form a Markov chain with a finite state space, it is compatible with Markov decision processes (MDPs). In [25], the authors include discretetime MDP into PBCN modeling and establish a probability criterion to restrict the induced loss defined by the state cost for finite time steps, before the network encounters the desirable states for the first time. Several more studies use the MDP framework for PBCN optimal control, including [26]- [28]. In [29], the context-sensitive PBNs with perturbations are used to minimize the finite horizon expected cost. Recently, reinforcement learning (RL) based techniques have been proposed in [30]- [36] to solve the optimal control problem of PBCNs.
The curse of dimensionality i.e., the problem of exponentially growing states and controls as the network size grows, affects most of the approaches available to control PBCNs. This problem cannot be solved without making some simplifying assumptions. Due to the necessity of computations on large matrices, the matrix-based techniques, i.e., STP or MDP, worsen the performance. Besides the utility of matrix-based methods is infeasible to even smaller systems than methods which do not involve matrix operations. Furthermore, for large systems, the approximate solution techniques impose a computational load, and convergence to the (sub)optimal solution is only guaranteed in the limited scenario of infinite iterations. Furthermore, adequate error margin performance can only be ensured after several simulations.
The techniques from continuous-time stochastic optimal control (SOC) [37], [38] can be adapted for PBCNs to address most of the above challenges. A family of nonlinear control problems with control affine dynamics and quadratic control cost is examined by Kappen [39] using the path integral (PI) based representation. The PI-based SOC is restated as a problem of minimizing the Kullback-Leibler (KL) divergence between controlled and uncontrolled transition distributions in [40]. The KL divergence is also referred to as an information cost between two distributions. The resulting control formulation is regarded as informationtheoretic control [41] with the cumulative sum of the state dependent cost and information cost as free energy [42]. The framework of linearly-solvable MDP (LMDP) [43] achieves a comparable formulation for discrete state space with the restriction of information cost in terms of KL divergence and the transition probabilities denoting continuous inputs.
Inspired by the preceding discussion, we develop a novel information-theoretic strategy to effectively solve the optimal control of PBCN and implement the same using a graphics processing unit (GPU) based parallel processing framework. The optimal control of PBCNs is proposed in the MDP framework, with the cost-to-go consisting of the state cost and the information cost. The following are the key contributions of the proposed framework in this paper: 1) To get the advantage of the inherent stochastic behavior of PBCNs, an information-theoretic formulation utilizing the augmented state space is proposed for optimal control of PBCNs. 2) To obtain the solution to information-theoretic control through approximation and overcome limitations of the Monte Carlo sampling method, an entropy-based improved Monte Carlo sampling technique is proposed. 3) To overcome memory constraints, a matrix-free approach for simulating the PBCN model is developed. 4) To obtain scalability of optimal control in large PBCNs, a GPU-based parallel implementation is introduced. The paper is organized as follows: Section II introduces the required fundamentals. In Section III, the PBCN classical and information-theoretic optimal control formulation over the proposed augmented state space is investigated. Section IV deals with the solution in terms of desirability function estimation, employing the path integral representation and its improved Monte Carlo sampling-based approximation. In Section V, several algorithms and GPU-based implementation are developed. A couple of illustrative examples are presented to validate the effectiveness proposed method, and the scalability is demonstrated by implementation on a 37gene T-cell network in Section VI.

NOTATIONS
R, R + , Z, and Z + denote the sets of real numbers, nonnegative real numbers, integers and nonnegtive integers, respectively. The scalar multiplication of two numbers is denoted by ×. E[·] represents the expectation. The set of integers is indicated by {m 1 , . . . , m 2 } for given any integers m 1 , m 2 ∈ Z + , such that m 1 ≤ m 2 . The symbol |P | denote the cardinality of any set P . If both of the inputs are the identical, the function δ(·, ·) returns 1, else it returns 0. We use the symbols X (U ) and s (a) for states (control inputs) of PBCN and MDP, respectively. We denote the Boolean domain by B := {0, 1}. Similarly,the Cartesian product of B n-times is given by B n := B × · · · × B n . Logical AND, OR, and NOT operations are denoted by ∧, ∨, and ¬, respectively.

II. PRELIMINARIES
In the following, we present a brief review of Probabilistic Boolean control networks (PBCNs) in the Markovian framework, Markov decision processes (MDPs), and informationtheoretic control framework.

A. PROBABILISTIC BOOLEAN CONTROL NETWORKS
For i ∈ {1, 2, . . . , n} and k ∈ {1, 2, . . . , m} consider the nodes x i (t) ∈ B and Boolean control inputs u k (t) ∈ B. In this network, the vector representation of the expression levels of all the genes at time t is given by the row vector x(t) ∈ B n defined as x(t) := [x 1 (t) x 2 (t) . . . x n (t)]. Similarly, the row vector corresponding to inputs is represented by u(t) ∈ B m and defined as [u 1 (t) . . . u m (t)]. For every node This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. : B n+m → B represents a possible predictor function (or an update rule) to determine the value of gene x i and |F i | is the number of possible predictor functions of x i . A PBCN with n genes and m control inputs for t ∈ Z + is described as follows The probability of selection associated with the predictor function f (j) i is denoted as c (j) i and it satisfies the following condition For any PBCN, the total number of constituent Boolean control networks (BCNs) is given by Letf l denote the dynamics of the l th constituent BCN and P 1 , P 2 , . . . , P N be the selection probability associated with each of the N possible networks. The selection probability of a l th BCN denoted by P l is obtained as, with selection of the functional relationships comprising of (f
It's a Markov process that generates a series of states which follow the Markov property Pr(s(t + 1) | s(t), . . . , s(0)) = Pr(s(t + 1)|s(t)), where s(t + 1), s(t), s(0) ∈ S and are affected by external interventions. Given a state s ∈ S and a control action a ∈ A, we can calculate the transition probability for possible next state s ∈ S as P (s |s, a) = Prs(t + 1) = s |s(t) = s, a(t) = a.
Likewise, there is a cost associated with any current state s, current action a, and the future state s as follows R(s |s, a) = E{r(t + 1)|s(t) = s, a(t) = a, s(t + 1) = s }, which indicates favorability of a state. In this case, r(t + 1) represents a real valued function of the state and action.

C. INFORMATION-THEORETIC CONTROL FRAMEWORK
The standard control problem with discrete control inputs and arbitrary control cost varies from the information-theoretic control problem [42] presented here. Definition 1: Continuous input is regarded as the input across a continuous set of transition probabilities from a given state under the effect of discrete control input a, i.e., a(s |s) = P (s |s, a). Definition 2: At state s, the uncontrolled distribution also referred to as passive dynamics, P (·|s), describes the system behavior in the absence of control inputs. In continuous stochastic systems, the passive dynamics (or distribution) is clearly defined, but its discrete case equivalent corresponds to the random walk in state space. Definition 3: [45] The KL divergence given below is a dissimilarity measure between two distributions h 1 (s) and h 2 (s) The information-theoretic approach can be applied to MDPs where immediate cost representing the the one-step running cost, i.e., l(s, a) is calculated as the sum of the state cost and the KL divergence between controlled and passive dynamics, KL P (·|s, a) P (·|s) as shown below.
KL P (·|s, a) P (·|s) = E s ∼P (·|s,a) log P (s |s, a) P (s |s) , where P (s |s, a) and P (s |s) are transition probabilities under the controlled and passive dynamics respectively.
l(s, a) = q(s) + E s ∼a(·|s) log(a(·|s)/P (·|s)) , The cost of altering the passive distribution can be viewed as the KL divergence.
By using the Bellman principle of optimality, the optimal control is determined as the minimizing control a of the costto-go function represented by J(s). J(s) = min a∈A q(s) + E s ∼a(·|s) log a(·|s) P (·|s) + J(s ) .
To get the linear form of the above nonlinear Bellman equation, the desirability function obtained as z(s) = exp(−J(s)). By the introduction of normalization term G z (s) = P (s |s)z(s) = E s ∼P (·|s) in (5), yields the following optimization problem min a∈A q(s) − log G z (s) + KL a(s |s) P (s |s)z(s) which achieves minimum when the controls are selected using (7).
The function z(s) indicates how desirable a state is. Since the desirability function is defined as an exponential of the negative cost function, the desirability function has a higher value for trajectories with lower costs. An expectation operator, which is a linear operator, constitutes the normalising term, as a result, the equation (8) can be solved using techniques applicable to the linear system of equations such as the Eigenvalue problem, iterative backward evaluation, and others. The linear equation (8) is solved to find desirability function of all states. The information-theoretic control framework doesn't provide the optimal control input explicitly; instead, it gives the optimal transition probability under optimal control input.

III. PBCN OPTIMAL CONTROL PROBLEM FORMULATION
The control dependent counterpart of the probability distribution vector w(t) for one-step evolution is given by where P (U (t)), represented by P (U ) in sequel for U ∈ U, is a controlled transition probability matrix in R 2 n ×2 n . For detailed description one can refer [28]. Letf l (X I , U K ) be the dynamics of the l th constituent BCN. For notational simplicity, we have suppressed the dependence of states (X), control inputs (U ) over time t.
The controlled Markov chain can be utilised to describe the dynamical behavior of PBCN, the theory of MDP can be employed to find an optimal intervention policy [46]. The PBCN can be defined as an MDP comprising of the set of states X := {X 1 , . . . , X 2 n } with X I = 1 + n i=1 2 n−i x i the set of controls U := {U 1 , . . . , U 2 m } with the control vector U K = 1 + m k=1 2 m−k u k , and the probability of transition from X = X I to X = X J under input U ∈ U is obtained from the element in I th row and J th column of the controlled transition distribution matrix P (U ) in (9). In the following, we discuss the classical approach, its limitations, and our proposed framework for PBCN optimal control.

A. CLASSICAL APPROACH TO OPTIMAL CONTROL
Under the probability distributions specified by (9), optimal control problem is to minimizing the expected cost (or costto-go function) of trajectories beginning from any state. To regulate PBCN behavior, control inputs are applied for a finite horizon t f and the states are divided into two categories: favorable (low penalty) and undesired (high penalty). The optimal control seeks a policy π = {µ 0 , µ 1 , · · · , µ t f −1 }, where µ t : X → U mapping the state space to control space. A one step cost l(X, U ) : X × U → R + comprises of the state dependent cost q(X) : X → R + and control cost g(U ) : U → R + i.e., l(X, U ) = q(X) + g(U ). For X, the cost-to-go for a trajectory generated by the policy is calculated as follows: and the corresponding Bellman equation as The control policy t (X) minimizes the right hand side of (10) for each state X and time t. Because the cost remains bounded, a solution to the optimal control problem is always exists. Hence, in this case, the resultant control policy is a time-dependent shortterm policy that alters PBCN's dynamic behavior.
Despite the fact that the MDP framework enables dynamic programming, the established Bellman equation (10) is nonlinear and ineffective in several cases as i) it rarely offers analytic solutions and is computationally expensive to solve using approximation methods in general, ii) each iteration in iterative approaches requires an exhaustive search over all feasible inputs, iii) in terms of computational cost and memory, an exact solution for large PBCN suffers from the curse of dimensionality. In comparison to traditional MDP-based approaches that approximate the solution of a nonlinear Bellman equation, the information-theoretic framework that can overcome the said hurdles is more powerful computationally because it approximates the solution of a linear equation. However, in posing the PBCN optimal control in this context, the following difficulties arise: i) The classical PBCN possesses discrete and finite inputs, however we require continuous inputs that represent the transition probabilities. ii) Because the PBCN is control action non-affine, a possible passive dynamics can coincide to a network with the equally likely occurrence of actions. However, out of P (U K ) ∀U K ∈ U, the passive dynamics P (·|X) have no obvious choice, and switching between controlled dynamics P (U K ) may not be feasible if the absolute continuity requirement (4) is not fulfilled. iii) Information-theoretic cost cannot be incorporated into the context-specific control cost specified for given PBCN. In the next section, we develop a novel state space architecture to overcome these difficulties.

B. CONSTRUCTION OF AUGMENTED STATE SPACE
To allow information-theoretic control of PBCN, we start with augmented state space generation and provide the resolution of the difficulties described in the preceding section.
Definition 4: For a PBCN (1), the augmented state spaceX is constructed as the set of all the tuples generated by the Cartesian product of X and the set of available control inputs U i.e., Definition 5: The transition probability of the augmented stateX toX is defined as Remark 1: The transition probability for augmented states is the average value of all underlying BCN transition probabilities. Whereas any alternative choice is likely to experience bias, the impact of continuous control inputs is considered to be equally probable for an unspecified natural intervention rate. Example 1: [28] Consider a two-gene PBCN with single control input with the system evolution according to (c As shown below, the augmented state space resolves the challenges in restructuring PBCN optimal control in the paradigm of information-theoretic control. 1) the input A of the augmented state space can now be considered equivalent to the continuous transition probabilities i.e., P (X |X, A) = A(X |X) as required by the information-theoretic framework, 2) All of the controlled transition probabilities P (U K ) can be integrated into a single state transition matrixP , which is 2 m+n × 2 m+n in size. The probability matrix P is used to represent the state transition matrix for the system's passive dynamics, which addresses problem (ii) from the preceding section. Furthermore, meeting the criterion of absolute continuity is no longer required in switching between controlled dynamics. 3) The penalty to be incurred for reshaping the passive dynamics can be viewed as the KL divergence between continuous input A and passive dynamicsP .

Remark 2:
The aforementioned transition probability matrix 2 n+m × 2 n+m ) in size over augmented state space is not the same as the STP-based matrix 2 n+m in size [34]. The developed approach, in this paper, does not involve a matrix representation of the system dynamics. The additional dimensions show that the state space has grown, yet they do not cause problems when used in large systems. This is due to the fact that this methodology is based on a trajectory dependent approach. The trajectories can be obtained from recorded data or produced through simulations using a known dynamic model.

C. PBCN INFORMATION-THEORETIC CONTROL
The proposed information-theoretic optimal control PBCN is presented in the FIGURE 2. In augmented state space, the Bellman equation is reformulated with the immediate cost l(X, A) and the terminal cost In this case,q(X) is the augmented state cost obtained as q(X I ) + g(U K ). The immediate cost is given as The minimizing control A of the optimum control is generated from (13).
Taking into account the desirability function z t (X) = exp(−J t (X)) and z t+1 (X) = exp(−J t+1 (X)). Incorporating z t and z t+1 in (14) provides Consider the normalizing term for the denominator of expectation as The expectation term is expressed as follows Since the term − ln(G[z t+1 ](X)) is constant with respect to distribution A(·|X), the expectation will be independent and equal to − ln(G[z t+1 ](X)). Substituting the expectation in (16) The expectation E A(·|X) ln Because the normalizing term (− ln G[z t+1 ](X)), state costs q(X I ), and control costs g(U K ) are independent of A (optimizing variable), if the contribution in (18) of the KL divergence is zero, the minimum is achieved. As a result, by setting the KL divergence component to zero, the optimal continuous inputs can be found as Replacing A(·|X) with the optimal continuous control input A * (·|X) and q(X where z(X) is referred to as the optimal desirability function of the augmented stateX. The normalizing term G[z t+1 ] is replaced with X P (X |X)z t+1 (X ) in (20) results in a linear Bellman equation in the desirability function for optimal control at time t as follows In the augmented settings the desirability function is evaluated for all augmented states by solving the linear equation (21). In this case, the stateX(t) ∈X will have the maximum desirability if the trajectories starting from the same state result in the lower values of the cost function.
Besides the desirability function here comprises both costs, i.e., the state and control costs. This scenario helps in finding the optimal control input for the problem under consideration. The desirability For small size systems, the desirability function (21) can be obtained through matrix operation.

IV. PATH INTEGRAL SOLUTION TO INFORMATION-THEORETIC CONTROL OF PBCN
For large systems, the PI-based method is advantageous since it can be adapted to work without explicitly computing the state transition matrix.

A. DESIRABILITY FUNCTION ESTIMATION USING PATH INTEGRAL
Using the Feynman-Kac lemma and diffusion process [39], it is possible to transform the backward in time calculation for the optimal control solution to a forward in time computation. Similar rationale the discrete system can be used to establish PI formulation with continuous inputs and information [47]. The desirability function is determined using PI, in which the expected cost of a particular state is estimated by taking into account all of the system's possible paths. By substituting G[z t ](X) over augmented states, the desirability function z t (X) is expressed in terms of PI as follows: Similarly, we have, withX denoting the augmented state at time t + 2. Substituting z t+1 (X ) in (22) we get the following form in terms of probability independent costq(·) for z t (X), With the recursion of z(·) in the inner expectation the PI based desirability function is derived as follows

B. SAMPLING BASED APPROXIMATION
In most cases, analytic assessment of PI is impractical, causing the use of numerical approximation. The PI is calculated using a random sampled trajectories from a distributionP using the Monte Carlo (MC) sampling method which yields the desirability function approximation shown below where S is total number of samples. The cost over the sampled path s, i.e.,q s,t→t f , is defined as φ (X t f ) + MC sampling, on the other hand, has the following drawbacks: i) When generating augmented passive dynamicsP, an arbitrary starting distribution for selection of the control action must be considered. ii) The optimal desirability generated corresponds to arbitrarily assumed initial control action distribution. Order of desirability may be different for different initial control action distributions. iii) The contribution of every randomly generated path is weighted equally in desirability estimation. Therefore, the non-optimal paths could introduce significant bias in the estimation. iv) There is no way to indirectly comment on other distributions under the selection of different control actions. v) Uncertainty in system dynamics not taken into account.

C. ENTROPY-BASED IMPROVED MC SAMPLING
The limitations as mentioned above can be overcome by introducing modifications in MC sampling as follows: i) A weighting function is introduced to modify the expected cost of trajectories starting from the required augmented state. ii) Cost-dependent weightage is assigned to each path. The weights assigned could be either monotonically increasing or decreasing over the range of costs. iii) Uncertainty in system dynamics is accounted for by introducing probability dependence in the weighting function.
We describe the weighting function in terms of average entropy, relative cost, and an application-specific parameter for an extra degree of freedom. The entropy for a gene measures the randomness associated with that gene. Furthermore, entropy is affected by the probability distribution of gene selection. This distribution has the highest entropy if it is uniform, while any other distribution has low entropy. For example, in the Example 1, because gene x 2 has higher randomness, its entropy will be higher than gene x 1 , which has 0 entropy. Having followed these principles, we define the average entropy for a PBCN, which measures the randomness depending on the distribution of selection probabilities across the overall network. VOLUME X, 20XX Definition 7: The relative error R e is defined as Definition 8: Index denoted by I d which can be used as exponent utilizing entropy is defined as The weighting function w s,t (X) for s th sample is determined using relative error R e (25) and power factor (26) considering the path costs either in pessimistic or optimistic manner. We propose two approaches in the subsequent section termed min-skew in optimistic way and max-skew in pessimistic way.

1) Min-skew sampling
The optimistic approach reinforces the greedy estimation of costs by skewing the expected path cost in the direction of minimum (min-skew). The paths with lower costs are accommodated with higher weightage and vice-versa in minskew approach. The control action is selected that pertains to the minimum expected cost after min-skew. FIGURE 3 represents the functioning of min-skew in essence which is summarized as follows: i) Evaluate the original cost distribution k(c) (FIGURE 3 (a)) using sampled trajectories. ii) Generate the weighting function w(c) that acts as the non-linear scaling function. (FIGURE 3 (b2)) iii) Apply the nonlinear scaling w(c) on x-axis (cost) depending upon relative distance form minimum cost (point a in (FIGURE 3 (a))) to determine the inwards shift (towards a). Point a remains stationary and point b experiences the maximum shift therefore termed minskew. iv) The expected cost after min-skew is acquired by averaging over k (c) normalized by total shift. Some candidate weighting function w(c) to be considered for min-skew case are as follows: where α and are application-specific parameters.  In contrast to min-skew, in pessimistic consideration, the preference is given to limit the worst-case scenarios by skewing the expected path cost in the direction of maximum (max-skew). In max-skew paths with higher cost are assigned higher weightage and vice-versa. The control action is chosen that pertains to the minimum expected cost after max-skew. The effect of max-skew is portrayed in FIGURE 3 (b2) and (c2). Some candidate weighting function w (c) to be considered for max-skew case are as follows: Use of (28) as the time dependent weighting function can be contemplated by considering two extreme cases. 1) When PBCN contains no uncertainty (i.e., it is nothing but a BCN), it has zero entropy (H = 0). The weighting function in (28) comes out to be 1 for the path corresponding to the minimum cost and 0 for all the remaining paths. This is precisely the behavior desired as for deterministic systems, the path of minimum cost can be traversed with certainty. 2) The PBCN has extreme uncertainty embedded in its structure when all the network selection probabilities are equal, i.e. c For this case the entropy assumes largest possible value H = 1 and all paths are weighted equally as expected.
The approximation of desirability over S samples using entropy-based improved MC sampling that introduces a bias towards minimum is specified as, Once the desirability of an augmented state is estimated using (29), we propose a method to determine the corresponding discrete optimal action in the next section.

A. OPTIMUM DISCRETE INPUT FROM OPTIMUM DESIRABILITY FUNCTION
The augmented state space desirability function is obtained in (24) which corresponds to the optimal continuous control input. As a result, the continuous input must be linked to the discrete inputs from the original PBCN. For each state, the optimal discrete input is determined by calculating the maximum desirability across the set of augmented states X ∈X for a given state (X I ∈ X ). Following are the steps in the procedure: 1) The vector for optimal desirability function of a given stateX I with the passive dynamicsP is computed as where z t (X I ) = [z(X I , U 1 ), · · · , z(X I , U 2 m )] T , G z (X I ) = [G z (X I , U 1 ), · · · , G z (X I , U 2 m ), ] T and exp(−q(X I )) is a diagonal matrix with 2 m elements exp(−q(X I , U K )) 2) The optimum discrete input U * ∈ U for state X I is In a summary, the augmented dynamics incorporate the corresponding distributions for all possible state transitions under the feasible control inputs.
An augmented state's desirability that gives the optimal desirability for state X I is obtained by iterating over all possible future augmented states. This automatically includes state transitions from state X I under all inputs along with the induced control costs. Therefore, the state with optimal desirability of X I corresponds to the optimal discrete input U * .

B. SCALABLE CONTROL ALGORITHM
To avoid the problems associated with matrix-based implementation, a couple of algorithms is suggested to run in conjunction that provide yields the optimal control solution of large PBCN. The matrix-free evaluation of the next state is is obtained from Algorithm 1. In this algorithm, we use the predictor function selection probability of each gene and compare it with a randomly generated probability. Based on the probabilities comparison the expression level of gene is determined.
Return {x 1 (t + 1), · · · , x n (t + 1)} Algorithm 1 is used to generate the trajectories starting from the given initial state X I for the terminal time t f while evaluating the PBCN optimal control input U * using Algorithm 2. In this case, r u (arg1, arg2) generates a random sequence of length arg1 by sampling from an uniform distribution 0 to (arg2 − 1). We translate the objective and initialize the cost vectors and state in Algorithm 2. For the S number of samples the path cost corresponding to each trajectory is calculate using the function presented in Algorithm 3. Even though Algorithm 2 along with 3 and 1 allows optimal Algorithm 2 Information-theoretic control of PBCNs Result: Optimal control input U * Initialization: m, n, Number of samples = S, Cost vector for all genes and control inputs, t f , initial state X I (0) Calculate path costq t →t f for S samples by: Sequential_Cost_Compute(·) Algorithm 3 or Parallel_Cost_Compute(·) FIGURE 5 5.
Evaluate desirability functionz(X I (t), U K (t)) using: (24) for MC sampling and (29) for entropybased improved MC sampling. 6. Calculate U * (X I (t)) using (31) control inputs evaluation for large PBCN, the sample size required for a reasonable approximation of the desirability function grows rapidly. Consequently , the time taken by the calculation of the sampled trajectories costs tends to grow significantly. This problem is resolved by incorporating a VOLUME X, 20XX Algorithm 3 Sequential cost computation over S samples function Sequential_Cost_Compute(S, U K , t ) 1. for(s = 1 to S) 2. Generate control sequence U s ← {U K , r u (t − 1, 2 m )} 3. Generate network sequence X s as: 4. for (U r in U s ) 5.
Calculate X s (r) for U r using Algorithm 1 6. Evaluate path costq t →t f (s) ← t f −1 τ =t q(X sτ ) end function Return Cost vectorq t →t f variation of Algorithm 3 in the parallel computation framework, as discussed in the following section.

C. GPU BASED PARALLEL IMPLEMENTATION
Because of its programmable improvements [48], the Graphics Processing Unit (GPU) had also found utility in domains other than gaming. NVIDIA, one of the most popular GPU manufacturers, offers a compute unified device architecture environment (CUDA) which is a collection of resources for multi-threaded applications that can perform multiple tasks in parallel. FIGURE 4 shows the generalized architecture for GPU-based parallel processing. The GPU is a co-processor for the CPU with its own dynamic random access memory (DRAM) implementation [49]. In addition to implementing a large number of tasks in parallel, CUDA organizes threads into logical blocks, each of which maps onto a multiprocessor. Because the number of threads a block can handle is limited, the blocks are organized into grids in order to run a large number of threads at the same time without communicating.  Through PI, the desirability function is evaluated by estimating the costs of the paths. The parallel computing architecture is facilitated by the fact that each path's cost is calculated independently of the others. The number of threads, blocks, and grids used in the CUDA architecture is decided by the number of samples needed to approximate the desirability function. The GPU efficiently implements multiple instances of the cost calculation task, greatly reducing the computational time. Threads are the basic component of ) ( : computations that are executed in the cores of GPU in the CUDA architecture. As shown in the FIGURE 5, each thread, identified by the thread id s, independently computes the cost q s,t→t f for one sample. Each thread starting from an initial state evaluates the trajectory and its path cost as show in the right part of FIGURE 5. This resulted path cost is then transferred to the CPU by each thread, which further executes instructions to obtain a desirability function estimate based on the sample costs available. The task division between the GPU and the CPU allows the system memory to be refreshed after each cycle of path cost calculation and desirability function estimation. As a result, for accurate estimation of the desirability function, a much larger number of samples than the system's memory handling capacity can be used.

VI. RESULTS AND DISCUSSION
Some of the existing results from the literature are compared with the solution obtained using proposed technique in this paper. The first is a two-gene artificial PBCN from [28] that was solved for optimal control using the conventional dynamic programming method. In the second illustrated case, the polynomial optimization-based approximation solution approach for a three-gene example from [50] is compared. Furthermore, we demonstrate the efficacy of entropy-based improved sampling in this scenario. After establishing the validity of the results, the technique is used to biological network optimal control problems. The WNT5A biological network, which has seven genes and one control gene, is the first biological network to be considered. A T-cell signaling network with 37 genes and three inputs is used to show the method's effectiveness for large systems. Furthermore, for this example, a GPU-based parallel implementation is used. The Python programming language is used to perform all of the simulations. The parallel implementation algorithm is executed on a Google colab GPU with a maximum virtual RAM of 12.72GB and a maximum disk space of 68.40GB.

1) Artificial 2-gene network
We match the solution of our proposed method with results from [28] where the problem is solved using the classical dynamic programming approach for two genes artificial PBCN given in Example 1 previously. In this case, the terminal penalties associated with states X 1 , X 2 , X 3 and X 4 are assumed to be 0, 1, 2 and 3 respectively. For any intermediate time, no cost is associated with states. The control cost of 1 is levied whenever control U 2 is applied. The control objective with t f = 5 is: J t+1 (X(t + 1)) , ∀t ∈ {0, 1, 2, 3, 4} J 5 (X(5)) = φ(X(5)).

The terminal costs and intermediate costs for augmented states are
For the augmented state space, the passive dynamics are derived in Example 1, and the optimal control problem is solved using (23) and (31). The algorithms 2 and 1 are employed utilizing sequential computation, to arrive at optimal control solution. The MC sampling is used for the estimation of desirability functions of states. The control sequence obtained is, u(t) = 0 ∀t ∈ {1, 2, 3, 4} ∀X(t) ∈ {1, 2, 3, 4} and u(5) = 0 for X(5) = 1, 2, 4 and u(5) = 1 for X(5) = 3, which matches exactly with the result in [28].

2) Artificial 3-gene network
The Boolean function and corresponding transition probabilities are given in (32) for a PBCN under consideration. The optimal control problem is formulated such that the expression of gene x 3 is deregulated at end of treatment horizon. This objective can be translated to find the control input that minimizes the cost (33) in finite horizon (t f = 2) case.
The optimal solution is achieved by use of MC sampling in Algorithms 1 -2 and the control actions obtained for all states are a(0) = 1, a(1) = 1 which match with results given in [50]. The average cost, starting from all states, over 32000 simulation epochs are as depicted in For the given treatment window, the disease should be treated by using the drug based on the observations and time refereeing to (34). The min-skew sampling is employed to solve this problem and the cost distribution for state 2 is provided in FIGURE 7. The time and path dependent weighting function used in this case is as follows. shows the improvement over the average cost for all states after incorporation of improved MC sampling.

3) Biological WNT5A network
There are seven genes as WNT5A, pirin, S100P, RET1, MART1, HADHB, and STC2 in a biological network which relates to melanoma. The concentration level are given by 0 corresponding to low concentration and 1 corresponding to high concentration. Because the gene pirin is a control input u, we may generate a BCN with dynamics f (i) d for all genes x i as follows: pirin : x 1 = ¬x 5 , MART1 : x 4 = ¬x 6 ∨ a, S100P : x 2 = ¬x 6 , HADHB : The following update rules transform this BCN to a PBCN by introducing random perturbations with a probability p with a probability of p 0, with a probability of (1-p)/2 1.
with a probability of (1-p)/2 WNT5A expression is directly correlated to the creation of melanoma. As a result, the control goal is to stop WNT5A from expressing at the end of a specified time horizon t f of dynamic system evolution. Assigning cost q(x 1 ) = 1 for WNT5A and q(x i ) = 0 for all other genes, as well as assigning terminal cost q t f (x 1 ) = 10 for WNT5A and q t f (x i ) = 0 for all other genes, the control objective in mathematical form is The optimal solution is obtained for two sets of function selection probabilities and random perturbation for the starting state From this result we can clearly see that for high degree of uncertainty, the optimal choice is to avoid the pirin expression. For all initial states and selection probability of p = 0.8, we show the estimated cost with application of control and without control in FIGURE 8.   Because evaluating the transition probability matrix is impractical in this case, we utilize the matrix-free technique to simulate the dynamics of Boolean network (Algorithm 1) followed by the information-theoretic approach to determine the optimal solution. Using the suggested Algorithm 2, results are compared between sequential implementation and GPU-based parallel implementation. The control inputs and corresponding trajectory starting from an initial state X I (0) = 98024258941 with 4.096 × 10 8 number of samples are given as follows The state encountered at the terminal time t f is 5941237809 and its Boolean equivalent is given by 00000101100010001000000001100000110001. It can be clearly observed from the Boolean representation of state at t f = 5, the gene expression for Calcin(x 3 ), DAG(x 7 ), NFkB(x 23 ), Ras(x 29 ), Rlk(x 31 ) is downregulated (shown by bar on the Boolean number.) According to TABLE 1, the time needed to execute the proposed algorithm in sequential manner increases dramatically with the number of samples (Column 2), but the time required to accomplish the same operation using GPU-based parallel processing increases moderately with the number of samples (Column 3).

B. DISCUSSION
Despite the richness and elegance of the dynamic programming (DP) given in [28], solving the Bellman equation for most practical problems is intractable in computational sense. This is due to the fact that the value function must be recorded for each state, and indeed the number of states in PBCNs increases exponentially. As a result, a number of approximation strategies based on solving the Bellman equation approximately have emerged. The policy iteration [34], value iteration [46], [53], and Q-learning [35], [54] are all common approximation approaches in the reinforcement learning field. The policy iteration has been shown to be superior than the value iteration in that it delivers the optimal stationary policy in a finite number of steps, whereas the value iteration may require an infinite amount of steps for convergence [46]. Q-learning [54] is the other technique used, and the learning time grows as the number of genes increases. As a consequence, the authors remark that the Q-learning technique might not always scale to large networks [54]. Furthermore, some of the approaches are matrix-based [34], which operate directly upon matrices and therefore are not suitable for large PBCNs because to memory constraints. The fundamental advantage of our proposed framework is that, given the cost function, minimization of objective may be executed analytically. We derive the linear Bellman equation, which allows for forward in-time simulation with parallel computing of the cost function evaluation, which is not possible with iterative approaches due to iteration interdependence. Our method could readily be extended to a model-free approach, in which the transition probabilities are learned from the data rather than calculated from known network dynamics.
The effectiveness of optimal control methods proposed in the literature was validated through the use of biological networks such as the 7-gene WNT5A network [29], [33], [46], [53], the 13-gene (9 state, 4 control) ARA OPERON network [34], and the 8-gene artificial network [55], among others. Neither approximation nor analytical PBCN optimal control approaches have been validated using a large biological network of 40 genes (37 states, three control), to the best of the authors' knowledge. The results of applying information-theoretic PBCN optimal control to a T-cell signaling network demonstrate the scalability of the method with large PBCNs.

VII. CONCLUSION
For PBCN, which is set in a traditional MDP form, an information-theoretic optimal control formulation is developed by adopting the stochastic optimal control theory. A nonlinear control problem is transformed into a linear problem using the proposed solution method. The resultant formulation is solved analytically using a matrix-free technique, allowing large systems to be solved. The suggested framework's scalability is validated through the use of GPU-based parallel processing for speedy estimate of the desirability function. The methodology described is general and can be VOLUME X, 20XX