Kullback–Leibler Control in Boolean Control Networks

This article addresses the Kullback–Leibler (KL) control problem in Boolean control networks. In the considered problem, an extended stage cost function depending on the control inputs is introduced; in contrast to a stage cost of the conventional KL control problems in the Markov decision process cannot take into consideration the control inputs. An associated Bellman equation and a matrix-based iteration algorithm are presented. The theoretical analysis shows that the proposed KL control results in an approximated form of conventional dynamic programming (DP). Furthermore, the convergence analysis is presented, with the weight parameter converging to zero and diverging to infinity. In practical application examples, a comparison of the conventional DP and proposed KL control is illustrated.


I. INTRODUCTION
L OGICAL dynamic systems have been employed for modeling complicated dynamical behavior, such as biological systems [1], combustion engines [2], and transportation systems [3].The mathematically tractable structure of Boolean control networks (BCNs) [4] has resulted in their widespread use.After the development of the semi-tensor product (STP) technique [5], [6], Boolean networks have been analyzed in terms of their matrix-based expressions.With respect to BCNs, theoretical concepts similar to those of continuous-valued systems and related analyses have been developed: stabilization problem [7], estimation problem [8], [9], decoupling problem [10], robustness analysis [11], [12], conservation law [13], Lyapunov function [14], [15], and reinforcement learning [16].To satisfy various control objectives, several control structure types have been reported.A common closed-loop structure has been frequently used for the stabilization: stabilization under random switching [17] and pinning control purpose [18], [19].Event-based [20], [21] and sample-value [22], [23] control schemes, which have been explored in the continuous-valued systems analysis, have been developed for BCNs.In contrast to these settings, the time-varying feedback law is exploited in the optimal control formulation [16], [24], as discussed in this article.
In practical systems, as there exist some preferred and undesired states [25], control schemes that avoid forbidden states in biological systems [26] or obstacles in robotic systems [27] have been discussed.In these reports, feedback raw is provided for making the transition probabilities to the forbidden state equal zero; this technique is similar to dynamic programming (DP) [28] with a modified stage cost having a value of the infinity, which is discussed in this article.However, the aforementioned schemes are specialized to make the transition probabilities zero, and they cannot quantitatively evaluate the probabilistic transitions to preferred and undesired states.The development of a control scheme quantitatively addressing the transition probabilities while minimizing the given objective function with optimal control is still an open and challenging issue.
Based on the background summarized above, in this study, the Kullback-Leibler (KL) divergence, which evaluates the similarity of two given probability distributions, is used, and the KL control problem formulation [29] is applied to BCNs with an extended stage cost function taking into consideration the control input.
In the conventional Markov decision process (MDP), it is assumed that the KL control can control the transition probabilities directly.This assumption holds in various trajectory planning problems, such as the maze game (see [30]) the trajectory planning of robots (see [27]), wherein the control input of the problem is the transition of the state itself.In BCNs, the state transition depends on the system structure, and the relationship between an applied input and the resulting state transition probability is complicated; therefore, a theoretical analysis and a practical code implementation for the BCNs result in difficult jobs.To the best of the authors' knowledge, optimal control problems with an extended stage cost function taking into consideration the control input and the corresponding Bellman equation have not been explored and is still an open issue.In addition, although the existing KL control is successively implemented in a matrix-based form, which is computationally efficient in recent programming languages, a matrix-based expression of the aforementioned formulation, including the extended stage cost function, is required to be developed.
The contribution of this article is broadly summarized as follows.
1) The optimal control problem with the KL divergence and extended stage cost function depending on the control input is addressed; in contrast, the existing KL control (see [27], [29]) has a stage cost that is dependent only on the state.That is, the considered problem has a high modeling capacity.This article presents the corresponding Bellman equation and the matrix-based iteration of the proposed algorithm.
2) The theoretical analysis provides the new insight that the solution of the aforementioned KL control problem results in the conventional DP modified by replacing the max and arg max operations with an extended log-sumexp-based approximation function and extended softmax function, respectively.Furthermore, the theoretical support for the following two limiting cases is given as follows.
a) If the weight parameter in the KL control scheme converges to zero, the value function and optimal control input obtained using the KL control converge to those of the conventional DP. b) If the weight parameter in the KL control scheme diverges to +∞, the value function and optimal control input obtained using the KL control converge to those of the desired transition probabilities given for the KL divergence.
II. PRELIMINARIES The notations used in this article are summarized as follows.1) For an m × n-valued matrix A ∈ R m×n , its (i, j) element is denoted by [A] i,j or A i,j .The m × n zero matrix is denoted by O m×n .The n × n identity is denoted by I n , and its ith column vector is denoted by δ i n .Its set is denoted by n = {δ i n , i = 1, . . ., n}. 2) A matrix Diag(a) ∈ R n×n calculated with a vector a ∈ R n is a diagonal matrix, the (i, i) element of which is a i .3) A matrix A ∈ R m×n , with all its column vectors belonging to m is called a logical matrix.L m×n is the set of all m × n-valued logical matrices.4) The Boolean domain, which comprises T (True) and F (False), is denoted by D = {T = 1, F = 0}.T and F are identified with δ 1 2 and δ 2 2 , respectively.5) The STP [6] of matrices A ∈ R m×n and B ∈ R p×q is defined as ), where ⊗ is the Kronecker product, and s is the least common multiple of n and p.This article omits the " " in A B for notational simplicity.indicates that a i ≤ b i for each i = 1, . . ., n. 10) In this article, the argument x ∈ R ∪ {−∞} of the exp operation is considered, and exp(−∞) = 0 is defined.In addition, 0 × +∞ = 0 and 0 log 0 = lim x 0 x log x = 0 are set formally, which makes the function x log x continuous and convex on [0, +∞).11) In this article, mathematical operations for matrices and vectors are defined as elementwise operations, for example, [exp(A)] i,j = exp(A i,j ).12) and are the elementwise product and division of matrices, respectively.13) P(A) is the probability of an event A, and P(A|B) is the conditional probability of A under a condition B. E[X] is the expectation of the stochastic variable X, and E[X|B] is the conditional expectation of X under B. Furthermore, the max operation approximation performed by the log-sum-exp function is summarized here.The log-sumexp function LSE(x, μ) = μ log(1 n exp(x/μ)) for x ∈ R n and μ > 0 (see [31,Example 10.45]) has been used in various research fields.Based on the definition of the log-sum-exp function, the extended log-sum-exp function where a simple calculation provides the second equality, thus indicating that esoftmax(x, μ, p) is the partial derivative of eLSE(x, μ, p) with respect to the variable x on R n .Theorem 1: The extended log-sum-exp function eLSE satisfies the following inequalities.
1) For an arbitrary x, x ∈ R n eLSE x , μ, p ≥ eLSE(x, μ, p) and eLSE x , μ, p ≤ eLSE(x, μ, p) (2) 2) If min(p) > 0, for an arbitrary 3) For an arbitrary Proof: Item 1): The inequalities are obtained by evaluating the Hessian of eLSE(x, μ, p) with respect to x directly.That is For the arbitrary vector ξ ∈ R n , the quadratic form is calculated as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
= esoftmax(x, μ, p) The following Cauchy-Schwarz inequality holds: where the equality On applying (7) to the quadratic form (6), the quadratic form becomes non-negative, which indicates that the Hessian is positive semi-definite.Therefore, the convexity of eLSE(x, μ, p) is demonstrated, and the resulting convex inequality (1) follows.As the second term ( * )( * ) of the Hessian ( 5) is positive semi-definite, the largest eigenvalue of the Hessian ( 5) is upper bounded by the largest eigenvalue of the first diagonal matrix, which means that μ −1 max(p exp(x/μ)/p exp(x/μ)) ≤ μ −1 ; the equivalence of the largest eigenvalue and Lipschitz smoothness (see [32,Sec. 2.1]) results in the inequality of (2) of Item 1).
Item 2): The flow is similar to that of the conventional logsum-exp function ( [31,Example 10.45]).The first inequality is given by eLSE(x, μ, p) = μ log p exp(x/μ) where the last equality used p 1 n = 1 on recalling p ∈ S n .On using i * ∈ arg max i=1,...,n x i , the following inequality results in the second inequality: into (2) results in the first inequality of (4).On recalling p ∈ S n and the convexity of exp, the Jensen inequality indicates that p exp(x/μ) ≥ exp(p x/μ), which results in the second inequality of (4).Remark 1: The conventional log-sum-exp function LSE(x, μ) = μ log(1 n exp(x/μ)) is a special case of the extended log-sum-exp function eLSE(x, μ, p) with p = 1 n /n ∈ S n , where n is the dimension of x.

III. PROBLEM FORMULATION
This article is focused on BCNs with a dth state update represented as follows: The state and control variables are expressed in the STPs as Although conventional studies on BCNs have addressed the deterministic control input u k ∈ 2 nu , this study takes into consideration the randomized control input and the corresponding conditional probabilities of u k under a given state x k at the kth step as follows: An initial state x 0 ∈ 2 nx is deterministically given, and the design problem of c k,i,l is addressed.It should be noted that c k,i = [c k,i,1 , . . ., c k,i,2 nu ] ∈ S 2 nu , which indicates that c k,i should be a point on the unit simplex.Herein, c k,i,l (k = 0, . . ., N − 1, i = 1, . . ., 2 n x , l = 1, . . ., 2 n u ) is referred to as a selection probability in analogy with the probabilistic BCNs, using a similar concept to that of random state switching [33].Then, the structure matrix M ∈ L 2 nx ×2 nu+nx of the BCNs (8) uniquely exists and satisfies the state equation x k+1 = Mu k x k [6].Although x k depends on the design and stochastic behavior of the randomized control, it is simply denoted by x k for notational simplicity herein.Next, a desired transition probability P(x k+1 |x k , u k ) is given, and the difference between the desired and actual transition probabilities is introduced as the KL divergence An objective function is the sum of the objective function of the conventional optimal control problem and the KL divergence g k and h are bounded, and the KL divergence P(•|x k ) depends on the selection probabilities c k,i (i = 1, . . ., 2 n u ) of u k , the feasible sets of which are C k,i .Compared with the conventional KL control [29], the KL divergence is introduced with a weight coefficient μ > 0, and the stage cost function g k (x k , u k ) is also extended to include the cost of the control input u k , instead of the conventional form g k (x k ).In the case of μ = 0 in Problem (10), which means that Problem (10) ignores the KL divergence, the problem can be solved using the conventional DP (see [28]).
As observed in Example 2, the KL divergence quantitatively evaluates the similarity between P and P, and it has a broader modeling capability compared with conventional forbiddenstate-based techniques [12], [26].If it is preferred that the system be fixed at a target point or in a given set of states similar to the stabilization problems (see [15], [18], [22]), the desired transition probabilities to the target points are set as large values.Throughout this article, the following simplified notations of the transition probabilities are used: Furthermore, the following notations are introduced in this article: In summary, inv(j|i) is the set of all the indices of the control inputs where l = inv * (j|i, k) is used.The optimality of inv * (j|i, k) [equivalently, the nonoptimality of inv † (j|i, k)] can be easily confirmed and omitted herein owing to space limitations.
From the definition of inv(j|i), the union of inv(j|i) with respect to j = 1, . . ., and inv(j|i) ∩ inv(j |i) = ∅ if j = j .

Example 1:
The following BCNs with a 2-D state and a single input are considered: The aforementioned BCNs have a structure matrix M = δ 4 [2,2,4,4,3,4,3,4].On using the state variable 4  4 , there are multiple inputs resulting in x k+1 = δ 4  4 .For such cases, inv(4|4) = {1, 2}, and In Problem (10), the following assumption is made.Assumption 1: In Problem (10): 1) The desired transition probability is given as follows: 2) The selection probability for l = 1, . . ., 2 n u , where j is an index satisfying δ , the meaning of which is clear.Remark 3: The second and third lines of the definition of c k,i,l , which indicate that the case of l = inv * (j|i, k), are considered.These two cases are classified using the value of p k,i,j as follows. 1) to avoid the zero-division issue, a transition probability corresponding to p k,i,j = 0, which means that x k+1 = δ j 2 nx is a forbidden state, is excluded in the definition of the KL divergence (9).Instead, the feasible set C k,i takes into consideration the forbidden state.b) If p k,i,j = 1, then c k,i,l = p k,i,j = 1, which is a redundant condition.Here, an index j fixed is set subject to satisfying p k,i,j = 0 for an arbitrary j = j fixed .j=1,...,2 nx p k,i,j = 1 results in p k,i,j fixed = 1.In addition, the index l satisfying l = inv * (j fixed |i, k) results in c k,i,l = 1 on using the aforementioned 1-a) case because c k,i,l = p k,i,j = 0 for arbitrary j = j fixed and The BCNs having a 2-D state and a single input of Example 1, the structure matrix of which is given by M = δ 4 [2,2,4,4,3,4,3,4], are considered.Equations Mδ 1  2 That is, the state x k cannot arrive at either x k+1 = δ 1  4 or x k+1 = δ 4  4 .Here, Problem (10) with a simple setting of N = 1, g 0 = 0, h = 0, μ = 1, and x 0 = δ 1  4 is considered, that is, the minimization problem of KL P(•|x 0 ) P(•|x 0 ) is considered.
Case 1): The desired transition probabilities are given by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The case of Because the initial value x 0 = δ 1  4 is deterministic, the objective function no longer requires the expectation operation and is given by The optimal solution is c * 0,1,1 = c * 0,1,2 = 0.5, and the optimal objective function value is zero (the detailed derivation is omitted herein because the problem is a special case of Problem ( 26), introduced subsequently).In the conventional optimal problems, the optimal control is obtained as according to a deterministic law, which means that (c 0,1,1 , c 0,1,2 ) = (0, 1) or (1, 0).On recalling that 0 log 0 is formally defined by lim x 0 x log x = 0 in this article, both the aforementioned deterministic laws result in KL P(•|x 0 ) P(•|x 0 ) = log 2. This example suggests that the randomized control input is required to be taken into consideration to minimize the objective function while including the KL divergence.
Case 2): The following desired transition probabilities are given: This means that the transition from x 0 = δ 1  4 to Eventually, the feasible set C 0,1 degenerates to a point (c 0,1,1 , c 0,1,2 ) = (1, 0), which is a trivial optimal solution.The transition probability, which is as desired, results in

A. Reformulation as Optimal Trajectory Planning Problem
This section reformulates the optimal control problem (Problem (10) with Assumption 1) as an optimal trajectory planning problem because the latter has a more tractable structure.If inv(j|i) = ∅ and l = inv * (j|i, k), which means that there exists a An optimal selection probability c * k,i,l of c k,i,l can be obtained from an optimal transition probability p * k,i,j of p k,i,j .Therefore, the optimal control Problem ( 10) is the design problem of p k,i,j .Here, the consideration of Assumption 1 results in There are two settings resulting in p k,i,j = 0: 1) inv(j|i) = ∅, that is, the state x k = δ i 2 nx cannot move to x k+1 = δ j 2 nx and 2) inv(j|i) = ∅ and inv * (j|i, k) = l.However, a designer deliberately sets p k,i,j = 0, that is, x k+1 = δ j 2 nx is a forbidden state.p k,i,j = 0 and Assumption 1 result in p k,i,j = 0; therefore, the transition probability p k,i,j the desired value of which is set such that p k,i,j = 0 is required to be designed, and the number of such variables is because the number of possible next states x k+1 , which means that inv * (j|i, k) = l in (16), is less than that of u k ∈ 2 nu .The feasible set of the variable p k,i is denoted as follows: The stage cost function ) as follows: The equivalence of the expectation of g k (x k , u k ) and w k (x k , x k+1 ) is presented here.The expectation of g k is calculated as follows: ) is expanded as On recalling (12), which indicates that {1, . . ., j=1 inv(j|i) and inv(j|i) ∩ inv(j |i) = ∅ if j = j , the sum with respect to l = 1, . . ., 2 n u , as mentioned above, can be expressed as that with respect to j = 1, . . ., 2 n x , as follows: In the sum with respect to l ∈ inv(j|i Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. l ∈ inv † (j|i, k) equivalently.Therefore, the sum is simplified as follows: where the second equality uses the relationship between p k,i,j and c k,i,l in (15) and the definition of w k in (19).Consequently, the optimal control problem (Problem (10) with Assumption 1) is reformulated in the following optimal trajectory planning problem: The optimal trajectory planning problem of the state x k is more tractable because the state variable is directly controlled (see the maze game in [30,Sec. 8.1]).
Example 3: The BCNs (13) in Example 1 is considered again.The structure matrix is M = δ 4 [2,2,4,4,3,4,3,4]. 1  4 results in (14) and indicates that the selection probabilities of the control are equivalent to the transition probabilities.Because inv(1|1) = inv(4|1) = ∅, which means that there is no control input u k driving x k = δ 1 2 nx to either x k+1 = δ 1  2 nx or x k+1 = δ 4  2 nx , the following transition probabilities are set as zero, based on the definition of P k,i in ( 18) respectively.In the case of x k = δ 4  4 , because inv(4|4) = {1, 2}, there are multiple inputs resulting in x k+1 = δ 4  4 .For such cases, a control input minimizing the stage cost in Problem (20) given by G k = 0.1 0.2 0.4 0.4 0.1 0.3 0.5 0.7 for each k = 0, . . ., N − 1 is considered.An inequality It should be noted that the transition probability matrices used in this article are similar to those used in the conventional MDP.This means that they are the transposed matrices of the STP-based transition matrices for the probabilistic BCNs [26].
The control variable p k,i,j = [P k ] i,j is given as follows: The second equality shown above claims that the original control variables, which are the selection probabilities of the control input, are recovered using the control variable P k .
In addition, the reformulation of the stage cost function for each k = 0, . . ., N − 1, is given by In the equation above, the path from x k = δ 4  4 to , and the corresponding cost is 2 ) = 0.7.For s = 0, . . ., N − 1, the following optimal trajectory planning subproblem derived from the optimal control problem (10) with Assumption 1 is considered Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
subject to The following theorem provides the Bellman equation of the optimal trajectory planning problem (20).
Theorem 2: The optimal trajectory planning problem (20) derived from the optimal control problem (10) with Assumption 1 is considered.In Problem (20), the Bellman equation on the value function v KL,μ s (s = 0, . . ., N −1) in ( 21) is given as p s,i,j log p s,i,j p s,i,j . ( Proof: ), as defined in (21), the sum of the stage cost from s + 1 and the terminal cost is rearranged as follows: On the right-hand side of the equation above, the minimum value and minimizer of the sum of the two expectations are v KL,μ s+1 (δ j 2 nx ) and the corresponding optimal p k,i (k = s + 1, . . ., N − 1, i = 1, . . ., 2 n x ), respectively.The stage cost function at the sth step is expanded as follows: p s,i,j log p s,i,j p s,i,j . ( Consequently, the combination of ( 23) and ( 24) results in the claim of the theorem.Remark 4: In Theorem 2, especially at s = 0, the value of v KL,μ 0 (x init ) is the optimal objective function value of Problem (20) and the minimizer of v KL,μ 0 (x init ) is an optimal solution of Problem (20).
The following theorem provides the vectorized expression of the Bellman equation ( 22) using the vectorized value and objective functions Theorem 3: The optimal trajectory planning problem (20) derived from the optimal control problem (10) with Assumption 1 is considered.The value function v The optimal transition probability is given by . Proof: The basic flow of the proof is similar to that presented in the original article of the KL control [29].For the variable p k,i,j , a variable transform with d k,i,j ∈ R is introduced as p k,i,j = p k,i,j d k,i,j .The case of p k,i,j = p k,i,j ∈ {0, 1} formally defines d k,i,j = p k,i,j .For the transformed control variable vector .
Problem (20) results in the equivalent design problem of d k,i ∈ D k,i (k = 0, . . ., N − 1, i = 1, . . ., 2 n x ).If there exists j such that p k,i,j = p k,i,j = d k,i,j ∈ {0, 1}, this variable degenerates to a point, and it need not to be considered (especially if there exists p k,i,j = p k,i,j = d k,i,j = 1, and D k,i is a single feasible point).The optimal solution is this feasible point and evidently satisfies the Bellman equation presented subsequently; this case is trivial and is excluded from the remainder of this proof.Thus, D k,i is convex and has interior feasible points.On using Theorem 2 and vectorized v On vectorizing the equation above, the following subproblem is obtained: The Hessian of the objective function Here, the following points are introduced: The point (d * ,μ k,i , λ * ,μ i ) is the Karush-Kuhn-Tucker (KKT) point associated with a Lagrange function with Lagrange multipliers λ i (i = 1, . . ., 2 n x ) for the equality constraint, which is introduced as follows: The optimality condition is Then, 1) the objective function and equality constraint set are both convex and 2) the feasible set has an interior feasible point, indicating that the Slater condition is satisfied.Therefore, a point satisfying the optimality condition is an optimal solution (see [34,Sec. 5.5.3]).Because p and all the equations of the theorem have been obtained.
For efficient code implementation using the transformed variable [29], a variable z μ k is introduced as follows: The following theorem provides an iterative equation in the matrix form.Theorem 4: The same condition of Theorem 3 for the optimal trajectory planning subproblem (21) derived from the optimal control problem (10) with Assumption 1 is considered.The variable z μ k in (28) satisfies the following matrix equation: where and Algorithm 1 DP for Problem (20) With μ = 0

. , 0 do
Backward calculation on k 3: Proof: The Bellman equation given in Theorem 3 is reformulated as Taking the exponential of both the sides results in the following equation: On vertically stacking the equation above, a vectorized equation of z μ k is obtained.In addition, the optimal transition probability p * ,μ k,i is rearranged as follows: On transposing and vertically stacking the equation above, the matrix form of P * ,μ k,i in the theorem is obtained.The algorithms of the DP for the optimal trajectory planning problem (20) with μ = 0 (see [28]) and the matrix-valued KL control in Theorem 4 are presented as Algorithms 1 and 2, respectively.
Remark 5: The computation complexity of both the algorithms is given by O(N • 2 n x • min(2 n u , 2 n x )) if appropriate implementation is considered.More precisely, in the fourth and fifth lines of Algorithm 1, because w k,i has elements having a value of +∞ and the number of the indices j satisfying w k,i,j < +∞ is not more than 2 n u [see (17), (19)], the min and arg min operations need not take into consideration these elements.In Algorithm 2, the ith row vector of P k , which is p k,i , is a vector with nonzero elements not more than 2 n u [see (17), ( 18)], that is, P k is a sparse matrix if 2 n u 2 n x (for further details of sparse implementation, please refer to the Appendix).If the theoretical computation time is the same for the two algorithms, a unified comparison of their computation time cannot be obtained, and the practical computation time depends on the implementation.
It should be noted that the conventional DP (Algorithm 1) and KL control (Algorithm 2) can be used to solve Problem (20) with μ = 0 and μ ∈ (0, +∞), respectively; the target problem formulations of these two algorithms are independent and not overlapped; however, the obtained Algorithm 2 is consistent with the conventional Algorithm 1 in terms of computation time.
It should be noted that a very small value of μ causes exp Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

. , 0 do
Backward calculation on k 3: Remark 6: As the value function v and the optimal transition probability p * ,μ k,i , given in Theorem 3, are interpreted as respectively.In other words, Algorithm 2 is the same as Algorithm 1 obtained on replacing the min(•) and arg min(•) operations of the fourth and fifth lines at the kth iteration step with −eLSE(−(•), μ, p k,i ) and esoftmax(−(•), μ, p k,i ), respectively.To the best of the authors' knowledge, the form of Algorithm 2 has not been studied in the context of BCNs and related MDPs.It should be noted that the interpretation above is naturally derived from the theoretical analysis and not introduced as a heuristic algorithm.The subsequent analysis of μ 0 implies that the KL control is a generalization of the conventional DP for μ = 0.
Hereafter, the main focus of the remainder of this section is the convergence behavior of the KL control with respect to the weight parameter μ.The convergence behavior of the aforementioned approximation is summarized in the subsequent theorems.As observed in the arg min operation in Algorithm 1, an optimal solution of the conventional problem formulation, which is the case of μ = 0 of Problem (20), is not necessarily a point but can be a simplex, which means that arg min(a) = arg min δ∈S n δ a = {δ ∈ S n |δ i = 0 (a i > min(a), i = 1, . . ., n)}.Therefore, the convergence of a sequence b k ∈ S n to arg min(a) ⊂ S n , which is denoted as b k → arg min(a), is defined by b k,i → 0 for each i such that a i > min(a).
Theorem 5: Algorithms 1 and 2 for the optimal trajectory planning problem (20) derived from the optimal control problem (10) with Assumption 1 are considered.In the limit of μ 0, ).The proof of the theorem above can be referred to in the Appendix.
In contrast to the limit of μ 0, the diverging μ → +∞ emphasizes the minimization of the KL divergence.The objective function value obtained with the desired transition probabilities p k,i (k = 0, . . ., N, i = 1, . . ., 2 n x ) is introduced as The limiting behavior of μ → +∞ is summarized in the following theorem.Theorem 6: Algorithm 2 for for the optimal trajectory planning problem (20) derived from the optimal control problem (10) with Assumption 1 is considered.In the limit of μ → +∞.
The proof of the theorem above can be referred to in the Appendix.
Example 4: Here, the control problem of Example 3 for the BCNs ( 8) is considered again.The control period is set as N = 3, and the terminal cost h is set as The value function and the optimal solution obtained using the conventional DP for μ = 0 (Algorithm 1) and the KL control for μ ∈ (0, +∞) (Algorithm 2) are compared.The matrixvalued value function and the control variable of the DP are defined as , respectively; those of the KL control, which are denoted as V KL,μ and P * ,μ k , respectively, are similarly defined.As shown in the proof of Theorem 5 in the Appendix, the difference between v DP k and v KL,μ k increases backwards in k; therefore, this example provides the values of P k at k = 0 because of the space limitations.In the case of N = 3, V DP and P DP 0 are given as follows: where p DP 0,3, [3,4] is an arbitrary vector in S 2 .The value function and optimal transition probabilities of the KL control in the cases of μ = 0.01 and μ = 1 are presented as follows: For varying the weight parameter μ, the optimal values without the KL divergence and the values of the KL divergence are illustrated in Fig. 2. The selected value of μ can balance the stage cost and KL divergence, which are indicated as x-axis and y-axis of Fig. 1, respectively.

V. APPLICATION EXAMPLES A. Lac Operon Model
First, the lac operon model proposed in [35] (see [35] for further details of the model) is presented.The lac operon Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.model has the variables M (lac mRNA), B (β-galactosidase, LacZ), R (repressor protein, LacI), A (allolactose), L (lactose), P (transport protein, LacY, "lac permease"), and C (cAMP-CAP protein complex).A variable with a subscript "e" or "m" represents the extracellular and least medium concentrations, respectively; the other variables represent the intracellular concentrations.If a variable takes the values True or False, it indicates that its concentration is high or low, respectively.The state and control variables are given by (x 1 , . . ., R, A, L, P, C, R m , A m , L m ) and (u 1 , u 2 , u 3 ) = (L e , L em , G e ), respectively.The Boolean update functions for the states are given as follows: In this numerical example, as in Example 3, the desired transition probabilities p k,i,j were uniformly set to the reachable states of the next state for each k = 0, . . ., N − 1, with N = 100, and the stage and terminal cost were randomly given.The differences v KL,μ 0 −v DP 0 and v 0 −v KL,μ 0 are depicted in Fig. 3, and the values of the cost and KL divergence are illustrated in Fig. 4.

B. Frozen Lake
As an example of a middle-scaled problem, a frozen lake [36] in Gymnasium is illustrated here.The detailed game setting is omitted here owing to space limitations but can be found in the Web documentation [36].The frozen lake is not a Boolean model but a trajectory planning problem in the MDPs, and the proposed framework can be easily applied by setting the state space 16 rather than 2 nx .Because deterministic systems are discussed in this article, the probabilistic slipping is ignored here.
In the map of the frozen lake (Fig. 5), a player starts at Start (S) and arrives at Goal (G) by moving on Frozen (F) ways.Because the player cannot move anymore after falling into Hall (H), the desired transition probabilities to the Halls are set as zero, which means that p k,i,05 = p k,i,07 = p k,i,11 = p k,i,12 = 0 for any k and i.In addition, an evidently timeconsuming route, such as 04 → 00, is excluded by setting their desired transition probabilities to zero.The time-invariant desired transition probabilities and cost are given and indicated in a map (Fig. 5).
A situation with multiple people, such as an evacuation in the case of a disaster, is considered.A single optimal route, which the conventional DP provides, can result in congestion; therefore, the desired transition probabilities are selected to separate the flow of people, such as p 00,01 = p 00,04 = 0.5, although route 00 → 04 with cost w 00,04 = 1.0 is more reasonable than 00 → 01 with w 00,01 = 2.0.In addition, the buffer area 03 is utilized.
A long control period N = 10 is considered such that the player can reach the goal.The transition probabilities obtained are summarized in Table I.For a small value of μ, the minimization of the cost is emphasized, and the transition probability obtained of a nonoptimal route, such as p 00,01 , takes a small value.In contrast, a large value of μ results in a small value of the KL divergence, indicating that the optimal transition probabilities are closed to the desired value.
Remark 7: The aforementioned numerical examples illustrate the advantages of the extended KL cost: 1) the extended KL cost can take into consideration the cost of the control (e.g., the transition costs indicated near the edges in Fig. 5), whereas the existing KL cost (e.g., [27] and [29]) does not take it into consideration and 2) using the weight parameter μ, the similarity to the desired transition probabilities and the transition costs can be balanced.Furthermore, the limiting cases μ 0 and μ → +∞ are theoretically supported by Theorems 5 and 6.

VI. CONCLUSION
This article addressed the optimal control problem with the stage cost function depending on the control input and the KL divergence.The introduced KL divergence can balance the objective function value and desired state transition probabilities.Furthermore, the convergence behavior of the KL control with respect to the weight parameter was presented.
In this work, the target system is a deterministic Boolean network, and thus, probabilistic Boolean networks (PBCNs) are not considered.The KL control problem for the PBCNs requires the consideration of innate stochastic behavior in the PBCNs and results in a more complicated discussion.It can be further investigated in future studies.

A. Proof of Theorem 5
For notational simplicity, using a set It should be noted that D k,i w k,i comprises w k,i,j such that p k,i,j > 0, which means that w k,i,j = +∞ associated with p k,i,j = 0 is excluded.Similarly, D k,i p k,i comprises p k,i,j > 0. On recalling that a term exp(−∞) = 0 in eLSE and the +∞/ − ∞ value in the max / min operations can be ignored, the following equations hold for an arbitrary vector a ∈ R The modification using D k,i above is introduced to exclude computations, including ∞ in fundamental inequalities given by Theorem 1.
An inequality v N−1,i satisfies the following inequalities: k+1 is assumed.Equations ( 1) and ( 2) with the substitution k+1 ], and p = D k,i p k,i bounds γ (1)  k,i are obtained as follows: The bound of an inner product esoftmax ] is considered.The esoftmax term always takes a value in S |J k,i | , indicating that it is a non-negative vector, and the induction hypothesis v For the upper bound, on using the Hölder inequality a b ≤ a 1 b ∞ = (1 a) max(b) for non-negative vectors a and b (see [37,Appendix B]), the following upper bound is given: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
On applying the aforementioned upper and lower bounds of esoftmax k,i in ( 29) is simplified as follows: For γ (2) k,i , (3) results in the following inequality: Equations ( 30) and (31) show that v satisfies the following inequalities: Therefore, the induction concludes that v KL,μ k ≥ v DP k (k = 0, . . ., N − 1).On taking the maximum value of both the sides and using The obtained inequality is a recursive equation of max(v KL,μ k − v DP k ); however, an explicit expression of the solution is difficult to obtain.Instead, using a sequence c k that does not depend on μ, the remaining part is the proof for an inequality max(v k+1 μ 2 and (32) results in the following bound:

2) Proof of Item 2):
The jth element of the optimal p * ,μ k,i ∈ S 2 nx is given as follows: and diverges to +∞ as μ 0. 2) w k,i,j +v DP k+1,j < w k,i,j +v DP k+1,j : Similar to the discussion above, there exists −ε < 0 and μ > 0 such that [w k,i,j + v ).The discussion above does not specify the convergence point, but the convergence of p * ,μ k,i

C. Sparse Implementation of Algorithm 2
For practical implementation, the third line μ k = P k exp(−μ −1 W k ) can consume a significant amount of time because naive implementation results in a dense matrix exp(−μ −1 W k ).In particular, W k , defined in (19), has many +∞ elements, and exp(−μ −1 W k ) is computed in the manner of the dense matrix computation.Instead, the following sparse matrix W k can be used: p k,i,j > 0 , 0 p k,i,j = 0 .
Because p k,i,j exp(−μ −1 W k,i,j ) = p k,i,j exp(−μ −1 W k,i,j ) = 0 if p k,i,j = 0, the modified matrix W k of W k does not affect the value of elements with the value of 1 and is still dense.Therefore, a common implementation named expm1, which computes expm1(x) = exp(x) − 1, can be used as min(a) and max(a) are the minimum and maximum elements of a, respectively.The minimizing and maximizing arguments are defined as arg min(a) = arg min δ∈S n δ a and arg max(a) = arg max δ∈S n δ a, respectively.9) The inequality a ≤ b for vectors a ∈ R n and b ∈ R n and the resulting problem (20) is addressed.Here, the desired time-invariant transition probability p k,i,j = [P k ] i,j for k = 0, . . ., N − 1 is set as follows: ]) and z μ k to become almost zero and causes the overflow of Diag(1 2 nx z μ k ).

. 1 .
From the value above and Fig.1, convergence is observed.

Fig. 2 .
Fig. 2. Optimal values with varying weight parameter μ > 0 (v KL,μ k,4 = v DP k,4 = v k,4 = 1.5 with the zero value of the KL divergence degenerating to a point (1.5, 0), as indicated in the plot above, because δ 1 4 cannot move to any other state).

Fig. 5 .
Fig. 5. Map of the frozen lake and associated desired transition probabilities and stage cost function.The values * / * on the right-hand side of an edge represents the value of the time-invariant desired transition probability and the cost, respectively, indicating that p i,j /w i,j .