Policy Optimization for Markovian Jump Linear Quadratic Control: Gradient Method and Global Convergence

Recently, policy optimization has received renewed attention from the control community due to various applications in reinforcement learning tasks. In this article, we investigate the global convergence of the gradient method for quadratic optimal control of discrete-time Markovian jump linear systems (MJLS). First, we study the optimization landscape of direct policy optimization for MJLS, with static-state feedback controllers and quadratic performance costs. Despite the nonconvexity of the resultant problem, we are still able to identify several useful properties such as coercivity, gradient dominance, and smoothness. Based on these properties, we prove that the gradient method converges to the optimal-state feedback controller for MJLS at a linear rate if initialized at a controller, which is mean-square stabilizing. This article brings new insights for understanding the performance of the policy gradient method on the Markovian jump linear quadratic control problem.


I. INTRODUCTION
Recently, reinforcement learning (RL) [1] has achieved impressive performance on continuous-control tasks such as locomotion [2] and robotic hand manipulation [3]. One main algorithmic framework for such RL applications is policy optimization [4]. Specifically, policybased RL methods including the policy gradient method [5], naturalpolicy gradient [6], [7], natural ac [8], and [9] have been widely used in various control tasks. These methods enable flexible policy parameterizations and optimize control performance directly.
Although policy-based RL methods have shown great promise in addressing complex control tasks, the selection and tuning of these methods have not been fully understood [10], [11]. This has motivated a recent research trend focusing on understanding the performances of policy optimization algorithms on simplified benchmarks, such as linear quadratic regulator (LQR) [12]- [22], linear robust control [23]- [25], and linear control of Lur'e systems [26]. Notice that even for LQR, directly optimizing over the policy space leads to a nonconvex constrained problem. Nevertheless, one can still prove the global convergence of policy-gradient methods on the LQR problem by exploiting properties such as gradient dominance, almost smoothness, and coercivity [12], [13]. This provides a good sanity check for applying policy optimization to more advanced control applications.
Built upon the good progress on understanding policy-based RL for linear time-invariant (LTI) systems, this article moves one step further and presents new theoretical results on policy optimization of Markov jump linear systems (MJLS) [27]. MJLS form an important class of hybrid dynamical systems that find many applications in control [28]- [33] and machine learning [34], [35]. The research on MJLS has great practical value, while in the mean time also provides new interesting theoretical questions. Different from the LTI case, the state/input matrices of MJLS are functions of a jump parameter sampled from an underlying Markov chain. Controlling unknown MJLS poses many new challenges over a traditional LQR due to the appearance of this Markov jump parameter, and it is the coupling effect between the state/input matrices and the jump parameter distribution that causes the main difficulty. To this end, the optimal control of MJLS provides a meaningful benchmark for further understanding of policy-based RL algorithms.
However, the theoretical properties of policy-based RL methods on discrete-time MJLS have been overlooked in the existing literature [36]- [39]. In this article, we make one step towards bridging this gap. Specifically, we develop a new convergence theory for the direct-policy optimization of MJLS. Despite the nonconvexity of the resultant policy-search problem, we are still able to identify several useful properties such as coercivity, gradient dominance, and smoothness. Then, we use these identified properties to prove that the gradient method converges to the optimalstate feedback controller for MJLS at a linear rate, if a stabilizing initial controller is used.
Our article generalizes the convergence theory for LTI policy optimization [12], [13], [20] to the MJLS case. This extension is nontrivial and heavily relies on the operator-theoretic stability arguments used in the MJLS literature [27]. Our article also expands on the previous results published by the authors in a conference paper [40], and has made significant extensions in identifying the smoothness property and analyzing the gradient-descent method in the MJLS setting. Our work serves as an important step toward understanding the theoretical aspects of policy-based RL methods for MJLS control. There is a follow-up work [41] which extended our convergence theory of the gradient method to the model-free policy gradient setting. When the models are unknown, the gradient method can still be implemented using zeroth-order optimization techniques and yield global-convergence guarantees [41]. The sample complexity analysis in [41] heavily relies on the cost properties identified in our article.

A. Notation
We denote the set of real numbers by R. Let A be a matrix, then we use the notation A T , A , tr (A), σ min (A), and ρ(A) to denote its transpose, maximal singular value, trace, minimum singular value, and spectral radius, respectively. Given matrices {D i } m i=1 , let diag (D 1 , . . . , D m ) denote the block diagonal matrix whose (i, i)th This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ block is D i . The Kronecker product of matrices A and B is denoted as A ⊗ B. We use vec(A) to denote the vectorization of matrix A. We indicate when a symmetric matrix Z is positive definite or positive semidefinite matrices by Z 0 and Z 0, respectively. Given a function f , we use df to denote its total derivative [42].
We now introduce some specific matrix spaces and notations motivated from the MJLS literature [27]. Let M N n×m denote the space made up of all N -tuples of real matrices Notice both V and S are sequences of matrices. It is also convenient to define V + S :

B. Markovian Jump Linear Quadratic Control
In this article, we consider the optimal control of the following discrete-time MJLS: where x t ∈ R d is the system state and u t ∈ R k corresponds to the control action. The system matrices A ω(t) ∈ R d×d and B ω(t) ∈ R d×k depend on the switching parameter ω(t), which takes values on Ω := {1, . . . , N s }. We will denote A = (A 1 , . . . , A Ns ) ∈ M Ns d×d and B = (B 1 , . . . , B Ns ) ∈ M Ns d×k . The jump parameter {ω(t)} ∞ t=0 is assumed to form a timehomogeneous Markov chain whose transition probability is given as Let P denote the probability transition matrix whose (i, j)th entry is p ij . The initial distribution of ω(0) is given by π = [π 1 · · · π Ns ] T . Obviously, we have p ij ≥ 0, Ns j=1 p ij = 1 and i∈Ω π i = 1. We further assume that system (1) is mean-square stabilizable. 1 Our control design objective is to choose the actions {u t } ∞ t=0 to minimize the following quadratic cost function: where D denotes the initial state distribution. For simplicity, it is as- indicate that there is a chance of starting from any mode i and the covariance of the initial state is full rank. These assumptions can be somehow informally thought as the persistently excitation condition in the system identification literature and are quite standard for learning-based control. The above problem can be viewed as the MJLS counterpart of the standard LQR problem, and hence is termed as the 1 The mean square stability of MJLS is reviewed in sequel.
"MJLS LQR problem." It is known that the optimal cost for the MJLS LQR problem can be achieved by a linear-state feedback of the form with K = (K 1 , . . . , K Ns ) ∈ M Ns k×d . Combining the linear policy (4) with (1), we obtain the closed-loop dynamics with Γ = (Γ 1 , . . . , Γ Ns ) ∈ M Ns d×d . Note that using this formulation, we can write the cost (3) as The optimal controller to the above MJLS LQR problem can be computed by solving a system of coupled algebraic Riccati equations (AREs) [43]. Specifically, define the opera- Let P = (P 1 , . . . , P Ns ) be the unique positive-definite solution to the following AREs: Then, it is known that the optimal controller is given by Notice that the existence of such a controller is guaranteed by the stabilizability assumption. In this article, we will revisit the above MJLS LQR problem from a policy optimization perspective.

C. Policy Optimization for LTI Systems
Before proceeding to policy optimization of MJLS, we briefly review some relevant results for LTI systems [12]. Consider the LTI system x t+1 = Ax t + Bu t , where A ∈ R d×d and B ∈ R d×k . Let u t be determined by a static-state feedback controller, i.e., u t = −Kx t . We adopt the following standard quadratic cost function: The following gradient formula [12], [44] is also well known: [12]. In addition, one can optimize (8) using the gradient method K ← K − η∇C(K), or the natural policy gradient method K ←K − η∇C(K)Σ −1 K . In [12], it was shown that these methods are guaranteed to converge to K * linearly if a stabilizing initial policy is used. One advantage of these gradient-based methods is that they can be implemented in a model-free manner. More discussions on these methods and their model-free variants can be found in [12].

D. Problem Setup: Policy Optimization for MJLS
In this section, we reformulate the MJLS LQR problem as a policyoptimization problem. Since the optimal cost for the MJLS LQR problem can be achieved by a linear-state feedback controller, it is reasonable to confine the policy search to the class of linear-state feedback policies. Hence, we set K = (K 1 , . . . , K Ns ) and the control action is determined as u t = −K ω(t) x t . This leads to the following policy-optimization problem whose decision variable is K.
Problem: Policy Optimization for MJLS minimize: cost C(K), given in (3) subject to: state dynamics, given in (1) control actions, given in (4) transition probabilities, given in (2) stability constraint, K stabilizing (1) in the mean-square sense.
When N s = 1, the above problem reduces to the policy optimization for LTI systems [12]. We want to emphasize that the above problem is indeed a constrained-optimization problem. Recall that given K, the resultant closed-loop MJLS (5) is mean square stable (MSS) if for any initial condition we can trivially apply the well-known equivalence between mean square stability and stochastic stability for MJLS [27] to show that C(K) is finite if and only if K stabilizes the closed-loop dynamics in the mean-square sense. Therefore, the feasible set of the above policy-optimization problem consists of all K stabilizing the closed-loop dynamics (5) in the meansquare sense. For simplicity, we denote this feasible set as K. For K ∈ K, C(K) can be calculated as where P K = (P K 1 , . . . , P K Ns ) ∈ M Ns d×d and each P K i is solved via the following coupled Lyapunov equations: The goal for policy optimization is to apply iterative gradient-based methods to search for the cost-minimizing element K * within the feasible set K. A fundamental question is how to check whether K ∈ K for any given K. There are several ways to do this, and we give a brief review here. We need to introduce a few operators which are standard in the MJLS literature. Specifically, for any V ∈ M Ns d×d , we define The following property of E i is quite useful: It is also easy to check that both T and L are Hermitian and positive operators. From [27], we also know T is the adjoint operator of L.
The operator T is useful in describing the covariance propagation of the MJLS (5). .
In addition, we know ∞ t=0 X(t) exists if K ∈ K. We denote this limit as X K and we have The operator L is useful for value computation, since we have Also, notice L is actually a linear operator and has a matrix representation [27,Prop. 3.4] for more details). Now we are ready to present the following well-known result which can be used to check whether K is in K or not.
Proposition 1 ( [27]): The following assertions are equivalent. Occasionally, we will use the notation L K and T K when there is a need to emphasize the dependence of these operators on K.

III. OPTIMIZATION LANDSCAPE AND COST PROPERTIES
In this section, we study the optimization landscape of the MJLS LQR problem and identify several useful properties of C(K). Based on [40, Lemma 1], the cost (9) is continuously differentiable with respect to K, and the gradient ∇C(K) can be calculated as where L K = (L K 1 , . . . , L K Ns ) ∈ M Ns k×d and L K i is given by Moreover, X K in the above gradient formula is given by (12). Since K is a tuple of real matrices, we have ∇C(K) ∈ M Ns k×d . Next, we present an explicit formula for the Hessian of the cost. To avoid tensors, we restrict analysis with the quadratic form of the Hessian ∇ 2 C(K)[E, E] on a matrix sequence E ∈ M Ns k×d . Lemma 1: For K ∈ K, the Hessian of the MJLS LQR cost C(K) applied to a direction E ∈ M Ns k×d is given by where . . . , Γ Ns ). Applying the Taylor series expansion about E [45], we have that the quadratic form of the Hessian ∇ 2 C(K) on E is given by Writing (9) as C(K) = P K , X(0) , we then have Denote (P K ) [E] := d dt | t=0 P K+tE . The following equation holds: We can show that where S is given as Since T is the adjoint operator of L, we have Plugging (20) into the above, we get We can get the desired result by noting that each block in X K is symmetric and using the cyclic property of the trace. Optimization Landscape for MJLS: Now we discuss the optimization landscape for the MJLS LQR problem. Notice that LTI systems are just a special case of MJLS. Since policy optimization for quadratic control of LTI systems is nonconvex, the same is true for the MJLS case. By examining the gradient formula (13), it becomes clear that as long as is full rank and π i > 0 for all i, any stationary point given by ∇C(K) = 0 has to satisfy Substituting the above equation into (10) leads to the global solution K * defined by the coupled AREs (6), and hence, the only stationary point is the global-optimal solution. When the initial mode is sufficiently random, i.e., π i > 0 for all i, the optimization landscape for the MJLS case becomes quite similar to the classic LQR case. Based on such similarity, it is reasonable to expect that gradient-based methods will work well in the MJLS LQR setting despite the nonconvex nature of the problem. Compared with the LTI case, the characterization of K is more complicated for MJLS. Hence, one main technical issue is how to show gradient-based methods can handle the feasibility constraint K ∈ K without using projection.
Key Properties of the MJLS LQR Cost: To analyze the performance of gradient-based methods for the MJLS LQR problem, a few key properties of C(K) will be used. By assumption, we have μ := min i∈Ω (π i ) σ min (E x 0 ∼D [x 0 x T 0 ]) > 0. Then, we can identify several key properties of C(K) as follows.
Lemma 2: The cost (9) satisfies the following properties. 1) Coercivity: The cost function C is coercive in the sense that for if either K l 2 → +∞ or K l converges to an element K in the boundary ∂K. 2) Almost smoothness: Given elements K, K ∈ K, the cost function C(K) defined in (3) satisfies 3) Gradient dominance: Given the optimal policy K * , the following sequence of inequalities holds for any K ∈ K:

5) Smoothness on the sublevel sets:
For any sublevel set K α , choose the smoothness constant as where ξ is calculated as Then, for any K ∈ K α , we have ∇ 2 C(K) ≤ L. In addition, for any (K, K ) satisfying tK + (1 − t)K ∈ K α ∀t ∈ [0, 1], the following inequality holds: Proof: To prove Statement 1, first notice that we have This directly shows that C(K l ) → +∞ as K l 2 → +∞. Next, we assume K l → K ∈ ∂K. Based on Proposition 1, we know that for all l, there exists Y l 0 such that where the dependence of L on K is emphasized by the superscript. We now want to show that the sequence {Y l } is unbounded, and will use a contradiction argument. Suppose that {Y l } is bounded. By the Weierstrass-Bolzano theorem, {Y l } admits a subsequence {Y ln } ∞ n=0 , which converges to some limit point denoted as Y . Clearly, we have Y 0. For the same subsequence {l n } ∞ n=0 , we have lim n→∞ K ln = K ∈ ∂K. For all l n , we still have Now letting n go to ∞, by continuity, leads to the equation Y − L K (Y ) = Q + K T RK. Since Q 0, R 0, and L K (Y ) 0, we conclude Y 0. By Proposition 1, we have K ∈ K, and this contradicts the fact that K ∈ ∂K. Therefore, {Y l } must be unbounded. Since Y l is positive definite, we can further conclude that { tr(Y l )} is unbounded and C(K l ) → ∞. This completes the proof of Statement 1. Next, we prove Statement 2. Recall that we have . Based on (10), we can show Here, the notation L K emphasizes that this is the operator associated with K . Now we can prove Statement 2 by applying the above equation and the fact that T K is the adjoint operator of L K .
To prove Statement 3, we rewrite the almost-smoothness condition and complete the squares as follows: Statement 4 can be proved using the continuity and coercivity of C(K). With the coercive property in place, we can continuously extend the function domain from K to M Ns k×d by allowing ∞ as a function value. Based on [46,Prop. 11.12], we know that K α is bounded for any finite α. Since C(K) is continuous on K, the set K α is also closed. Hence, Statement 4 holds as desired.
Finally, to prove Statement 5, we only need to bound the norm of ∇ 2 C(K). Then, the desired conclusion follows by applying the meanvalue theorem. Since ∇ 2 C(K) is self-adjoint, its operator norm can be characterized as Based on the Hessian formula (15), we have Now, we only need to provide upper bounds for the two terms on the right side of the above inequality. For simplicity, we denote As a matter of fact, q 1 and q 2 can be bounded as follows: The proofs of (23) and (24) are tedious and hence are deferred to the appendix for readability. Now, we are ready to prove the L-smoothness of C(K) within the set K α . Notice C(K) ≤ α for any K ∈ K α . Hence, we can combine (23) and (24) to show 2q 1 + 4q 2 ≤ L, where L is given in Statement 5. Based on the mean-value theorem, this leads to the desired conclusion. It is worth emphasizing that (21) only holds when the line segment between K and K is in K α . Since K α is nonconvex in general, it is possible that there exists K, K ∈ K α , such that (21) does not hold. Now we briefly explain the importance of the above properties. When applying the gradient method to search for K * , the following two issues need to be addressed and our techniques will heavily rely on the above cost properties. 1) Feasibility: One has to ensure that the iterates generated by the gradient method always stay in the nonconvex feasible set K. The coercivity implies that the function C(K) serves as a barrier function on K. Based on the coercivity and the compactness of the sublevel set, one can show that the decrease of the cost ensures the next iterate to stay inside K. 2) Convergence: After ensuring the feasibility, one next needs to show that the iterates generated by the optimization method converge to K * . The smoothness and gradient-dominance properties will play a key role in the convergence proof when there is an absence of convexity.

IV. GRADIENT METHOD AND CONVERGENCE
In Section II-C, we have reviewed policy optimization for the LTI case. In this section, we will consider the gradient method in the MJLS LQR setting and provide new global-convergence guarantees. In the MJLS LQR setting, the gradient method iterates as where K 0 is required to be in K. The stepsize η is a hyperparameter to be tuned. When the parameters (A, B, Q, R) are exactly known, the gradient ∇C(K n ) can be evaluated using the formula (13). If the model parameters are unknown, one can still estimate the gradient from data using either zeroth-order optimization [41] or policy-gradient theorem [47]. Now, we present the convergence theory for the update rule (25) with exact gradient information. We first need to ensure that the iterates generated by (25) are always in K. Consider the one-step gradient update K ← K − 2ηL K X K . We will use the coercivity of C(K) and the compactness of K α to show that, for all K ∈ K, we can choose η such that K will also be in K. Lemma 3: Suppose K ∈ K α and K = K − η∇C(K). Set L as described in Statement 5 of Lemma 2. If 0 < η ≤ 1 L , then we have K ∈ K α ⊂ K and Proof: We define the interior set of K α as K o α := {K ∈ K : Clearly (K o α+ ) c is a closed set and (K o α+ ) c ∩ K α = ∅. Since K α is compact, we know the distance between K α and (K o α+ ) c is strictly positive. We denote this distance as δ. Let us choose τ = min{0.9δ/ ∇C(K) , 1/1.1L}. Obviously, the line segment between K and (K − τ ∇C(K)) is in K α+ . Notice ∇ 2 C(K) ≤ 1.1 L for all K ∈ K α+ , and hence, we have which leads to As long as τ ≤ 2/(1.1 L), we have −τ + 1.1Lτ 2 2 ≤ 0 and C(K − τ ∇C(K)) ≤ C(K). Hence, we have K − τ ∇C(K) ∈ K α . Actually, it is straightforward to see that the line segment between K and The rest of the proof follows from induction. We can apply the same argument to show that the line segment between (K − τ ∇C(K)) and (K − 2τ ∇C(K)) is also in K α . This means that the line segment between K and (K − 2τ ∇C(K)) is in K α . Since τ > 0, we only need to apply the above argument for finite times and then will be able to show that the line segment between K and (K − η∇C(K)) is in K α for any 0 < η ≤ 1 L . 2 Since ∇ 2 C(K) ≤ L for all K ∈ K α , we have where the last step follows from the fact that we have 0 < η ≤ 1 L . This completes the proof.
Next, we can combine (26) with the gradient-dominance property to show that the cost associated with the one-step progress of the gradientdescent method is decreasing. This step is quite standard.
Lemma 4: Suppose K ∈ K α and K = K − η∇C(K). Set L as described in Statement 5 of Lemma 2. If 0 < η ≤ 1 L , then the following inequality holds: Proof: By Lemma 3, we know K is stabilizing. We can combine (26) with Statement 3 in Lemma 2 to show which directly leads to the desired conclusion. Now we are ready to prove the global convergence of the policygradient method (25).
Theorem 1: Suppose K 0 ∈ K. Choose α = C(K 0 ) and set L as described in Statement 5 of Lemma 2. For any step size 0 < η ≤ 1 L , the iterations generated by the gradient-descent method (25) always stay in K and converge to the global minimum K * linearly as follows: Proof: We will use an induction argument. Since α = C(K 0 ), we have K 0 ∈ K α . By Lemma 4, we know (27) holds for n = 1. Since C(K 1 ) ≤ C(K 0 ), we have K 1 ∈ K α . We can apply Lemma 4 again to show (27) holds for n = 2. Now it is clear that we can repeatedly apply the above argument to show (27) holds for any n.
From the above proof, one can see that without using projection, one can still guarantee the gradient method will stay in the feasible set and converge to the global minimum. When the model parameters are known, there are many other methods which can be used to solve K * [27], [48]. We do not claim that the gradient method is more desirable than other methods when the model information is known. The purpose of our study is to bring new insights for understanding policy-based RL methods in the MJLS setting. When the model is unknown, one can still apply model-free techniques such as zeroth-order optimization [49], [50] to estimate the gradient ∇C(K) from data. Then, the gradient-estimation errors have to be explicitly addressed, and this has recently been done in a follow-up work [41]. The analysis in [41] combines the theory in our article with some estimation error bounds to handle the model-free case.
Remark 1: The above proof technique is more general than the implicit regularization arguments in our previous conference paper [40], which addresses the convergence of the natural gradient method 3 , 2 The argument even works for any η ≤ 2 1.1 L . Since the step size leading to the fastest convergence rate is 1 L , we state our result only for η ≤ 1 L . 3 See Appendix A.1 in the arXiv version of [47] for more explanation of the terminology "natural gradient." which is iterated as K n+1 = K n − η∇C(K n )(X K n ) −1 . In [40], it has been shown that the natural-gradient method with any 0 < η ≤ ) −1 always stays in K and will converge to the global minimum K * linearly as follows: The proof in [40] relies on an implicit regularization argument where P K is used to construct a Lyapunov function for K and then guarantee K ∈ K. Similar ideas have been used to show the convergence properties of policy optimization methods for the mixed H 2 /H ∞ state feedback design problem [23]. However, such an implicit regularizationproof idea does not work for the gradient method. For the gradient method, the value function at step n cannot be directly used as a Lyapunov function at step (n + 1). A similar fact has also been observed for the mixed H 2 /H ∞ control problem [23].

A. Proof of the Bound (23)
The proof of (23) is straightforward. Notice that we have where the last step follows from (11). Hence we immediately have Now, what we need to bound P K max and i∈Ω tr (X K i ). Recall ]. Therefore, we have which leads to the following upper bound Notice T is the adjoint operator of L. Hence, we also have

B. Proof of the Bound (24)
For simplicity, we shorten the notation (P K ) [E] as (P K ) . To prove (24), first notice that we can use the Cauchy-Schwarz inequality to show Next, we bound E T B T E((P K ) )Γ (X K ) 1/2 2 as follows: Since E i ((P K ) )E i ((P K ) ) is positive semidefinite, we have i∈Ω tr E i ((P K ) )E i ((P K ) )Γ i X If K ∈ K, we know T (X K ) − X K ≺ 0 and hence, the following also holds: Therefore, substituting the above bounds into (B.1) leads to Since (X K ) 1/2 2 2 = j∈Ω tr(X K j ), we finally have Based on (B.2), proving (24) only requires showing that the following bound holds for any E 2 = 1 and K ∈ K where ξ is given as Once (B.3) is proved, it can be combined with (B.2), (A.2), and (A.3) to verify (24) easily. Now the only remaining task is to prove (B.3). Let us first show (P K ) [E] ≤ ξP K given E 2 = 1. We will use [27,Corollary 2.7], which states thatX X if (X,X) satisfies X − L(X) = S andX − L(X) =S withS S and K ∈ K. Since E(P K ) 0 and R 0, we have

C(K) μ
which directly leads to the following result: