Stability-certified reinforcement learning: A control-theoretic perspective

We investigate the important problem of certifying stability of reinforcement learning policies when interconnected with nonlinear dynamical systems. We show that by regulating the input-output gradients of policies, strong guarantees of robust stability can be obtained based on a proposed semidefinite programming feasibility problem. The method is able to certify a large set of stabilizing controllers by exploiting problem-specific structures; furthermore, we analyze and establish its (non)conservatism. Empirical evaluations on two decentralized control tasks, namely multi-flight formation and power system frequency regulation, demonstrate that the reinforcement learning agents can have high performance within the stability-certified parameter space, and also exhibit stable learning behaviors in the long run.


I. INTRODUCTION
Reinforcement learning (RL) aims at guiding an agent to perform a task as efficiently and skillfully as possible through interactions with the environment. Consider the interconnected system illustrated in Fig. 1, where G is the environment and π θ is the policy. The goal of RL is to maximize the expected return: where r(x, u) is the reward at state x ∈ R n s and action u ∈ R n a , ρ ∈ (0, 1] is the future discount factor, and the expectation is taken over the policy π θ as well as the initial state distribution and dynamics G. While remarkable progress has been made in RL algorithms, such as policy gradient [1]- [3], Q-learning [4], [5], and actor-critic methods [6], [7], a fundamental issue that is unresolved in the literature is how to analyze or certify stability of the interconnected system, which is closely related to the safety aspect of reinforcement learning [8]- [10]. Stability verification is challenging for two key reasons: (i) both the environment and the control policy (e.g., deep neural networks) are often highly nonlinear; and (ii) the policy The associate editor coordinating the review of this manuscript and approving it for publication was Fangfei Li . The goal of RL is to maximize expected rewards through interaction and exploration. In this study, we address the challenge of maintaining the stability of the interconnected system during the learning and control process. changes dynamically during the learning phase. In robust control, Lyapunov functions are widely used to analyze and verify stability under uncertainty [11], [12]. For nonlinear systems without global stability guarantees, region of attraction and reachability analysis have been employed for local convergence analysis [13]- [15]. The main challenge of these methods is that the robustness guarantee can be conservative due to coarse constraints on nonlinearity such as those based on Lipschitz constants [16], [17], leading to a limited search space for safe policies. To mitigate this issue, the integral quadratic constraint (IQC) framework proposed in [18] can be employed, which has been widely used to analyze the stability of large-scale complex systems, such as aircraft control [19]. Nevertheless, existing techniques can be computational intensive for deep neural networks, and establishing necessary conditions for robustness has been limited to only a few cases (e.g., block-diagonal structured uncertainty operators with bounded singular values [20]).
To address this issue, we introduce a more informative quadratic constraint and analyze the necessity of the certificate criterion as an extension of the preliminary conference version [21]. This opens up new possibilities to safely apply deep RL to nonlinear large-scale real-world systems, whose stability is otherwise impossible to be certified using existing approaches. As an overview, we propose a ''safety set'' of policies P(ξ ) based on partial gradients, i.e., 1 π ξ ij ≤ ∂ j π i (y) ≤ ξ ij , ∀i ∈ [n a ], j ∈ [n s ], y ∈ R n s , (2) where y ∈ R n s is the output vector, and n a , n s are the dimensions of the input and output, respectively, ∂ j π i is the partial derivative of π i for the j-th input. Our framework designs a set of numerical bounds ξ ∈ R n a ×n s and ξ ∈ R n a ×n s such that as long as the policy stays within the ''safety set,'' the stability of the interconnected system is guaranteed. Importantly, this work bridges the robust control with reinforcement learning to provide provably guarantees of stability during the learning and control process. Key contributions of the present study are as follows: • Development of a general framework to certify the stability of reinforcement learning for nonlinear systems, with new strategies to incorporate the stability requirement into RL; • A new characterization of the safety set of policies based on point-wise IQC, as well as the analysis of its non-conservativeness property; • Numerical evaluation on decentralized control problems for flight formation and power grid frequency regulation. The rest of the paper is organized as follows. We formulate the problem in Section II. Main results for the stability analysis are presented in Section III, where we also analyze the conservatism of the certificate. Section IV presents numerical studies on two nonlinear decentralized control tasks. Conclusions are drawn in Section V.

NOTATIONS
Let A B or A B denote that A − B is positive semidefinite or positive definite, respectively. Also, R and R + represent the sets of real and nonnegative real numbers, respectively.
The notation [n] shows the set {1, . . . , n}. We use x to denote a vector, and boldface x to denote a signal with value x(t) at each time t. Also, we use x 0:T to denote the signal over the time interval [0, T ]. We use {x ij } i∈ [m],j∈ [n] to denote a collection of entries x ij in a vector form, or simply {x ij } if 1 We use [n] = {1, . . . , n} as the set notation. the ranges of i and j are clear from the context. By convention, we arrange the entries in {x ij } i∈ [m],j∈ [n] such that x ij is at location k = (i − 1)n + j. Similarly, we use {x ij } to denote a collection of signals. The i-th entry of x is written as x i or [x] i , and the i-th dimension of a signal is x i or [x] i . For norms, we denote |x| = x 2 1 + · · · + x 2 n as the 2-norm of a vector x, and x = ∞ 0 |x(t)| 2 dt as the 2-norm of a signal. When · is applied to an operator, it denotes an induced norm from L to L, where L is the vector space of signals with bounded 2-norm, i.e., In the case we need to specify n as the spatial dimension of a signal, we will use L n . Also, x, y = ∞ 0 x(t) y(t)dt denotes the inner product between two signals in L; therefore, x = √ x, x . The notation x = √ x, x is used for an adjoint operator . Given a matrix M ∈ R m×n , diag(M ) denotes a diagonal matrix whose diagonal elements are given by the entries M 11 , . . . , M 1n , M 21 , . . . , M 2n , . . . , M m1 , . . . , M mn . We use ⊗ to denote the Kronecker product.

II. PROBLEM FORMULATION
Consider a continuous-time dynamical system: with the state x(t) ∈ R n s and the control action u(t) ∈ R n a . In general, f t (·, ·) can be a time-varying and nonlinear function.
In this work, we focus on an important class of systems described by: where f t (·, ·) comprises a linear time-invariant (LTI) component modeled by a matrix A ∈ R n s ×n s that is Hurwitz (i.e., every eigenvalue of A has strictly negative real part) and a control matrix B ∈ R n s ×n a , as well as a slowly time-varying component g t (·) that captures the nonlinearity and uncertainty of the original system. 2 The output y(t) = Cx(t) ∈ R n s is a linear function of the states, where C ∈ R n s ×n s may have a sparsity pattern in the context of decentralized controls [22]. Nevertheless, since we can account for the sparsity pattern in the design of the control policy, we henceforth assume that y(t) = x(t) for simplicity. The control input based on RL is where π t (y(t); θ t ) is usually a neural network parametrized by θ t (which can be time-varying during learning). The vector e(t) ∈ R n a captures the input perturbation that is assumed to have a bounded energy over time ( e 2 = |e(t)| 2 2 dt ≤ ∞). Let {π t |t ∈ [0, T ]} be the trajectory of policies deployed in the system over the time interval [0, T ]. We propose the following notion of dynamic stability adapted from the classical definition of the L 2 gain [20], [23]. 3 Definition 1 (Dynamic Input-Output Stability): Given a dynamical system G, the L 2 gain of the system G with the input u(t) = π t (y(t); θ t ) + e(t) is defined as the worst-case ratio between the total output energy and the total input perturbation energy: where L 2 is the set of all square-summable signals and is bounded by a finite number , then the closed-loop system is said to be dynamically input-output stable (with finite gain ).
The stability-certified RL problem is thus to find an optimal policy π θ that maximizes the expected reward η(π θ ) defined in (1) while ensuring that the system is dynamically input-output stable, i.e., γ (G, {π t |t ∈ [0, T ]}) < for a given finite gain . Our approach is to delineate a safety set of controllers P(ξ ) based on partial gradients, such that as long as π t ∈ P(ξ ) for all t ∈ [0, T ], we can guarantee that γ (G, {π t |t ∈ [0, T ]}) is bounded by . After identifying the safety set, one can impose it via a constraint during RL, and then the problem can be solved by any arbitrary RL algorithm (e.g., policy gradient [1], [2], Q-learning [4], [24], actor-critic [6], [7]). Furthermore, the certification can be regarded as an S-procedure (c.f., [25]), and we analyze its non-conservatism by showing that this condition is necessary for the robustness of a surrogate system that is closely related to the original system.

A. QUADRATIC CONSTRAINTS ON GRADIENT-BOUNDED FUNCTIONS
First, we introduce the time-domain definition of IQC [26]: Definition 2 (IQC in the Time-Domain): Consider the signals w ∈ L and y ∈ L with w = (y), where is a bounded and causal operator. Let be a stable LTI system and M = M be a symmetric matrix. Then, is said to satisfy the hard IQC defined by ( , M ), denoted by where z = y w is the filtered output given by . Further- then is said to satisfy the point-wise IQC.
Recall that a function h(x) is Lipschitz continuous with constant ξ > 0 if and only if it satisfies the following quadratic constraint for all x α , x β ∈ R n : ≥ 0, (9) which corresponds to the point-wise IQC (8) with = I and M = ξ 2 I 0 0 −I . It can be observed that a Lipschitz continuous function with constant ξ is a member of P(ξ ), where ξ ij = −ξ ij = ξ . Nevertheless, this constraint can be sometimes too conservative. For example, consider the function h : where |a| ≤ 0.1 is a deterministic but unknown parameter with a bounded magnitude. Clearly, the function has the Lipschitz constant 1; however, the above characterization ignores the non-homogeneity of h 1 (x) (i.e., the first output of h(x)) and h 2 (x), as well as the sparsity of the dependence on x. Indeed, h 1 (x) only depends on x 1 with its slope restricted to [−0.1, 0.6] for all possible |a| ≤ 0.1, and h 2 (x) only depends on x 2 with its slope restricted to [−1, 1]. In the context of control synthesis, the non-homogeneity of outputs often arises from physical constraints, and the sparsity of inputs can be due to distributed local information. To explicitly address these requirements, we propose the following quadratic constraint. Lemma 1 (Function With Bounded Partial Gradients): Consider an arbitrary vector-valued function h : R n → R m that is differentiable with bounded partial derivatives (i.e., ξ ij ≤ ∂ j h i (x) ≤ ξ ij for all x ∈ R n ). Define the vectors c ∈ R nm andc ∈ R nm with the entries c ij = 1 2 ξ ij + ξ ij and c ij = ξ ij − c ij for i ∈ [m] and j ∈ [n]. Let W = I m ⊗ 1 1×n . Then, for all λ ∈ R n×m + and x α , x β ∈ R n , there exists a vector-valued function q : R 2n → R nm (with entries q ij (·, ·) for i ∈ [m] and j ∈ [n]) with the property that such that the following quadratic constraint is satisfied: where M (λ; ξ ) is given by and U (λ, c) ∈ R n×mn is given by The above bound is a direct consequence of standard tools in real analysis [27]. To understand this result, it can be observed that (12) is equivalent to: for λ ij ≥ 0 and h i (x α ) − h i (x β ) = n j=1 q ij , where we omit the dependence of q ij on x α and x β for notational simplicity. Since (14) holds for all λ ij ≥ 0, it is equivalent to the condition that (c 2 , which is a direct result of the bounds imposed on the partial derivatives of h i . If we apply it to the example function (10), we can specify that ξ 11 = −0.1, ξ 11 = 0.6, ξ 22 = −1, ξ 22 = 1, and all the other bounds (ξ 12 , ξ 12 , ξ 21 , ξ 21 ) are zero. This clearly yields a more informative constraint than merely relying on the Lipschitz constraint (9). In fact, for a differentiable l-Lipschitz function, we have ξ ij = −ξ ij = l, and by limiting the choice of λ ij = is reduced to (9). However, as illustrated in this example and the experiments in Section IV, since the condition in Lemma 1 can incorporate richer information about the problem structure, it often gives rise to less restrictive stability bounds compared to (9). Remark 1: The constraint introduced above can be considered as a point-wise IQC; however, it is unconventional in the sense that it involves an intermediate unknown function q(·, ·) that is related to the output h(·). For stability analysis, let x β = x * ∈ R n be the equilibrium point, and without loss of generality, assume that x * = 0 and h(x * ) = 0. Then, one can define the quadratic functions ij , and the condition (12) can be written as which can be used to characterize the set of (x, q) associated with the function h(·), as we will discuss in Section III-D.
To simplify the mathematical treatment, we have focused on differentiable functions in Lemma 1; nevertheless, the analysis can be extended to non-differentiable but continuous functions (e.g., the ReLU function max{0, x}) using the notion of generalized gradient [28,Chap. 2]. In brief, by re-assigning the bounds on partial derivatives to uniform bounds on the set of generalized partial derivatives, the constraint (12) can be directly applied.
In relation to the existing IQCs, this constraint has wider applications for the characterization of gradient-bounded functions. The Zames-Falb IQC introduced in [29] has been widely used for single-input single-output (SISO) functions h : R → R, but it requires the function to be monotone with the slope restricted to The multi-input multi-output (MIMO) extension holds true only if the nonlinear function h : R n → R n is restricted to be the gradient of a convex real-valued function [30], [31]. As for the sector IQC, the scalar version can not be used (because it requires h i (x) = 0 whenever there exists j ∈ [n] such that x j = 0, which is extremely restrictive), and the vector version is in fact (9). In contrast, the quadratic constraint in Lemma 1 can be applied to non-monotone, vector-valued gradient-bounded functions.

B. STABILITY CERTIFICATION
In this part, we assume that the nonlinear component g t (x) is zero and leave the generalization to the next subsection. It is desirable to study the stability certification of RL for an LTI system G, whose state-space representation is given by: where x G ∈ R n s is the state (the dependence on t is omitted for simplicity), and the policy π t is changing due to RL. As before, A is Hurwitz. The goal is to certify the safety set P(ξ ) such that as long as π t remains in P(ξ ), the RL-controlled system is stable with some constant γ and c γ : To this end, define the SDP(P, λ, γ , ξ ) as follows: where P = P 0 and where M (λ; ξ ) and W = I m ⊗ 1 1×n are defined in Lemma 1.
We will next show that the stability can be certified using the aforementioned linear matrix inequalities (LMIs). Theorem 1: Let A be Hurwitz and π t ∈ R n s → R n a be a time-varying causal controller. Assume that π t has bounded partial derivatives (i.e., If there exist P 0 and a scalar γ > 0 such that SDP(P, λ, γ , ξ ) is feasible, then the RL-controlled system (16) is stable with the gain γ .
Proof: To proceed, we choose a q according to Lemma 1 and multiply x G q e to the left and its transpose to the right of the augmented matrix in (18), and use the constraints w = Wq and y = x G . Then, SDP(P, λ, γ , ξ ) can be written as a dissipation inequality: VOLUME 8, 2020 where V (x G ) = x G Px G is known as the storage function, andV (·) is its derivative with respect to time t. Denote the initial state of the system by x 0 . Because the second term is guaranteed to be non-negative at all times t by Lemma 1, if SDP(P, λ, γ , ξ ) is feasible with a solution (P, λ, γ , ξ ), we have:V which is satisfied at all times t. The above inequality can be integrated from t = 0 to t = T , and then it follows from P 0 that: which completes the proof. We remark that SDP(P, λ, γ , ξ ) is quasiconvex, in the sense that it reduces to a standard LMI with a fixed γ . To solve it numerically, we start with a small γ and gradually increase it until a solution (P, λ) is found. This is repeated for multiple sets of ξ . Each iteration (i.e., LMI for a given set of γ and ξ ) can be solved efficiently by interior-point methods. As an alternative to searching on γ for a given ξ , more sophisticated methods for solving the generalized eigenvalue optimization problem can be employed [32].

C. EXTENSION TO NONLINEAR SYSTEMS WITH UNCERTAINTY
In the context of RL, we often need to deal with systems with nonlinear dynamics and/or unmodeled dynamics. We model the nonlinear, uncertain and potentially time-varying part of the system with g t (x(t)) in (3) and regard it as an uncertain block with IQC constraints. Specifically, consider the LTI component G: where A is assumed to be Hurwitz. The nonlinear part is moved to the feedback: where e ∈ R n a and w ∈ R n a are defined as before, and g t : R n s → R n s is the nonlinear and time-varying component. We assume that g t : R n s → R n s satisfies the IQC defined by ( , M g ) as in Definition 2 (see [18], [33] for some examples).
The system has the state-space representation: where ψ ∈ R n s is the internal state and z ∈ R n z is the filtered output. By denoting x = x G ψ ∈ R 2n s as the new state, one can combine (21) and (23) via reducing y and letting w = Wq: where A, B e , B q , B v , C are matrices of proper dimensions defined above. We define SDP(P, λ, γ , ξ ) as: where P 0, and where M (λ; ξ ) is defined in (13). The next theorem provides a stability certificate for the nonlinear time-varying system (3). Theorem 2: Let A in (21) be Hurwitz, and π t ∈ R n s → R n a be a time-varying causal controller. Assume that: (i) π t has bounded partial derivatives (i.e., where is a stable LTI system. If there exist P 0 and a scalar γ > 0 such that SDP(P, λ, γ , ξ ) given in (25) is feasible, then the RL-controlled system (3) is stable with the gain γ .
Proof: See Appendix V-B. Theorem 2 requires A to be stable, but this is not a conservative assumption. Indeed, any arbitrary system can be decomposed (in infinitely many ways) into a stable component Ax and a second part g t (x). In other words, g t (x) gives the flexibility to perform this decomposition. Alternatively, one can find a linear nominal model for the system (using linearization of the known part of the dynamics), stabilize it via an internal controller to arrive at a stable matrix A, and then design a main RL controller to deal with the remaining dynamics g t (x).
The present study models the unknown system as a known nominal model plus an unknown (un-modelled) dynamics with weak assumptions on the prior knowledge about the system. This is standard in control theory and indeed it is theoretically impossible to design a high-performance stabilizing controller without prior knowledge about the system. If such knowledge is not available, the only option is to perform system identification to gain such knowledge, which requires extensive data collection and sufficient system excitation. Note that having the nonlinearity satisfying IQC is different from having the nonlinearity being known. For example, if we make an assumption that an unknown parameter of the system has norm less than 1, then there are still infinitely many possibilities for that unknown parameter. IQC serves the same purpose. In the two case studies provided in the paper (see Sec. IV), it is NP-hard to design an optimal distributed controller even for a known model of the system. In those examples, it is easy to define the nominal part and the nonlinear part, and then check the IQC. This way we can effectively handle the NP-hardness of the problem. In other words, one can ignore the nonlinearity to break down the NP-hardness of the controller design and then use IQC to make up for the discarded nonlinear part through an RL controller.
The stability analysis fundamentally depends on the uncertainty of the system-as the system becomes more uncertain (e.g., due to time variation or finite sample estimation), the stability guarantee becomes more conservative. The proposed method is a general framework that can address a variety of uncertainties, both in the system model and in the reinforcement learning policy. As we impose more and tighter constraints, the results become less conservative. Therefore, as opposed to being restrictive assumptions, the IQC constraints imposed on the nonlinear, uncertain, and potentially time-varying components of the system incorporate useful information to make the stability analysis less conservative.

D. ANALYSIS OF CONSERVATISM OF THE STABILITY CERTIFICATE
We focus on system (16) where an LTI system is interconnected with a fixed RL policy. Without loss of generality, we also assume that ξ ij = −ξ ij for all i ∈ [n a ] and j ∈ [n s ] to streamline the presentation. To certify stability of the original system, as will be shown in the next proposition, it suffices to examine the stability of the following system: where G = A BW B I 0 0 = G 11 G 12 , andπ ∈ P(ξ ) is a functional in the uncertainty set Recall that the notation x implies a signal, and it represents x(t) for t from 0 to infinity. This means that although the argument of π(·) is a vector, the argument ofπ(·) is a signal. Therefore,π(·) is a functional extension of the function π.
Proposition 1: If the system (26) is stable for allπ ∈ P(ξ ), then the system (16) is stable for all π ∈ P(ξ ). Proof: We now show that the pair (x, q) belongs to S(ξ ) if and only if there exists a sector-bounded functionπ ∈P(ξ ) such that it satisfies q =π(x). Lemma 2: Suppose that x ∈ L n s and q ∈ L n a n s . Then, the pair (x, q) belongs to S(ξ ) if and only if there exists an operatorπ : L n s → L n a n s such that q =π(x) andπ satisfies the following conditions: Proof: See the Appendix V-D. The previous result indicates that the input and output pair ofπ can be described by S(ξ ). We next show that this set should be separated from the signal space of the dynamical system in order to ensure robust stability.
Lemma 3: If (G,π) is robustly stable, then there cannot exist a nonzero q ∈ L such that x = Gq and (x, q) ∈ S(ξ ).
Proof: We prove this lemma by contraposition. If there exists a nonzero q ∈ L such that (x, q) ∈ S(ξ ), then it follows from Lemma 2 that there exists a linear operatorπ such that q =π(x) =π(Gq). This implies that the operator (I −πG) is singular, and therefore, (I − Gπ) is singular, implying that the interconnected system is not robustly stable.
Consider the set generated by the LTI system: q)} ∈ R n a n s |q ∈ L n a n s , q = 1, x = Gq , and the positive orthant = {r ij } ∈ R n a n s |r ij ≥ 0, ∀i ∈ [n a ], j ∈ [n s ] . (30) Lemma 3 implies that the two sets and are separated if (G,π) is robustly stable.
To prove the necessity of the stability certificate, we first aim to show in Lemma 4 that there exists a strict separating hyperplane when the system is robustly stable. We then draw the connection in Proposition 2 between the existence of the strict separating hyperplane and the feasibility of the SDP condition (18). These two results together imply that the stability certificate derived from (18) is necessary for any robustly stable system controlled by RL (Theorem 3).
Define ij,x = diag ξ 2 ij and ij,q as a matrix with its (k, l)-th element equals to Thus, we can write φ ij (x = Gq, q) as an inner product: where T ij = G * ij,x G − ij,q . Now, we show that strict separation occurs when the system is robustly stable.
Proof: Assume that D( , ) = inf r∈ ,y∈ |r − y| = 0. Consider a sequence k → 0 as k tends to ∞. For each k , we construct the signals q (k) with a bounded support on the time interval [t k , t k+1 ], such that q (k) = 1 and that there exists a signal x (k) ∈ L n s satisfying: We also construct a functionalπ (k) ≤ C k for some constant C > 0 that depends on the sector bounds ξ (the existence of such q (k) andπ (k) is shown in Lemma Then, π ≤ C π , where C π > 0 is a finite number that depends on ξ . We havẽ Because k → 0, the right-hand side approaches 0, and so does the left-hand side. Therefore, since q (k) = 1, the mapping I −πG cannot be invertible, which contradicts the robust stability assumption. This implies that and are strictly separable.
Next, we draw the connection to the SDP problem (18). Observe that where for all q ∈ L n a n s and x = Gq.
Proof: By (32), the condition (33) is equivalent to By the KYP lemma, this is equivalent to the existence of P 0 such that: By Schur complement, P satisfies the KYP condition if and only if it satisfies (18), which proves the statement. Finally, we are able to show the necessity of the stability certificate below.
Theorem 3: Letπ : L n s → L n a n s be a bounded causal controller such thatπ ∈P(ξ ). Then, the input-output stability of the feedback interconnection of system (26) implies that there exist P 0, γ > 0 and λ ≥ 0 such that SDP(P, λ, γ , ξ ) in (18) is feasible.
Proof: Since the system is input-output stable, the sets and are strictly separable due to Lemma 4. Since both and are convex (see Lemma 5 in Appendix V-E), there exist a strictly separating hyperplane parametrized by λ ∈ R mn and scalars α, β, such that λ, φ ≤ α < β ≤ λ, ν for all φ ∈ and ν ∈ . Since λ, ν is bounded from below, we must have λ ≥ 0, and without loss of generality, we can set β = 0 and α < 0. This condition is equivalent to (33), and by Proposition 2, this implies that the SDP condition is feasible.

E. REINFORCEMENT LEARNING WITH STABILITY REGULARIZATION
The key takeaway message from the previous analysis is that in order to ensure stability during the learning and control process, it is critical to regulate the magnitudes of the partial gradients. In this section, we use trust region policy optimization (TRPO) [3] as an example to illustrate our approach to addressing this requirement. Note that our methods can be applied to other types of RL algorithms as well.

1) STABILITY PENALTY
In this method, we consider adding a penalty term to the RL objective function: (36) where 1(·) is the indicator function that aims to keep the partial gradients along the trajectories inside the bounds by softly penalizing it. This term is closely related to the term used in the literature Both terms encourage the behavior that small changes in the input should not change the output drastically, and they are different from typical regularization terms designed for the weights θ (e.g., weight decay in neural network). Interestingly, the penalty (37) was termed ''double backpropagation'' in [34], and recently rediscovered in [35], [36] to improve the generalization performance of image classification tasks. The present study provides theoretical and empirical justification of this penalty along with (36) to improve stability of the RL-controlled dynamical system. For policy gradient with TRPO, by manipulating the expected return η(π) using the identity proposed in [37], the following ''surrogate objective'' η pol (θ) can be designed: where the expectation is taken over the old policy π old = π θ old , the ratio inside the expectation is also known as the importance weight, and π old (x, u) is the advantage function given by: where the expectation is with respect to the dynamics x ∼ T (x, u) (in our case, the dynamics is a nonlinear dynamical system, and x is the next state given the current state x and input u), and it measures the improvement of taking action u at state x over the old policy in terms of the value function V π old . Combined with (36), the modified objective is given by: where ω ≥ 0 is the regularization coefficient whose value is selected such that the scale of the corresponding term is about [0.01, 0.05] of the surrogate loss value η pol (θ). In practice, the modified objective η new (θ) can be estimated using trajectories sampled from π old .

2) HARD THRESHOLDING
After each gradient step, we obtain an upper bound on the Lipschitz constant l(π θ ) of the updated neural network π θ (e.g., using the simple approach introduced in [38]). If the upper bound is greater than the certified bound l • , then we execute a hard thresholding (HT) step that rescales the weight matrices at each layer by (l • /l(π θ )) 1/n L if l(π θ ) > l • , where n L is the number of layers of the neural network. The goal is to ensure that the Lipschitz constant of the RL policy remains bounded by l • . Remark 2: We note that the proposed approaches are applicable to existing RL algorithms (e.g., policy gradient, Q-learning, and actor-critic), since we either modify the objective function (1) or the resulting policy directly. For neural networks, it remains an open problem how to calculate the Lipschitz constant or partial gradients exactly [39]. In some cases, the bounds given by the simple approach proposed in [38] can be conservative. Alternatively, one can estimate the bounds on partial gradients using existing trajectories, which is reasonable as long as the trajectory does not deviate significantly from the history.

IV. CASE STUDIES
In this section, we empirically study the stability-certified RL in two important problems: flight formation [40] and power grid frequency regulation [41]. Designing an optimal controller for these systems is challenging, because they consist of interconnected subsystems that have limited information sharing, and also their underlying models are typically nonlinear and even time-varying and uncertain. Indeed, for the case of distributed control, which aims at designing a set of local controllers whose interactions are specified by physical and informational structures, it has been long known that it amounts to an NP-hard optimization problem in general [22]. End-to-end reinforcement learning comes in handy, because it does not require model information by simply interacting with the environment while collecting rewards.
In a multi-agent setting, each agent explores and learns its own policy independently without knowing about other agents' policies [42]. For the simplicity of implementation, we consider the synchronous and cooperative scenario, where agents conduct an action at each time step and observe the reward for the whole system. Their goal is to collectively maximize the rewards (or minimize the costs) shared equally among them. The present analysis aims at offering stability certificates when applying RL to dynamical systems, which is orthogonal to the line of research that aims at improving the performance of the existing RL algorithms. For illustration, we adopt an approach based on policy gradient that combines TRPO [3] with natural gradient [2] and the two regularization techniques proposed in Section III-D. For both the multi-agent formation and power grid frequency regulation, we employ the method proposed in [41] to design the nominal controller, which is distributed and stabilizing for the nominal LTI system without uncertainty and nonlinearity, since those two parts together with reward maximization are handled through RL.

A. MULTI-AGENT FLIGHT FORMATION
Consider the multi-agent flight formation problem [40], where each agent can only observe the relative distance from its neighbors, as illustrated in Fig. 2. The goal is to design a local controller for each aircraft such that a predefined pattern is formed as efficiently as possible.
The physical model 4 for each aircraft is given by: where z i , θ i and u i denote the horizontal position, angle and control input of aircraft i, respectively, and δ > 0 characterizes the physical coupling of rolling moment and lateral acceleration. We consider 10 aircrafts in the experiments. One particular strength of RL is that the reward function can be highly nonconvex, nonlinear, and arbitrarily designed; however, since quadratic costs are widely used in the control literature, consider the case r(x(t), u(t)) = x(t) Qx(t) + u(t) Ru(t). For the following experiments, assume that Q = 1000 × I 15 and R = I 4 . In addition, we designed a nominal static distributed controller K n to make the largest eigenvalue of A + BK n negative [41].
The task for multi-agent RL is to learn the controller u i (t), which only takes inputs of the relative distances of agent i to its neighbors. For example, agent 1 can only observe z 1 (t) − z 2 (t) − d (i.e., the 1 st entry of x(t)); similarly, agent 2 can only observe z 1 (t) − z 2 (t) − d and z 2 (t) − z 3 (t) − d (i.e., the 1 st and 5 th entries of x(t)).

1) STABILITY CERTIFICATE
To obtain the stability certificate, we apply the method in Section III-C. The nonzero entries of the nonlinear component g(x(t)) are in the form of sin(θ) − θ, which can be treated as an uncertainty block with the slope restricted to [−1, 0] for θ ∈ [− π 2 , π 2 ]; therefore, the Zames-Falb IQCs can be employed to construct (23) [29], [43]. As for the RL agents u i , their gradient bounds can be certified according to Theorem 2. Specifically, we assume that each agent u i is l-Lipschitz continuous, and solve (25) for a given set of γ and l. The certified gradient bounds (Lipschitz constants) are plotted in Fig. 3 using different constraints. The L 2 constraint (9) can only certify stability for Lipschitz constants up to 0.8. By setting the bounds ξ ij and ξ ij to 0 for agent i that does not access input j due to input sparsity, we can increase the certified value to 1.2.
To further increase the set of certifiable stable controllers, we monitor the partial gradient information for each agent and encode them as non-homogeneous gradient bounds. For instance, if ∂ j π i (x) has been consistently positive for latest iterations, we will set ξ ij = l and ξ ij = − l, where > 0 is a small margin, e.g. 0.1, to allow explorations. As a result, we can significantly enlarge the certified Lipschitz bound to up to 2.5, as shown in Fig. 3.

2) RL PERFORMANCE
The trajectories of rewards averaged over 20 independent experiments are shown in Fig. 4. In this example, agents with a one hidden layer neural network (each with 5 hidden units) can learn most efficiently with stability regularization, which significantly outperforms the linear controller.
The learned 5-layer neural network policy is employed in an actual control task, as shown in Fig. 5. Compared to the nominal controller (i.e., no RL control), the flights can be maneuvered more efficiently. In terms of the actual cost, the RL agents achieve the cost 41.0, which is about 30% lower than that of the nominal controller (58.3). This result can be examined both in the actual state-action trajectories in Fig. 5 or the control behaviors in Fig. 6. The results indicate that RL is able to improve a given controller when the underlying system is nonlinear and unknown.

B. POWER SYSTEM FREQUENCY REGULATION
In this case study, we focus on the problem of distributed control for power system frequency regulation [41]. The IEEE 39-Bus New England Power System under analysis is shown in Fig. 7. In a distributed control setting, each generator can only share its rotor angle and frequency information with a pre-specified set of counterparts that are geographically distributed. The main goal is to optimally adjust the mechanical power input to each generator such that the phase and frequency at each bus can be restored to their nominal values after a possible perturbation.
Let the rotor angles and the frequency states be denoted as θ = θ 1 · · · θ n and ω = ω 1 · · · ω n , and the generator mechanical power injections be denoted as p m = p m 1 · · · p m n . Then, the state-space representation of the nonlinear system is given by: where g(θ) = g 1 (θ ) · · · g n (θ ) 229094 VOLUME 8, 2020 FIGURE 5. State and action trajectories in a typical contral task, where the nominal controller (Nom) and the RL agents achieve costs of 58.3 and 41.0, respectively. In the legend, we use ''A1-A2'' to denote the relative distance between agents A1 and A2.
, and L is a Laplacian matrix whose entries are specified in [41, Sec. IV-B]. For linearization (also known as DC approximation), the nonlinear part g(x) is assumed to be zero when the phase differences are small [41], [44]. On the contrary, we deal with this term in the stability certification to demonstrate its capability of producing non-conservative results even for nonlinear systems. Similar to the flight formation case, we assume that there exists a distributed nominal controller that stabilizes the system. To conduct multi-agent RL, each controller p m i is a neural network learned online.

1) STABILITY CERTIFICATE
The nonlinearities in g i (θ) are in the form of θ ij − sin θ ij , where θ ij = θ i −θ j represents the phase difference, which has its slope restricted to [0, 1 − cos(θ)] for every θ ij ∈ [−θ , θ] (here, assume θ = π 3 to include both normal and abnormal operational conditions); thus, we can apply the Zames-Falb IQC. By simply applying the L 2 constraint in (9), we can only certify stability for Lipschitz constants up to 0.4, as shown in Fig. 8. With input sparsity, we can certify l up to 0.6. With output non-homogeneity, which is visualized in Fig. 9, we can further extend the certificate up to 1.1.

2) RL PERFORMANCE
First, we visualize the behavior of the learned neural network controller in a typical control case (Fig. 10). The nominal controller is designed by the method in [41] for the LTI nominal system and does not maximize the reward. As can be seen, the RL policies can regulate the frequencies more efficiently than the nominal controller (i.e. no RL), with a significantly lowered cost (50.8 vs. 23.9). More importantly, we compare the cases of RL with and without regulating the Lipschitz constants in Fig. 11. Without regulating the gradients, the RL is able to reach a performance slightly higher than its stability-certified counterpart. However, after about FIGURE 11. Long-term performance of RL for agents with regulated gradients by soft penalty (SP) and hard thresholding (HT). The RL agents without regulating the gradients exhibit ''dangerous'' behaviors in the long run.
iteration 500, the performance starts to deteriorate (due to a possibly large gradient variance and high sensitivity to step size) until it completely loses the previous gains and starts to behave erratically. This intolerable behavior is due to the large Lipschitz gains that grow unboundedly, as shown in Fig. 12. In comparison, RL with regulated gradient bounds is able to make a substantial improvement and also preserve stability of the interconnected system at the same time.

C. DISCUSSIONS
While the case studies focus on the application of the framework to systems with nonlinearities, the method can be readily deployed to handle uncertainties (which may arise due to finite sample estimation in system identification), since the unmodeled parts can be principally characterized using IQC constraints [18]. In practice, to certify the stability of any system with the feedback control of reinforcement learning agents, one needs to summarize the information into the linear parts of the model and impose as many valid constraints as possible on the unmodeled parts to reduce the conservativeness.
One limitation of the present study is the requirement of the system nominal system matrix A to be Hurwitz. Even though it is not difficult for many applications to design a controller that is stabilizing as showcased in the experiments, this requirement may exclude some possibilities for reinforcement learning agents to control highly unstable systems. In particular, if K n is required to be distributed, finding a stabilizing K n even for a known LTI system is NP-hard in the worst case. The second limitation is that the analysis is restricted to global asymptotic stability -it is interesting to extend the analysis to local stability.

V. CONCLUSION
In this paper, we focused on the challenging task of ensuring the stability of RL in real-world dynamical systems. The present study makes the following contributions: • Development of a new quadratic constraint based on partial gradients, which can be integrated into a SDP-based problem to certify stability of RL-controlled systems; • Analysis of the (non)conservatism of the certification condition, which is shown to be almost necessary and sufficient under some assumptions; • Evaluation of the framework for decentralized nonlinear control tasks, which demonstrates that the proposed approach can certify policies with Lipschitz constants that are about 3 times larger than those certifiable by existing methods for the flight formation and power system frequency regulation problems. The proposed approach can systematically address nonlinearity in the neural network policy and uncertainty/time-variation in the underlying system. A key benefit is the ability to certify a much larger set of controllers for exploration by reinforcement learning. Moreover, unlike some existing techniques that require the controller to be static for the stability analysis, our method allows it to change over time, which is crucial for RL since the policies are continuously updated with new data. Most importantly, the regulation of gradient bounds was shown to improve on-policy learning performance and avoid ''catastrophic'' effects caused by the unregulated high gains. The present study represents a key step towards safe deployment of RL in mission-critical real-world systems.

A. PROOF OF LEMMA 1
Proof: For a vector-valued function h : R n → R m that is differentiable with bounded partial derivatives (i.e., The result follows by introducing nonnegative multipliers λ ij ≥ 0, and the fact that

B. PROOF OF THEOREM 2
Proof: The proof is in the same vein as that of Theorem 1. The main technical difference is the consideration of the filtered state ψ and the output z to impose IQC constraints on the nonlinearities g t (y) in the dynamical system [18]. The dissipation inequality follows by multiplying both sides of the matrix in (25) by x q v e and its transpose: where x and z are defined in (24), and V (x) = x Px is the storage function withV (·) as its time derivative. The second term on the left side is non-negative because g t ∈ IQC( , M g ), and the third term is non-negative due to the smoothness VOLUME 8, 2020 quadratic constraint in Lemma 1. Thus, if there exists a feasible solution P 0 to SDP(P, λ, γ , ξ ), integrating the inequality from t = 0 to t = T yields that: Hence, the nonlinear system interconnected with the RL policy π is certifiably stable in the sense of a finite L 2 gain.

C. PROOF OF PROPOSITION 1
Proof: First, we define the set Since the constraint onπ in P(ξ ) is point-wise at each time instance, it is straightforward to verify that P(ξ ) ⊆ P(ξ ). It suffices to show that for any π ∈ P(ξ ), there exists a policyπ ∈ P(ξ ) such that π = Wπ. Let y 0 j = 0 · · · 0 y j+1 · · · y n s ∈ R n s for every j ∈ {0, 1, . . . , n s }, and y 0 0 = y, y 0 n s = 0. Then, one can write: whereπ ij (y) satisfies if y j = 0 andπ ij (y) = 0 if y j = 0. The bound is due to the mean-value theorem and the bounds on the partial derivatives of π i . Since the above argument is valid for all i ∈ [n a ], it means thatπ ∈ P(ξ ) ⊆ P(ξ ), and π = Wπ.

D. PROOF OF LEMMA 2
Proof: For the sufficiency condition, sinceπ ij is sector bounded, andπ ij (x) = 0 if x j = 0, we have By rearranging the above inequality, it can be concluded that (x, q) ∈ S(ξ ). For the necessary direction, we can constructπ(y) = q y,x x 2 for all y ∈ L n s . This leads toπ(x) = q, and the condition φ ij (x, q) ≥ 0 is equivalent to q ij ≤ ξ ij x j . Thus, we havẽ π ij (x) = 0 if x j = 0 and the sector bound condition is satisfied.

E. STATEMENT AND PROOF OF LEMMA 5
Lemma 5: For a given linear time-invariant operator G, the closure of defined in (29) is convex.
Proof: Because G is time-invariant, by denoting D τ as the delay operator at scale τ , we obtain D * τ T ij D τ = T ij . For simplicity, let φ(q) denote φ(Gq, q), since we restrict the first argument of φ to only depend on q. Let y = φ(q) andỹ = φ(q) be the elements of , with q = q = 1. By considering q τ = √ αq + √ 1 − αD τq , one can write By letting τ → ∞, we obtain Re T ij q, D τq → 0, where Re(x) denotes the real part of a complex vector x. Thus, and lim τ →∞ q τ 2 = α q 2 + (1 − α) q 2 = 1. Therefore, F. STATEMENT AND PROOF OF LEMMA 6 Lemma 6: Suppose that D( , ) = inf r∈ ,y∈ |r − y| = 0. Given any > 0 and t 0 ≥ 0, there exist a closed interval [t 0 , t 1 ] and two signals x ∈ L n s and q ∈ L n a n s with q = 1 such that where [t 0 ,t 1 ] projects the signal onto the support of [t 0 , t 1 ].