Stable Reinforcement Learning for Optimal Frequency Control: A Distributed Averaging-Based Integral Approach

Frequency control plays a pivotal role in reliable power system operations. It is conventionally performed in a hierarchical way that first rapidly stabilizes the frequency deviations and then slowly recovers the nominal frequency. However, as the generation mix shifts from synchronous generators to renewable resources, power systems experience larger and faster frequency fluctuations due to the loss of inertia, which adversely impacts the frequency stability. This has motivated active research in algorithms that jointly address frequency degradation and economic efficiency in a fast timescale, among which the distributed averaging-based integral (DAI) control is a notable one that sets controllable power injections directly proportional to the integrals of frequency deviation and economic inefficiency signals. Nevertheless, DAI do not typically consider the transient performance of the system following power disturbances and has been restricted to quadratic operational cost functions. This manuscript aims to leverage nonlinear optimal controllers to simultaneously achieve optimal transient frequency control and find the most economic power dispatch for frequency restoration. To this end, we integrate reinforcement learning (RL) to the classic DAI, which results in RL-DAI. Specifically, we use RL to learn a neural network-based control policy mapping from the integral variables of DAI to the controllable power injections which provides optimal transient frequency control, while DAI inherently ensures the frequency restoration and optimal economic dispatch. Compared to existing methods, we provide provable guarantees on the stability of the learned controllers and extend allowable cost functions to a much larger class. Simulations on the 39-bus New England system illustrate our results.


I. INTRODUCTION
The key to the normal operation of a power system is the balance between electric power supply and demand over the network [1]. For instance, the main cause of the 2021 Texas power crisis is that the deficient supply of power due to frozen equipment could not meet the high demand for electricity in cold weather. A system frequency deviation from its nominal value is a reflection of a power imbalance [2], which makes frequency control a vital task of grid operators. Traditionally, this task is performed in a hierarchical structure composed of three layers with timescale separation: primary-droop control (<20 s), secondary-frequency restoration (30 s-10 min), and tertiary-economic dispatch (>15 min) [2].
Nowadays, power systems are experiencing a change in the mix of generation, where conventional synchronous generators are gradually being replaced by renewable energy sources like solar and wind energy [3]. It is anticipated that the renewable share of the electricity generation mix in the United States will double from 21% in 2020 to This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 42% in 2050 [4]. Intermittent renewable sources are typically inverter-interfaced, which may adversely affect the robustness of the frequency dynamics due to the loss of inertia [5]. This exposes power systems to larger and faster frequency fluctuations than before, which has motivated active research on flexible distributed frequency control schemes that can break the hierarchy by addressing simultaneously frequency degradation and economic efficiency at a fast timescale.
A key challenge in frequency control is that the power imbalances across the network are not explicitly known. A number of studies have proposed distributed algorithms to overcome this challenge [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. The works in [6], [7], [8], [9], [10], [11], [12], [13], [14] focus on optimizing the steady-state frequency and economic performance using a principled design of fixed updating rules involving agent communication for control dynamics. The approaches mainly fall in two categories. The first category [7], [8], [12], [13], [14] rests on a primal-dual interpretation of power system dynamics under a properly designed optimization problem. This approach, however, always requires the estimation of certain system parameters. The second category [6], [9], [10], [11] builds upon various consensus algorithms to converge to an equilibrium with nominal frequency and economic efficiency. A notable example in this category is the distributed averaging-based integral (DAI) mechanism [6], [10], [11], [16], where the controllable power injections are directly proportional to the integrals of frequency deviation and economic inefficiency signals. A key caveat to this approach is that DAI control has been so far restricted to quadratic cost functions.
The works above focus on the optimization of the steadystate performance and do not typically consider the transient performance along the system trajectories following power disturbances. In fact, the optimization of transient performance is a challenging problem due to the nonlinearity of power dynamics and the uncertainty in power disturbances. Reinforcement learning (RL) [17] is a powerful tool for learning from interactions with uncertain environments and determining how to map situations to actions so that a desired performance is optimized. By virtue of the above feature, RL has emerged [18], [19] as an effective instrument to address the optimal transient frequency control problem in nonlinear power systems under unknown power disturbances. Nevertheless, the Achilles' heel of standard RL algorithms is their lack of provable stability guarantees, which presents a significant barrier to their practical implementation for the operation of power systems. In fact, many works [20], [21], [22], [23] optimize the transient performance by learning control policies that exhibit good performance against data but without any provable guarantees on steady-state performance. The recent paper [24] on RL for optimal primary frequency control proposes a way to address the stability issue by identifying a set of properties that make a control design stabilizing and then restricting the search space of neural network-based controllers. In this paper, we extend this idea to achieve provable guarantees on frequency restoration and economic efficiency at the steady-state.
With the aim of filling the gap between the optimization of steady-state and transient frequency control performance, we propose to unify these two perspectives by integrating RL into DAI control. In our approach, we employ RL to seek the optimal control policy in terms of the transient performance as a map from the integral variables of DAI to the controllable power injections, while DAI inherently ensures the optimal steady-state performance. This results in nonlinear optimal frequency controllers which we term RL-DAI control that generalizes the standard DAI approach. Specifically, the generalization is manifold: r Unlike the classic DAI control that only addresses quadratic operational cost functions, RL-DAI control admits any strictly convex cost functions whose gradients are identical up to heterogeneous scaling factors; r Unlike the standard RL methods that do not provide stability guarantees, RL-DAI control encodes the stabilization requirement on the control policy as mild conditions on its continuity and monotonicity. Such conditions can be easily realized by an ingenious parameterization of the control policy as a monotonic neural network, which is trained by a RL algorithm based on a recurrent neural network (RNN); r RL-DAI control jointly optimizes both steady-state and transient frequency control performance by leveraging the added degrees of freedom in tuning parameters that characterize the nonlinear control policy. The rest of this paper is organized as follows. Section II describes the power system model and formalizes the optimal frequency control problem. Section III describes the proposed generalized DAI control and shows how it guarantees the asymptotic convergence of the closed-loop system to the equilibrium that achieves the steady-state performance objectives. Section IV illustrates how to integrate RL with DAI control such that the transient performance can be optimized without jeopardizing stability. Section V validates our results through detailed simulations. Section VI gathers our conclusions and outlook.

II. MODELLING APPROACH AND PROBLEM STATEMENT
In this section, we describe the power system model used in this paper for analysis and the frequency control problem we aim to address.

A. POWER SYSTEM MODEL
We consider 1 a n-bus power system whose topology can be characterized by a weighted undirected connected graph (V, E), where buses indexed by i, j ∈ N := {1, . . . , n} are linked through transmission lines denoted by unordered pairs The set of buses N is a disjoint union of two subsets: the buses with generators 2 G and buses that are purely loads L. Generator buses are described with differential equations, while load buses characterizing frequency-responsive devices such as frequency-sensitive loads and inverter-based resources performing droop control are governed by algebraic equations. More precisely, given the net power injection p i (in p.u.) on each bus, the dynamics of the voltage angle θ i (in rad/s) and the frequency deviation ω i (in p.u.) from the nominal frequency f 0 (50 Hz or 60 Hz depending on the particular system) evolve according tȯ where the parameters are defined as: m i := 2H i > 0 -generator inertia constant (in s), α i > 0 -frequency sensitivity coefficient from generator or load (in p.u.) depending on Finally, u i is a controllable power injection to be designed for frequency control.

Remark 1 (Model assumptions):
The model in (1) implicitly makes the following assumptions that are well-justified for frequency control on transmission networks [8], [25].

B. FREQUENCY CONTROL
The basic goal of frequency control is to keep the frequency of the power system close to its nominal value. We now illustrate this in more detail by following the same line of argument as in [6], [9], [11], [26].
Since the frequency dynamics depend only on the phase angle differences, for easy of analysis, we express the dynamics in center-of-inertia coordinates [11], [26] by making the following change of variables: δ i := θ i − ( n j=1 θ j )/n, ∀i ∈ N. Then, the dynamics in (1) can be rewritten and stacked into a vector form as: 2 The generator buses can have collocated loads. with Here, a vector with a set as subscript denotes the subvector composed only of the elements from that set, e.g., ω G := (ω i , i ∈ G) ∈ R |G| . 3 The equilibria of (2) satisfy Clearly, (3a) implies that there exists a scalar ω * such that ω * = 1 n ω * . Then, (3b) and (3c) can be combined into with A := diag(α i , i ∈ N) ∈ R n×n . Moreover, after premultiplying (4) by 1 T n , we can characterize ω * as where we have used the zero net power flow balance 1 T n ∇U (δ) = 0 of a lossless power system. Observe from (5) that, in the absence of frequency control, the system undergoing power disturbances p will synchronize to a nonzero frequency deviation, i.e., ω * = 0, since ω * = 0 only if n i=1 p i + n i=1 u * i = 0. Therefore, the main goal of frequency control is to regulate the frequency such that ω * = 0 by providing appropriate controllable power injections u to meet power disturbances p. Note that the existence of equilibrium points (δ * , 1 n ω * ) such that (4) holds for ω * = 0 is equivalent to the feasibility of the power flow equation which is a standing assumption of this paper. In fact, if the feasibility of (6) holds, then there are multiple choices of u to meet (6). Consequently, we seek to select among the possible candidates by optimizing certain performance metrics, which we describe next.

C. PERFORMANCE ASSESSMENT
For the design of frequency control strategies, not only frequency performance but also economic factors must be taken into account. Moreover, the secure and efficient operation of power systems relies on properly controlled frequency and cost in both slow and fast timescales. Thus, we now introduce the frequency and economic performance metrics used in this paper based on different timescales.

1) STEADY-STATE PERFORMANCE METRICS
Our control objective in the long run is to achieve the nominal frequency restoration, i.e., ω * = 0 n , as well as the lowest steady-state aggregate operational cost C(u * ) := n i=1 C i (u * i ), where the cost function C i (u i ) quantifies either the generation cost on a generator bus or the user disutility on a load bus for contributing u i . This results in the following constrained optimization problem called optimal steady-state economic dispatch problem: where (7b) is a necessary constraint on the steady-state controllable power injections u * in order to achieve ω * = 0 n as discussed in Section II-B. Here, we adopt the standard assumption that the cost function C i (u i ) is strictly convex and continuously differentiable [9], [13] with respect to u i . Then, the optimization problem (7) has a convex objective function and an affine equality constraint. Thus, by the Karush-Kuhn-Tucker conditions [27,Chapter 5.5.3], u * is the unique minimizer of (7) if and only if it ensures identical marginal costs [6], [9], [10], [16], [26], i.e., 2) TRANSIENT PERFORMANCE METRICS Following sudden major power disturbances, the transient frequency dip can be large in the first few seconds, especially in low-inertia power systems. This may trigger undesired protection measures and even cause cascading failures. Thus, besides the steady-state performance, one should also pay attention to the transient frequency performance with moderate economic cost. With this aim, we define the following transient performance metrics evaluated along the trajectories of the system (1): r Frequency Nadir is the maximum frequency deviation from the nominal frequency on each bus during the transient response, i.e., r Finite horizon economic cost measures the average cost on a generator or load bus for its participation in frequency control during a time horizon T , i.e., Then the optimal transient frequency control problem becomes: where ρ > 0 is the coefficient for tradeoff between the frequency performance and the economic cost. Note that we impose the stability requirement on u as a hard constraint in the optimization problem (9), which will play a pivotal role in its design.

D. REINFORCEMENT LEARNING FOR FREQUENCY CONTROL
Our goal is to design an optimal stabilizing controller that brings the system (1) to an equilibrium that restores the nominal frequency, i.e., ω * = 0 n , and solves the optimal steady-state economic dispatch problem (7), while solving the optimal transient frequency control problem (9) at the same time. A good starting point to achieve our steady-state control goal is the well-known DAI control. However, it is not straightforward how to also optimize the transient performance. This is hard to be done purely by conventional optimization methods since power systems are nonlinear and power disturbances are unknown. Therefore, we would like to integrate RL into DAI to jointly optimize steady-state and transient performance.

III. GENERALIZED DISTRIBUTED AVERAGING-BASED INTEGRAL CONTROL
We start this section by briefly reviewing distributed averaging-based integral (DAI) control and then generalizing it to account for nonlinear control laws. We show that the closed-loop system is stable as long as the nonlinearity satisfies conditions on continuity and monotonicity. Interestingly, the steady-state performance objectives are still achieved under our proposed generalized DAI even for nonquadratic cost functions provided that the gradients of the cost functions on individual buses satisfy certain condition.

A. REVIEW AND GENERALIZATION OF DAI
DAI control [6], [10], [11], [16] is an established choice of frequency control strategy to meet the steady-state performance objectives discussed in Section II-C. However, most works restrict the cost functions to be quadratic, i.e., where c i > 0 is the cost coefficient. With the cost functions given by (10), the power injections u in the system (2) under DAI are where k i > 0 is a tunable control gain and Q i j ≥ 0 is the weight associated with an undirected connected communication graph (V, E Q ) (not necessarily the same as the physical system topology graph (V, E)) such that The core idea behind DAI is that the control dynamics can settle down only if the nominal frequency restoration, i.e., ω i = 0, ∀i ∈ N, and the identical marginal costs, i.e., ∇C i (u i ) = c i u i = c j u j = ∇C j (u j ), ∀i, j ∈ N, are simultaneously achieved, which ensures that ω * = 0 n and u * is the solution of the optimal steady-state economic dispatch problem (7). However, DAI in (11) does not address the transient behavior of the system. Thus, trajectories that eventually reach the equilibrium may have poor transient performances. Therefore, it is desirable to modify DAI so that we have better flexibility in improving the transient performance. Inspired by the fact that the control policy u i (s i ) of DAI in (11a) is directly proportional to the integral variable s i , we generalize DAI by allowing u i (s i ) to be any Lipschitz continuous and strictly increasing function of s i , with u i (0) = 0. This extends u i (s i ) from a linear function to any nonlinear function satisfying these mild requirements. The added nonlinearity provides us with a good opportunity to solve the optimal transient frequency control problem (9) while remaining the optimal steady-state performance.
Assumption 1 (Continuity and monotonicity of u): ∀i ∈ N, the function u i (·) : R → R is Lipschitz continuous and strictly increasing with u i (0) = 0.
In our treatment, we also relax the restriction to quadratic cost functions by allowing DAI to handle any cost functions satisfying the following assumption.
Assumption 2 (Scaled cost gradient functions): There exists a strictly convex and continuously differentiable function C o (·) : R → R and a group of positive scaling factors ζ : Assumption 2 admits any strictly convex cost functions whose gradients are identical up to heterogeneous scaling factors. To look in greater detail at how this generalizes beyond quadratic cost functions, we provide some examples of common strictly convex functions that satisfy Assumption 2. If the cost functions on individual buses are: r power functions with positive even integer powers: where c i > 0 and r is a positive even integer, then Thus, we can choose which satisfies, ∀i ∈ N, r arbitrary strictly convex and continuously differentiable functions that are identical up to constant terms: Clearly, the quadratic cost functions in (10) belong to the first scenario since (13) reduces exactly to (10) if r = 2 and b i = 0, which validates our generalization through Assumption 2.
Remark 2 (Cost function extensions and limitations): Although the generation costs of conventional synchronous generators are usually considered as quadratic functions with respect to their mechanical power output, it need not be the case for inverter-based resources like solar [28] and wind [29] power as well as electrochemical batteries [30]. Thus, it is important to extend the class of cost functions that DAI can handle, especially when non-conventional generation participates in frequency control. Admittedly, our extension via Assumption 2 is still limited, whose further relaxation is a possible direction of future research. Now, under Assumptions 1 and 2, we are ready to propose the generalized DAI as follows: Although not obvious at first sight, the generalized DAI in (16) preserves the ability of the classic DAI to restore the nominal frequency and solves the optimal steady-state economic dispatch problem (7) for cost functions satisfying Assumption 2. In addition, the more general nonlinear form of u introduced in Assumption 1 provides us with more degrees of freedom to better deal with the optimal transient frequency control problem (9). All of these statements would become clear as the analysis unfolds.
After combining (2) and (16), we can write the overall closed-loop system dynamics under our proposed generalized DAI compactly aṡ We conclude this subsection by stating a property of the inverse functions of the cost gradient functions on individual buses, which is a consequence of Assumption 2. It is useful later for the stability analysis of the closed-loop system (17).
Proof: See the Appendix A.
Provided that the cost functions on individual buses satisfy Assumption 2, Lemma 1 indicates that, for any given marginal price, each generator or load contributes the same amount of controllable power injection up to a scaling factor.
Remark 3 (Existence and monotonicity of ∇C −1 i (·) and ∇C −1 o (·)): Note that the existence of ∇C −1 i (·) is guaranteed by the strict convexity of C i (·). More precisely, by [31, Theorem 2.14], the strict convexity of C i (·) implies that ∇C i (·) is a strictly increasing function. According to [32,Chapter 12.9], this further implies that ∇C i (·) has an inverse function ∇C −1 i (·) which is also strictly increasing. The same argument holds for ∇C −1 o (·).

B. CLOSED-LOOP EQUILIBRIUM ANALYSIS
The first result we show is that the steady-state performance objectives are still achieved under the generalized DAI. This is captured by the following theorem. 4 Then the equilibrium (δ * , ω * , s * ) of the closed-loop system (17) is a unique point satisfying with γ uniquely determined by Proof: The equilibrium analysis is similar to the one presented in Section II-B, except that the effect of the integral variables s introduced by the generalized DAI should be considered. Hence, in steady-state, (17) yields where we have used the equilibria characterizations ω * = 1 n ω * and (4). We then proceed to investigate the equilibrium by following a similar argument as in [16,Lemma 4.2]. Premultiplying (21b) by 1 T n Z −1 yields where the second equality is due to the property of the Laplacian matrix [33] that 1 T n L Q = 0 T n . Note that 1 T n Z −1 1 n > 0 since Z −1 0 by construction. Thus, (22) implies that ω * = 0. Then, (21b) becomes ZL Q ∇C(u(s * )) = 0 n , which indicates that ∇C(u(s * )) ∈ range(1 n ), i.e., ∇C(u(s * )) = γ 1 n for some constant γ . Thus, u(s * Then, premultiplying (23) by 1 T n yields the equation that determines γ as shown in (20), where 1 T n ∇U (δ) = 0 is used again. We can show that the solution γ to (20) is unique by way of contradiction. Suppose that both γ andγ satisfy (20), where γ =γ . Then, where the inequality is due to the fact that ∇C −1 o (·) is strictly increasing as discussed in Remark 3. Clearly, (24) is a contradiction. Thus, γ is unique.
It remains to show that the equilibrium is unique. We focus on the uniqueness of s * . By way of contradiction, suppose that both s * and s satisfy (19c), where s * = s . Then, u(s * ) − u(s ) = 0 n , which after being multiplied by (s * − s ) T yields where the inequality results from our requirement that u i (s i ) is strictly increasing with respect to s i . Clearly, (25) is a contradiction. Thus, s * is unique. By [34, Lemma 1], the solution δ * to (23) is unique. This concludes the proof of the uniqueness of the equilibrium. Theorem 1 verifies that the generalized DAI preserves the steady-state performance of the normal DAI. It forces the closed-loop system (17) to settle down at a unique equilibrium point where the frequency is nominal, i.e., ω * = 0 n , and the controllable power injections u * meet the identical marginal cost requirement (8) since, ∀i ∈ N, Thus, the objectives of nominal steady-state frequency and optimal steady-state economic dispatch for a broader range of cost functions are both achieved.

C. LYAPUNOV STABILITY ANALYSIS
Having characterized the equilibrium point and confirmed the steady-state performance of the closed-loop system (17) under the generalized DAI, we are now ready to investigate the system stability by performing Lyapunov stability analysis. More precisely, the stability under the generalized DAI can be certified by finding a well-defined "weak" Lyapunov function that is nonincreasing along the trajectories of the closed-loop system (17). The main result of this whole subsection is presented below, whose proof is enabled by a sequence of smaller results that we discuss next.
Inspired by [9], we begin our construction of a Lyapunov function by defining the following integral function: which clearly satisfies ∇L(s) = u(s). Another useful property related to L(s) is given in the next lemma. Proof: See the Appendix C.
Having characterized the properties of L(s), we consider the following Lyapunov function candidate: where (δ * , 0 |G| , s * ) corresponds to the unique equilibrium point of the closed-loop system (17) satisfying (19). The next result shows that this is a well-defined Lyapunov function candidate on X.
Next, we examine the system stability through the derivative of W (δ, ω G , s) along the trajectories of the closed-loop system (17), whose expression is provided by the next result.
Lemma 5 (Directional derivative of Lyapunov function): Let Assumptions 1 and 2 hold. Then the derivative of the Lyapunov function W (δ, ω G , s) defined in (28) along the trajectories of the closed-loop system (17) is given bẏ Proof: See the Appendix E. Our next step is to show thatẆ (δ, ω G , s) is nonpositive. From (29), we observe that the quadratic terms defined by A G and A −1 L are clearly negative. Thus, it only remains to determine the sign of the the cross-term u(s) T ZL Q ∇C(u(s)) defined by the scaled Laplacian matrix ZL Q . The first thing we notice is that this term as a bilinear form with respect to the scaled Laplacian matrix ZL Q can be expanded as the expression provided below.
Lemma 6 (Bilinear form on R n for scaled L Q ): ∀x : Proof: See the Appendix F. With the help of Lemma 6, the problem of determining the sign of the cross-term u(s) T ZL Q ∇C(u(s)) can be solved using the following corollary.

Corollary 1 (Sign of cross-term): If Assumption 2 holds, then u(s) T ZL Q ∇C(u(s)) ≥ 0 with equality holding if and only if ∇C(u(s)) ∈ range(1 n ).
Proof: See the Appendix G.
Theorem 2 shows that the closed-loop system (17) under the generalized DAI is locally asymptotically stabilized to the unique equilibrium characterized by Theorem 1, where the steady-state performance objectives are achieved even for a broader range of cost functions beyond quadratic ones.

IV. REINFORCEMENT LEARNING FOR OPTIMAL TRANSIENT FREQUENCY CONTROL
In this section, we focus on integrating reinforcement learning (RL) into our proposed generalized DAI for the purpose of further improving the transient performance of the system without jeopardizing its stability. Basically, after parameterizing the control policy u i (s i ) that maps the integral variable s i of the generalized DAI to the controllable power injection u i on each bus as a monotonic neural network, we train those neural networks by a RL algorithm based on a recurrent neural network (RNN). We now discuss this in more detail by mostly following the approach from [24].

A. MONOTONIC NEURAL NETWORKS FOR STABILITY GUARANTEE
Recall from Section III that the stability of the closed-loop system (17) under the generalized DAI is ensured by any nonlinear control policy u i (s i ) that is a Lipschitz continuous and strictly increasing function of s i with u i (0) = 0 as summarized by Assumption 1, provided that the cost functions on individual buses satisfy Assumption 2. Within this class of stabilizing controllers, we want to find one that has the best transient performance when the frequencies are recovering to their nominal values. In principle, optimizing over all controllers satisfying Assumption 1 is an infinite-dimensional problem. To make the problem tractable, we need to parameterize u i (s i ) in some way. Neural networks emerge as a natural candidate due to their universal approximation property [36]. According to [24,Theorem 2], any function of our interest can be approximated accurately enough by a single hidden layer fully-connected neural network with rectified linear unit (ReLU) activation functions for properly designed weights and biases of the neural network. Such a neural network is called a stacked ReLU monotonic neural network [24,Lemma 5] whose architecture is shown in Fig. 1. 5  FIGURE 1. Stacked ReLU monotonic neural network.
As illustrated in Fig. 1, each stacked ReLU monotonic neural network is essentially a parameterized mapping from the state s i to the control u i with parameters k + i : Thus, it is natural to investigate the neural network by this partition. For the top part, the state s i is transformed linearly as (s i − b + i, j ) and then processed by the ReLU in the jth hidden neuron, ∀ j ∈ D. Denote ReLU operator as σ (·) for simplicity. Then, the output of the jth hidden neuron in the top part becomes σ ( , which is linearly weighted by k + i, j before contributing to the final output. Therefore, the contribution from the jth hidden neuron to the final output is , which implies that the jth hidden neuron is activated only if the state s i exceeds the threshold b + i, j , as illustrated in Fig. 2(a). At the output layer, the contributions from all d hidden neurons in the top part yield The bottom part follows the similar logic, where the d hidden neurons produce In order to let u i (s i ) satisfy Assumption 1, we need to ensure that its net slope is always positive and finite. This gives rise to the following requirements on weights and biases [24, which can be illustrated via Fig. 2. Observe from Fig. 2(a) that the requirements on biases in (30a) assign the top d and bottom d hidden neurons to take charge of the control policy u i (s i ) for s i ≥ 0 and s i ≤ 0, respectively, since they are deactivated for the other way around, where a by-product of b − i, Fig. 2(b) that the requirements on weights in (30b) simply ensure that the slope of the piece-wise linear curve u i (s i ) is always positive and finite since the ιth piece of u i (s i ) for s i ≥ 0 and s i ≤ 0 has the slope ι j=1 k + i, j and − ι j=1 k − i, j , respectively, ∀ι ∈ D. Thus, under (30), the stacked ReLU monotonic neural network shown in Fig. 1 produces a control policy u i (s i ) satisfying Assumption 1 with a shape as sketched in Fig. 2(b).
The inequality constraints in (30) are not trivial to enforce when training the neural networks. A simple trick to address this challenge is to introduce a group of intermediate param- In this way, the requirements in (30) naturally hold. Thus, the remaining problem is just the search of optimal parameters μ + i , χ + i , μ − i , and χ − i through the training of neural networks. To highlight the dependence of the control policy u i (s i ) generated by the ith neural network on parameters μ + i , χ + i , μ − i , and χ − i , we will use the notation u i (s i ; i (d )) in the rest of this paper, where i (d ) denotes a set of parameters i := associated with a stacked ReLU monotonic neural network that has 2d hidden neurons.

B. RECURRENT NEURAL NETWORK-BASED REINFORCEMENT LEARNING
A recurrent neural network (RNN) is a class of neural networks that have a looping mechanism to allow information learned from the previous step to flow to the next step. This feature makes a RNN suitable for learning parameters related to dynamic systems, where the prior states usually affect the following states. Thus, to efficiently train the stacked ReLU monotonic neural networks for parameters i (d ), ∀i ∈ N, we adopt the RNN-based RL algorithm proposed by [24], whose scheme is shown in Fig. 3(a).
To fit into the RNN architecture, we have to discretize the problem (9). Let the lth sampling instant after the initial time t 0 be (t 0 + lh), where h is the sampling interval. Then, using voltage angles θ := (θ i , i ∈ N) ∈ R n as an example, we would like to approximate the continuous-time state at the lth sampling instant θ(t 0 + lh) as a discrete-time state θ l . Provided that θ 0 = θ(t 0 ), the simplest way to obtain θ l ≈ θ(t 0 + lh) is through the reccurence relation via Euler method, i.e., θ l+1 = θ l + 2π f 0 hω l . In general, Euler method works well for small enough sampling interval h, which means that the discretized system dynamics are sufficiently close to the original system dynamics. This allows us to solve the discretized problem below instead: with where T /h denotes the floor of T /h. After a random initialization of { i (d ), i ∈ N}, the RNNbased RL algorithm solves the problem (31) by gradually learning the optimal { i (d ), i ∈ N} through an iterative process mainly composed of three parts -sampled trajectories generation, loss function evaluation, and backpropagationbased policy improvement -as illustrated in Fig. 3(a). For a given operating point (θ 0 , ω 0 G , s 0 ) of the power system, a sample of system trajectory over a time horizon T following sudden power disturbances can be generated by randomly initializing the constant input p to the RNN cell and then simulating forward T /h time steps. The detailed structure of the RNN cell is shown in Fig. 3(b), where the control signals u(s l ) produced by n stacked ReLU monotonic neural networks with parameters { i (d ), i ∈ N} are applied to the discretized closed-loop system with states (θ l , ω l G , s l ) for the generation of (θ l+1 , ω l+1 G , s l+1 ). In this way, the RNN cell evolves forward for l = 0, . . . , T /h − 1, which yields one trajectory under the parameters { i (d ), i ∈ N} that are shared temporally. The loss function related to this trajectory is defined as the objective function of the problem (31), which can be computed by applying (32) to the stored historical outputs of the RNN cell. Usually, to improve the efficiency of training, multiple trajectories are generated parallelly, where each of them is excited by a randomly initialized constant input p. The collection of these trajectories is called a batch and the number of such trajectories is called the batch size. Then, the loss function related to a batch of size |B| is defined as the mean of the losses related to individual trajectories in the batch, i.e., where, with abuse of notation, we simply introduce a superscript [b] to denote the loss along the bth trajectory in the batch rather than accurately distinguish ω l i ∞ andC i,T along different trajectories to avoid complicating the notations too much. Once the loss of the batch J ({ i (d ), i ∈ N}) has been calculated, the parameters { i (d ), i ∈ N} are adjusted according to the batch gradient descent method with a suitable learning rate ξ , i.e., where the gradient is approximated efficiently through backward gradient propagation. This ends one epoch of the iterative training process. The procedure is repeated until the number of epochs E is reached, which produces the optimal { i (d ), i ∈ N} for the stacked ReLU monotonic neural networks so that the corresponding control policy u i (s i ; i (d )) on each bus helps the generalized DAI to best handle the optimal transient frequency control problem (9).

V. NUMERICAL ILLUSTRATIONS
In this section, we provide numerical validation for the performance of our RL-DAI on the 39-bus New England system [37]. First, we will train the optimal control policy u(s) = (u i (s i ; i (d )), i ∈ N) for the RL-DAI on the ideal power system model as described in Section IV-B using TensorFlow framework. Then, we will validate the performance of the trained RL-DAI on a more realistic power system test case using Power System Toolbox (PST) [38] for Matlab.
The dynamic model of the 39-bus New England system contains 10 generator buses and 29 load buses, whose union is denoted as N. Each of the 10 generator buses is distinctly indexed by some i ∈ {30, . . . , 39} := G and each of the 29 load buses is distinctly indexed by some i ∈ {1, . . . , 29} := L. The generator inertia constant m i , the voltage magnitude v i , and the susceptance B i j are directly obtained from the dataset. Given that the values of frequency sensitivity coefficients are not provided by the dataset, we set α i = 150 p.u., ∀i ∈ G, and α i = 100 p.u., ∀i ∈ L. 7 We then add frequency control u i governed by our RL-DAI to each bus, where the underlying communication graph (V, E Q ) has 38 edges forming a path that connects all buses with Q i j = 125, ∀{i, j} ∈ E Q . The operational cost functions are assumed to be C i (u i ) = c i u 4 i /4 + b i , ∀i ∈ N, where the cost coefficients c i and b i are generated randomly from (0,1) and (0,0.001), respectively. Thus, ∀i ∈ N, ∇C i (u i ) = c i u 3 i by (14) and ζ i = c 1/3 i by (15), which determines the evolution of the integral variable s i in RL-DAI according to (16). In the remaining of this section, we will first train the control policy u(s) = (u i (s i ; i (d )), i ∈ N) on the ideal power system model and then apply the optimal control policy obtained from the training to the more realistic power system test case.

A. TRAINING OF STACKED RELU MONOTONIC NEURAL NETWORKS
Following the RNN-based RL algorithm illustrated in Section IV-B, we can train the stacked ReLU monotonic neural networks that construct control policy u(s) = 7 All per unit values are on the system base, where the system power base is S 0 = 100 MVA and the nominal system frequency is f 0 = 50 Hz.

B. TIME-DOMAIN RESPONSES FOLLOWING POWER IMBALANCES
We now proceed to test the performance of our trained RL-DAI on the 39-bus New England system test case provided by PST. Although the simplified model is used in our theoretical analysis and training process, the test case adopts a more detailed power system model including line losses and voltage dynamics. As for the generators, instead of the ideal swing dynamics, 2-axis sub-transient reactance models equipped with multi-stage turbines are adopted in our simulations, where the parameters of the turbines are chosen to be somewhat heterogenous to make the case more realistic. Moreover, instead of controllable power injections that follow our commands instantly and perfectly, the frequency control is implemented through inverter-interfaced resources on individual buses in our simulations, where we explicitly model dynamics of interfacing converters with the phase-locked loop and inner current control loop as in [39].
For the purpose of comparison, the frequency deviations of the original system without additional frequency control u when there is a step change of 3 p.u. in power consumption at buses 15, 23, and 29 at time t = 15 s are provided in Fig. 6. Clearly, in the absence of proper control, the transient frequency dip is large and there exists noticeable steady-state frequency deviation from the nominal frequency.
The performance of the system under RL-DAI in the same scenario is given in Fig. 7. Some observations are in order. First, a comparison between Figs. 6 and 7(a) shows that RL-DAI significantly improves the frequency Nadir and perfectly restores the frequencies to the nominal value as predicted by Theorem 1. Second, Fig. 7(d) confirms that RL-DAI help generators and loads asymptotically achieve identical marginal costs and thus the optimal steady-state economic dispatch problem is solved. Third, the nuances between the shapes of the trajectories of s and u in Fig. 7(b) and (c) are indicators of the nonlinearity of the learned control policy u(s). Clearly, this nonlinearity has not jeopardized the stability of the system.
To serve as the baseline, the performance of the system under linear-DAI in the same scenario is also provided in Fig. 7. Here, the linear-DAI denotes the DAI with the linear control policy u i (s i ) = k i s i as in (11a) but modified dynamics of the integral variable s i according to (16) in order to handle the biquadratic operational cost function. Trivially, linear-DAI is able to restore the nominal frequency, achieve the identical marginal costs, and ensure the closed-loop system stability by Theorems 1 and 2. Thus, we are more curious about its transient performance. To make a fair comparison, we gradually increase the linear control gain k i uniformly for linear-DAI until the settling time under linear-DAI is the same as that under RL-DAI. Here, by settling time, we mean the time required for all states to reach and stay within 1% of their final values, which is around 343 s in our case. This procedure yields k i = 1.57, ∀i ∈ N. Thus, when generating the dynamic simulation results for linear-DAI in Fig. 7, we adopt the linear control policy u i (s i ) = 1.57s i , ∀i ∈ N, whose shape is shown in Fig. 5. Observe from Fig. 7 that the trajectories of the aggregate economic cost are similar under RL-DAI and linear-DAI but RL-DAI largely improves the frequency Nadir than linear-DAI does. This observation can be confirmed by Table 1, which summarizes the transient performance metrics under two controllers. Thus, we can see, with the practically same economic cost, RL-DAI improves the frequency dynamics better than linear-DAI does, which shows the advantage of our proposed RL-DAI.
Finally, in order to better illustrate the importance of constraints (30) on weights and biases of stacked ReLU monotonic neural networks shown in Fig. 1 to stability of the closed-loop system under RL-DAI, we also test the case when constraints (30) are relaxed in the training of RL-DAI. Examples of the resulting control policy u i (s i ) are plotted in Fig. 5, which clearly shows that each u i (s i ) obtained neither necessarily passes through the origin nor necessarily increases with s i . This implies that Assumption 1, one of the key assumptions for Theorems 1 and 2, may be violated if constraints (30) are not enforced. Thus, there is no stability guarantee in this case. This is in line with the dynamic simulation results shown in Fig. 8, where the system is obviously unstable since all states end up blowing up. Last but not least, it should be noted that there exist slight frequency oscillations immediately following t = 0 s in Fig. 8(a), which is a weird action caused by the failure to satisfy u i (0) = 0, ∀i ∈ N. This straightforwardly explains the necessity of the requirement that each control policy u i (s i ) must pass through the origin since otherwise the controllers will take actions even at the equilibrium point.

VI. CONCLUSIONS AND OUTLOOK
This paper develops a framework that bridges the gap between the optimization of steady-state and transient frequency control performance for power systems undergoing power disturbances. Particularly, we propose to generalize DAI by adding properly designed nonlinearity to it. This provides us with the freedom to integrate RL techniques to DAI for finding the stable nonlinear controller that optimizes the transient frequency behavior while taking advantage of the DAI mechanism to optimize the steady-state cost of restoring frequency to its nominal value.
Several interesting research directions are open. First, we would like to expand the characterization of the properties of the learned controller beyond its stabilizing nature to address questions about robustness to unmodelled dynamics. In addition, although the proposed controllers only require local information when they are implemented, they are trained in a centralized manner. Understanding how they can be trained in a distributed way is also very relevant. Finally, in this paper, we have extended the class of cost functions from quadratic ones to those that satisfy a certain scaling property. It would be interesting to see if DAI control can be applied to an even larger class of cost functions.

B. Proof of Lemma 2
By a similar argument as in the proof of [11,Lemma 4], we can easily get the first inequality and ω L 2 ≤ν δ−δ * 2 + u(s)−u(s * ) 2 (33) for some constantν > 0. Under Assumption 1, each u i (s i ) is Lipschitz continuous, which indicates that there exists some constant η > 0 such that, ∀i ∈ N, which is exactly (29). Here, some tricks are used to add or remove terms for constructing a quadratic format without affecting the original value ofẆ (δ, ω G , s). In 1 , the property that −∇ G U (δ * ) + p G + u G (s * ) = 0 |G| by (19b) of Theorem 1 is used. In 2 , the fact that ∇U (δ) T 1 n = 0 is used. In 3 , the fact that ω L is determined by (26) and the property that u(s * ) T ZL Q = ∇C −1 o (γ )1 T n Z −1 ZL Q = ∇C −1 o (γ )1 T n L Q = 0 T n obtained through (19c) of Theorem 1 are used. In 4 , the property that −∇ L U (δ * ) + p L + u L (s * ) = 0 |L| by (19b) of Theorem 1 is used.

F. Proof of Lemma 6
This is a direct extension of the standard Laplacian potential function [33]: We claim that, ∀i, j ∈ N, there is parity between the signs of (∇C i (u i (s i )) − ∇C j (u j (s j ))) and (ζ i u i (s i ) − ζ j u j (s j )), i.e., where sign denotes the sign function. To see this, without loss of generality, ∀i, j ∈ N, we only consider the case where ∇C i (u i ) − ∇C j (u j ) > 0. Since ∇C i (u i ) > ∇C j (u j ), we have ζ i u i = ζ i ∇C −1 i (∇C i (u i )) > ζ i ∇C −1 i (∇C j (u j )) 1 = ∇C −1 o (∇C j (u j )) 2 = ζ j ∇C −1 j (∇C j (u j )) = ζ j u j . Here, the inequality results from the fact that ζ i > 0 and ∇C −1 i (·) is strictly increasing as mentioned in Remark 3. By Lemma 1, 1 and 2 hold. The case where ∇C i (u i ) − ∇C j (u j ) ≤ 0 follows from an analogous argument. Therefore, our claim that (38) holds is true, which together with the fact that Q i j > 0, ∀{i, j} ∈ E Q , implies that (37) is nonnegative. Notably, since each term of the sum in (37) is nonnegative, as the sum vanishes, each term must be 0. Thus, u(s) T ZL Q ∇C(u(s)) = 0 if and only if, ∀{i, j} ∈ E Q , ∇C i (u i (s i )) − ∇C j (u j (s j )) ζ i u i (s i ) − ζ j u j (s j ) = 0 , which is equivalent to ∇C i (u i (s i )) = ∇C j (u j (s j )), ∀{i, j} ∈ E Q by (38). Recall that the communication graph (V, E Q ) is assumed to be connected, which implies that ∇C 1 (u 1 (s 1 )) = . . . = ∇C n (u n (s n )), i.e., ∇C(u(s)) ∈ range(1 n ).