Multi-Agent Deep Reinforcement Learning for Sectional AGC Dispatch

Aiming at the problem of coordinating system economy, security and control performance in secondary frequency regulation of the power grid, a sectional automatic generation control (AGC) dispatch framework is proposed. The dispatch of AGC is classified as three sections with the sectional dispatch method. Besides, a hierarchical multi-agent deep deterministic policy gradient (HMA-DDPG) algorithm is proposed for the framework in this paper. This algorithm, considering economy and security of the system in AGC dispatch, can ensure the control performance of AGC. Furthermore, through simulation, the control effect of the sectional dispatch method and several AGC dispatch methods on the Guangdong province power grid system and the IEEE 39 bus system is compared. The result shows that the best effect can be achieved with the sectional dispatch method.


I. INTRODUCTION
Automatic generation control (AGC) is an important operation task of interconnected power grid, which can maintain the system frequency and tie-line exchange power to the expected values [1]. The current regular AGC, if still adopting methods such as engineering actual experience or simple generation unit capacity and regulation speed, so that fixed dispatch can dispatch the total AGC generation power commands of the system to each generation unit, it will not satisfy the control performance standard (CPS) appraisal in control area with a high renewable energy penetration rate and insufficient regulated sources. Moreover, the short-term random fluctuation and corresponding system power deficiency are relatively small in the interconnected power grid with traditional hydropower power and coal-fired power as the main power supply, the small power regulation variation of the AGC unit has little effect to the security and generation cost of the power grid. For these reasons, the regular AGC dispatch methods ignore the effect on security constraint and generation cost. However, the connection among various new energy and the development of Ultra High Voltage (UHV) technology have greatly made the increase in uncertainty The associate editor coordinating the review of this manuscript and approving it for publication was Huai-Zhi Wang. power disturbance inevitable. Especially when mono-polar blocking fault of one or even multiple DC transmission lines occurs, the huge load disturbance will reduce the system frequency significantly. Correspondingly, the AGC unit regulation output, to meet the requirements of the frequency's stability, will increase significantly as well. As a result, line overload may occur, thus seriously affecting the system frequency stability and security. Therefore, in an interconnected power grid with large-scale new energy and DC transmission lines, its AGC dispatch method should consider not only control performance but also the effect on power grid security and unit generation costs [2].
Some experts and scholars have already proposed some improvement measures for the deficiency of the regular AGC dispatch methods [2]- [5]. In detail, a model is proposed in [2] that takes into generation cost and the regulation cost in the AGC as well as the system security constraint, however, ignoring the control performance optimization during the AGC dispatch, thereby resulting in CPS index deterioration and poor control performance. Besides, the AGC dispatch coefficient in the AGC system is optimized with the hierarchical Q-learning method in [3]. However, hierarchical Q-learning needs to discretize its state and action and cannot regulate the AGC generation power command continuously and smoothly. Even more, the algorithm ignores the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ effect on system security and operation economy in AGC dispatch. In [4], a dynamic optimization dispatch strategy, instead of the original regular AGC dispatch, is adopted to create a new AGC dynamic optimization dispatch model [4]. It considers power balance constraint but ignores the system security constraint and generation cost. It is difficult to ensure the regulation control performance in the case of sudden large-scale load disturbance due to the too long AGC dispatch interval. Reference [5] proposes a predictive control method of the interconnected power grid model that takes participation factors into account for the interconnected power system with multiple units in a single area. This method links the two regulation and control frameworks with different time scales, namely, economical scheduling and AGC, considering the system power balance constraint [5], however, ignoring the system security power flow constraint and control performance.
To sum it up, the current AGC dispatch method has the following deficiencies: Foremost, AGC control performance, system operation economy and the security constraint are not considered simultaneously. In the electricity power market environment, the power grid scheduling center tries to ensure operation economy, causing the power grid operating within the boundary of the system security constraint range. At this time, adjusting the AGC unit output may cause the power grid to operate beyond the security constraint range. Besides, it is impossible to realize AGC control performance, system operation economy and the network security constraint at the same time. The preference must be determined based on needs.
In response of the above-mentioned problem, a sectional AGC dispatch model that considers AGC control performance, the regulation cost and system economy and security simultaneously is proposed in this paper. Under the guidance of the CPS appraisal rule in China [1], the AGC dispatch problem is classified as three sections according to different CPS1 indices. This ensures the fast recovery of the system frequency, allowing to quickly restore the system generation cost to the optimal state of the current system, and the system security constraint is considered during AGC dispatch. When finding the solution to this model, the HMA-DDPG is proposed for higher solution-finding efficiency and solution quality.
the main motivations and novelties of this paper are given as follows: 1) The previous studies of AGC dispatch don't consider the coordination of AGC control performance, generation cost and security constraints, especially for collaborative optimization in a power grid that has large-scale of new energy and distributed energy. To fill up this gap, a sectional AGC dispatch is proposed to simultaneously to solve the coordination problem, when it dispatches the real-time total power regulation command to all the controllable regulation resources.
2) HMA-DDPG proposed in this paper uses hierarchical framework and Multi-agent optimization, which makes the strategy more effective, have the advantages of fast convergence, not tending to have a local optimum as well as realizing continuous AGC dispatch for multiple objects.

II. MULTI-AGENT DEEP DETERMINISTIC POLICY GRADIENT (MADDPG)
In 2017, DeepMind proposed the MADDPG in [6], and its basic framework is as follows: A. BASIC ASSUMPTIONS Assume that there are N agents in environment E, and each agent has its own strategy. The total strategy set of the N agents is: π = {π 1 , π 2 ,. . . , π N } neural network is used to represent each strategy. Its parameter set θ = {θ 1 , θ 2 , . . ., θ N }. The environment system satisfies the following assumptions [6]: Assumption 1: each agent's strategy only depends on the state it observes and has nothing to do with the states that other agents observe, that is, a i = π i (o i ).
Assumption 2: in an unknown environment, the reward value of each agent and the next state are unpredictable after an action is taken. The reward comes from the feedback from the environment, and its own action only depends on the strategy.
Assumption 3: during training, the agents do not communicate with each other, or the communication content is only a component of their respective observation.

B. MADDPG TRAINING METHOD
The training framework of the MADDPG is shown in Figure 1. All agents in the environment consist of an actor network, a critic network, a target actor network and a target critic network. To facilitate illustration demonstration, agent i is used as an example in Figure 1, and other agents are represented by squares.
The method of decentralized implementation and concentrated training is adopted during training. That is to say, each agent obtains the action to be performed for the current state based on its own strategy: a j i = π j i (o j i ) and interacts with the environment to obtain the experience sample (σ then store it into its own experience buffer pool. After all agents have interacted with the environment, each agent randomly sample experience from its experience pool to train its own neural network. To accelerate the learning process of an agent, the critic network input must include the observed states of other agents and action taken by them, that is, Q = (s j , a 1 , a 2 , . . . , a n , θ Q ),where s j= (o , the critic network parameter is updated by minimizing the strategy loss. The strategy loss calculation formula as [6]: Afterwards, calculate the parameter for updating the action network through the gradient descent method. The gradient calculation formula as [6]: C. SECTIONAL AGC DISPATCH ALGORITHM-HMA-DDPG In the MADDPG, the agents are independent from each other. Based on the assumption 3 that the agents do not communicate with each other or have hierarchical relations, the author found in actual simulation that it is easy for the algorithm to be trapped in a local optimum if it is up to the critic of each agent to obtain extra information (such as the actions of other agents) to directly perform concentrated training without correlation between each agent's reward function or being able to cover global information. Besides, there will be the problem of difficult convergence by using the traditional MADDPG when there are too many agents, and the strategy estimation method in the traditional MADDPG may result in overestimation [7]: that is, an agent will produce a strong strategy against a competing agent through over-fitting. However, such strong strategy is very fragile, as it is difficult for the strategy to adapt to the opponent's new strategy with the updating of the opponent's strategy. To address the above-mentioned problem, the author proposed HMA-DDPG based on the thinking of hierarchical reinforcement learning, and its framework is shown in Table 1. This method has the following characteristics: 1) changes the original distributed optimum-seeking method of the MADDPG into the centralized-distributed optimum-seeking method through a hierarchical method and adopts centralized training as well as decentralized execution. During training, the critic of each agent can observe the actions of all other agents, and each agent's actor only needs to observe the local state to make a decision during testing.
2) During training, each agent's convergence is more directional. As the hierarchical method is used, the agents on the lowest layer converge first, and agents of all layers converge in order from the bottom layer to the top layer.
3) Compared with the hierarchical reinforcement learning method, this method adopts the centralized training and decentralized execution of the MADDPG and can observe global information during centralized training, which is more conducive to global optimum seeking and easier to obtain an optimal solution.
4) The hierarchical method disintegrates the original task into many sub-tasks, and the state and action space of each sub-task are greatly reduced. Compared with that of the single-agent DDPG method, the possibility of curse of dimensionality is reduced, which favors algorithm convergence. 5) Hierarchical learning is adopted to force the final result of the problem to be related to each agent's decision, and therefore the cooperation and game relationship among agents is strengthened.  The phasor measurement unit (PMU) in the wide area measurement system (WAMS) is a phasor measurement unit using the global positioning system pulse as its synchronous clock.
The PMU can directly measure generator power angle, generator outlet frequency, active power, voltage, power flow of important buses of the converting station to synchronously collect data of each bus in the power grid.
It can provide dynamic power grid data with millisecondlevel precision for the scheduling main station. Therefore, to more comprehensive measurement information about the power grid, the author uses the synchronous PMU to measure the voltage, the current, power, power grid frequency, real-time power flow state, primary frequency regulation variation of units and other information of power grid buses for the AGC to use. As a result, during the AGC dispatch, frequency stability, the dynamic power flow of the current power grid and the satisfaction condition of the realtime security constraint can be all considered at the same time. Figure 2 shows a sectional AGC dispatch system. In each AGC control interval, the power grid scheduling center obtains the real-time CPS index value of the current moment, the power plant generation plan and other historical values from the SCADA database of the energy management system [8]. At the same time, the synchronous PMU in the WAMS database measures the real-time bus voltage and current, frequency, the primary regulation output of each unit, the real-time power flow state and sends them to the sectional AGC dispatch system. The controller of the sectional AGC dispatch system is a regular PI controller which calculates the total AGC generation power command of the unit P order− . The power grid scheduling center then dispatches the total AGC generation power commands to each AGC unit through the related generation power command optimization algorithm and calculates the regulation output of each AGC unit P order−n . The AGC generation power command of each unit is sent to the generation control system of each power plant through an information transmission system [9], [10]. The WAMS collects the regulation output of each unit and the operation information of other units then sends them to the Sectional AGC dispatch of the scheduling center. The control cycle is 8s.

2) SECTIONAL AGC DISPATCH SYSTEM
The AGC dispatch algorithm provides the optimal dispatch strategy and outputs sets of continuous data, which are the participation factors allocated to n units. The AGC generation power commands of n units are the product of the total AGC generation power command output by the PI controller multiplied by each participation factor. In Figure 2, P order− is the total AGC power regulation command of the scheduling center, P order−i is the AGC generation power command of the ith unit, and a i is the participation factor of the ith unit. It Satisfy the formula as P order−i = a i * P order− . To satisfy the power balance constraint, the participation factor satisfies the constraint of formula (3).

B. SECTION
To solve the problem of integrated control of power flow and frequency in the interconnected power grid with large-scale new energy, AGC optimization model for sectional dynamic allocation of generation power commands is proposed in this paper. This model not only considers the power deficiency of the power grid, the ramp rate constraint for units [11], the frequency quality constraint [12] that need to be considered in regular AGC systems, but also unit generation cost [13] and the security constraint of regular optimal power flow models that need to be considered in economical scheduling [14]. The model adopts the power flow formula that considers static characteristics to represent the relationship between power/frequency change [15] and the power flow as well as security constraint [16], [17]. The sectional AGC dispatch process is divided into three parts according to the CPS1 instantaneous value C CPS1 and the average value of CPS1 within one-minute C 1minCPS1 . That is, a dispatch method of three sections is adopted, as shown in Figure 3.

1) FIRST SECTION-FREQUENCY STABILITY CONTROL SECTION
The judgment standard is the CPS1 instantaneous [18] value C CPS1< C M , with C M being a constant smaller than 200%. When the area meets this standard the AGC dispatch will use the method for the first section.
If the CPS1 instantaneous value Satisfy the following formulas: C CPS1 ≥180%, the CPS appraisal in this period is excellent [19]. As the load prediction technology is mature, the author suggests that C M = 180% be set. The main consideration of control for the first section is the CPS index and the regulation cost, and the linear weight obtained from the product of the two multiplied by the weight coefficient as the objective function is used to find the cumulative minimum. The purpose of this section is to quickly recover frequency and the CPS index to a normal level.
where t is the discrete time; P error−n is the difference between the unit generation power command received by the nth unit and this unit's actual output (hereinafter referred to as control error), (MW); C n is the AGC regulation cost of the nth unit ($); E is the cumulative value of the square of the control error ( P error−n ) and the regulation cost C n of each unit within the time period of T , N is the total number of units; µ 1 and µ 2 are constants that are set to solve the different dimension problem between different optimization targets in the multi-target optimization problem and that are the weights of the square of the control error and the regulation cost in the control target respectively. P order− is the total AGC system generation power command value (MW); P order−n is the AGC generation power command dispatch to the nth unit (MW); UR n and DR n are the ramp rate upper limit and lower limit of the nth unit (MW); P Gn is the actual regulation output of the nth unit (MW); P max Gn and P min Gn are the regulation capacity upper and lower limits of the nth unit respectively (MW). VOLUME 8, 2020

2) SECOND SECTION-TRANSITION CONTROL SECTION
The judgment standard is where C CPS1 is the instantaneous value of CPS1. C M = 180%, C Q = 195%. C 1minCPS1 is the average CPS1 value within one minute. Those meet this standard will use the generation power command dispatch method for the second section. This section is set mainly to prevent the direct change of AGC dispatch method from the first section AGC generation power command dispatch method to the third section, causing generation power command change and therefore sudden change of the CPS index and generator output, which will affect frequency stability and the system CPS appraisal index. When the transitional second section dispatch method is set between the first section dispatch method and the third section dispatch method, the objective function needs to simultaneously consider the generation cost, the CPS index and the regulation cost, with the generation cost as the main consideration index. Besides, the system power flow constraint needs to be considered at the same time.
where C nG (t) is the generation cost of the nth unit at moment t, P Gi (t) and Q Gi (t) are the active and passive power output by the ith generator, which is the sum of base point power of the ith generator and AGC regulation power; P Di (t) and Q Di (t) are the active and passive loads of the ith generator at moment t; P i (t) and Q i (t) are the injected active and passive power of the ith generator at moment t; G ij and B ij are the real part and virtual part of the ith row and the jth column elements in the system bus admittance matrix respectively; e i (t) and f i (t) are the actual part and virtual part of the voltage component of bus i at moment t respectively, a i , b i and c i are the cost coefficient of the ith unit. S ij (t) is the apparent power transmitted by i and j buses of the line at moment t. The subscripts ''max'' and ''min'' represent the upper and lower limits of the corresponding variables respectively in this paper. As AGC control is a dynamic process, it is difficult to satisfy the last two constraints of formula (7) within the first couple of control cycles after the section entry. However, the two constraints will be gradually satisfied after AGC control. Therefore, the two constraints can be achieved as a control target.
In the model of this paper, the generation units are divided into two categories, the first category is regular generation units that only participate in the primary frequency regulation, and the second category is AGC units that participate in both primary and secondary frequency regulation. The relationship between the generation power and the frequency of the regular units is as follows: where f (t) is the current frequency of the system; f N is the system's rated frequency; P Gi0 (t) is the base point of the ith unit at moment t; K Gi is the active-frequency static characteristic coefficient of the ith generator. The relationship between the generation power and the frequency of the AGC units is as follows: If the static frequency characteristic is considered, the expression [14] of the active and passive loads of each mode is: where f N is the rated frequency: 50HZ; P DNi (t) and Q DNi (t) are the active load and passive load of bus i under the rated voltage and frequency at moment t respectively; K Pfi and K Qfi are all static frequency characteristic parameters of the load model.

3) THIRD SECTION-OPTIMAL POWER FLOW SECTION
The judgment standard C Q = 195%, only considers the generation cost as the objective function in this section. The constraint is the security constraint, including the equality constraint and the inequality constraint in formula (7).

IV. SETTINGS FOR SECTIONAL AGC DISPATCH ALGORITHM
As the HMA-DDPG has the advantage of fast response and being not easy to get into a local optimum in the online testing operation after pre-training, this algorithm needs to be used in the dispatch processes of the first section and the second section that require fast response. The security constraint and the inequality constraint of the power grid need to be strictly satisfied after frequency stability during the dispatch process of the third section. Besides, considering the continuous non-linear characteristic of the optimization model [10], the primal-dual interior point method with good convergence is selected for solution finding in [20].

A. FIRST SECTION DISPATCH AND SECOND SECTION DISPATCH ALGORITHM-HMA-DDPG
The HMA-DDPG used in first section and second section dispatch. First, the units with little difference in the secondary frequency regulation delay time are categorized. Afterwards, the categorized unit groups are further divided into several layers of unit sets based on other regulation characteristics of the units. Each set corresponds to one agent i. The hierarchical dispatch method is shown in Figure 4.

1) ACTION SPACE
To satisfy the constraint requirement of formula (3), during each AGC dispatch process, for a certain agent i, assume that there are n allocated units, and the AGC dispatch algorithm only needs to output the participation factors for n-1 units [18]. The participation factor for the nth unit is: The nth unit defined in this paper is the balance unit, and the unit with the greatest adjustable capacity is selected as the balance unit. For any agent i at moment t, the participation factors of the first n-1 units are agent actions, n-1 in total, as shown in formula (14).

2) STATE SPACE
The state of the highest-layer agents is similar to that of sub-agents: as the HMA-DDPG is used for first section dispatch and second section dispatch, the CPS index C CPS1 in the SCADA must be observed in the state space. The total AGC generation power command dispatch by the scheduling center must also include power grid dynamic data in the WAMS. N buses, M units (including X new energy units), and (7+M ) state space dimensions are set for the grid topology, so the actor state input by the agent corresponding to the nth category unit group includes: the generation power command value allocated to the nth unit group from the previous layer algorithm P order−n , the system CPS index C CPS1 , the actual active output value of the nth category of unit groups -P G−n , the power difference of the nth category of unit groups -P error−n (t), the power grid load value -L A , the real-time active output of each unit in the power grid -(P G1 , P G2 , . . .,P Gn ), and the overload index of the power grid line -Line out which equals M when there is any overload line in the power grid, or otherwise, equals 0.
Line out = M There is line overload Line out = 0 There is no line overload (15) VOLUME 8, 2020 The critic state input during centralized training also includes the actions of all agents.

3) REWARD FUNCTION
Based on the requirement of first section dispatch and second section dispatch, it is necessary to ensure continuous and smooth change of generation output and the action of each agent. The reward function is divided into two parts based on the CPS1 instantaneous value. When C CPS1 < 180%, the first and the sub-layers' learning objective is to control the total power deviation and regulation cost of this section. When 180%≤ C CPS1 ≤ 195% and C 1minCPS1 < 195, the first layer and the sub-layers' learning task is to control the total power deviation and regulation cost, the generation cost and the system security constraint.
The highest-layer agent reward function is designed as follows: where P error−h (t) is the difference between the CPS command and the total unit output (MW); W is a positive constant term. To ensure that the algorithm finds the gradient, the positive constant term and µ 3 C hG (t) are added for the reward function when the CPS enters the second-section and thirdsection dispatch state; C h− (t) is the total regulation cost ($); µ 1 , µ 2 , µ 3 are weight coefficients. µ 3 shall be greater than µ 1 and µ 2 to achieve the objective of giving more emphasis on generation cost reduction during the second-section dispatch process.
The sub-agent reward function is designed as: where P error−n (t) is the control error of the nth category unit group (MW); C n− (t) is the sum of regulation cost of all units of the nth category unit group ($); µ 1 , µ 2 and µ 3 are all constants, which are the weight coefficients of the control error, the regulation cost and the generation cost respectively; P order−n (t) is the generation power command value dispatch from the nth category unit group of this layer (MW); P Gn (t) is the actual output of the nth category unit group (MW).

B. THIRD-SECTION DISPATCH ALGORITHM-PRIMAL-DUAL INTERIOR POINT METHOD
The algorithm adopts the general primal-dual interior point method. The basic principle for finding the optimal solution is as follows: set the slack variable to equalize the inequality constraint. Set the disturbance factor and the punishment term to change the original optimization problem into a new optimization problem. Use the Kuhn-Tucker conditions to obtain a series of non-linear equations. Finally, use the Newton-Raphson method to solve the non-linear equations and judge convergence through the duality gap.  Table 5. T s is the secondary frequency regulation delay time, and payment refers to the AGC regulation cost.

1) REWARD FUNCTION
The dispatch method for this simulation only used the firstsection dispatch method without considering the sectional dispatch method. Therefore, the state space and the reward function are set again. The principle of first-level dispatch is to mainly ensure the CPS index, with fast regulation of the units and economical regulation also considered. That is, the difference between the total unit output and the generation power command of the scheduling end is considered first before the AGC regulation cost is considered. The reward function of each agent is shown in formula (19), where µ 1 = 10 −6 , µ 2 = 10 −7 .

2) ACTION SPACE
As the HMA-DDPG is adopted, the action is set to be the participation factor of the current agent to the next layer. Based on formula (14), the sum of all participation factors needs to be 1. As it is necessary to add a punishment function in the reward function for satisfying constraint of (14) when the action space with two or more dimensions is used; when the sum of participation factors is not 1, minus a big positive number. The author found that this method would seriously affect algorithm optimum seeking and therefore cause a local optimum or difficulty in convergence, training cost increase  and others. Therefore, every two units or unit groups are set as an agent, and each agent only has only one action: the participation factor of a certain unit or the participation factor of a certain category of units is a ij , and the participation factor of another category of units or unit group is 1-a ij . This way, using a punishment term can be avoided. The specific hierarchical method is shown in Figure 6. The action space is a ij .

3) STATE SPACE
The state space is as follows: 1. P order−n , the generation power command of the nth category unit group of this layer; 2. P Gn1 , the actual output value of the first unit of the nth category unit group; 3. P Gn2 , the actual output value of the second unit of the nth category unit group; 4. The state space of P error−n , the state of the nth category unit group is as follows: [ P order-n P Gn1 P Gn2 P error−n ] During training, the input state of the critic also includes the participation factors of all agents and the difference values between them and 1. 1 − a 11 , a 21 , 1 − a 21 , a 32 , 1 − a 32 , a 42 , 1 − a 42 ,  a 52 , 1 − a 52 , a 62 , 1 − a 62 , a 72 , 1 − a 72 The hyperparameter settings of each agent is shown in Table 2.

4) PRE-LEARNING AND ONLINE TEST
In the pre-learning stage, apply continuous step load disturbance to area A (Guangdong power grid) with a cycle of 1800s and an amplitude of ±760 MW. After the algorithm completed pre-learning with enough iteration times, the HMA-DDPG is used in a real-life environment. During online operation, the load disturbance is random step disturbance with an amplitude of not larger than 760 MW and a cycle of 1800s. To verify the algorithm's superiority, six AGC generation power command dispatch algorithms with different principles are used for a comparison. The simulation time is 86400s. Figure 7 is a diagram showing the CPS1 change of the first 3200s. Based on the change curve within the range of 0s-400s, the CPS1 instantaneous value of the HMA-DDPG algorithm have already reached 199.82% at 99s, while that of other algorithms are: 196.05%, 197.96%, 197.93%, 197.42% and 198.02% respectively. At the same moment, the CPS1 of HMA-DDPG is higher, and during the stable restoration process, the CPS1 value of HMA-DDPG is all better than those of the other algorithms. Based on the curve between 2700s-3200s (the small diagram on the right of    [22]: 145.03%, DDPG: 146.08%. HMA-DDPG could respond to frequency change more quickly, resulting in smaller minimal CPS1 value than those of the other algorithms and ensuring the stability of the system frequency.
Based on Table 3, | f |, C CPS1 and |E ACE | are the average value of the absolute values of the frequency deviation, the average value of the CPS1 index and the average value of ACE absolute values. The payment is the total regulation cost. It can be concluded that the control performance of HMA-DDPG is better than that of the other five algorithms, and its regulation cost is lower than those of the other five algorithms. Therefore, HMA-DDPG is superior.

B. LOAD FREQUENCY CONTROL MODEL FOR IEEE 39 BUS SYSTEM
To elaborate the superiority of the sectional AGC dispatch method, an IEEE 39 bus system is selected as the power grid topology structure of area A. The third bus is a photovoltaic generation bus, and the 21st bus is a wind turbine generation bus. The bus topology is shown in Figure 8. The load disturbance of the 13th bus convertor station is random load disturbance with an amplitude of 700 MW from 0s, and the specific information is shown in Figure 16 of Appendix. The photovoltaic generation output curve and wind turbine generation output curve settings are shown in Figure 17 and Figure 18 of Appendix. The total load increment curve is shown in Figure 19 of Appendix. Bus 1 is the connection bus of the area tie-line. Unit #7 and #8 are nuclear power units with fixed generation power and don't participate in AGC. Area B don't consider power grid topology. Unit parameters are shown in Table 6 of Appendix.
At 37,200s, the 35th line have permanent fault and changed from a double-circuit line to a single-circuit line. At 39,000s, the 35th line changed back to a double-circuit line. The result of the simulation fault timeline is shown in Figure 9. To verify that the calculation result is correct, three methods commonly used in engineering are used for comparison: the PROP dispatch method, the priority AGC dispatch method and the OPF dispatch method.

1) STATE SPACE
As shown in Figure 6, the state space of the highest-layer agents and the sub-agents are similar.
The state space: the state of the agents of the nth category unit group includes: the generation power command value dispatched from the previous-layer algorithm to the nth category unit group of this layer -P order−n (t), the system's instantaneous CPS1 index -C CPS1 , the actual output value of the nth category unit group -P G−n (t), the power difference of the nth category unit group -P error−n (t), the total power grid load -L A , the active output of each AGC unit -(P G1 , P G2 , P G3 , P G4 , P G5 , P G6 , P G9 , P G10 ), the new energy unit output -P wt , P pv , the power grid overload index -Line out . The state of the input actor: P order−n (t) P error−n (t) C CPS1 P G-n P Gl P G2 P G3 P G4 P G5 P G6 P G9 P G10 P wt P pv Line out (22) During training, besides formula (22), the input critic state also includes: participation factors of other agents, as shown in formula (22).

2) ACTION SPACE
As shown in Figure 6, two units form one set, and the participation factor of one unit is a ij and that of the other unit is 1a ij . The action of each agent i is the participation factor a ij , and the action dimension is 1. The action space is shown in formula (22).

4) PRE-LEARNING
During the pre-learning stage, continuous sinusoidal load disturbance with a cycle of 3600s, an amplitude of 900 MW and duration of 3600s is applied to the 13th bus of area A. The phase is 0.5π. At the same time, the same disturbance is applied for the total load. Sinusoidal waves with a cycle of 3600s, an amplitude of 200 MW and duration of 3600s are applied to the 3rd and 21st new energy generation unit buses. A random value selection method is adopted in each episode for the phase of the latter three types of sinusoidal waves to ensure sample diversity.

5) CONVERGENCE EFFECT
The convergence effect is shown in Figure 10. As the highest-layer agents converged at last, it can be seen from Figure 10 that the average reward value of the agents gradually approached the maximum value of the average reward value in a smooth manner.

6) RESULT OF ONLINE OPERATION TEST
After training, online simulation is used to simulate the system for 86,400s, that is, 24 hours of a day. The specific data are shown in the Table 4. As can be seen from Figure 11, Figure 12 and Figure 13, during instant step disturbance, the CPS1 variation of the HMA-DDPG can be maintained to be smaller than those of OPF as well as PROP and close to that of the priority method. At the same time, the peak of the frequency deviation and area control error (ACE) are relatively small. As the priority algorithm adopted hydropower units which have the fastest responses, its CPS1 peak is close to that of the HMA-DDPG, but it is not as stable as the HMA-DDPG during the recovery. Besides, the CPS1 of the HMA-DDPG is always superior to that of the priority algorithm, and the |E ACE | of the HMA-DDPG is close to that of the priority algorithm and superior to those of the other three algorithms in the process of CPS1 approaching 200%. The HMA-DDPG is   also superior to the priority algorithm in the recovery process. The same is true of frequency deviation. The regulation cost of the HMA-DDPG is smaller than that of the OPF algorithm,  and its overload duration is the shortest. The stability of the HMA-DDPG is better than that of the other three algorithms, as can be seen from the curves. Based on the number of the system's overload lines of Figure 15, the line overload state of the HMA-DDPG and OPF both occurred when the disturbance suddenly appeared, and after a few control intervals, they can quickly be out of the line overload state. As the HMA-DDPG have a faster early-stage response rate than the OPF algorithm, it could be out of the line overload state faster. Based on Table 4, the proportion of the line overload duration of the HMA-DDPG is smaller than that of the OPF, while the PROP and the priority algorithm could cause several lines of the system to be in a serious long-term overload state. Without artificial intervention, the secure and stable operation of the power grid will be seriously affected. Therefore, as far as control of the line overload state is concerned, the HMA-DDPG is superior to the rest three algorithms.
Based on Figure 14 and Figure 15, line 35 changed from a double-circuit line to a single-circuit line, and the maximum power limit of the line reduced by half at 37,200s. The units  responded quickly, and there is a sudden change in the unit output at 37,200s. After only a short period of being overload, the line current returned to within the constraint, while that of the PROP and priority algorithms is still in a long-term overload state. At 40,000s, line 35 returned from a singlecircuit line to a double-circuit line after successful forced transmission, and all the units are back to normal. The power grid is maintained in a secure and economical operation state as well as ensuring a comprehensive optimum of the control performance and the generation cost even under wind power and photovoltaic disturbance.
Based on Table 4, | f |, C CPS1 and |E ACE | are the average value of the frequency deviation absolute value, the average value of the CPS1 index and the average value of the ACE absolute value. The payment C NG is the total regulation cost, that is, the total generation cost of the eight units, and t o /t n is the percentage of the line overload time compared with the total process time. Generally, it can be seen that the HMA-DDPG proposed in this paper has the minimal | f | and |E ACE |, the maximal C CPS1 , the lowest generation cost, the lowest overload time proportion, compared with the other three algorithms. Adopting the hydropower units that have the lowest regulation cost, the priority algorithm has the  lowest regulation cost. However, this algorithm used only one hydropower unit for frequency regulation, which caused a serious overload state of the unit's outlet as well as the nearby lines and therefore increased the grid loss, unit output then increased the generation cost. As a result, the comprehensive economic benefit is reduced. Therefore, it can be concluded that the sectional AGC dispatch of the HMA-DDPG proposed in this paper can ensure fast recovery of the frequency as well as the system's fast recovery to the optimal power flow state, and consider the system security constraint simultaneously, then achieve a comprehensive optimum of control performance, economic benefit and system security.

VI. CONCLUSION
In summary, the main contributions of this work are as follows: 1) The proposed sectional AGC dispatch can satisfy not only the technical, economic benefits requirements of power grid but also the system's security constraints. It thus addresses the problem of coordinating system economy, security and control performance during secondary frequency regulation in the power grid. VOLUME 8, 2020  2) HMA-DDPG, the hierarchical multi-agent algorithm, is employed for AGC dispatch. It reduces the difficulty and dimensions of the algorithm optimum seeking, makes the strategy more effective thus providing reference for the optimal AGC dispatch involving a large-scale power grid and multiple units.
3) The simulation analysis indicates that the sectional AGC dispatch based on HMA-DDPG can change AGC unit output with the change of system state, thus ensuring the comprehensive optimum of control performance, the economic benefit, security and stability for the power grid.