Virtual Generation Alliance Automatic Generation Control Based on Deep Reinforcement Learning

This article proposes a distributed hierarchical automatic generation control (AGC) framework with multiple regulation units in the performance-based frequency regulation market, named virtual generation alliance automatic generation control (VGA-AGC), aiming to achieve the coordination of control algorithm and AGC dispatch algorithm and adapt to the development trend of AGC from centralized framework to centralized-decentralized framework. The framework also involves a multi agent distributed multiple improved deep deterministic policy gradient (MADMI-TD3) algorithm that is characterized by excellent global search capability and optimizing speed. The algorithm can help create an optimal AGC strategy in a randomization environment so as to obtain an optimal cooperative control of AGC. According to a simulation verification on the LFC model for an interconnected power grid of a province, the algorithm is superior to the current algorithms and conventional engineering methods in terms of control performance and economic benefits. In other words, the algorithm can improve control performance and reduce the regulation mileage payment.


I. INTRODUCTION
The ever-increasing innovation of renewable energy makes the power grid more dispersed, diverse, and random [1]- [3]. A traditional AGC strategy has difficulty dealing with the strong random disturbance. In the strategy, the total AGC generation power command of system is generated and dispatched to units through the proportion integration (PI) controller and the proportional dispatch method [4].
Especially when there is a sudden power disturbance in a complicated power grid system, in which the large-scale wind turbine has a poor disturbance tolerance, the traditional AGC strategy may result in chain reactions, which can cause a larger power disturbance and affect the safety as well as stability of the system frequency. For example, during the ''8·9'' blackout in the UK [5] and ''9.28'' blackout in Australia [6], the major power failure accidents were all caused by the off grid of wind turbine, which resulted in serious reduction of frequency. The AGC could not timely be responded and the frequency regulation capacity was short during the accident. Thus, it is very important to improve the response speed and The associate editor coordinating the review of this manuscript and approving it for publication was Bin Zhou. control performance of AGC in a complicated power grid system with large-scale wind turbine.
The algorithms can generally be divided into two categories. The first category is the control algorithm such as the conventional PID algorithm [7], [8], sliding mode control (SMC) [9], active disturbance rejection control (ADRC) [10], fractional Order PID (FOPID) [11], fuzzy control [8], and reinforcement learning series such as Q learning algorithm [12], [13], Q learning algorithm [14], R(λ) learning algorithm [15], (Deep Q-Network) DQN [16], and (Double Deep Q-Network)DDQN [17]. Generally speaking, these algorithms take the entire power grid as a single area for calculation of generation command, which is then proportionally distributed to AGC regulation units. The other category of algorithms refers to generation power command dispatch algorithm such as classical genetic algorithm (GA), quadratic programming, gray wolf optimizer (GWO) [18]- [20], proportional algorithm, particle swarm optimization (PSO) [21], moth-flame optimization (MFO) [22], whale optimization algorithm (WOA) [23], ant lion optimizer (ALO) [24], dragonfly algorithm (DA) [25], group search optimizer (GSO) [26], chicken swarm optimization (CSO) [27], sine cosine algorithm, SCA) [28], and etc [28], [29]. The classical PID control algorithm is generally adopted as the control algorithm and the dispatch algorithm is generally used to dispatch the total power regulation command of ACG to each AGC unit, aiming to minimize the regulation payment. The separation of the two types of algorithms has certain advantages. For example, the control algorithm and dispatch algorithm can separately be designed. The two types of algorithms also have a problem in terms of cooperation. The control algorithm aims to minimize the control deviation of frequency while dispatch algorithm the regulation payment. The combination of two types can reduce the frequency deviation and the regulation payment, thus improving the control performance as well as lowering the regulation payment of the AGC. The methods, mentioned above, are all based on the centralized control framework. It is necessary to collect real time operation data from all units, which means a large amount of information for transmission. When the sizes of units increase, the convergence time of the above methods can greatly be improved, however, it may become difficult to meet the real-time control requirements of AGC [28].
The performance-based frequency regulation market (hereinafter referred to as ''frequency regulation market'') [30] is proposed in Order No. 755, issued by the Federal Energy Regulatory Commission (FERC) in 2011, aiming to encourage more fast-response regulation units, such as wind turbine unit, photovoltaic generation unit, and flexible loads, to participate in the secondary frequency regulation. Due to the new frequency regulation market mechanism, AGC regulation payment is changed from the original simple fixed payment per unit of regulation output to the dynamic compensation payment influenced by the comprehensive frequency regulation performance index, frequency regulation mileage, and frequency regulation mileage quotation. The original combination of control algorithm and dispatch algorithm (hereinafter referred to as the conventional combined algorithm) of AGC scheduling framework cannot be suitable to the new frequency regulation market mechanism, and the problem concerning coordination of control algorithm and dispatch algorithm has become more serious.
This article aims to design a virtual generation alliance automatic generation control (VGA-AGC) framework with various units including distributed energy units and flexible load units in order to solve the problem concerning coordination of control algorithm and dispatch algorithm in performance-based frequency regulation, make AGC to control more frequency regulation units, and adapt to the development trend of AGC from centralized framework to centralized-decentralized framework. The VGA-AGC framework also involves the coordination of AGC control algorithm and generation power command dispatch algorithm. The VGA-AGC framework, which is based on MADMI-TD3 algorithm, has following characteristics.
1) The proposed MADMI-TD3 algorithm employs different parameters of multiple actor networks and critic networks for distributed optimizing. A few techniques such as classified experience replay, variable noise models, and warm boot of experience pool are used to improve the global search ability and optimizing speed of the algorithm. The algorithm can obtain the optimal control strategy under strong random environment with random disturbance caused by large-scale distributed energy in the power grid.
2) The proposed MADMI-TD3 algorithm employs different parameters of multiple actor networks and critic networks for distributed optimizing, in addition, several techniques like classified experience replay, variable noise models, warm boot of experience pool are utilized to obtain an adaptive reinforcement learning control algorithm with superior global search ability and optimizing speed. The algorithm can obtain the optimal control strategy under strong random environment with random disturbance caused by large-scale distributed energy introduction in the power grid.
3) A simulation verification on the LFC model for an interconnected power grid of a province has shown that the algorithm proposed here is superior to the current algorithms and conventional engineering methods in terms of control performance and economic benefits: that is, the improvement of the control performance and reduction of the regulation mileage payment.

II. VIRTUAL GENERATION ALLIANCE AUTOMATIC GENERATION CONTROL(VGA-AGC)
A. CONTROL FRAMEWORK Different from conventional AGC system, VGA-AGC has a framework with multiple agents for generating and dispatching of the AGC total generation power command. Agents, composing the virtual generation alliance, cooperate with each other in all the layers. The AGC control cycle is 4s.

B. VIRTUAL GENERATION ALLIANCE
Virtual generation alliance: Professor Clerc divided the whole particle swarm into several subgroups, namely ''alliances'' [31], and each ''alliance'' consists of several particles. Hence, this article divided the units into several territory groups of units according to their type, namely ''alliance''. It is actually a new dispatch and control layer added between the center scheduling and plant controller (PLC) as a form of centralized-decentralized autonomy, corresponding to territory groups of units. For each territory, there is an administrator: Lord agent. As shown in Figure 1, for the VGA-AGC framework, four roles corresponding to the agents are proposed, including king agent, general agent, lord agent, and knight (units).

1) KING AGENT
It refers to the controller for an area of the power grid. In this article, the king agent based on MADMI-TD3 replaced the conventional PI controller. In the process of offline training, the agent can observe the state of each agent in the environment, so as to evaluate the action made by the king agent, thus adjusting its own actions based on global information. As compared to conventional controller which VOLUME 8, 2020 makes decisions only based on area control error (ACE), the king agent has superior robustness and coordination. The king agent is responsible for the real-time output of total power regulation command.

2) GENERAL AGENT
It is the total dispatch agent inferior to the king agent. It dispatches the total power regulation command issued by the king agent to the next level of agent: lord agent. General agent also needs to observe the state of all agents in offline training process, so as to evaluate and adjust its own actions, thus obtaining the optimal dispatch strategy.

3) LORD AGENT
Different types of units are classified in many groups which named as the territory groups. The agent responsible for the territory groups of units is lord agent, and it is responsible for dispatching the generation power command issued by the general agent to units which are in their territory groups. Lord agents includes the following types: Coal lord for coal-fired generation unit, gas lord for CHP and Liquefied Natural Gas (LNG) units, hydropower lord for hydroelectric unit, flexible load,virtual power plant (VVP) lord for various distributed energies, such as wind power, photovoltaic power, P2G and etc.

4) KNIGHT (UNITS)
It refers to the generation units which is responsible for outputting power under the generation command of the lords.

C. APPLICATION PROCESS
Before applying the VGA-AGC to the power grid, which is called online testing, it needs to participate sufficient offline training:

1) OFFLINE TRAINING
During offline training, in each of the AGC control cycle, each agent communicates with the EMS system of scheduling center, and at the same time, all actions performed by other agents can be observed by the agent, so that each agent can understand the environment and cooperate with each other. Therefore, they can evaluate and make their decisions based on those of other agents.

2) ONLINE TESTING
During online testing, the king agent only needs to obtain the frequency deviation, ACE and the integral value of two while other agents only need to communicate with their superior agents to obtain power regulation command, so as to realize centralized-decentralized autonomous control of AGC.

D. REGULATION MILEAGE PAYMENT
According to the rules of china southern power grid (CSG), the regulation mileage of each AGC regulation unit response to the AGC generation power command is shown in formula (1) [32].
The formula for regulation mileage of frequency regulation unit i is as follows: where M i (k) refers to the frequency regulation mileage of the ith AGC unit within the kth control interval period. In this formula, P out i (k) refers to the actual regulation power output of the ith AGC unit within the period of the kth control interval.
The regulation mileage payment can be calculated by the following formula [32]: where D i refers to the total regulation mileage payment of the ith AGC unit in N control intervals; λ refers to the price of the frequency regulation mileage, S p i means the comprehensive frequency regulation performance indicator score of the ith AGC unit; additionally, N refers to the control interval number within every period of frequency regulation service. For example, when the time cycle of AGC control is set to be 4s, the real-time frequency regulation market settlement cycle will be 900s. In addition, the amount of N is 225.
f 1 is the absolute value of total frequency deviation, and f 2 is the absolute value of the total ACE, the objective function can be expressed as formula (3) where n is the number of AGC units; f (k) is the frequency deviation of control interval k, and P out i (k + 1) is output of AGC unit i at the beginning of control interval k + 1.

2) GENERAL AGENT
In the objective function of general agent, for VGA-AGC, the frequency error and regulation mileage payment are fully considered, thus taking into account both the control performance and the frequency regulation mileage payment for optimization. The objective function is as follows: where P error-G (k) is total power control error at control intervals k, P order-(k) is the total AGC generation power command of control interval k, and D i (k) is the regulation mileage payment of unit group i in the control interval k.
where P error-L (k) is the power control error of control interval k, d i (k) is regulation mileage payment of unit group i in control interval k, P error-L-j (k) is the total power control error of territory unit group j in control interval k, and P order-L-j (k) is the total AGC generation power command of territory unit group j in control interval k; P n j G j (k) is the actual total regulation output of territory unit group j at the beginning of control interval k, S p n j is the comprehensive frequency regulation performance index of unit n j .

F. FREGULATION UNIT AND RELEVANT CONSTRAINTS 1) CONVENTIONAL REGULATION UNIT
The conventional regulation unit includes coal-fired unit. LNG unit and hydroelectric unit.
Constraints: Power balance constraints, regulation direction constraints, the upper and lower limits of AGC regulation capacity, and constraints on generation ramp rate, which are as formulas (11) and (12) in sequence: where P order-(k) refers to the total AGC power regulation command at the beginning of the kth control interval, P max i refers to the AGC regulation power upper limit of the ith AGC unit, and P min i refers to the AGC regulation power upper limit of the ith AGC unit. In addition, P rate i refers to the ramp rate of the ith AGC unit.

2) REGULATION UNIT FOR THE CHP AND P2G
The CHP consists of a compressor, a combustion chamber and a steam turbine. The transformation formula is as follows: where P CHP is the electrical power of the CHP unit, η e is the generation efficiency, and G CHP is the gas consumption power.
Constraints on the feasible operating range of CHP: Relevant constraints are as shown in formulas (7) and (9): where H i is heat output of CHP, P min i (H i ) is the lower limit of CHP electrical power output when the heat output is H i .
is the lower limit of CHP's heat output when the electrical power output is P 0 i + P i . P2G is used to convert electrical energy into easy-totransport-and-store hydrogen or natural gas through conversion. Its model can be described by formula (10). It can be regarded as a fast unit.
where G P2G is natural gas production of the P2G, P P2G is its electrical power consumption, η P2G is the conversion efficiency, and f LHV is the low heat value of the natural gas.

3) EGULATION UNIT OF RENEWABLE ENERGY
The frequency regulation unit of the renewable energy system includes photovoltaic power and wind turbine, which are controlled by power electronic equipment, and the constraints are as shown in formula (7). It is assumed that the active wind turbine output is tracked and controlled based on the maximum power point. P g, is the active power output which can be calculated with the wind speed as follows: where V w is the wind speed, V in w and V out w are the cut-in wind speed and cut-out wind speed respectively, V base w is the rated VOLUME 8, 2020 wind speed of the fan, P base w is the rated output power of the fan. P pv is the active photovoltaic power output. It can be calculated as follows: where P base pv is the rated generated power of the photovoltaic power station, α pv is the photovoltaic temperature conversion power factor, T is the temperature at the current moment, T ref is the reference value of temperature, S pv is the illumination intensity at the current moment.

4) FLEXIBLE LOAD REGULATION UNIT
The load aggregators are introduced according to the power grid source-charge collaborative frequency control theory. For relevant equipment, there are three temperature control devices including air conditioner, refrigerator, electric water heater, energy storage and electric vehicles. The general constraints of them are shown in formula (7).
General heating/cooling air conditioner load is an important part of demand response. The mathematical expression for cooling air conditioning is as follows: • where Q t is the indoor heat absorbed from the outdoor at time t; T out t and T in t are outdoor and indoor temperatures at time t respectively; R is the thermal resistance of the house, P t is the heating power at time t, C air is the specific heat capacity of air, and t is time increment. The electrical characteristics of the refrigerator load are shown in formula (14): where T FR t is the internal temperature of the refrigerator at the time t; s FR t is the on/off state for refrigeration function of the refrigerator; α FR is the refrigeration coefficient of the refrigerator when the refrigeration function is on; γ FR is the warming coefficient of the refrigerator when the refrigeration function is off.
It is assumed that after the hot water is consumed, an equal amount of cold water will be immediately introduced. According to the second law of thermodynamics, the water temperature can be expressed as follows: where T hw is the hot water temperature at time t; ρ, V tank , and C w are water density, water tank volume and specific heat capacity of water respectively; V t cold is the volume of cold water introduced at time t; T cold is the temperature of introduced cold water; h t wh is the heating power at time t. The battery energy storage system (BESS) can help conventional regulation units maintain frequency stability due to its fast response speed and flexible control of output.
where, SOC min im , SOC max im are the upper and lower limits of SOC, η ch is the charging efficiency, η dis is the discharge efficiency, and E im is the rated capacity.
Electric vehicles can be regarded as a kind of large-scale energy storage facilities. After a single electric vehicle is connected to the charging pile, the relation between the state of charge (SOC) at time t. P is the charging/discharging power, can be expressed as SOC. C max is the ratio of current battery capacity to the battery capacity: P is the charging/discharging power. In order to ensure the safety of the battery in the process of charging/discharging, P shall be within the charging/discharging power limit of the battery: P dis. max ≤ P ≤ P char. max (19) where P dis.max and P char.max are the maximum discharging power and the maximum charging power that the battery can withstand under safe operation conditions respectively.

III. MADMI-TD3 A. DEEP DETERMINISTIC POLICY GRADIENT (DDPG)
DDPG, an actor-critic framework algorithm of deep reinforcement learning, incorporates deep learning neural networks into the deterministic policy gradient (DPG): DDPG only employs an actor network to explore the environment, which will lead to a large amount of redundancy in the information used by agent. Thus resulting in slow parameter update. It is difficult to ensure the diversity of samples and easy to fall into the local optimum.

B. MADMI-TD3 FRAMEWORK
MADMI-TD3 is a deep reinforcement learning algorithm developed from DDPG [33], [34]. In order to overcome the over-estimation of Q value [35], [36] and the low training efficiency problem [37] in DDPG, the algorithm adopted seven techniques for improvement of the stability and training efficiency.

1) CLIPPED MULTIPLE Q-LEARNING STRATEGY
MADMI-TD3 employs the clipped multiple Q-learning strategy to calculate the target value, and the formula is as follows: To reduce the cost of training, each explorer used an independent actor network and two critic networks. π φ1 is the strategy of actor network which is updated based on Q θ 1 of the critic network. y 2 t and y 1 t are equal. Q θ 2 is the values of critic network.

2) STRATEGY DELAYED UPDATING
MADMI-TD3 updates the actor network one time after the critic network is updated d times, so as to ensure that the actor network can be updated as the Q value error is low, thus improving the updating efficiency of actor network.

3) SMOOTH REGULARIZATION OF TARGET STRATEGY
The algorithm introduced a regularization method to reduce the variance of the target values, and smooth the Q-value estimate by bootstrap of similar action: Moreover, smooth regularization is achieved by adding a random noise to the target strategy and averaging on mini-batch:

4) DISTRIBUTED REINFORCEMENT LEARNING FRAMEWORK BY DECENTRALIZED IMPLEMENTATION AND CENTRALIZED TRAINING
In MADMI-TD3, each agent has multiple explorers, a leader and two shared experience pools, among which the leader includes two critic networks and an actor network. Each explorer has an actor network. Also, it has own network and environment. For an exploration environment with several different explorers, first, the explorers generate the transformation experience based on their own environment and add the transformation experience to the two experience buffer pools according to the criteria. Then the leader samples and transforms the experience from the experience buffer pools according to the criteria. After this, it keeps learning. To speed up the learning process of an agent, the input of the critic network in the leader should include the observed states and actions of other agents, so that each agent can have a comprehensive understanding of the environment, thus properly evaluating the strategy, and cooperating with other agents. Finally, the actor network in the explorer periodically updates its network parameters based on the latest actor network from the learner. Since multi-agent centralized training and decentralized execution are adopted, different explorers of multiple agents are grouped into one group. The distributed training is implement with multiple explorer teams in parallel, as described in Section 1.1.2. In the centralized training, each agent gets the action that needs to be performed in the current state according to its own strategy. After all agents interact with the environment. Each agent randomly selects experience from the experience pool to train their neural network.

5) VARIABLE NOISE MODEL
In the algorithm proposed, random noise with different variances is adopted to actor networks of different explorers, so as to produce different samples. The random noise model employs Gaussian noise or ornstein-uhlenbeck (OU) noise model randomly, so as to increase the randomness and diversity of the explored samples. nosie = N (0, σ ) or OU (24)

6) CLASSIFIED EXPERIENCE REPLAY
The algorithm proposed uses the classification criteria for the mean value of immediate rewards: two completely independent experience buffer pools are used to store experience samples. When the network model is initialized, the average value of immediate reward value of all samples in the two experience buffer pools is set to 0. During training, the immediate reward value is compared with the average value of the sample data. If the immediate reward value in the experience sample is greater than r a which is the mean of all the immediate reward values in the experience sample, store the sample in experience buffer pool 1, otherwise store it in experience buffer pool 2.

7) ''WARM BOOT'' OF EXPERIENCE POOL
To improve the algorithm in optimizing features, so as not to lose the direction for optimizing due to too many low-value samples learnt by the algorithm at the beginning of the training in the early stages of the training, in this article, the experience pool is designed in a way of ''warm boot'', that is, before the formal training, let each agent conduct ''warmup'' training, so as to produce some samples and classify the samples based on the principle of classified replay. Then according to the reward value, the sample in experience pool 1 obtained by ''warm-up'' training is divided into two parts according to the classification criteria, which are put into experience pool 1 and experience pool 2 respectively. In both experience pools, samples of ''warm-up'' training with high reward value are left in advance, so that the algorithm will not to drift during formal training, thus obtaining a better solution and accelerating the convergence speed. During training, as to pool 1, n ξ samples can be gotten with the probability of ξ . In pool 2, n (1−ξ ) samples can be gotten with the probability of 1-ξ . The detailed framework is displayed in Fig. 2. The explicit process is displayed in Table. 1.

IV. VGA-AGC SYSTEM BASED ON MADMI-TD3 IN PERFORMANCE-BASED FREQUENCY REGULATION MARKET
A. KING AGENT-CONTROLLER AGENT 1) ACTION SPACE For any control interval, the output of king agent is the total generation power command P order-(k), and so the action space is as follows:

2) STATE SPACE
The state space is compose of frequency deviation f , the time integral of frequency deviation (

3) REWARD FUNCTION
The reward function is as follows: where A is the control reward item, and when the absolute value of frequency deviation is less than 0.05, it is equal to 2.

B. GENERAL AGENT -TOTAL DISPATCH AGENT 1) ACTION SPACE
In order to meet the power balance constraint, as shown in formula (29), the participation factors of each generation unit are to satisfy the following formula: To meet the requirement of formula (29), suppose that it has n units, but participation factors only need to be allocated to n − 1 of them in each AGC dispatch period. The participation factor for the nth group can be calculated as below: a Gi (30) In this article, generation unit group n is defined as the balanced unit group, and the unit group with large regulation capacity is selected as the balancing unit group, and the action is: Constraints:

2) STATE SPACE
State space of the general agent is composed of the total AGC generation power command P order-, P GG1 , P GG2 , P GG3 , P GG4 , P GG5 which are the outputs of different units respectively. It can be shown in formula (33).

3) REWARD FUNCTION
In order to keep the forms of reward function consistent, the score of regulation mileage comprehensive frequency regulation performance index is the historical average value obtained through a long-term simulation.
where σ is the penalty term. If the participation factor of balanced units is less than 0 σ is equal to −0.5.

C. LORD AGENT J -SUB DISPATCH AGENT
Several lord agents have similar state space, action space and reward functions.

1) ACTION SPACE
In this article, unit n is defined as a balanced unit, For the lord agent j(j = 1, 2, 3, 4, 5), the action space is as follows: a Lj1 a Lj2 . . . a Lji . . . a Lj(n−1) ,

2) STATE SPACE
State space of the general agent is composed of the AGC generation power command P order-j input by the lord agent j and the actual output of n units P Gi (i = 1, 2, 3, . . . , n), which are managed by the lord agent j. It can be shown in formula (37).
The weight coefficient in the reward function and the hyperparameter design in the pre-learning are as shown in Table 2.

V. SIMULATION VERIFICATION
To verify the effectiveness of the proposed VGA-AGC based on MADMI-TD3, the conventional AGC framework (PI + GA, PI + PROP) and VGA-AGC based on deep reinforcement learning algorithm (MADMI-DDPG, TD3 and DDPG) are introduced as the comparisons. The interconnected power grid system of a province includes 32 regulation units. The specific control model is shown in Figure 3, and the parameters for the units are shown in Table 5.

A. SIMULATION OF A PROVINCIAL INTERCONNECTED POWER GRID UNDER RANDOM STEP DISTURBANCE 1) PRE-LEARNING STAGE
In the pre-learning stage, a durative sinusoidal disturbance with cycle of 3600s, amplitude of 1800MW, duration of 3600s and phase of 0.5π is added to area A. The specific control model diagram is shown in Figure 3, and the parameters for the generation units are shown in Table 2. The training chart is shown in Figure 4. In Fig. 4, the curve represents the mean of reward values of corresponding episodes for MADMI-TD3. Obviously, The average reward value of the MADMI-TD3 algorithm can smoothly converge to an optimal solution, and the algorithm is stable.

2) STEP DISTURBANCE ONLINE TEST
For a power grid containing various regulation units, step load disturbance is used for testing. The amplitude is not more than 800MW. The results are shown in Figures 5-9 and Table 4.      moments, the 10-minute average CPS1 of MADMI-TD3 is significantly higher than that of the other six algorithms. MADMI-TD3 power control deviation is much smaller than that of conventional combination algorithm and the response rate is faster than that of conventional combination algorithm according to Fig. 6. This shows that the actual total output of generation unit by MADMI-TD3 algorithm is closer to the actual load disturbance. The reason for this is that in conventional combination algorithm, too many slow-response units are used for frequency regulation. In addition, the coordination of PI controller and generation command dispatch algorithm is not taken into account, which may cause fluctuation of total AGC generation power command output by PI controller (as shown in Fig. 7). Thus, the output of some generation units are regulated frequently, and ''overregulation'' is occurred, which can greatly increase the regulation mileage of some units, thus increasing the regulation mileage payment. In contrast, in the results of MADMI-TD3, more fast-response units are used for frequency regulation, such as hydroelectric units, renewable energy units, and flexible load (as shown in Fig. 8). Moreover, in the process of offline training, considering the coordination of agents, for king agent, there is no instability and discordance caused by the problem of cooperation between control algorithm and dispatch algorithm in the VGA-AGC based on MADMI-TD3. The king agent (controller) output can track load disturbance in real time and accurately. By MADMI-TD3, the actual total output of generation units is always close to the actual load disturbance as the control performance of AGC is significantly improved. Thus, the possibility of ''overregulation'', the regulation mileage as well as the regulation mileage payment is reduced. It also leads to a smaller change in the frequency deviation of MADMI-TD3 frequency compared with the conventional combination algorithm (as shown in Fig. 9) and the frequency recovered faster. As shown in Fig. 9, the regulation mileage payment for MADMI-TD3 algorithm is less than that for conventional combination algorithm. As shown in Table 3 for statistical results, in comparison of several algorithms, the | f |, |e ACE | and regulation mileage payment for MADMI-TD3 are at minimum while C CPS1 the mean CPS1 is at maximum.
Compared with other algorithms for the VGA-AGC framework, the MADMI-TD3 algorithm has the optimal control performance and the regulation mileage payment is lower than other algorithms.

B. RANDOM POWER DISTURBANCE ONLINE TEST
In the two-area power grid systems, the disturbance of photovoltaic units and wind units are simplified as random output models, which are treated as random load disturbance of AGC system. Meanwhile, part of the capacity of wind turbine and photovoltaic units participated in secondary frequency   regulation of the system. The wind model consists of three small capacity wind turbine units and one large offshore wind turbine unit. The latter does not participate in secondary frequency regulation. Fig. 13 shows the 24h curves of load disturbance, wind turbine's output as well as photovoltaic unit's output.
The online test results are shown in Figures 10 to 12. For the reason that there is an offshore wind power plant in the system, the disturbance changes very rapidly with an amplitude larger than that of the normal step disturbance. According to Fig. 10, the total generation commend output of MADMI-TD3 algorithm's king agent (controller) can still remain smooth and close to the actual power disturbance, but those of PI + PROP algorithm's controller will obviously exceed the actual load disturbance, thus appearing the overregulation of the total actual unit output and resulting in bigger frequency deviation and larger ACE, as shown in Table 4. It indicates that as the coordination of control algorithm and dispatch algorithm is considered. The MADMI-TD3 has better control performance and robustness in the control process. Fig. 11 shows the total actual regulation power output of six algorithms. Obviously, for the conventional combination algorithm, burrs and overregulation phenomenon appear more frequently. The total actual regulation power output of several deep reinforcement learning algorithm is closer to the actual load disturbance, thus obtaining better control performance and economic profits, as shown in Table 4. Because of the conventional combination algorithms have relatively large overregulation of the total actual regulation power output, the regulation mileage as well as the regulation mileage payment is increased. Hence, their regulation mileage payment is higher than that of deep reinforcement learning algorithm in each market settlement cycle (as shown in Fig. 12), indicating that as collaboration of control algorithm and dispatch algorithm is considered. Deep reinforcement learning algorithm is more economic in frequency regulation.
The results of simulation for all of the above algorithms are summarized to form Table 4. According to Table 4, the | f |, |e ACE | and regulation mileage payment for MADMI-TD3 algorithm are at minimum while C CPS1, the mean CPS1, is at maximum compare with other algorithms. The data in the Table 4 show that in the LFC which is added by random disturbance, the MADMI-TD3 has a better control effect than that of other conventional combination algorithms, and its frequency deviation is less than that of the conventional combination algorithms. However, the exploration and optimization process of other deep reinforcement learning algorithms is not optimized sufficiently. Hence, others are weaker than MADMI-TD3 algorithm in control performance and economic benefits.

VI. CONCLUSION
To conclude: 1) In the performance-based frequency regulation market, the VGA-AGC that is based on the proposed MADMI-TD3 can help build a concentrated-decentralized autonomous framework in order to solve the problem concerning the FIGURE 13. Disturbance curve. VOLUME 8, 2020  collaboration of conventional control algorithm and dispatch algorithm. Compared to the conventional combination algorithm, the VGA-AGC framework can realize the comprehensive optimization of control performance and economic benefits in the process of secondary frequency regulation of a power grid with large random disturbance.
2) The proposed MADMI-TD3 employs different parameters of multiple actor networks for distributed optimizing. A few techniques such as experience replay, various noise models, and warm boot of experience pool are to improve the global search ability and optimizing speed of the algorithm. The algorithm can be used to obtain the optimal control strategy in strong random environment, which can solve the problems of random disturbance caused by the large-scale distributed energies in the power grid.
3) According to the results of simulation, the proposed method can significantly improve the control performance and reduce the regulation mileage payment. Consequently, the method can obtain the maximum CPS1 index and effectively reduce the regulation mileage payment.