Deep Reinforcement Learning Based Multi-Objective Integrated Automatic Generation Control for Multiple Continuous Power Disturbances

This paper proposes a multi-objective integrated automatic generation control (MOI-AGC) that combines a controller with a dispatch together. This can contribute to improving both control performance and economy in a power grid with multiple continuous power disturbances. Subsequently, a distributed classification replay twin delayed deep deterministic policy gradient (DCR-TD3) is designed for MOI-AGC. On the one hand, DCR-TD3 introduces the classification replay method based on multiple explorers with different parameter actor networks for distributed optimization. On the other hand, the optimal control strategy is obtained through DCR-TD3 in an extremely random environment based on frequency deviation, regional control error together with frequency mileage payment as the reward function. This helps address the problem of frequency instability caused by multiple stochastic disturbance in a grid with a large number of distributed energies. Simulation verification is performed for the two-area load frequency control (LFC) model, with the result showing that the proposed algorithm has better control performance and economic benefits. Besides, compared with the existing algorithms, it can achieve a regional optimum control, reducing frequency mileage payment.


I. INTRODUCTION
The continuous increase of the distributed energy utilization rate in the power grid is followed by the constant increase of stochastic disturbance. Thus, traditional AGC methods cannot meet such a large-scale complicated energy system's requirement for frequency stability anymore [1].
Generally, traditional AGC generates and dispatches the system's total AGC power regulation command to each generation unit with various methods, such as proportion integration (PI) controller, proportional dispatch based on simple generator unit capacity, and regulation speed. However, it is difficult to meet the requirement of power grid control performance standard (CPS) appraisal in the control area with a high renewable energy penetration rate and insufficient regulated sources [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Huai-Zhi Wang. To address the secondary frequency regulation in the power grid involving multiple renewable energy, scholars classified algorithms as two categories. One is the control algorithm for AGC, such as the proportion integration differentiation (PID) commonly employed in engineering, the fuzzy PID, fuzzy logic control [2]- [4], sliding mode control (SMC) [5], active disturbance rejection control, (ADRC) [6], fractional order PID (FOPID) [7], the Q-learning in reinforcement learning, the DQN(deep Q-learning) [8], [9] learning and others. Usually, such algorithms treat the power grid as a single area to calculate the total AGC power regulation command, and then dispatch it to each AGC unit based on proportion (PROP). The other category is optimized algorithms for AGC dispatch, such as genetic algorithm (GA), gray wolf optimizer (GWO) [10], [11], proportion (PROP) [12], particle swarm optimization (PSO) [13], moth-flame optimization (MFO) [14], whale optimization algorithm (WOA) [15], ant lion optimizer (ALO) [16], dragonfly algorithm (DA) [17], VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ group search optimizer (GSO) [18], chicken swarm optimization (CSO) [19], sine cosine algorithm (SCA) [20] etc. Generally, such algorithms adopt the classic PID control algorithm as the control algorithm for AGC, then dispatch the total AGC power regulation command to each AGC unit based on the optimized algorithms, aiming to minimize the regulation payment. Overall, these two categories of algorithms have certain advantages. For example, control algorithms and optimized algorithms are separated, making it possible to design different algorithms separately, but there is a problem of collaboration between the two algorithms. The control algorithm takes minimizing control frequency deviation as its control objective, while the optimized dispatch algorithm takes minimizing the regulation payment as its optimization objective. Furthermore, the combination of the two algorithms will increase both the frequency deviation and the regulation payment, thus worsening the generation control performance. The actual calculation time, due to precision improvement, will exceed the maximal time allowed by the regulation command [21].
To encourage faster response AGC frequency regulation unit to participate in secondary frequency regulation, the Federal Energy Regulatory Commission (FERC) issued the No. 755 order, proposing a more fair and reasonable mechanism for the performance-based frequency regulation market (hereinafter referred to as the frequency regulation market) [22]. Based on this mechanism, the payment of each AGC unit consists of two parts [23], namely, AGC capacity payment and regulation mileage payment, which are directly affected by the frequency regulation mileage quotation. That is, the AGC frequency regulation unit with a higher quotation will be more involved in AGC frequency regulation for a better control performance. Some independent system operators (ISO), such as PJM, China Southern Power Grid (CSG) and others, invoke frequency regulation resources in accordance with the actual frequency regulation performance of different AGC units. Guided by a new frequency regulation market mechanism, the regulation payment has changed from the original simple linear calculation method with a fixed price for each unit regulation power capacity to a dynamic calculation method with the regulation mileage payment affected by the comprehensive regulation performance score, regulation mileage and the regulation mileage quotation. This makes the original method of combining the generation control algorithm with the optimization dispatch algorithm no longer suitable for the current frequency regulation mechanism. In a power grid, multiple continuous power disturbances appear, causing the system frequency to show a trend of deterioration. Even more, the mutual influence and conflict between the control performance of the generation control algorithm and the optimization performance of the optimization dispatch algorithm are deteriorated. In [24], the total AGC power regulation command is generated by the traditional PI controller, followed by the adoption of a fixed dispatch method: the participation degree of all AGC regulation units is proportional to their regulation mileage capacity [24]. Besides, in the research [25], a real-time AGC scheduling method [25] is designed, making the fast regulation resources assume a relatively huge AGC generation power command required by a high regulation mileage. A multi-agent deep reinforcement learning algorithm with action exploration and recognition thinking was proposed in [26], effectively alleviating frequency instability caused by multiple stochastic disturbance in a grid with numerous distributed energies. However, its controller only outputs total AGC power regulation command, without in-depth research on the dispatch algorithm. This tends to cause poor performance due to the uncoordinated collaboration between the control algorithm and the optimization algorithm. To address the poor coordination brought by the combination of the generation control algorithm and the optimization algorithm in microgrid, an integrated framework was designed with an adaptive deep dynamic programming [27]. This framework can prevent the poor performance brought by the combination of the control algorithm and the optimization algorithm. However, introducing the economic scheduling problem into AGC, it is not suitable for large power grid with a large number of units. In addition, the regulation payment under the frequency regulation market is ignored.
These methods, though simple and suitable, ignore the coordination of the two algorithms and specific optimization. Therefore, they cannot meet the ISO's requirement for an optimal comprehensive benefit of control performance and regulation mileage payment.
To overcome the weakness of the combination of the generation control algorithm and the optimization dispatch algorithm in the frequency regulation market, MOI-AGC framework is designed, and DCR-TD3 is proposed for this framework. This algorithm This algorithm uses multiple actor networks with different noise parameters for distributed optimum-seeking and finds the optimal strategy in an extremely random environment-interconnection power grids with large-scale distributed energy by using frequency deviation, the area control error (ACE) and the regulation mileage payment as the comprehensive reward function.
The innovation of this paper is as follows: 1) Previous studies of AGC ignored the coordination of control algorithm and optimization algorithm, thus not satisfying the benefit pursuit of ISO. It is especially true of multi-objective optimization in a performance-based frequency regulation market of power grid with large-scale distributed energy. To fill up this gap, a MOI-AGC is proposed to balance the technical and economic benefits of ISO when it distributes the total power regulation command to all the regulation units.
2) The conventional reinforcement learning continuously updates a single neural network of an agent with a single experience pool during parameter updating, causing an extremely slow parameter pace and a local optimum tendency. It is especially true of the Deep Deterministic policy gradient (DDPG) which requires high exploration capability. To address this problem, DCR-TD3 is proposed for MOI-AGC. Employing multiple explorers with different parameter actor networks for distributed optimization, this algorithm introduces the classification replay method for the optimal control strategy in an extremely random environment.

II. DCR-TD3 ALGORITHM
TD3 [28], deriving from DDPG [29], is a deep reinforcement learning algorithm with an actor-critic framework. In order to solve the overestimation problem of the Q value in the actor-critic framework, TD3 uses three key tricks.
1) Clipped double Q learning under the actor-critic framework [28]. TD3 selects the optimal action using the current actor network and uses the evaluation strategy of the target critic network.
The target value is as follows [31]: where Q θ 1 and Q θ 2 are the value functions of two current networks under the same state s t and the action π ϕ (s t ).
To reduce the training payment, two critic networks and an independent actor network are adopted in TD3.
3)Strategy delayed updating [28]. The critic network can reduce its deviation with the target Q value through multi-step updating. However, in the event of a big critic network deviation, updating the actor network causes divergence of the strategy. Therefore, TD3 updates the actor network once after the critic network is updated d times.
3)Smooth target strategy regularization [28]. Similar to DDPG, as TD3 uses a deterministic strategy, the target value can be easily affected by the function's approximate deviation during critic updating, resulting in an inaccurate target value. Therefore, a regularization strategy is introduced to reduce the target value variance to perform Q value estimation smoothing for the estimated value by bootstrapping a similar state action in TD3.
At the same time, smooth regularization is realized by adding a random noise ε to the target strategy and obtaining an average value from the mini-batch: The traditional reinforcement learning algorithm continuously updates a single neural network of a single agent during parameter updating [30]. This results in a large amount of redundancy of information used in agent updating, an extremely slow parameter pace, and a tendency of a local optimum [31]. Especially for the deep deterministic policy gradient which requires high exploration capability.
The DDPG adopts the method of adding noise [32]. However, it is difficult to ensure sample diversity to use only one actor network for environment exploration [33]. To solve this problem, the distributed reinforcement learning training method is introduced into the TD3. At the same time, classification replay is introduced to propose the DCR-TD3. The training framework adopted by the DCR-TD3 is a distributed reinforcement learning training framework with multiple explorers, a leader and two shared experience buffer pools. The leader includes two critic networks and an actor network. Each explorer has one actor network, its own network model and environment. The explorers first generate transfer experience based on their own environment and add the transfer experience into each of two experience buffer pools based on the corresponding standard. Afterwards, the leader performs sampling for the transfer experience from the experience buffer pools based on the corresponding standard and keeps learning. Finally, the actor networks of the explorers regularly update their network parameters from the latest leader actor network. The actor networks of different explorers use random noise with different variances for optimum-seeking strategies to increase the randomness and diversity of the explored samples. As the algorithm adopts an experience replay mechanism, the mechanism uses a classification probability replay (short as classification replay) method. Standard for the classification probability replay as fellows: Through the ε-greed strategy in Q-learning, two independent experience buffer pools are used to store empirical samples in DCR-TD3. During network model initialization, the average reward of all samples in the two pools is set to 0. Compare a sample's reward with the average value. If the sample's immediate reward value is larger than the average value ω 2 , place the sample in pool 1. Otherwise, place the sample in pool 2.
During training, n ξ samples are selected with the probability of ξ from pool 1, and n (1−ξ ) samples are selected from pool 2 with the probability of 1-ξ . The specific framework is shown in Fig. 1. The specific procedure is shown in Table. 1.

III. DISTRIBUTED CLASSIFICATION REPLAY TWIN DELAYED DEEP DETERMINISTIC POLICY GRADIENT A. FREQUENCY REGULATION MILEAGE
Frequency regulation mileage is a new quantitative indicator [14] that judges the actual regulation amount of each AGC frequency regulation unit based on the AGC generation power command allocated by the power grid scheduling center in real time. Based on the rules of the CSG market, the total frequency regulation of a certain period is the sum of the regulation mileage of all AGC frequency regulation units responding to the AGC power regulation command within that period. The frequency regulation mileage of an AGC generation power unit responding to an AGC generation power command refers to the absolute value of the difference between the actual regulation power output value at the end and the regulation power output value upon the respond. The details are as follows [8]: VOLUME 8, 2020  The frequency regulation mileage calculation formula for the i th unit is: where is the frequency regulation mileage of the ith AGC frequency regulation unit during the kth control interval and P out i (k) is the actual regulation power output of the ith AGC frequency regulation unit during the kth control interval.
the regulation mileage payment of each AGC frequency regulation unit is as follows [8]: where D i is the total regulation mileage payment of the ith AGC frequency regulation unit in N control intervals, λ is the frequency regulation mileage price, S p i is the comprehensive frequency regulation performance indicator score of the ith AGC frequency regulation unit, and N is the number of control intervals of each period during the frequency regulation service. If the AGC control cycle is 4s, the time cycle of the real-time frequency regulation market clearing is 15 min (900s), N is 225. S p i is actually the frequency regulation quality indicator of the AGC frequency regulation unit and is related to the dynamic regulation characteristic of the AGC frequency regulation unit, that is, the AGC frequency regulation unit with a quicker response will receive a higher comprehensive frequency regulation performance indicator score. Based on the rules of the CSG market, S p i consists of the following three different scores [8].
1) Ramp rate score: refers to the rate of an AGC frequency regulation unit responding to an AGC generation power command. The calculation formula is as follows: where S rate i is the regulation rate score of the ith AGC frequency regulation unit, P rate i is its maximum regulation rate, and P rate a is the average regulation rate of all AGC frequency regulation units within the corresponding control area.
2) Response delay score: refers to the time delay of an AGC frequency regulation unit responding to an AGC generation power command. The score is expressed as: where S delay i is the response time score of the ith AGC frequency regulation unit, and T d i is the regulation time constant of the ith AGC frequency regulation unit.
3) Regulation precision score: it evaluates regulation precision based on the deviation between the allocated AGC generation power command input and the actual unit regulation power output. The score is calculated as follows: where S pre i is the regulation precision score of the ith AGC frequency regulation unit, P in i (k) is the AGC generation power command input of the ith AGC frequency regulation unit at the beginning of the kth control interval, P out i (k+1) is the unit regulation power output of the ith AGC frequency regulation unit at the beginning of the (k+1)th control interval, P i,a is the regulation error of the ith AGC frequency regulation unit during each cycle, which is 1.5% of the rated regulation power output of the ith AGC frequency regulation unit.
Therefore, the comprehensive frequency regulation performance indicator score is the sum of the three regulation scores multiplied by their respective weight. The formulas are as follows: (12) where ω 1 , ω 2 and ω 3 are the weight coefficients of different regulation scores and their values are 0.50, 0.25 and 0.25 respectively in the frequency regulation market of the CSG.

B. MULTIPLE CONTINUOUS POWER DISTURBANCES
The power grid is a complex nonlinear large-scale system. After a disturbance in one place is formed, it may spread to other nodes and cause disturbances in other nodes. Therefore, in the UHV(ultra-high voltage) synchronous power grid, it is impractical to consider only a single disturbance, and the disturbance should be regarded as the combined effect of multiple events in time, space, and type that cause changes in the operating state of the system. Multiple continuous power disturbances will cause drastic changes in system power loss, thereby affecting frequency stability. Therefore, AGC system with better control performance is required when encountering multiple continuous power disturbances, so that the frequency can be quickly recovered. The combination of traditional control algorithm and dispatch algorithm cannot meet this requirement.

C. OBJECTIVE FUNCTION
In this research, in order to achieve an optimal comprehensive benefit of control performance and economy for dynamic dispatch of AGC power regulation command, the objective is divided into the following three parts to ensure fast and accurately balance the system power disturbance. That is, the absolute value of the total frequency deviation f 1 , the absolute value of the total control area deviation f 2 and the total regulation mileage payment f 3 are all minimal in the AGC frequency regulation process. Obviously, the first two objectives are incompatible with the third objective. They can be expressed as (13): where n is the number of AGC regulation units, f (k) is the frequency deviation of the kth control interval, P out i (k+1) is the unit regulation power output of the ith AGC unit at the beginning of the (k + 1)th control interval.

D. CONSTRAINT CONDITION
Constraints of coal-fired units liquefied natural gas (LNG) units, oil units and hydropower units are as follows: power balance constraint, frequency regulation direction constraint, constraint on AGC regulation capacity upper and lower limits and generation ramp rate constraint. These constraints are expressed in (14) respectively: where P order− (k) is the total AGC power regulation command at the beginning of the kth control interval, P max i and P min i are the AGC regulation capacity upper and lower limits of the ith regulation unit respectively, and P rate i is the ramp rate of the ith regulation unit.

IV. MOI-AGC BASED ON DCR-TD3 FRAMEWORK
Unlike the traditional AGC system which adopts a combination of the PI controller and a fixed dispatch algorithm, the control algorithm of the MOI-AGC based on DCR-TD3 can output AGC power regulation command for multiple AGC units at a time, which gives consideration to the comprehensive optimum of both frequency deviation and the regulation mileage payment. The system monitors calculation, stores ACE/ f /CPS data as well as long-term historical VOLUME 8, 2020 records of each area of the interconnected power grid and inputs frequency deviation of the interconnected power grid ( f ), area control error (ACE), and regulation power output of each unit into the MOI-AGC. The algorithm then calculates and obtains the AGC power regulation command to be sent to each AGC unit. The specific framework is shown in Fig. 2.

A. ACTION SPACE
To ensure that the agent's m explorers can use more standard noise sequences during training, the AGC generation factors of n units are used as agent actions for any moment t.as shown in (15).
where a i is the AGC generation factor of the ith unit, P max iG is the maximal regulation capacity of the ith unit, and P min iG is the minimal regulation capacity of the ith unit.

B. STATE SPACE
Input frequency deviation( f ) of the MOI-AGC based on DCR-TD3, the integral ( t 0 fdt) of time-to-frequency deviation, the area control error (e ACE ) and actual regulation power output ( P Gi ) of n units. There are n + 3 states in total, as shown in (16).

C. SELECTION OF REWARD FUNCTION
Based on (13), a comprehensive reward function is formed using the quadratic term of ACE of each control cycle, the absolute value of the area control error and the linear weight and constraint penalty term of the regulation mileage in this paper.
To ensure consistency of the reward function form, the comprehensive regulation performance metric score for regulation mileage used in this paper is the historical average value obtained from long-term simulation.
The reward function is expressed as: where t is the discrete moment, e ACE (t) is the ACE of moment t, f (t) is the frequency deviation of moment t, P G− (t) is the actual total regulation power output of moment t, P Gi (t) is the actual regulation power output of the ith unit at moment t, P Gi (t−1) is the actual regulation power output of the ith unit at moment t−1, d i (k) is the regulation mileage payment of the ith unit at moment t, λ is the regulation mileage price, and S p i is the comprehensive regulation performance metric of the ith unit, which is the historical average value. A is a control reward term, which equals 10 when the absolute value of the frequency deviation is smaller than 0.05.

D. PARAMETER SELECTION
The weight coefficient in the reward function and the hyperparameters in pre-learning are designed, as shown in Table 2.

V. SIMULATION VERIFICATION
To verify the effectiveness of the proposed DCR-TD3 for MOI-AGC, a simulation test is run in the power grid of a certain province with the traditional algorithms (PI+GA, PI+PROP and PI+equal dispatch) for AGC generation power command dispatch, the regular DDPG and the TD3 as the comparison calculation examples.
The interconnected power grid system of the province has 10 units. The specific control model diagram is shown in Fig. 3 and unit parameters in Table 5.

A. SIMULATION OF A PROVINCIAL INTERCONNECTED POWER GRID UNDER RANDOM STEP DISTURBANCE 1) PRE-LEARNING STAGE
In the pre-learning stage, continuous sinusoidal power disturbance with a cycle of 1800s, an amplitude of 900 MW, a duration of 1800s and phase of 0.5π is applied to area A. The specific control model is shown in Fig. 3, the unit parameters are shown in Table 2, and the training chart is shown in Fig. 4.
In Fig. 4, the curves are the average reward value of every episode of each algorithm. It can be seen that the learning rate is slow and there are obvious vibrations during learning for TABLE 3. Random step disturbance results of the provincial power grid system.

FIGURE 5.
A provincial grid random disturbance power regulation command and total regulation power output change. the TD3 and DDPG; although the average reward values of the DCR-DDPG and the DCR-TD3 are close, the DCR-TD3 shows earlier convergence to the optimal solution and a more stable process. At the same time, as the DCR-TD3 adopts distributed training, its learning duration is significantly shorter than those of TD3 and DDPG. In addition, the average reward values of DCR-TD3 and DCR-DDPG after convergence are greater than those of the TD3 and the DDPG, meaning that

2) RANDOM STEP DISTURBANCE TEST
Random step power disturbance with an amplitude of between 1500 MW and −400 MW is added into the provincial power grid which consists of both coal-fired units, LNG units, oil units, hydropower units, wind turbine units as well as photovoltaic units. The test result is shown in Fig. 5 to Fig. 8 and Table 3.
It can be concluded that based on Fig. 5 to Fig. 8 and Table 3, the MOI-AGC based on DCR-TD3 delivers optimal control performance and has the smallest regulation mileage payment under random step power disturbance compared with other algorithms. Fig. 5 to Fig. 8 show the online optimization results for MOI-AGC based on DCR-TD3, DCR-DDPG, TD3 and DDPG and traditional algorithm (PI+GA, PI+PROP and PI+equal dispatch). It can be seen from Fig. 5 the power control deviation of DCR-TD3 is much smaller than those of the PI+PROP algorithm at the same moment, and the former also has a quicker response rate, which means the actual unit regulation power output of the DCR-TD3 is closer to the actual power disturbance at the same moment. This is because the DCR-TD3 uses more quick-response regulation units (G9, G10), as shown in Fig. 6. This also results in the small change in and quick recovery of the frequency deviation of the DCR-TD3 compared with that of the traditional algorithms (Fig. 7). In addition, as the traditional algorithms adopts too many slow-response regulation units for frequency regulation, the total AGC power regulation command output by the PI controllers are greater than the power disturbance, which causes some units to have ''over modulation and overshooting'' and increases their regulation mileage, adding to the regulation mileage payments. By contrast, the DCR-TD3 uses mostly fast-responding units for frequency regulation and a MOI-AGC framework, which does not have the problem of instability or overshooting due to improper PI controller parameter settings. This way, not only the AGC control performance is improved, but also ''overshooting'' of each unit is reduced as the actual unit regulation power output is always close to the actual power disturbance; therefore the regulation mileage is reduced and so is the regulation mileage payment. As shown in Fig. 7, the regulation mileage payment of the DCR-TD3 is lower than the traditional algorithms. Based on Table 3, | f |, |e ACE | and the regulation mileage payment of the DCR-TD3 are all the smallest; its CPS1 metric average value (C CPS1 ) is the greatest.

B. SIMULATION UNDER MULTIPLE CONTINUOUS POWER DISTURBANCES
As there is a relatively great randomness and uncontrollability in wind turbine generation, photovoltaic generation as well as irregular random step power disturbance, the distributed energy output models of photovoltaic generation and wind   turbine generation are simplified in the two-area power grid system and handled only as random power disturbance of the AGC system without involving in the system frequency regulation. Random wind simulated by limited bandwidth white noise is used for the wind output model, while the photovoltaic generation output model is from simulation of the all-day light intensity change curve. Continuous random step power disturbances that occur irregularly are used to represent multiple continuous power disturbances in the power grid, and multiple continuous power disturbances only appear when the total power disturbance reaches more than 800MW. Fig. A1 are the 24-hour power disturbance curves of the wind turbine generation, photovoltaic generation and the irregular random step power disturbance. The random power disturbance brought by interconnecting distributed energy with the system can more accurately analyze DCR-TD3 performance under the two-area extremely random environment. Power disturbance with an appraisal cycle of 86,400s is used to test the performance of seven algorithms including DCR-TD3, DCR-DDPG, TD3 and DDPG, which all have a MOI-AGC framework and the traditional algorithms (PI+GA, PI+PROP and PI+equal dispatch). Fig. 9 to Fig. 10 show the online optimization results. As shown in Fig. 9, the total unit regulation power output curve of the DCR-TD3 is smoother, and its total regulation power output is closer to the power disturbance at the same moment compared with the other algorithms, which indicates that the control performance of DCR-TD3 is obviously better. As shown in Fig. 10, the regulation mileage payment of the DCR-TD3 is slightly lower than those of the other three MOI-AGC framework algorithms and far smaller than those of the three traditional algorithms.
The simulation calculation example result statistics of all the above-mentioned algorithms are shown in Table 4. The average absolute values of frequency deviation of the seven algorithms are 0.00278Hz, 0.00281Hz, 0.00845Hz, 0.00294Hz, 0.00437Hz, 0.00452Hz and 0.00453Hz. respectively, | f |, |e ACE | and the regulation mileage payment of the DCR-TD3 are the smallest, and its average CPS metric value (C CPS1 ) is the greatest, as shown in Table 4. It can also be seen that through comparison under multiple continuous power disturbances simulation, the control performance of the three MOI-AGC algorithms, DCR-DDPG, TD3 and DDPG, is better than that of the traditional algorithms, their frequency deviations are smaller than those of the traditional algorithms. However, as their exploration and optimum-seeking processes are not properly optimized, their control performance and economy are not as good as those of the DCR-TD3. It can be concluded that under large-scale new energy multiple continuous power disturbances, the MOI-AGC based on DCR-TD3 has a quicker response speed and better stability and economy than the AGC framework of the traditional algorithms combining PI controller and dispatch; the DCR-TD3 has better control performance and economy than the other three MOI-AGC framework algorithms and achieves a comprehensive optimum of control performance and economic benefits.

VI. CONCLUSION
In summary, the main contributions of this work are given as follows: 1) The proposed MOI-AGC based on DCR-TD3 achieves the function of ''control algorithm combined with the optimization dispatch algorithm''. Through the frequency regulation market mechanism, this algorithm has the advantages of better control performance and economic benefits in the secondary regulation process of large power grid compared with traditional algorithm of ''the control algorithm combined with the optimization algorithm'' under the environment of multiple units and large-scale renewable energy.
2) As the algorithm is improved by using the distributed classification replay, a large number of samples can be learned. Besides, experience replay is performed by using the strategy of sampling from different experience samples with a probability, the control capability over the system is constantly improved and therefore better decisions can be made, which provides a better control effect and avoids the weaknesses of the algorithm of ''the control algorithm combined with the optimization algorithm''. Compared with the DDPG and TD3 algorithms, DCR-TD3 can obtain the optimal solution more easily and has a faster convergence rate, not tending to get into a local optimum.
3) A simulation is run for a provincial power grid under two types of disturbance. The simulation result indicates that  the proposed method can significantly decrease the system frequency deviation, the total power deviation and obtain the greatest CPS1 metric value, as well as reducing the regulation mileage payment. Table 5