Power System Load Frequency Active Disturbance Rejection Control via Reinforcement Learning-Based Memetic Particle Swarm Optimization

Load frequency control (LFC) is necessary to guarantee the safe operation of power systems. Aiming at the frequency and power stability problems caused by load disturbances in interconnected power systems, active disturbance rejection control (ADRC) was designed. There are eight parameters that need to be adjusted for an ADRC, which are challenging to adjust manually, thus limiting the development of this approach in industrial applications. Regardless of the theory or application, there is still no unified and efficient parameter optimization method. The traditional particle swarm optimization (PSO) algorithm suffers from premature convergence and a high computational cost. Therefore, in this paper, we utilize an improved PSO algorithm, a reinforcement-learning-based memetic particle swarm optimization (RLMPSO), for the parameter tuning of ADRC to obtain better control performance for the controlled system. Finally, to highlight the advantages of the proposed RLMPSO-ADRC method and to prove its superiority, the results were compared with other control algorithms in both a traditional non-reheat two-area thermal power system and a non-linear power system with a governor dead band (GDB) and a generation rate constraint (GRC). Moreover, the robustness of the proposed method was tested by simulations with parameter perturbations and different working conditions. The simulation results showed that the proposed method can meet the demand for the frequency deviation to stabilize to 0 in LFC with higher performance, and it is worthy of popularization and application.


I. INTRODUCTION
Maintaining the relative stability of the frequency and voltage is a prerequisite for ensuring the safe operation of a power system. As a well-known issue in power systems The associate editor coordinating the review of this manuscript and approving it for publication was Seyedali Mirjalili .
research, how to implement load frequency control (LFC) to ensure a constant frequency of a power system and improve the power quality and economic benefits have been widely studied. Furthermore, as the demand for power quality has increased, interconnected power systems have emerged. However, the exchange power of tie-lines in an interconnected power system is susceptible to disturbances, which may cause unnecessary economic losses. Therefore, determining how to design a controller to ensure the safe, stable, and efficient operation of the power system is of great significance.
The essential requirement of a control strategy is to have the ability to process parameter uncertainty and achieve a good anti-disturbance performance while obtaining the desired dynamic performance to the greatest possible extent. Moreover, the design of the controller should not be too complicated to provide practical solutions for engineering debugging. At present, as an important means of active frequency modulation in power systems, LFC control strategies currently cover sliding mode control (SMC) [1], linear operator inequality [2], robust control [3], and predictive control [4]. As the most traditional control method, proportional-integral (PI) [5] and proportional-integral-derivative (PID) [6] controllers have been the most widely used in the LFC field due to their clear principles, simple implementation, and certain degrees of robustness. They can restore the stability of a power system, but their performances are poor. Other control strategies, such as adaptive control, can make corresponding real-time adjustments to controller parameters or rules, but the design is complex, and they are not easy to apply in industry [7]. A variable-structure controller can respond to system disturbances and parameter changes at faster speeds, thereby significantly improving the dynamic performance of the controlled system, but the practical applications of such a controller are limited [8]. For LFC, a robust control strategy can theoretically cope with the problems caused by system disturbances and parameter modeling errors. However, the order is high, and the design algorithm is relatively complex, relying on engineering experience [9]. Model predictive control (MPC) exhibits strong robustness and adaptability, and it does not require high model accuracy. However, the stability analysis and robust performance detection of multivariable MPC need further study [10].
In practical applications, PID still occupies a dominant position in engineering. Han [11], [12] first proposed an active disturbance rejection control (ADRC) technology, which was based on the PID control idea of eliminating the effects of disturbances based on errors. It attributes all uncertain factors and external disturbances of the system to the total disturbance of the system based on the system inputs and outputs. The disturbance can be eliminated by designing a control law. ADRC has been widely used in many industrial fields, such as aircraft control [13], motor control [14], ship control [15], and vehicle control [16]. ADRC has also been applied for LFC. For example, Rahman and Chowdhury [17] compared the control effects of ADRC and PID for an LFC system, and the simulation results showed that ADRC is a powerful substitute for PID and has significant performance advantages for LFC. Zheng et al. [18] applied ADRC to a three-area interconnected power system both in regulated and deregulated environments. It is worth mentioning that most of the controllers in ADRC currently designed for LFC use linear ADRC (LADRC), which was proposed by Gao [19], where the internal structures of the ADRC method have been greatly simplified. To a certain extent, this was caused by the difficulty of ADRC parameter tuning. Therefore, determining how to tune the parameters for ADRC is of great significance to promote its application.
Intelligent optimization algorithms are derived from the observation and simulation of biological systems in the natural world, and they are a new class of strategies suitable for optimization problems. In terms of parameter tuning, intelligent optimization algorithms, such as particle swarm optimization (PSO) [20], the genetic algorithm (GA) [21], simulated annealing (SA) [22], and ant colony optimization (ACO) [23], have shown good optimization capabilities. However, some limitations and shortcomings of intelligent optimization algorithms cannot be ignored. As mentioned in the 'No Free Lunch (NFL)' theorem, no single algorithm can be designated as the best algorithm that is applicable to all optimization problems. Therefore, various optimization and improved intelligent algorithms are constantly being proposed. For example, for the standard PSO, on the one hand, it can quickly fall into local optima at the beginning of the search process; on the other hand, the computational cost will increase with the increase in the sample population size [24]. Therefore, improved PSO algorithms such as the differential evolution particle swarm optimization (DEPSO) [25] and reinforcement-learning-based memetic particle swarm optimization (RLMPSO) [26] came into being. RLMPSO is an improved algorithm from a memetic algorithm (MA) perspective, where the MA is a hybrid algorithm that consist of a local search method, reinforcement learning (RL), and a globally optimal PSO algorithm.
In this study, the proposed ADRC is evaluated on a two-area thermal power system, which mainly contains a non-reheat turbine and nonlinear links with a generation rate constraint (GRC) and a governor dead band (GDB). For the first time, RLMPSO is used to adjust the controller's parameters. To verify the effectiveness of the proposed method, simulation analysis on both linear and non-linear power systems with a GRC and a GDB was conducted, and the results were compared with those from other methods. Moreover, robustness tests were also carried out for uncertain parameter values and disturbances. The main contributions of this study are summarized as follows: (1) Aiming at the two-area interconnected power system with non-reheat turbines, two third-order ADRC controllers were designed.
(2) The RLMPSO algorithm was used to optimize the eight parameters in the ADRC, and the effectiveness of the designed method was verified by comparison with other methods.
The rest of the paper is arranged as follows. Section 2 describes the mathematical model of the LFC system. In Section 3, the ADRC is designed for LFC. Section 4 introduces the parameter optimization process based on RLMPSO. Section 5 shows the simulation results, and the corresponding analysis is presented. Section 6 concludes this paper.

II. POWER SYSTEM MODEL DESCRIPTION
LFC is a popular research subject for power systems, where the stability of the frequency is a prerequisite for the safe and reliable operation of the power grid. Fig. 1 shows a schematic diagram of the linear structure of the i-th area in the interconnected power system, which mainly includes three links: the governor G gi , the turbine G ti , and the generator G pi . The governor mainly controls the guide vane or intake valve through the feedback of the speed deviation to control the speed and load of the turbine. The turbine acts as a prime mover to generate mechanical power to drive the generator to convert mechanical energy into electrical energy. Generally, due to the strong coupling between interconnected power systems, when the load disturbance occurs in one area, other areas will also be affected by this area, resulting in instability of the entire grid. Therefore, the main goal of LFC is to control the frequency deviation f i within a safe range by overcoming the influence of load disturbances. Furthermore, in the interconnected power system, the tie-line that exchanges power P tiei between two adjacent areas also needs to be stabilized at the planned value. This study does not consider the load frequency control problem under deregulated environments involving economic benefits, so the planned value of the exchange power of the tie-line is 0. The mathematical expressions of the above three links are introduced as where T gi , T ti , and T pi are the time constant of the governor, the non-reheat turbine, and the generator. K pi represents the gain of the generator. According to Fig. 1, f i can be derived as The expression of the area control error (ACE) is written as where u i is the increment of the position change, which needs to be adjusted by the controller. P Li denotes the load disturbance. P tiei is expressed as The meanings of the symbols shown in Fig. 1 are described in Table 1.

III. DESIGN OF ACTIVE DISTURBANCE REJECTION CONTROL (ADRC)
ADRC is derived from the combination of the classic PID and modern control theory, which has the advantages of not relying on model information and eliminating unknown disturbances. It only needs to know the order of the system. Therefore, this section mainly focuses on the design of the ADRC for the above model. For the multi-area interconnected power system, to stabilize the frequency deviation and the exchange power of the tie-line to 0, tie-line bias control (TBC) mode is adopted, which uses ACE i as the controller input. In addition, in view of the coupling problem between areas, Tan [27] proposed a decentralized control method and carried out a theoretical proof, in which the controller in each area can be designed separately on the premise of ignoring the tie-line exchanged power.

A. TRANSFER FUNCTION PROCESSING
According to the decentralized control mentioned above, ignoring P tiei , Eq. 3 can be written as where Y (s), U (s), and D(s) are the Laplace transforms of ACE i , u i , and P Li . Combining this with Eq. 1, the expression of G i (s) and G di can be presented as Converting the above equation into a differential equation form, we obtain Therefore, Eq. 3 can be organized into the following third-order system as f is the total disturbance containing the uncertainties caused by parameter perturbations of the system model and external disturbances caused by load disturbances. Since the actual value of b cannot be known in practice, the adjustable parameter b 0 is used to substitute b.

B. DESIGN OF ADRC FOR LOAD FREQUENCY CONTROL (LFC)
ADRC is composed of a tracking differentiator (TD), extended state observer (ESO), and nonlinear state error feedback (NLSEF), where the TD can suppress the noise amplification effect in the input signal, the ESO can estimate the total unknown disturbance f , and the NLSEF can eliminate the estimated disturbance. Moreover, the ADRC structure diagram corresponding to the third-order system is shown in Fig. 2.
Because the control goal in this study seeks to make ACE i (t) = 0 at steady state, the tracking reference input v = 0. Moreover, since the load disturbance of the power system does not contain white noise, there is no need for a filter. Thus, the design of the TD can be ignored.
According to Eq. 8, the state can be defined as The corresponding state equation is The ESO can be expressed as where z 1 , z 2 , z 3 , and z 4 are the estimated values of x 1 , x 2 , x 3 , and x 4 , respectively. β 01 , β 02 , β 03 , and β 04 represent the observer gains, which can ensure that the state estimation values are close to the actual values when they are appropriate. The nonlinear function fal(·) has the following form: where a and δ are the adjustable parameters.
After the disturbance state is estimated, it needs to be eliminated. We define e i = v i − z i , i = 1, 2, 3, and then the control law can be designed as where β i , i = 1, 2, 3 are the feedback control gains, and fal(·) has the same form as Eq. 12. VOLUME 9, 2021 u i in Eq. 8 is then expressed as By substituting Eq. 13 and 14 into Eq. 8, under the premise of f ≈ z 4 , we can obtain Therefore, appropriate feedback control gains can ensure that ACE i converges to 0. In this study, δ = 0.05, δ 0 = 0.005, a 1 = 0.3, a 2 = 0.8, and a 3 = 1.1.
The above design process shows that for a third-order controlled system, the parameters that need to be adjusted in the ADRC are β 01 , β 02 , β 03 , β 04 , β 1 , β 2 , β 3 , and b 0 . It is very difficult to manually adjust eight parameters at the same time, and an effective optimization method is urgently needed.

IV. REINFORCEMENT-LEARNING-BASED MEMETIC PARTICLE SWARM OPTIMIZATION (RLMPSO)-OPTIMIZED ADRC
As a control technology for estimating and compensating for uncertain disturbances, ADRC has attracted widespread attention in industry and academia since its proposal. However, there has not been a set of efficient and unified tuning rules for parameter adjustment. In this study, the RLMPSO algorithm is applied for the parameter tuning problem of the nonlinear ADRC to test the superiority of the combination of RL and the intelligent optimization algorithm in a new field.

A. BASICS OF PARTICLE SWARM OPTIMIZATION (PSO) AND REINFORCEMENT LEARNING (RL) 1) PRINCIPLE OF PSO
The PSO originated from the simulation of bird predation behavior. The main concepts of PSO include a population, potential solutions (called particles), and iterative search space, where each particle is composed of a position, speed, and fitness value. It moves at an adaptive speed in the search space and retains the best position it has ever visited, that is, the position with the lowest function value (in general, only the minimization problem is considered). At the same time, it tracks the current optimal particle in the solution space to realize the information exchange between particles and then adjusts the flight direction and distances between the particles to complete the optimization search.
In the PSO algorithm, each iteration needs to complete two main processes. One is the determination of the individual optimal position p best of each particle, and the other is the determination of the optimal particle position g best of the group. The position and velocity of each particle in the population are updated according to Eq. 16.
where k is the current iteration number, X k is the current position of the particle, and V k is the current velocity of the particle. c 1 and c 2 represent the learning factors.
ω k denotes the inertia weight. r 1 and r 2 are random numbers between [0, 1]. The initial population is generally randomly generated by where X i and V i represent the position and velocity of the particle, respectively, R 0 and R 0 are the random vector with the same dimension as X i , and each component is between [0, 1]. U and L represent the upper and lower bounds of the particle position (that is, the parameter solution), respectively. V max and V min denote the upper and lower bounds of the particle velocity, respectively.

2) PRINCIPLE OF RL
RL is a method to find the optimal strategy through the continuous interaction between an agent and an uncertain environment. When the intelligent agent ''communicates'' with the environment through actions, the environment will return the current reward to the agent, through which the action can be evaluated [28]. The basic framework is shown in Fig. 3. We assume that the environment produces state s t at time t, and the reward value r t can be obtained based on the reward function. The agent can obtain the optimal action a t through the state-action value function based on the cumulative reward R c , where the state-action value function can be regarded as the evaluation value of the action.
Q-learning is the most common RL algorithm, first proposed by Watkins in 1989 [29]. As an RL method based on a time difference, the selection of the current state and action in Q-learning can be regarded as an event, and any event corresponds to a state-action value function Q(s t , a t ), which is stored in the Q table. Through the updated iteration of the Q table, the intelligent agent in Q-learning will gradually approach the optimal strategy for sequential decision-making problems in the continuous in-depth interaction between the agent and the environment. A typical learning process of Q-learning can be described as follows: Step 1: Model initialization. For all discrete states s ∈ S and actions a ∈ A, initialize their corresponding value functions Q(s, a) ∈ Q.
Step 2: Initialize state s. On the basis of a Q table, select the initial action value a by an ε − greedy policy, shown as π(a, s) ← where ε represents the probability of exploration. From Eq. 18, we observe that the probability of selecting an action corresponding to the largest Q value is ε. Otherwise, an action value is randomly selected from the action space.
Step 3: Perform action a t , thus obtaining the corresponding reward value r t and the state s t+1 at the next moment.
Step 4: Update Q(s t , a t ) according to where α is the learning rate. γ represents the discount factor, which reflects the importance of rewards for future moments.
Step 5: Determine whether to end the process. If yes, output the optimal strategy; otherwise, return to the third step.

B. RLMPSO
The particle swarm algorithm suffers from premature convergence and a high computational cost. Improving the particle swarm algorithm from the perspective of a memetic algorithm is a significant research direction.

1) PRINCIPLE OF RLMPSO
The main idea of RLMPSO is to embed RL into the operation of each particle search stage in the particle swarm algorithm. Under RL control, each particle performs one of five possible operations: exploration, convergence, highjump, low-jump, and fine-tuning. Moreover, each action will be rewarded or punished based on the performance. In addition, the population size in the RLMPSO algorithm is small, and each particle evolves independently. For example, one particle performs exploration, while other particles perform their own operations. The schematic diagram of the RLMPSO structure is shown in Fig. 4, where Q-learning is adopted. Fig. 4 shows the structure of the entire RLMPSO that integrates RL and PSO. Particles in the PSO act as agents in the Q-learning, and the search space of particles is used as the environment in Q-learning. The state is expressed as the current operation of each particle, namely exploration, convergence, high-jump, low-jump, or fine-tuning. Actions are defined when changing from one state to another. In other words, Q-learning controls the operation of each particle in the PSO group. Specifically, RL adaptively switches particles from one operation (state) to another operation (state) based on the performances of the particles. Positive rewards are given to particles that perform well, and particles that perform poorly are punished.

2) DEFINITION OF ACTION
As mentioned earlier, there are five operations that each particle can perform. The five operations are introduced in this section.
Exploration and convergence are two operations defined by the difference in the power of the particle's global search during the search process and the preference of whether to track the current global optimal position g best , which is realized by the differences of the values of ω, c 1 , and c 2 . For the exploration, the particle probes the solution space with a larger ω, and to maximize the global search, the particle will be far from the current global optimal position g best , where c 1 > c 2 . The convergence operation is the opposite. The search power of the particles in the solution space will be attenuated. Thus, particles will converge in the direction of g best . At this time, c 1 < c 2 , and ω is small.
The main idea of the jump operation is to avoid premature convergence of particles by changing the individual optimal particle p best,i (i is the particle subscript) to escape the possible local optima. Specifically, a random value is added to each dimension (that is, each parameter value to be optimized) of p best,i , shown as where r n is a normally distributed random number in the range [0, 1], that is, r n ∼ N (0, σ 2 ). For the high-jump operation, this means that the particle will change p best,i with a larger step length, and the standard deviation σ is closer to 1 at this time, while for the low-jump operation, the particle step size is smaller, and the standard deviation σ is close to 0. Similar to the jump operation, the fine-tuning operation also adds a random value to each dimension of the individual optimal particle p best,i , so that the entire population performs a local search within the current global optimal solution neighborhood. The difference is that the fine-tuning VOLUME 9, 2021 of each dimension of the particles that needs to be manipulated is performed independently, and the fine-tuning operation will perform a certain number of fitness evaluations in cycles when each dimension changes. During the evaluation process, p best,i and the corresponding speed variable V i,d (i represents the current particle subscript, and d is the dimension subscript) will change as the fitness value changes.
For example, supposing that particle i performs fine-tuning operations, the maximum dimension (the number of parameters to be optimized) is D, and the maximum number of fitness evaluations is E m . For the current dimension d ∈ [1, D], the current fitness evaluation number of this dimension is e ∈ [1, E m ], and the minimum objective function is f (x). The realization process ( * ) is described as follows: Step 1: Update the speed using the following equation, where L i,d is the step size, a is the acceleration factor, p is the parameter that controls the speed attenuation, and r is a uniformly distributed random number in [−0.5, 0.5]: Step 2: Record the original best fitness value f best and calculate f (p best,i + V i,d ).
Step 3: Update p best as follows: Step 4: Update L i,d with the following equation and let e = e + 1: Step 5: If e ≤ E m , return a; otherwise, let d = d + 1.

3) Q TABLE
Unlike the standard PSO, RLMPSO can perform any operation at any stage of the search process, namely exploration, convergence, high-jump, low-jump, and fine-tuning. RL is responsible for tracking the best performance of each particle. Each particle has a Q table that only belongs to itself and not to the population during the search process. The dimension of the Q table is 5 × 5: a 11 a 12 a 13 a 14 a 15  C a 21 a 22 a 23 a 24 a 25  H a 31 a 32 a 33 a 34 a 35  L  a 41 a 42 a 43 a 44 a 45  F a 51 a 52 a 53 a 54 In Eq. 24, rows represent states, and columns represent actions. E, C, H, L, and F represent the five operations, exploration, convergence, high-jump, low-jump, and finetuning, respectively. The state of E indicates that the current particle execution operation is exploration, and the action of E indicates that the next operation performed by the particle is exploration. Because the number of fitness evaluations for each dimension in the fine-tuning operation will be calculated based on the number of iterations, the fine-tuning operation requires a large number of global iterations, while other operations only require one. At the same time, since the fine-tuning operation is an operation on the optimal position of the individual, the execution of the fine-tuning must be postponed to the global operation, that is, after the exploration, convergence, and jump operations. Therefore, to delay the execution of the fine-tuning at the beginning of the search process and give higher priority to the other operations to be performed, the initial Q table entry of action F (corresponding to the last column in the Q table) is set to an infinitesimal negative value. Before activating the fine-tuning operation, RLMPSO must perform M iterations and consider the minimum value (including negative numbers) of the other four columns in the Q table as the column item's initial value of the action F is located. During the RLMPSO execution, the best action can be retrieved in the current state from the Q table, presented as where a t+1 is the action to be performed next, that is, the best action for the current state, s t represents the current state, a i ∈ A denotes an action, and A is the action space collection. In summary, the principle diagram of the movement adjustment of each particle in the RLMPSO is shown in Fig. 5.

C. DESIGN OF RLMPSO-OPTIMIZED ADRC
As reported previously [26], RLMPSO performs significantly better than many other PSO variants in optimizing multiple unimodal functions, multimodal functions, composite functions, and two practical optimization problems involving train gear and pressure vessel design. This section uses the RLMPSO algorithm to optimize the ADRC parameters to better control the control system's dynamic performance. The following time-weighted integral of absolute error (ITAE) is selected as the objective function:   where n is the total number of controlled system areas, and T is the simulation time. When the ITAE decreases, the controller receives an instant reward 1; otherwise, it receives a −1. The schematic diagram of the ADRC based on the RLMPSO algorithm is shown in Fig. 6.
The parameters of the RLMPSO are shown in Table 2. In addition, the ranges [L, U ] of the controller parameters are given in Table 3. The corresponding speed range is 0.2[L, U ]. The description of the optimization process is shown in Fig. 7.

V. NUMERICAL SIMULATION RESULTS AND ANALYSIS A. TRADITIONAL TWO-AREA NON-REHEAT THERMAL POWER SYSTEM
In this section, the two-area non-reheat thermal power system [30] is used as the simulation model, where the two areas of the interconnected power system have the same structure, and each area contains a non-reheat turbine. The model parameters are listed as 425p.u.MW /Hz; R 1 = R 2 = 2.4Hz/p.u.;T g1 = T g2 = 0.03s; T t1 = T t2 = 0.3s;K p1 = K p2 = 120Hz/p.u.; T p1 = T p2 = 20s;T 12 = 0.545p.u.;a 12 = −1.

1) SIMULATION ANALYSIS OF NOMINAL PARAMETERS
Supposing that when t = 0, the first area is disturbed by a step-load disturbance (SLP) P L1 = 0.1 p.u. According to the discussion above, each area adopted a third-order ADRC. The optimized parameters are shown in Table 4. Fig. 8 shows the response results of the frequency deviation and tie-line exchanged power, where the optimization results of various algorithms, including BFOA-PID [31], HBFOA-PID [32], hPSO-PS-FUZZY-PID [33], TLBO-PID [34], ISFS-PID [35], DSA-FOPID [36], and DSA-FOPI-FOPD [36], are also given. Table 5 shows the adjustment time and ITAE performance index results of the frequency deviation in each area and the system tie-line exchanged power. Fig. 8 shows that the frequency deviation and tie-line power deviation in each area were stable at zero in the steady state, which means that the steady-state performances of the two controllers were the same. As shown in Table 5, the proposed LFC controller with RLMPSO-ADRC achieved a significantly smaller ITAE value than the LFC controllers from previous studies. The ITAE value under this method was 0.00015, which was 1 120 of the optimal ITAE value of the other methods (0.018 of DSA-FOPI-FOPD). This means that the proposed method has a better control performance.

2) ROBUSTNESS ANALYSIS
In modern complex power systems, the uncertainty of the system parameters is a crucial issue. Therefore, it is crucial for the LFC controller to be robust to the uncertainty of the parameters in the system. To examine the robustness of the proposed control strategy, Fig. 9 and 10 show time-domain  response curves when the system model parameters, B, R, T p , K p , T g , and T t , were changed by +%30 and −%30, respectively. The response performance of the proposed method was better than those of the other methods, i.e., the minimum overshoot and undershoot and the shortest stabilization time, thus verifying the robustness of the proposed method. In addition, to test the suppression ability of the proposed method to different load disturbances, Fig. 11 shows the output response of the system controlled by the proposed method with different load disturbances, where P L1 = 0 p.u. and P L2 = 0.1 p.u. at t = 0-10 s, and P L1 = 0.1 p.u. and P L2 = 0.1 p.u. at t = 10-20 s.
As shown in Fig. 11, the RLMPSO-ADRC control system could always suppress the system frequency deviation and the fluctuations of the tie-line power deviation under different load disturbances and restore the system to a stable state in a very short time, thus demonstrating that the proposed LFC controller had good disturbance rejection capabilities and robustness.

B. TWO-AREA POWER SYSTEM WITH NONLINEARITY
In practice, the system will inevitably be subject to the internal constraints of the physical system dynamics. The nonlinearity caused by both the GRC and the GDB are considered, which are shown in Fig. 12 and 13, respectively. The GRC is a saturated nonlinear phenomenon that occurs when the generator has difficulty responding on-demand when a large load disturbance occurs and cannot provide a sufficient rate of change. The GDB can avoid the loss of control due to      Similarly, using the third-order ADRC, where the controller parameter ranges are still as shown in Table 3, the optimized parameter results are shown in Table 6. For a step load disturbance of P L2 = 0.02 p.u. added to the second area at t = 0, the simulation results are shown in Fig. 14.
To facilitate the performance comparison with other algorithms, this section uses the following performance indicators for evaluation: integral absolute error (IAE), overshoot O sh , undershoot U sh , and settling time T s , as shown in Table 7, where IAE= ∞ 0 |e(t)|dt. As shown by the response curve of the time-domain system in Fig. 14, the RLMPSO-ADRC had the smallest fluctuations of f 1 , f 2 , and P tie compared with the other four algorithms and the fastest recovery speed. Table 7 also numerically shows the effectiveness of the proposed method. The transient performance and response speed were significantly better than those of other algorithms, thus verifying the effectiveness of the proposed method for the LFC control system.

VI. CONCLUSION
In this study, for the problem of frequency instability caused by load disturbances in a power system, a load-frequency active disturbance rejection controller was designed. In view of the difficulty of determining the controller parameters, we introduced the RLMPSO algorithm, with a better convergence speed, which combines RL with PSO. To verify the effectiveness of the proposed RLMPSO-ADRC, we conducted many simulations and comparison experiments. We first applied the traditional two-area non-reheat thermal power system method, and the performance is compared with those of previously reported methods. Compared with other strategies, the proposed method showed a smaller response deviation and settling time, thus highlighting the advantages of the proposed method. The robustness of the method in the power system was studied. The system response under critical parameter states and different disturbance conditions maintained good dynamics, proving that the proposed method had good robustness to the uncertainties of the system. Finally, the proposed method was applied to a nonlinear power system containing a GDB and a GRC, and the dynamic response performance of the system was compared with control strategies, such as predictive control and conventional control. The results showed that the method had a stronger ability to suppress system load disturbances, which indicated that the proposed method is an effective solution to LFC problems. Therefore, the proposed method has theoretical significance and research value for the application of ADRC in actual LFC systems.
YUEMIN ZHENG was born in 1996. She received the B.E. degree from Shijiazhuang Tiedao University, Shijiazhuang, China, in 2018. She is currently a Graduate Student with Nankai University, Tianjin, China. Her current research interests include active disturbance rejection control and reinforcement learning. VOLUME 9, 2021