Efficient Load Frequency Control of Renewable Integrated Power System: A Twin Delayed DDPG-Based Deep Reinforcement Learning Approach

Power systems have been evolving dynamically due to the integration of renewable energy sources, making it more challenging for power grids to control the frequency and tie-line power variations. In this context, this paper proposes an efficient automatic load frequency control of hybrid power system based on deep reinforcement learning. By incorporating intermittent renewable energy sources, variable loads and electric vehicles, the complexity of the interconnected power system is escalated for a more realistic approach. The proposed method tunes the proportional-integral-derivative (PID) controller parameters using an improved twin delayed deep deterministic policy gradient (TD3) based reinforcement learning agent, where a non-negative fully connected layer is added with absolute function to avoid negative gain values. Multi deep reinforcement learning agents are trained to obtain the optimal controller gains for the given two-area interconnected system, and each agent uses the local area control error information to minimize the deviations in frequency and tie-line power. The integral absolute error of area control error is used as a reward function to derive the controller gains. The proposed approach is tested under random load-generation disturbances along with nonlinear generation behaviors. The simulation results demonstrate the superiority of the proposed approach compared to other techniques presented in the literature and show that it can effectively cope with nonlinearities caused by load-generation variations.


I. INTRODUCTION
The growing energy demand, environmental impacts, and depletion of fossil fuels have led to large-scale use of renewable energy sources (RES). The utilization of these RES results in complex and dynamic electric power systems [1]. Therefore, it is becoming more challenging for modern grids to maintain the frequency and tie-line power within a specified limit in interconnected areas. The deviation in frequency causes an imbalance between electric load and the generation [2]. As the load continuously varies and if there would The associate editor coordinating the review of this manuscript and approving it for publication was M. Mejbaul Haque .
not be an immediate action to mitigate the problem then it could lead to a severe damage. Recently, due to large penetration of intermittent renewable energy sources into the grid led to a total blackout of the power system [3]. Hence, effective control strategies are vital under uncertain conditions in order to achieve a balance between the system reliability and efficiency. Therefore, automatic load frequency control (ALFC) plays an important role in maintaining loadgeneration balance by regulating the tie-line power flow and frequency oscillations between interconnected areas.
At present, classical proportional integral derivative (PID) type controllers are being used by utilities for load frequency control (LFC) because of their simple structure, high VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ reliability, and better performance-to-cost ratio. PID controller gain values are being tuned over the decades based on experience, utilizing trial-and-error procedures and conventional tuning methods such as Ziegler-Nicholas, but these strategies perform poorly under random load variations and wide range of operating conditions [4]. Over the years, researchers have proposed several intelligent and optimized based control strategies for LFC. Fuzzy logic and adaptive neuro fuzzy inference system (ANFIS) are proposed to tune the PID parameters in [5], [6]. However, a fuzzy system needs field expertise to tune the membership functions and it is difficult to acquire the specific knowledge due to its inadaptability [7]. Recently, many advanced control techniques are proposed for LFC, such as model predictive control (MPC) [8], sliding mode control (SMC) [9], disturbance rejection control [10], and variable structure control [11]. But these controllers are complex and not widely used in the industry, so it is required to improve the PID controller owing to its widespread applications. As unideal gains are the primary impediment in optimum settings of PID controller, the gain values are therefore derived by heuristic approaches like genetic algorithm (GA) [12], particle swarm optimization (PSO) [13], firefly algorithm (FA) [14], grey wolf optimization (GWO) [15], ant colony optimization (ACO) [16], etc. However, mostly these schemes are only proposed for conventional power systems without considering RES and nonlinear constraints. Apart from that, researchers also have proposed cascade controllers for LFC in articles [17], [18], but these types of techniques required additional controller that also has to be tuned, so it increases the complexity of the strategy.
In recent years, reinforcement learning (RL) based control techniques have been identified as a promising solution for the modern grid. A critical literature review on electric power system control using reinforcement learning has been presented in [19]. Reinforcement learning exhibits superiority over conventional control schemes because of its selflearning approach via an interactive trial and error method based on observations it gets from the dynamic environment. Hence, reinforcement learning can make decisions and solve realistic control problems more effectively. There are some studies proposed in the literature to control the frequency of an interconnected area using reinforcement learning schemes. Data-driven RL based control techniques are presented in [20]- [22] for LFC of multi-area power systems. However, while designing traditional RL agents, the degree of action discretization becomes crucial since control action is taken from a low-dimensional action domain, resulting in limited control performance [23]. Here, deep learning was combined with RL to overcome these deficiencies, which is called deep reinforcement learning. A new approach is proposed in [24] for frequency control using DRL in the continuous action domain, but this kind of technique lacks a constant gradient signal due to the concurrent learning behavior of agents [25].
To solve the continuous control problems, deep deterministic policy gradient (DDPG) was put forward by Lillicrap et al. [23] and it does not necessitate the discretization of both the states and actions. Recently, Yan et al. [26] have proposed a multi-agent deep reinforcement learning (MA-DRL) approach for multi-area LFC using DDPG. The concept behind that article is an offline centralized learning and online individual application for each control area, where the objective function is maximized by formulating the controller as an MA-DRL problem. However, since DDPG updates the Q-value in the same way as deep Q-networks (DQN) does, it inherits the drawback of overestimation of Qvalues, which may lead to suboptimal policy and incremental bias [27]. Moreover, as the authors [26] first implemented the PID controller on the power system to collect the data for initialization of the agent, so specific dataset may lead to sub-optimum convergence under continuous variations of load-generation. A grid-area coordinated LFC technique based on an effective exploration with multi-agent DDPG (EE-MADDPG) is presented in [28], but it cannot be practically implemented on actual grid due to abrupt changing of power grid. Furthermore, as discussed earlier these types of controlling schemes are not widely being used in the industry compared to the PID controller. Therefore, in this paper, we propose a twin delayed deep deterministic policy gradient (TD3) based deep reinforcement learning approach to fine-tune the PID controller parameters under uncertain conditions. TD3 resolves the defects of DDPG by employing delayed actors update, double critics and actors, and additive clipped noise on control actions. Moreover, unlike the above papers that use DRL for LFC, our proposed technique can directly interact with the power system model to tune the PID gains for actual power grid. Multi-TD3-agents are trained to minimize the frequency and tie-line power deviations of the power system, where each agent uses the local area control error information to decide the action. Furthermore, for better performance we replaced the actor-network's fully connected layer with the new layer consisting of function y = abs(weights) * x. This new layer ensures that the weights are positive, as gradient descent optimization may lead the weights to negative values. Moreover, a new integrated hybrid power system architecture is proposed for the interconnected system, which comprises of wind, PV, electric vehicle, hydro and thermal plants. Nonlinearities such as generation dead band (GDB) and generation rate constraints (GRC) are also considered because many of the existing studies ignored these realistic nonlinear behaviors.
Our contributions in this paper can be summarized as follows: • To the best of our knowledge, this paper is the first work that uses the deep reinforcement learning in continuous control action to optimally tune the PID controller parameters.
• We improve the twin delayed deep deterministic policy gradient based agent to avoid negative PID gain values while training the agent for LFC, which considerably reduces the computational process.
• We evaluate our novel approach on the given system and compare the performance with metaheuristic and DRL based techniques. In addition, a sensitivity analysis is performed to verify the robustness of the proposed scheme.
The rest of the paper is organized as follows. The modeling of the two-area interconnected system is discussed in Section II. In Section III, the preliminaries used in our work are described. Our proposed TD3 scheme and its implementation is illustrated in Section IV, while the Section V covers the simulation results and discussion of the proposed method. Finally, the article is concluded in Section VI.

II. SYSTEM MODELING
This section briefly discusses the description of the proposed renewable integrated power system. An unequal two-area interconnected power system is under consideration for our study. Area 1 consists of hydro, reheat thermal and diesel, while area 2 integrates the wind, hydro, PV and electric vehicle as shown in Figure 1. The differential equations for load frequency control of two area systems are widely reported in the literature [5]- [16], [29]- [31] and can be given as follows.
where P Gi is the governor position for ith area, P Ti is the power generation level for ith area and f i is the frequency deviation for the ith area. The tie-line connects two areas for power sharing and any particular variation of load in any area can be compensated by the neighboring areas through this tieline. Mathematically tie-line power can be expressed as under any perturbation the tie-line power deviates to where T 12 is The final relationship between power angle of machine and frequency deviation will be ACE is the area control error, which is usually the input to the controller that is denoted by B 1 and B 2 in Figure 1 are the frequency bias parameters that can be described as is the governor speed regulation parameter and D i is the dependency parameter. The area size ratio is shown as 'a 12 = −P r1 /P r2 , where P is the power capacity (MW) for each area. The detailed information of the parameters used for simulation is listed in the Appendix and taken from [29]- [31]. The block diagrams and transfer functions of the conventional power system are extensively discussed in the literature [5]- [16], whereas we have integrated the electric vehicle (EV), wind and PV into the system, and their details are given below.

A. ELECTRIC VEHICLE MODEL
An aggregate model of the EV comprised of a battery charger and primary frequency control is illustrated in Figure 2. EV fleets can compensate the unscheduled load by exchanging power between battery and the grid via a charger. The dead band function along with droop characteristics is taken into account since there is a possibility that all EVs may disconnect from the grid resulting in frequency deviation. The upper and lower limits of the dead band are set to 10 mHz and -10 mHz respectively, whereas the droop coefficient (R ev ) value is taken same as other plants. K EV represents the EV gain and the value of K EV (between 0-1) determined by the EVs' state of charge (SOC), while the battery time constant is represented by T EV . P max AG and P min AG are the maximum and minimum power outputs of the EV fleets and these reserves can be calculated as follows [32]. The incremental generation change of EV in the area is denoted by P EV . N EV indicates the total number of electric vehicles connected to the system.

B. WIND GENERATION MODEL
A wind turbine (WT) based on a doubly-fed induction generator (DFIG) is investigated in this study. Wind turbines convert wind energy into electricity and the output power can be characterized as follows.
Here ρ, A, C p and V represents the air density, blade swept area, power coefficient and wind speed respectively. The wind turbine power coefficient is: where β is the blade pitch angle, and λ is the tip speed ratio. The transfer function can be written as [33] A wind energy system can cause instability in the system due to its intermittent nature; hence continuous power fluctuation can be handled by a battery. Battery energy storage system (BESS) stores excessive electrical energy and if equipped with a large battery bank, it can offer a great amount of power supply for a longer length of time. The simplified transfer function of the BESS is expressed as follows.
C. PHOTOVOLTAIC MODEL Photovoltaic (PV) modules are solar energy-generating components. The relationship between voltage and current is nonlinear because of the variation in solar radiations throughout the day. Therefore, to increase the output power of the PV panel, a maximum power point tracker (MPPT) must be used.
The following is a description of the PV plant's transfer function with MPPT [34].
K PVi and T PVi represent the gains and time constants of the PV system respectively. Incremental conductance (IC) method is used to extract MPP from the PV system under the following conditions.

D. NONLINEAR GENERATION BEHAVIORS
Generation rate constraint (GRC) and dead band (GDB) are incorporated into the system for a more realistic approach. Power generation can only vary at a certain limit called GRC, on the other hand, GDB is the steady-state speed change until the governor valves position changes. GDB has a significant effect that may lead to random fluctuation, and the factors that contribute to it are backlash in different governor linkages between servo piston and camshaft, and valve overlapping in hydraulic relays [35]. The nonlinear models are shown in Figure 3, and for thermal plant the values of GRC and GDB are taken as ±3% /min and 0.06% (0.036 Hz), respectively. The GRC lowering and raising values of hydropower plants are 360%/min and 270%/min, respectively, whereas GDB limit is 0.02%. Thermal unit's GDB is incorporated in the governor transfer function as given below where N 1 , N 2 and w 0 are computed as 0.8, -0.2 and π respectively. Moreover, the area participation factor (apf) for each plant is considered to determine how much each unit will contribute to the nominal loading. K H , K T , and K D of area 1 are the apfs of hydro, thermal, and diesel plants, respectively. Similarly, the apfs for area 2 are specified, where sum of these factors must be equal to 1 for each area. The apfs for each unit are listed in the Appendix.

III. PRELIMINARIES
In this section, a brief description of the deep reinforcement learning techniques that will be used in our study are presented.

A. DEEP DETERMINISTIC POLICY GRADIENT
DDPG is an improved class of deterministic policy gradient that combines DPG and DQN, and is a model free off-policy actor-critic algorithm. Moreover, it can be used in continuous space using policy-function (actor) and Q-function (critic) framework, which is essential for analysis of the power system as it operates in continuous action because of varying load and generation. A general network of actor and critic is shown in Figure 4. The critic uses temporal difference (TD) technique to update its parameters in the same way as DQN does, whereas DPG algorithm is used to update the actor via α = µ(s|θ µ ) + N , here N represents random noise function. Exponential smoothing is used to update the corresponding θ µ and θ Q parameters of the actor and critic network as stated below [36].
The learning stability may be improved owing to slow and smooth variations of the target network and hyperparameters. Using critic framework, the action values can be estimated with the Bellman equation.
Next, y = r + γ Q (s , a ) is used as a TD-error with a discounting factor γ 1, and to update the critic parameters minimize the loss function across all samples.
For training, DDPG employs the experience relay (ER) technique, in which a random dataset is selected from the reply buffer and trained in a mini-batch scheme. Through mapping the state of the provided action, the current network's actor parameters are updated via action value function and then updated them using the neural network gradient backpropagation. To maximize the expected discounted reward, the following policy gradient is used [23].
To learn the parameterized policy, the Actor-Critic technique converts Monte Carlo based updates into TD. Meanwhile, classic on-policy is transformed to off-policy by adding experience replay from DQN and a target network, which enhances sample efficiency. The performance of Q-learning method is known to be affected by overestimation of the value function, so the policy update will be negatively affected if the overestimation persists throughout training. Because of these limitations, approaches such as double Q-learning and double DQN are developed, which employ two value networks to separate VOLUME 10, 2022 Q-value and actions' selection updates. Twin delayed DDPG (TD3) [37] solves the overestimation of Q-value using the following three techniques.
To begin, the concept of double Q-learning is imitated by the TD3, which computes the next state value by creating two Q-value networks as given below.
To compensate the overestimation of Q-value, the target Q-value is taken as the clipped minimum of two values and then put into the Bellman equation to compute the loss function (same as stated in Eq. 21) and the TD-error as shown in Figure 5 and given below [38].
Even though this Q-value update rule may result in an underestimating bias when compared to the classic Q-learning technique, the action values will not be openly passed on via policy update. Moreover, to achieve better convergence the target network is set up being a deep function approximator that offers constant objectives while learning phase. On the other hand, the observed states are sensitive to divergence if the error is integrated. Therefore, compared to the value network the policy network is intended to update at a lower frequency in order to limit the error propagation, hence a high-quality update can be obtained.
Finally, to avoid overfitting, the Q-value computation needs to be smoothed in order to resolve the trade-off between bias and variability. Hence, for each action a clipped normal distribution noise is applied as a regularization, resulting in the revised target update as shown below [37].

IV. PROPOSED METHOD
The twin delayed DDPG-based agent is trained to act as an LFC controller to optimally tune the PID parameters.
Multi-agents have been trained where each area has its own frequency controller (agent) in the proposed interconnected system, and the elements involved in this formulation are stated as follows.

A. ENVIRONMENT
Everything in an interconnected power system apart from the agent is referred to as an environment. An agent takes the environment's state as information at every time step to choose an appropriate action, and then the environment gives back a reward and new state against that particular action.

B. OBSERVATIONS
The frequency response that will be used by the TD3 algorithm, policy, and reward function is represented by the state or observations.

C. REWARD
To evaluate the agent's behavior against each state the environment gives the feedback to determine whether or not the system is converging its objectives. As a result, reward function directly influences the agent to take actions that maximize the values in order to approach objective function.

D. ACTION
It is the agent output in the form of a control signal to the power plant and its value decided by the policy to maximize reward at a certain state. The implementation of the twin delayed DDPG agent is illustrated in Figure 6. The area control error (ACE) is the input/state for each agent in the proposed interconnected system. The state (s) or observations are given as proportional, integral, and derivative of the ACE to calculate the action (a) of each agent in both areas. Based on reward function and frequency response the agent tries out different PID values to interact with the power system and this action exploration remains continuous until its approach specified objectives. To get the optimal PID parameters the reward function plays an important role in effectively taking the actions toward solving the defined load frequency control problem because reward affects the Bellman equation (Eq. 24) of the proposed TD3 algorithm. As the agent learns all by itself by continuously updating its parameters, so the proper reward function will help in fast convergence with less computation and high performance. In this paper, the absolute sum of frequency deviation and tie-line power is defined as the objective/reward function to minimize the fluctuation of tie-line power and frequency in both areas. The reward function is stated as follows.
where negative will minimize the error, thus maximizes the reward. If a particular agent's action is not taken towards minimizing the error, then a penalty will be applied which will reduce the reward, so the RL agent keeps exploring the action/PID values that will maximize the reward.

E. DESIGN OF TD3-BASED CONTROLLER
As the primary goal of the scheme is to minimize the frequency and tie-line power under uncertain conditions, the dynamic environment is created by integrating RES and selecting random step disturbance in the power system to train the TD3 agent. The agent receives frequency response from the environment in the form of proportional, integral, and derivative of ACE, and gives the control signal to the environment as an output. The agent consists of critic and actor networks where an actor is known as a policy structure to decide action and the critic is estimated value function.
To create the TD3 agent the actor and critic are created as deep neural networks. In actor, we mimic the neural network as PID controller where the feature-input layer is with proportional, integral and derivative of ACE as input and fully connected layer as controller output. Furthermore, we improved the TD3 agent by replacing fully connected layer in actor-network with the new layer that consists of function y = abs(weights) * x. This new layer ensures that the weights (PID gains) are positive, as gradient descent optimization may lead the weights to negative values. The parameters which are considered while creating actor and critic networks for the TD3 agent are listed in Table 1. The critic network shown in Figure 6 is made up of total 9 layers, as it receives the frequency response (s) and actor's action (a), so feature-input layers are used for both inputs. Then, a concatenation layer is added to link both inputs followed by fully connected layers for each input. Rectified VOLUME 10, 2022 linear unit (ReLU) is used between each fully connected layer as an activation function. Adam optimizer is applied to update the parameters of actor and critic networks, while the glorot is used as weights initializer for fully connected layers. To formulate the TD3 agent two critic networks (Q 1 (s, a), Q 2 (s, a)) are created and these two networks help the agent to estimate long-term reward based on states and actions. The structures and parameters of the target actor and target critics are taken similar to the actor-critic. The target actor and target critics parameters are continuously being updated by the agent to improve the optimization's stability. The steps used in implementing the proposed TD3 algorithm for LFC are briefly discussed as follows: Step 1. Create critic and actor functions for the agent to estimate the value function and policy during training at each time step.
Step 2. Specify the agent options such as experience replay buffer length, mini-batch size, and Gaussian noise.
Step 3. Based on specified parameters in step 1 and step 2, create the TD3 agents for both areas.
Step 4. To train the TD3 agent the following algorithm is used.
Once the training is completed the actor network's absolute weights are fetched as the proportional, integral, and derivative gains of the PID controller. The flowchart in Figure 7 illustrates simplified representation for PID tuning.

V. RESULTS AND DISCUSSION
The two-area interconnected system shown in Figure 1 is developed in MATLAB/Simulink and the TD3 agent is implemented as a controller for each area in the system to get the optimal PID gains. The configuration of the proposed scheme is illustrated in Figure 8. During training, the algorithm runs the simulation for every episode and the simulation for a single episode continues until it reaches the window's length or triggered the threshold limit. After every episode, Algorithm 1 Twin Delayed DDPG.
1: Initialize actor µ (s | θ) and critics Q (s, s | ∅) networks 2: Initialize target actor µ t (s | θ t ) and target critics Q t (s, s | ∅ t ) using primary actor-critic networks' parameters 3: for each episode = 1,. . . , M do 4: Simulate the environment with random load-generation disturbance 5: Observe the current state as [ACE(s), ACE(s)/s, ACE(s)/s+1] and store in experience buffer 6: Initialize random exploration noise (Gaussian) N t 7: for t= 1,. . . , T do 8: Choose an action a t = µ(s|θ) + N t based on current observations/state 9: Execute the action and get the details of the reward r t and next state s t+1 10: Store values (s t , a t , r t , s t+1 ) in experience replay buffer 11: Sample the random minibatch of store values from replay buffer

12:
Put Update each critic parameters by minimizing the loss function stated in Eq. 21. 14: Update actor parameters using Update target actor and target critic parameters using smoothing factor θ µ = τ θ µ + (1 − τ ) θ µ θ Q = τ θ Q + (1 − τ ) θ Q 16: end for 17: end for    in Figure 9. We have given [0 0 0] as initial PID gains to initialize the model, therefore starting episodes received high negative rewards as an error penalty. Area-2 is more heavily penalized due to the presence of RES. The agent tries to maximize the reward by choosing optimal PID gains as control actions. As shown in the figure the model performing better after 200 episodes but to get better results and converge the system at optimal solution 800 episodes are carried out. After training the model, the robustness of the proposed scheme is tested under different scenarios and the results  are compared with conventional meta-heuristic and DRL techniques. Table 2 shows the obtained PID controller gains across each algorithm which are taken into consideration for comparison against the proposed approach. While training the model, the lower limit [0 0 0] and the upper limit [5 5 5] of the PID gains are set for every algorithm for a fair comparison. The IAE values in Table 2 show that the proposed TD3 approach gives the minimum error among all listed algorithms.
The random output power fluctuations of RES are provided for system analysis. The power of EV fleet is shown in Figure A1, which illustrates the charging and discharging states of EVs to compensate the unscheduled load by exchanging power between battery and the grid via a controlling charger. To compare the control performance, firstly 1% step load perturbation (SLP) is applied in area 1. The results shown in the Figure 10 clearly indicate the superiority of the proposed TD3 approach, where -0.006 Hz and 0.0011 Hz  are the under-shoot (US) and over-shoot (OS) frequency responses of the system The frequency settling time (Ts) is 7.5s compared to the other techniques, which require more than 14s to stabilize the response. The DDPG's US and OS are -0.011 Hz and 0.003 Hz, respectively. The PSO and GA provide nearly identical findings, with a minor variation in frequency responses, where -0.01 Hz and 0.0017 Hz are US and OS of the PSO compared to the GA's -0.009 Hz and 0.0017 Hz, respectively. Moreover, the proposed TD3 scheme efficiently compensated the oscillations while stabilizing the frequency deviations. The detailed LFC performance comparison of all the considered techniques is demonstrated in Table 3.
Furthermore, the robustness of the proposed approach is tested under random step load disturbances in both areas as shown in Figure A2. The responses of frequencies and tie-line power are shown in Figures 13-15. The proposed TD3 method performed better compared to other techniques in terms of minimum undershoot, overshoot and settling time. The GA, PSO, and DDPG exhibit poor performance when the random load disturbance is applied in area 2. The maximum US, OS and Ts for TD3 under random SLP in area 2 is -0.023 Hz, 0.008 Hz and 9s, respectively. For DDPG, the values are -0.059 Hz, 0.022 Hz and 13.5s, respectively. While the PSO's US, OS and Ts are -0.05 Hz, 0.018 Hz and 14s, respectively. The GA performed poorly under random step load disturbance as shown in Figures 13-15, therefore we only considered PSO in further results owing to its slightly better performance than GA. Finally, the performances of all three    and PSO the deviation is ±0.01 (p.u) and ±0.008 (p.u), respectively. Hence, these results verify the superiority of the proposed TD3 approach against fluctuations of the renewable integrated energy sources into the system.

A. SENSITIVITY ANALYSIS
In this subsection, a sensitivity analysis is carried out to illustrate the robustness of the proposed approach by varying the system parameters and system operating conditions. Since changing the conditions may lead to severe disturbance, the controller parameters should be robust enough to tolerate these changes. To test the proposed approach, parameters such as time constants (T h , T gr , T w ), gain constants (K PV , K EV ), R, and coefficient D are varied in the range of ∓50% from nominal values. The optimal PID gains obtained at nominal operating conditions are used to evaluate the performance while varying the system parameters. For sensitivity analysis, only one parameter at a time is changed to 25% while the other parameters are kept at nominal values. The Table 4 shows the control performance of f 1 , f 2 , and P tie in steps of 25% parametric variations. The comparison of all listed responses with nominal values reveals the robustness of the proposed approach against system parameters variations, where frequency and tie-line responses are almost overlapping with minor differences.
The parametric variation responses with T W and R are illustrated in Figures 19 and 20 respectively, which confirm the robustness of the proposed TD3 approach against any system parameter variations.

VI. CONCLUSION
In this paper, a novel approach is proposed to optimally tune the proportional-integral-derivative (PID) controller gains for load frequency control of renewable integrated hybrid power system using the deep reinforcement learning (DRL) method.  A twin delayed deep deterministic policy gradient (TD3) algorithm based multi DRL agents are trained, which act as controllers for each area to decide optimal PID values via an interactive trial and error method. The performance of TD3 was compared with deep deterministic policy gradient and meta-heuristic techniques such as genetic algorithm and particle swarm optimization. The results under various scenarios clearly show that our proposed approach outperforms the abovementioned schemes. The TD3 approach gives a significant reduction of almost 50% to 60% in settling time and under/overshoot deviations. All the considered techniques are unable to stabilize the tie-line power under random step load perturbation except the TD3 which certifies its superiority under dynamic variations. Moreover, the proposed scheme greatly compensates the steady-state error and increases the system stability under continuous load-generation variations compared to the conventional control schemes. Furthermore, the sensitivity analysis also indicates that the obtained gain values were robust enough to withstand any system parametric variations.
As the traditional electric grid is undergoing a major transition that incorporates computation in grid operations for better reliability, it brings the key challenge of cyber security. In our future work, we will develop a cyber-attack detection model for LFC to improve the reliability of the system.

APPENDIX
See Figures 21 and 22 here for A1 and A2, respectively.