A Reinforcement Learning Approach to Undetectable Attacks Against Automatic Generation Control

Automatic generation control (AGC) is an essential functionality for ensuring the stability of power systems, and its secure operation is thus of utmost importance to power system operators. In this paper, we investigate the vulnerability of AGC to false data injection attacks that could remain undetected by traditional detection methods based on the area control error (ACE) and the recently proposed unknown input observer (UIO). We formulate the problem of computing undetectable attacks as a multi-objective partially observable Markov decision process. We propose a flexible reward function that allows to explore the trade-off between attack impact and detectability, and use the proximal policy optimization (PPO) algorithm for learning efficient attack policies. Through extensive simulations of a 3-area power system, we show that the proposed attacks can drive the frequency beyond critical limits, while remaining undetectable by state-of-the-art algorithms employed for fault and attack detection in AGC. Our results also show that detectors trained using supervised and unsupervised machine learning can both significantly outperform existing detectors.

in a power system, based on power and frequency measurements taken across the interconnected power system.In AGC, the power system is typically divided into several areas, and a separate AGC controller is deployed for each area.The AGC controller attempts to minimize the deviation of the measured power flows across certain transmission lines and the grid frequency from their expected values.This is typically achieved by minimizing the area control error (ACE) metric, which is a weighted sum of the two aforementioned quantities.
The operation of AGC is dependent on the accuracy and integrity of the deployed sensor measurements.Nevertheless, since modern power systems usually utilize insecure public communication networks, the AGC control loop is vulnerable to a wide range of cyber-attacks.One of the most studied attacks is the false data injection attack (FDIA) [3], in which the attacker uses the communication network to inject false measurements and transmit them to the control center, where the AGC controller typically resides.The false measurements could cause the AGC controller to issue incorrect dispatch commands to the generators, potentially leading to catastrophic consequences in the power system.Therefore, extensive surveillance of the AGC control loop (including the sensor measurements) is an important aspect of the security of any power system.
Conventional solutions for detecting FDIAs against AGC systems depend on simply monitoring the ACE value at each area [1].However, these methods do not utilize information from the AGC system model.Therefore, a recent promising approach for FDIA detection is utilizing the unknown input observer (UIO) [4], which can accurately estimate the unknown system states affecting AGC operation given (1) the observed sensor measurements and (2) accurate knowledge of the power system topology and parameters.An attack or a fault in AGC operation will usually lead to high estimation residuals, which causes an alarm to be raised.This approach has shown great potential in detecting naively computed FDIAs, such as the scaling, ramp, and random attacks.Nevertheless, the vulnerability of UIOs to targeted FDIAs has not yet been fully explored.
In this paper we investigate the vulnerability of state-ofthe-art FDIA detection methods in AGC systems using the framework of reinforcement learning (RL).The contributions of this paper are as follows.
c 2023 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
1) We model the problem of finding stealthy FDIAs against AGC from the perspective of the attacker as a multiobjective partially observable Markov decision process (MO-POMDP) [5].2) We develop a flexible reward function that allows the RL-based attack to maximize the attack impact, while keeping the detection metrics low.3) We use extensive simulations to evaluate the proposed RL-based attacks and showcase their superiority over several baseline attacks in terms of attack impact and in terms of undetectability.To the best of our knowledge, this is the first work that considers computing FDIAs that bypass state-of-the-art AGC attack detectors such as the UIO, and the first work that uses RL to compute such attacks against AGC.
The rest of this paper is organized as follows.Section II discusses previous work on attacks against AGC as well as their countermeasures.Section III presents our model of the AGC system, and the capabilities of the attacker.The problem of computing FDIAs is formulated as a MO-POMDP in Section IV.Section V evaluates the performance of the proposed attacks in terms of stability, impact detectability, as well as sensitivity to model inaccuracy.Finally, Section VI concludes the paper.

II. RELATED WORK
Several recent works have investigated the security of AGC and its vulnerability to attacks.One of these attacks is the time delay attack (TDA), which delays the transmission of measurements sent from the sensors to the control center, or the control commands from the control center to the generators.Recent works have shown that TDAs can degrade the performance of AGC or even disable it [6], [7].Nonetheless, the most studied attack is by far the false data injection attack (FDIA) where the attacker can compromise the measurements (e.g., power flows or frequency measurements), thus leading the AGC controller to send incorrect dispatch commands to the generators [8].Such FDIAs have been shown to pose a severe threat to the system frequency [9].
A different line of work has developed improvements to FDIAs against AGC systems.A "mild" version of FDIA that gradually changes the measurements was proposed in [10], and it was shown that these attacks could still cause significant deviations in the system frequency.Authors in [11] developed an FDIA that maximizes the system frequency deviation, while keeping measurement perturbations within limits.Authors in [3] proposed an FDIA against AGC based on a model of the AGC system that minimizes the time until initiating remedial actions by the system operator.Notably, the proposed attack is able to bypass state-of-the-art bad data detection (BDD) methods used in power system state estimation.Another attack that can bypass BDD methods was proposed in [12].In the first phase of the FDIA, the false measurements are designed to look like un-attacked cases, while the second phase finally drives the frequency beyond the safe range.More recently, [13] designed an attack that minimizes both the attack magnitude and the time until frequency violation, while keeping the attacked measurements and the ACE values within normal limits.
In response to the rising threat of FDIAs against AGC, their detection and mitigation have recently attracted significant research interest.The traditional approach is to monitor the ACE of each area [1], since an increase in the ACE could be a strong indicator of a system fault or an attack.Building on this simple intuition, other approaches utilized the ACE signals for attack detection in more complicated ways.Most notably, [9] proposed an anomaly detector that monitors the ACE values and compares them with predicted values based on load forecasts.Similarly, [14] used load forecasts to predict a range of normal ACE values, which can be used to both detect and compensate for FDIAs.Moreover, [15] used pattern recognition and supervised classification to predict whether the ACE signal is normal or attacked.Besides, [16] proposed two methods based on long short-term memory (LSTM) and discrete Fourier transform (DFT) to detect abnormalities in ACE time series.A multilayer perceptron (MLP) combined with feature selection was trained in [17] to distinguish between attacked and non-attacked ACE signals.Recently, [18] proposed a combination of fuzzy logic and neural networks for the detection of FDIAs, where the input data consisted of the ACE values as well as other measurements.
In contrast to the above ACE-dependent approaches, another common approach is the use of a mathematical model of AGC to detect FDIAs.The most commonly used models are the unknown input observers (UIOs) [19], [20].In these works, a mathematical model of AGC is formulated and is used to perform a delayed estimation of the system states by observing the sensor measurements.By comparing the received measurements with the measurements expected based on the estimated state, faults and attacks against AGC could be detected.Developing on the basic idea of the UIO, [21] includes the attack as a part of the UIO model (i.e., as an unknown input) so that the model learns to estimate the system state as well as the attack, which allows for correcting corrupted measurements.Similarly, [22] designed a UIO for FDIA detection and combined it with a robust adaptive observer and the H ∞ technique to estimate and correct the attacks.A similar idea was developed in [23] for detecting attacks in a decentralized manner by building smaller models that utilize only state variables from a single area.
Several works considered other model-based approaches for detecting FDIAs in AGC systems.The approach in [24] combined state and attack estimation with attack compensation using observer-based output feedback control design.Authors in [25] considered the slightly different AGC problem in hybrid AC/HVDC grids, and designed a residual generator based on the system model to detect and recover attacks.A recent approach is proposed in [26], where the authors designed a set of sliding mode observers (SMOs) and Luenberger observers to detect FDIAs and identify the location of the attacks.Another model-based approach is investigated in [27], where the Kalman filter is proposed for FDIA detection in AGC systems.Moreover, [28] used the Kalman filter to estimate and correct the effect of the attack.Finally, contrary to most works which consider a linearized AGC system Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
model, [29] took system non-linearities into consideration and proposed using a particle filter to detect FDIAs.
Other approaches for detecting FDIAs in AGC systems include [30], which applies dynamic watermarking to measurements fed to the AGC system to detect attacks.More recently, an ensemble method based on supervised machine learning applied to area-level features has been proposed for detecting FDIAs in a decentralized manner [31].Similarly, authors in [32] proposed detecting FDIAs by training an unsupervised generative adversarial network (GAN) using historical measurement and load data.Another unsupervised technique is presented in [33], where FDIAs are detected using an autoencoder neural network with LSTM structured neurons.
Apart from the FDIA detection problem, many works focused on the problem of fault-tolerant control in AGC systems.Authors in [34] proposed FDIA-resilient control in AGC systems combining a Luenberger observer, an artificial neural network (ANN), and an extended Kalman filter.Moreover, [35] proposed an H ∞ controller for event-triggered AGC to control the system frequency under DoS attacks and FDIAs.Besides, an LSTM-based regression model was developed in [36] to predict and compensate for the FDIA signals in AGC.Finally, several research works used a gametheoretic approach to model the interaction between the system operator and the attacker.In the game formulated in [37], the attacker chooses between manipulating either half or all of the samples, and the operator chooses between two different configurations of a FDIA detector.In the game proposed in [38], the attacker could either attack both power and frequency measurements or only the frequency, while the defender could switch between two different FDIA detectors, namely support vector machine (SVM) and k-nearest neighbours (KNN).
A significant limitation of most of the aforementioned works studying AGC security is that they considered weak attack models.Simple FDIAs such as ramp, pulse, step, scaling, sine, random, and replay attacks [9], [10], [14], [17], [19], [23], [27], [28], [31] have been commonly utilized either (1) to quantify the impact of FDIAs on AGC systems, or (2) to evaluate FDIA detection approaches.However, several works proposed attacks that included a notion of stealthiness.FDIAs constructed in [11], [13], [25], [33] satisfied simple constraints, e.g., upper and lower bound constraints on the attacked measurements.The above naive attacks do not exploit any knowledge about the attacked AGC system model, nor about the detectors deployed by the system operator.Therefore, available detection methods in the literature could very effectively detect these naive FDIAs.However, it is unclear whether state-of-the-art detection methods could detect more intelligent attacks that can leverage insider information about the AGC system and the employed detectors.Therefore, a thorough study of the security of AGC w.r.t. to a strong attack model is highly needed.
It is also worth to note that very few works [3], [12] proposed attacks that can bypass bad data detection (BDD) techniques typically employed with power system state estimation.Nevertheless, these attacks are agnostic of any AGCspecific FDIA detectors, and should thus be detectable by those.Besides, although [32] considered attacks that are stealthy w.r.t. the AGC system model, computing those attacks requires access to the unknown inputs (e.g., loads) and the authors do not provide a clear FDIA computation procedure.
Going beyond the above works, constructing intelligent FDIAs against AGC systems could be regarded as an optimal sequential decision making problem, with the objective of maximizing the attack impact and stealthiness.To this end, we utilize the framework of reinforcement learning (RL) to compute FDIAs because (1) computing optimal attacks using traditional mathematical optimization tools could be infeasible for large and highly dynamic AGC systems, and (2) the RL approach only requires the availability of a system model and historical data, and the attack procedure could in principle be applied against other cyber-physical control systems.
Moreover, RL has been extensively used in various power systems optimization tasks.Several works have proposed AGC controllers using RL or multi-agent RL (MARL) instead of the widely used PI-controller [1], [2].One of the first RL-based AGC controllers was proposed in [39], where the authors used the Q-learning algorithm [40] based on discretized actions (generation set points) and observations of either (1) the ACE values, or (2) the power-flow and frequency measurements.More recently, [41] treated AGC as a decentralized multiagent problem (i.e., each area controller is considered as one agent) and utilized state and action discretization to use the double deep Q-network (DDQN) [42] algorithm with action discovery.MARL has also been used to solve the problem of automatic voltage control (AVC) [43], using the multiagent deep deterministic policy gradient (MADDPG) [44] algorithm, which leverages centralized training and decentralized execution, and is able to deal with continuous actions and observations.Similarly, a multi-agent actor critic RL algorithm was proposed in [45] to solve the problem of voltage and frequency control in inverter-based microgrids.Finally, Qlearning has been proposed to compute FDIAs against power system state estimation [46].Nevertheless, to the best of our knowledge, our work is the first work to consider RL-based attacks against AGC, including the question of detectability using state-of-the-art detectors.

A. Automatic Generation Control
We consider an interconnected power system consisting of N areas, connected by power transmission lines called tie lines.We denote by P sch i,j the scheduled (planned) power flow from area i to area j across their corresponding tie line(s), by P tie i,j the actual power flow from area i to area j, and by P tie i,j = P tie i,j − P sch i,j the deviation from the scheduled values.We denote by f i the AC frequency of area i, and its deviation from the nominal grid frequency (e.g., f 0 = 60 Hz) by ω i = f i −f 0 f 0 .Each area has one or more electric power generators whose generation levels are controlled by the AGC in order to keep the deviations of both the frequency and the tie line power flows close to zero, despite changes P L in the electrical loads in each area.
At time instant t, the evolution of the frequency deviation ω i is given by the differential equation where H i is the inertia constant of generator i, P m i is the deviation in the mechanical power output of generator i, P tie i (t) = N j=1 P tie i,j (t), and D i is the damping coefficient of generator i.The power flow on a tie line can be approximated by where P s ij is the synchronizing power coefficient between areas i and j [2].
To drive the power and frequency deviations back to zero, each area's generator governor adjusts the position of the turbine's steam valve P v i based on the differential equation where τ g i is the time constant of the governor in area i, R i is the speed regulation (droop) coefficient of the generator, and is the input reference power generation of area i supplied by AGC.Changing P v will in turn control the output mechanical power P m as where τ t i is the turbine time constant of area i.To regulate the frequency and the tie line power flows, the AGC controller is typically implemented as a PI-controller that controls P ref using where k i is the integrator gain of the PI-controller, and ACE i is the area control error in area i, computed as where β i is the frequency bias of area i computed as A block diagram of the above equations for two areas is shown in Figure 1, where the transfer function of each block is given in the Laplace domain.
The above equations can be converted into the state space model where and A c ij , i = j is a 5 × 5 matrix whose only non-zero element is -P s ij in the first row and the second column.Combining the equations for all areas we obtain where The above continuous time model can be converted to discrete time with a discretization time step T s using the zero-order hold (ZOH) method [47] to obtain where A and B are obtained by the ZOH discretization of A c and B c respectively.

B. Fault and Attack Detection in AGC
As mentioned in Section II, the most commonly used methods for fault and attack detection in AGC are (1) monitoring the ACE values, which is a model-free method, and (2) developing unknown input observers, which is model-based.
1) Area Control Error: The ACE can be computed for each area as in (6), based on the received power-flow and frequency measurements.Since the main objective of AGC is to keep the ACE values small, an increase in ACE could be a strong indicator for a system fault or malicious activity [9].The simplest Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

ACE-based detector would then monitor the ACE values, and raise an alarm if max
where ρ a is a predefined detection threshold.In what follows, we refer to the detector based on (11) as the ACE detector.
2) Unknown Input Observer: Another commonly used method for fault and attack detection for AGC is based on the idea of the delayed unknown input observer (UIO) for discrete-time linear systems [4], [19].The UIO is based on the discrete-time state space model of the system, where Note that the output y includes only the variables that could be measured by the operator.For ease of notation we let n = 5N, m = N, p = 3N denote the total number of states, inputs, and outputs of the system, respectively.Assuming knowledge of the initial state of the system (i.e., x[0]), a UIO with a detection delay α can be used to estimate the system state at time t after observing the system measurements y from time t to t + α, making use of the relation where The estimated system state x[t] by the UIO can then be given by where L ∈ R n×p(α+1) is the UIO gain matrix that should be designed in order to ensure the accuracy and stability of the UIO.It has been shown [19], [48] that for α ≥ 2, the accuracy and stability of the UIO can be ensured when the following procedure is followed [4], [19]: After estimating the system state x using ( 14), the residual can be computed as and an alarm is raised if where ρ r is a predefined detection threshold.Recall that despite being the residual of the estimated state for time t, r[t] cannot be computed by the UIO before time t+α.Furthermore, observe that the knowledge of the load changes (i.e., u[t]) is not required to compute r [t].In what follows we refer to the detector based on ( 16) as the UIO detector.

C. Attack Model
We consider an attacker that has knowledge of the system matrices A, B, and C.This means that the attacker either knows or can accurately estimate the parameters of each area (i.e., H i , D i , R i , β i , τ g i , τ t i , P s ij ).Furthermore, the attacker is able to eavesdrop on the system measurements (i.e., y[t]) at each AGC cycle.We assume that the attacker knows whether the system operator is using an ACE detector, a UIO detector, none, or both.If the operator is using a UIO, the attacker knows the parameters α and L of the UIO, and can thus predict the effect of its attack on the UIO residual r.
We consider that the attacker can inject false measurements of the tie-line power flows as well as the area frequencies, and can thus manipulate P tie and ω as where a P i and a ω i represent the perturbation (attack) of the tie-line power flows and frequency in area i. Observe that in practice, manipulating the frequency measurements might be harder than manipulating power flow measurements since (1) power flow measurements are typically greater in number than frequency measurements, and are thus harder to secure, and (2) the grid frequency is a variable that can be verified by the system operator from neighbouring buses in the same area [3].The attacked power-flow and frequency measurements would then affect the ACE computation in (6), and thus the output of the PI-controller in (5).
In practice, the attacker could eavesdrop and inject false measurements through network intrusion.The attack could directly manipulate messages transmitted using communication protocols such as Modbus, DNP3 or IEC 61850 [49], [50], as these protocols do not mandate either authentication or Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.encryption of messages.Although security recommendations exist for these protocols [51], their use is not mandatory.Even if message authentication is used, eavesdropping and injection of measurements would be feasible through the compromise of end devices.An end device (e.g., a remote terminal unit (RTU) or a phasor measurement unit (PMU)) could be compromised by stealing cryptographic credentials or by exploiting software or hardware vulnerabilities, and state estimates based on PMU measurements could also be compromised by time synchronization attacks [52].Finally, information regarding the system parameters and the used detectors could be obtained by insiders, or could be estimated by an adversary that can eavesdrop the measurements during an extended period of time.
Overall, the advantage of such a strong attack model is that it allows us to consider the worst case attacks and their potential impact on the system's performance.Such a strong model is not uncommon in the security literature, given the recent success of cyber attacks with high level of attacker knowledge, e.g., Stuxnet [53] and FDIAs against power system state estimation [54], [55].We further assume that the attack is constrained by where a P+ and a ω+ denote the respective maximum allowed attack magnitudes.The reason for constraint (18) is that an attack that sets the power-flow or frequency measurements too far from their expected values should be easily detectable.Moreover, the attacker is constrained by that N i=1 P tie i,a and N i=1 a P i must be kept close to zero for any attack.As a result of the attack, the state-space model becomes [20] x where Observe that the only state variable that is directly affected by the attack is P ref , due to the manipulated ACE value.The considered attack model is illustrated in Figure 2. The attacker's goal is to maximize the deviation of the frequency from its nominal value f 0 in a certain target area i * .Ideally, the attacker would like to cause the frequency to drift Fig. 2. Block diagram of the AGC system, including the physical power system, the communication network, the control center, and the attacker.beyond its secure limit, which might cause load shedding schemes to take effect, or in the worst case cause blackouts.
We consider that the adversary aims to find a sequence π = (a [1], a [2], . . ., a[T]) for some time horizon T that maximizes the frequency deviation without being detected by either the UIO or the ACE.This corresponds to solving the optimization problem An important feature of this seemingly simple problem is that the attacker has limited information about the system at every time step and has no knowledge of the future evolution of the system.Thus, ( 22) is essentially a sequential decision problem under uncertainty, and hence we propose to adopt a multiobjective POMDP formulation.

IV. RL-BASED ATTACKS ON AGC
In what follows we formulate the problem of computing attacks against AGC that are undetectable w.r.t. the UIO and the ACE as a multi-objective partially observable Markov decision process (MO-POMDP) [5], and propose to use reinforcement learning for obtaining an attack policy.Although we present a solution that specifically targets the two detectors discussed in Section III-B, the proposed approach can easily be extended to target any other model-based or model-free fault and attack detection method that is based on a hypothesis test in the form (11) or (16).

A. Multi-Objective POMDP Formulation
We formulate the problem by first introducing a tuple M and then showing that it is a POMDP.Let M (S, A, R, P, O, γ ), where: • S is the state space, and s[t] ∈ S is the state at time step t.For our problem, this includes the state of the AGC system, the load demand, the current estimated state by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the UIO, as well as the delayed measurements needed for estimating the next state.Therefore, • A is the set of the attacker's possible actions, and a[t] ∈ A denotes the action at time step t as defined in (21).
• R[t] is the reward function.We propose a reward function that rewards an increase of the frequency deviation at the target area and at the same time includes punishment terms for the UIO residual and the ACE.Particularly, we use a weighted sum of the frequency deviation, the norm of the residual, and the maximum of the ACE values among different areas as the reward, where λ r and λ a are regularization coefficients (note that , and ACE i [t] are the resulting frequency deviation, residual, and ACE, when the transition (s[t], a[t]) → s[t + 1] occurs).The values of (λ r , λ a ) can be used for setting the relative importance of the impact ( ω i * ) and (un)detectability (r and ACE).Observe that the reward function essentially converts three objectives into a scalar objective, which is a widely used approach for dealing with MO-POMDPs.
Note that we assumed that the vector y is observable by the attacker, and y is the result of the attacker's actions on y and accordingly, is observable by the attacker.• γ ∈ [0, 1) is a discount factor.Proposition 1: The tuple M with the definitions in ( 23), (24), and ( 25) is a POMDP.
Proof: To prove this, we need to show that (i) s[t] as defined in ( 23) is indeed Markovian.

(ii) The transition (s[t], a[t]) → s[t + 1] contains all
information needed for computing the reward.In order to prove (i), we need to show that s[t + 1] only depends on s[t] and a[t] and not the entire history, i.e., we have to verify that Equations ( 14), (19), and (20) and both y [t − α + 1] and

Finally, ACE i [t] is computed based on y [t] using (6), and y [t] is determined by s[t] and a[t]. Hence, writing the reward in (24) as R[t] = R(s[t], a[t], s[t + 1]
) is well-justified.

B. Attacker's Policy
To solve the above-mentioned POMDP, the attacker seeks to find a policy π : O → A that maximizes the expected discounted average reward.That is, the attacker's objective is finding the solution to the following problem: Note that maximizing the objective in ( 27) corresponds to solving the following optimization problem: which can be regarded as a relaxed approximation of the problem in (22).This justifies our definition of the reward function in (24).Finding the optimal policy is an RL problem with continuous state and action spaces.We thus propose to use deep RL for finding good policies.In what follows we refer to the attack based on this policy as the deep RL attack DRLA(λ r , λ a ).

V. NUMERICAL RESULTS
In this section we evaluate the proposed RL-based attacks and compare them to state-of-the-art FDIAs against AGC.All experiments were carried out on a server with AMD 7543P CPU with 32 cores @ 2.8 GHz and 64 GB of RAM.

A. Simulation Methodology
We simulated an N = 3-area power system operating at a nominal frequency of 60 Hz.The parameters for areas 1 and 2 are the same as for the examples in [2, Ch. 12] and the parameters for area 3 were obtained by slightly perturbing the values for area 1, as shown in Table I.Each area is connected to the other two areas through a tie line.Although seemingly simplistic, the simulated 3-area system can model a wide-range of practical systems, since each area does in reality include many generator and load buses.To simulate the dynamics of the system, we assumed a discretization time step of T s = 2 seconds, which is a reasonable value considering the AGC cycle [2].The load for each area is assumed to follow a random walk given by where v L i follows a zero-mean Gaussian distribution with a standard deviation σ L i = 0.02 p.u. for all areas.Furthermore, state noise and measurement noise are added to (12) according to zero-mean Gaussian distributions with a standard deviation of 0.03 Hz for frequency variables and √ 0.03 MW for power variables [19], [56].The above three factors (i.e., load fluctuation, state and measurement noise) are thus the main sources of randomness in our experiments.For the evaluation we implemented a UIO with an estimation delay of α = 2, which is the smallest value that ensures the accuracy and stability of the UIO [48].To choose the UIO gain matrix L, the eigenvalues of the matrix (A − BS 2 ) − L 1 S 1 were chosen to be equidistant values in the range [ − 0.5, 0.5], which was observed to improve the UIO accuracy.
Next, to compute DRLAs, we considered that the maximum allowed deviation of the power-flow is a P+ = 0.3 p.u. = 300 MW, and the maximum allowed deviation of the frequency (in the case frequency measurements are attacked) is a ω+ = 0.006 p.u. = 0.36 Hz, as in (18).The aforementioned values were chosen based on preliminary experiments s.t. the deviations in attacked measurements are large enough to affect the AGC system, but not too large to raise alarms and initiate remedial actions by the operator.The attack objective was to maximize the frequency deviation in area 1, i.e., i * = 1.Since the states and actions are continuous, popular discretespace RL algorithms such as deep Q-network (DQN) [57] could not be used.Instead, the RL attacks were trained by the proximal policy optimization (PPO) algorithm [58].PPO was chosen based on the results of preliminary experiments comparing its performance to other state-of-the-art continuousspace RL algorithms such as deep deterministic policy gradient (DDPG) [59], and soft actor-critic (SAC) [60].Due to its simplicity, ease of tuning, and state-of-the-art performance in various RL tasks, PPO is currently one of the most used RL algorithms.It belongs to the class of actor-critic policy gradient algorithms.The PPO algorithm consists of two interacting neural networks: an actor network which learns to produce actions based on observations, and a critic network which learns to evaluate the actions generated by the actor network.The actions produced by the actor NN are optimized by maximizing the clipped value of the advantage function, which quantifies the advantage of taking an action compared to the average behavior.The optimization objective could possibly include minimizing the KL-divergence [61] between the policies followed in subsequent optimization steps.In our PPO implementation, we used the default PPO parameters from the RL-lib Python library [62].The discount factor used was γ = 0.99.The advantage function was estimated using generalized advantage estimation (GAE) [63] with λ GAE = 1.The KL-divergence was included in the objective with a coefficient of 0.2 and a target of 0.01.The PPO clip parameter used was = 0.3.The actor and critic NNs were implemented in the Tensorflow Python library [64], and each network included 2 hidden layers with 256 neurons, and tanh activation functions.The NNs were optimized using stochastic gradient descent (SGD) [65] with 30 epochs of training per batch, and a mini-batch size of 128 samples.The number of episodes needed to train each RL agent was 80,000.Each episode's length was 150 AGC cycles (i.e., 300 seconds given T s = 2s), and the attacks started at the 51st cycle, resulting in T e = 100 attacked AGC cycles per episode.The initial 50 unattacked cycles were simulated to avoid any undesired interaction between the attack and the initial transient behavior of the UIO.We used three attack schemes as baselines for comparison.
a) Random Attack: At each time step, the attack is randomly chosen according to a uniform distribution, i.e., P tie i,a ∼ U (−a P+ , a P+ ) and ω i,a ∼ U (−a ω+ , a ω+ ).b) Regression Attack: proposed in [3], the attacker develops a linear regression model of the attack impact (i.e., | ω 1 |) as a function of the change in the area loads P L [t], and the attacker's action a [t].The optimal attack can then be computed based on the learned model.c) DRLA (0, 0): the attacker attempts to maximize the impact, without taking neither the UIO residual nor the ACE into consideration, and uses RL for this purpose.This is achieved by setting λ r = λ a = 0 in (24).For each attack scenario, the simulation procedure is as follows at each time step: 1) Compute the attack a[t] according to the attack policy.
3) Compute the UIO residuals r based on y [t].4) Simulate the state-space model of the AGC system according to (19).5) Compute the un-attacked measurements y[t + 1] from ( 12), which will be part of the observation o[t + 1] for the attacker.

B. Attack Impact and Detectability
In what follows we present the results of the evaluation of our proposed DRLA s against AGC. Figure 3(left) and (right) shows the attack impact measured as the maximum frequency deviation in the target area (i.e., Area 1) during an episode vs. the maximum 2 -norm of the UIO residual over one episode, and the attack impact vs. the maximum max i |ACE i | over one episode, respectively, when attacking only the power-flow Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.measurements with (a P+ = 0.3). Figure 4 shows corresponding results for the case when attacking both power-flow and frequency measurements with (a P+ = 0.3, a ω+ = 0.006).Each point in the figures represents one episode and a total of 1000 episodes were simulated per scenario.We identified non-zero λ r and λ a values by numerically exploring the Pareto frontier, and then choosing parameter pairs with significant impact while retaining undetectability.Focusing on Figure 3, we can first observe that the baselines can typically achieve slightly higher impact compared to the proposed DRLA s.For example, the maximum impact achieved by the baselines is around 0.5 Hz, compared to slightly over 0.4 Hz for DRLA s.However, DRLA s (with non-zero regularization coefficients) can greatly reduce the values of the detection metrics compared to baselines (e.g., by around two orders of magnitude for the UIO residual and around one order of magnitude for the ACE), and bring the detection metrics close to their values in the no-attack scenario.Furthermore, as expected, DRLA(0.0145,0) which penalizes high UIO residuals succeeds in achieving a good balance between impact and UIO residuals.However, it clearly fails in keeping the ACE values low (similar to the baselines).The exact opposite is observed for DRLA(0, 0.02736).Comparing the two aforementioned attacks, it can be observed that attacking the UIO residual seems to be easier than attacking the ACE, which indicates that the ACE might be a better metric for detecting attacks than the UIO residuals in this case.On the contrary, DRLA(0.0052,0.022) succeeds in keeping both detection metrics low, at the cost of lower attack impact.
Comparing the above results with Figure 4, we can observe that attacking the frequency measurements can allow the attacker to slightly increase both the attack impact and stealthiness.For example, DRLA(0.018,0) can have an impact reaching 0.6 Hz, which is above the security limit of many applications.Furthermore, the same attack yields UIO residuals that are on average much lower than the corresponding attack in Figure 3.Note the discrepancy between the values of (λ r , λ a ) in Figures 3 and 4, since these values were chosen empirically.The vertical line in the figures shows the detection threshold corresponding to a false positive rate (FPR) of 0.1%.The FPR is defined as the fraction of non-attacked episodes for which the detector raises an alarm, and can be controlled by changing the detection threshold, and FPR=0.1% corresponds to a time between false alarms TBFA = T e T s /FPR = 100 × 2/0.001 = 200, 000 seconds (< 0.5 false alarms per day).Figure 4 shows that the UIO detector can detect 27.6% of the DRLA(0.005,0.022) attacks and 28.2% of the DRLA (0.018, 0) attacks.For the same FPR, the ACE detector can detect only 1.1% of the DRLA(0.005,0.022) attacks and 8.3% of the DRLA(0, 0.029) attacks.This suggests that the UIO detector is better than the ACE detector for this case.In general, Figure 4 confirms the earlier observation that DRLA is successful in terms of impact and (un)detectability.We have also evaluated the performance of an additional detector based the cumulative sum (CUSUM) [66] of the UIO residuals.The results, shown in the Appendix, suggest that CUSUM does not provide a significant improvement over the above detectors, especially for the case when both power-flow and frequency measurements are attacked.

C. Training Stability
To assess the stability of DRLA, we further trained 10 separate agents for each (λ r , λ a ) tuple, excluding the baseline DRLA(0, 0), and computed the minimum, mean, and maximum reward per episode over the 10 agents as the training progresses.Figure 5 shows the so-called reward curves for the trained agents, with and without attacking frequency measurements.To facilitate the comparison, the rewards were scaled over the 10 trained agents using min-max scaling.The figure shows that most agents do converge with very low variance after around 10,000 episodes of training, with the only exception being DRLA(0, 0.029) (when attacking both P tie and ω), which indicates that the agents might need further training.To conclude, the trained agents show in general very stable performance.

D. Immediate Response
We further consider the hypothetical scenario that the operator immediately reacts to the attacks detected by either the UIO or the ACE detectors (e.g., through neglecting suspected measurements, or initiating load shedding schemes).For this case, it is reasonable to evaluate the attacks in terms of the highest impact caused until detection, instead of the highest impact over the whole episode.For brevity, all upcoming results concern the scenario where both power-flow and frequency measurements are attacked (i.e., a P+ = 0.3, a ω+ = 0.006), unless otherwise stated.Figure 6 shows the relation between the attack impact before detection, and the average TBFA.Every point is computed by using a different value for the detection thresholds (ρ r or ρ a ).The figure shows that the effective impact of the baseline attacks is always negligible irrespective of the chosen TBFA, since those attacks are always detected at the beginning of an episode, before they can achieve any significant impact.Interestingly, this is also the case for DRLA s targeting the wrong detection metrics, e.g., DRLA s with λ r = 0 have negligible effective impact when the UIO detector is used, and vice versa.Among DRLA s, the effective impact of the attacks with non-zero regularization coefficients increases with the TBFA until it approaches the average impact shown in Figure 4.The results in this figure and the previous figures emphasize the importance of the attacker's knowledge of the detector employed by the defender.They also show that even if the operator decides to use both detection metrics, DRLA s with λ r > 0 and λ a > 0 are expected to be undetected, even if somewhat less impactful.

E. Data-Driven Detectors
To further investigate the detectability of the proposed DRLA s, we examine the use of two machine learning (ML) based detection approaches: (1) an unsupervised autoencoder (AE) neural network, and (2) a supervised deep neural network (DNN) classifier.For both approaches, we consider that the input features at each timestep are: (a) the measurements y[t], (b) the UIO residuals r[t − α], (c) the norm of the UIO residuals r[t − α] 2 , and (d) the ACE in all areas.Thus, for our 3-area system this corresponds to a total of n f = 9 + 9 + 1 + 3 = 22 features.The dimensions of the AE layers were n f * [1, 0.7, 0.5, 0.7, 1] (i.e., three hidden layers), and the dimensions of the DNN layers were n f * [1, 4, 0.5, 1] (i.e., two hidden layers).Both approaches used ReLU as the activation function for the neurons, used the Adam optimization algorithm, and were implemented using PyTorch.To evaluate the data-driven detectors, we used the same simulation data described in Section V-B.The data (7 attack scenarios × 1000 episodes × 100 time steps) were split into 800 training episodes and 200 test episodes.The unsupervised AE was trained on non-attacked training data only, while the supervised DNN was trained using the whole labelled training data.The detection was then done on the test data using a hypothesis test similar to (11) and (16), where the test statistics for AE and DNN were the MSE of the AE reconstruction error (the difference between input and output layers), and the scalar output of the DNN, respectively.
To compare the performance of the ML detectors to the UIO and ACE detectors, we utilize the receiver operating characteristic (ROC) curves.The ROC curve shows the trade-off between the fraction of attacked episodes for which a detector Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.raises an alarm (true positive rate, or TPR) on the vertical axis, and the FPR on the horizontal axis, and is obtained by varying the detection threshold (e.g., ρ r in (16) for the UIO detector).The area under the ROC curve is a commonly-used evaluation metric that summarizes the performance of the detector.An ideal detector would have AUC = 1, while a detector with AUC = 0.5 would correspond to a performance that is as good as random guessing.
Table II shows the AUC achieved by each of the detectors, as well as the mean impact of each attack.Observe that the impact was defined as the maximum observed frequency deviation in area 1, and hence the no-attack scenario has nonzero impact.From the table, we can generally see that the ML detectors (especially the DNN) have significantly higher AUC values than the UIO and ACE detectors.Nonetheless, the unsupervised AE performs surprisingly poor against DRLA (0.005, 0.022), even though the attack was not specifically trained to bypass it.This result interestingly indicates the potential generalization power of DRLA against unseen detectors.Finally, it is worth noting that although the supervised DNN can effectively detect all considered attacks, the performance of supervised ML typically degrades against unseen and zero-day attacks [67], [68].Moreover, the acquisition of accurately labelled data in real scenarios might not always be feasible [69], [70].

F. Impact of Parameter Misestimation
We now consider the case when the operator's model of the AGC system (i.e., H, D, τ t , τ g , R) is slightly inaccurate.Model inaccuracy would affect the control accuracy of the PIcontrollers and the UIO residuals, making an attack potentially more difficult to detect.To simulate this scenario, we consider that the real system parameters are 20% higher than those in Table I, and are used for evaluating the attack and the detection schemes.We consider the case of symmetric information availability, i.e., the operator and the attacker have access to the same inaccurate parameters.The parameters available are the ones shown in Table I, and are used for computing the UIO matrices, the residuals, for training DRLA s.There is thus 20% estimation error, which would not drastically increase the frequency deviations or UIO residuals without an attack, but is large enough to affect detectability.Table III presents the attack impact and the AUC achieved by the detectors in this scenario.Surprisingly, even though the attacker uses the same inaccurate parameters as the operator, the attack impact is significantly increased for DRLA s compared to Table II, while remaining completely undetectable w.r.t.most detectors.Observe that the AUC for the UIO and ACE detectors are significantly smaller than 0.5 for some attacks.This means that DRLA learns to yield UIO residuals and ACE values that are on average smaller than the no-attack case.
Overall, our results indicate that DRLA s are powerful w.r.t.both the inflicted impact to the power grid, and the stealthiness against a wide range of detectors.However, the results also suggest potential methods to enhance the security of AGC, including (1) obtaining more accurate system models and information, (2) utilizing supervised ML detectors with rich training data, and (3) securing measurements from physical and network intrusions, by, e.g., utilizing redundant frequency measurements.

VI. CONCLUSION
In this paper we investigated the vulnerability of state-ofthe-art AGC to attacks against power and frequency measurements.We formulated the problem of attacking an AGC system equipped with multiple fault and attack detection methods as a POMDP.We proposed an RL solution based on the proximal policy optimization algorithm to compute the attacked sensor measurements.Our results show the superiority of the proposed RL-based attacks compared to several baseline attacks in terms of stealthiness and attack impact, and show that sophisticated attacks could bypass existing detection schemes and could lead the grid frequency to critical trajectories.One direction for future work could be to analyze the practical feasibility of the proposed attack when considering weaker attack models, e.g., attackers without knowledge of the system parameters, or those manipulating measurements in only one area.

A. FDIA Detection using UIO and CUSUM
In what follows we consider that the system operator is using a combination of the UIO detector and CUSUM (i.e., referred to as the CUSUM(UIO) detector).The detection metric for the CUSUM(UIO) detector is computed as [66] S r [t] = max (0, S r [t − 1] + r[t] 2 − b r ), (29) where r[t] 2 is the 2 -norm of the UIO residual at time t, b r is the bias term chosen to be equal to the mean UIO residual in the normal (unattacked) case, and S r [0] = 0.An alarm is then raised by the detector if where ρ c is a predefined detection threshold.
Using the same simulated data described in Section V-B, we evaluated the CUSUM(UIO) detector against the baseline FDIAs and our proposed DRLAs, and the results are shown in Figure 7 and Figure 8.The figures show the tradeoff between the attack impact, and the maximum CUSUM detection metric (S r ) during an episode.For the case when only power measurements are attacked, Figure 7 shows that using CUSUM can improve the separability of the attacked and the non-attacked measurements, compared to the UIOdetector which directly uses the raw residuals (c.f., Figure 3).To the contrary, when both power and frequency measurements are attacked, Figure 8 shows that the CUSUM(UIO) detector did not bring any performance improvement compared to the UIO detector (c.f., Figure 4).Observe that in the former case, the attacked UIO residuals were on average higher than the non-attacked ones.Using CUSUM in that case allows this difference to accumulate over time, and thus boosts the detection performance.On the other hand, the attacked UIO residuals  were on average less than or equal to the non-attacked residuals in the latter case.Thus, using CUSUM makes little to no difference in the detection performance.
Note that the DRLAs in Figure 8 were capable of bypassing the CUSUM(UIO) detector despite the fact that they were trained to minimize r[t] 2 and not S r [t].Training DRLAs that target S r [t] should yield even stealthier attacks.Furthermore, one could also implement a CUSUM(ACE) detector, but our results suggest that such a detector would provide little improvement in detection performance, especially for the case when both power and frequency measurements are attacked.

Fig. 1 .
Fig. 1.Block diagram of automatic generation control of a 2-area power system using ACE, including the locations of FDIAs.

Fig. 3 .
Fig. 3. Trade-off between attack impact and detection metrics for DRLAs and baselines, when only power-flow measurements are attacked.

Fig. 4 .
Fig. 4. Trade-off between attack impact and detection metrics for DRLAs and baselines, when both power-flow and frequency measurements are attacked.

Fig. 6 .
Fig.6.Trade-off between the highest achieved attack impact before detection and the time between false alarms.

Fig. 7 .
Fig. 7. Trade-off between attack impact and the CUSUM detection metric for DRLAs and baselines, when only power-flow measurements are attacked.

Fig. 8 .
Fig. 8. Trade-off between attack impact and the CUSUM detection metric for DRLAs and baselines, when both power-flow and frequency measurements are attacked.

TABLE I PARAMETERS
OF THE CONSIDERED THREE-AREA POWER SYSTEM

TABLE II COMPARISON
OF THE ATTACKS W.R.T. THEIR IMPACT AND CORRESPONDING AUC SCORES BY THE DIFFERENT DETECTORS

TABLE III COMPARISON
OF THE ATTACK IMPACTS AND CORRESPONDING AUC SCORES, IN THE PRESENCE OF 20% PARAMETER MISESTIMATION