Deep Reinforcement Learning for Cybersecurity Assessment of Wind Integrated Power Systems

The integration of renewable energy sources (RES), and specifically wind and solar PV systems, is rapidly increasing in electric power systems (EPS). While the inclusion of these intermittent RES coupled with the wide-scale deployment of communication and sensing devices is important towards a fully smart and modern grid, it has also expanded the cyberthreat landscape, effectively making power systems vulnerable to cyberattacks. This paper proposes a cybersecurity assessment approach designed to assess the cyberphysical security of EPS. The work takes into consideration the intermittent generation of RES, vulnerabilities introduced by microprocessor-based electronic information and operational technology (IT/OT) devices, and contingency analysis results. The proposed approach utilizes deep reinforcement learning (DRL) and an adapted Common Vulnerability Scoring System (CVSS) score tailored to assess vulnerabilities in EPS, in order to identify the optimal attack transition policy based on N-2 contingency results, i.e., the simultaneous failure of two system elements. The effectiveness of the approach is validated via numerical and real-time simulation experiments, which in turn, demonstrate how the proposed process successfully identifies potential threats that can be utilized by attackers to cause critical EPS disruptions.


I. INTRODUCTION
The power grid is the cornerstone of all critical infrastructures. The safe and secure functionality of electric power systems (EPS) is directly related to every aspect of the economy and society. In the last decades, worldwide energy demand has significantly increased and is estimated to continue to do so by nearly 50% by 2050 [1]. Due to the increasing energy demand as well as the need to enhance system efficiency and asset reliability, the technological modernization of the power grid infrastructure has become an immediate priority for governments and energy stakeholders around the world. This modernization, alongside with environmental concerns, are driving factors for the integration of renewable energy sources (RES) to the power grid. For example, the U.S. Energy Information Administration (EIA) indicates that, in 2019, wind was responsible for generating approximately 42% of RES generated power at utility-scale facilities in the U.S., and 7.3% of the total U.S. electricity generation, making it the most popular RES [2].
There are still many challenges that need to be addressed before a harmonious integration of RES. Some of these challenges are directly related to the intermittent nature of The  RES, their effects on power system security and contingency analyses, and the growing number of cybersecurity threats that threaten electronic devices supporting EPS operation.

A. Paper Motivation
The cybersecurity assessment approach presented in this paper aims to provide an effective way for system operators and stakeholders to assess the cyberphysical security of EPS with high penetration of renewables and a large number of computing, communication, and sensing devices. The proposed approach follows a step-by-step process, from an attacker's point of view, designed to identify the most critical threats an adversary can leverage to compromise the targeted EPS. In particular, our work addresses key challenges identified when performing cybersecurity assessments in modern EPS: (1) it captures the impact that intermittent RES generation, and specifically wind generation, has in contingency analyses, and (2) it considers vulnerabilities that exist in information and operational technology (IT/OT) electronic devices within system nodes that can potentially cause adverse effects in the EPS operation. In addition, the proposed approach performs the cybersecurity assessment without the need for full observability of the cyberphysical system, i.e., the physical (electrical) and the cyber (communication network) systems.
1) Wind Intermittency and Impact on Contingency Analysis: Even though wind integration aids in accommodating the increasing power demand, its intermittent nature introduces challenges related to the mismatch between supply and demand. For instance, short-term wind power fluctuations occur on a second or sub-second timescale during which load balancing methods do not yet operate. Thus, to ensure system stability, critical aspects such as optimal location, power flow, and generation variance must be taken into consideration when interconnecting wind energy systems.
Traditionally, contingency analysis have been used to assess physical power system security in EPS [3]. This is achieved by calculating the power flow of all the lines and elements of the system in the event of a single or multiple failures. In essence, a contingency is the failure or loss of any element such as a circuit breaker, generator, or transmission line. Contingencies can be planned or unplanned. Planned contingencies include events resulting from scheduled maintenance and proactive emergency preparedness, while unplanned events include fluctuating wind injections, cyberattacks, human errors, etc. The North American Electric Reliability Corporation (NERC) requires system operators to meet the N − 1 security constraint and classifies systems into four main categories [4]. These categories are shown in Table I.

Category Contingency Case
The intermittent nature of wind power generates challenges when performing security studies of power systems based on contingency analysis. Their intermittency can rapidly change the most critical contingencies of the system or create a number of contingencies (λ) that exceeds the maximum number of contingencies that the system can handle (k); thus leading to cascading scenarios. A prime example of insufficient security margins is the widespread power outage across the U.K. in 2019 [5]. The near-simultaneous loss of two generation sites, one being an offshore wind farm and the other one a gas-fired power station, resulted in a massive under-frequency event. Load shedding mechanisms responded immediately causing a major disturbance that affected nearly one million people. A detailed study on the effect of intermittent wind power generation on contingency results is presented in Section III.
2) Vulnerabilities of IT/OT Electronic Devices: The power grid is experiencing a rapid move towards a more interconnected system. Currently, OT electronic devices deployed and operated at all scales and levels of the power system are being designed and retrofitted with IT devices to support communication processes and protocols that enhance the controllability and observability of the system. The use of such digital electronic devices with software applications, module, drivers, commercial-off-the-shelf (COTS) hardware, and network resources is a double-edged sword. On one hand, it assists in the development of the future modern and advanced grid in terms of optimizing asset utilization, addressing disturbances, providing better power quality, and accommodating all storage and generation options with grid-support functions. On the other hand, the coupling between such cyber-electronic devices and physical components in power systems has altered the threat model. In the past, the threat model has been solely focused on physical threats. However, due to the integration of such network-controlled components, the security challenges need to consider both the cyber and physical nature of the grid, addressing the growing number of emerging threats. Some examples of these potential threats are presented in [6]- [8]; it has been demonstrated that attackers can leverage publicly available sources by using open-source intelligence (OSINT) techniques combined with open-source exploitation methods in order to spoof GPS signals coming from phasor measurement units (PMUs). Another example is presented in [9], where a real-world attack within the Ukrainian power system is accomplished by injecting malicious firmware in serial-to-Ethernet gateways at targeted substations. Attackers were able to trip circuit breakers and cause a blackout that affected approximately 225,000 customers.

B. Related Work
In this part, we explore some of the state-of-the-art approaches being proposed by researchers that aim to address issues related to: (1) N − k contingencies simulations considering intermittent RES generation, (2) assessing the severity of electronic devices security vulnerabilities, and (3) vulnerability and risk assessments methods for cyberphysical EPS. 1) Methods for N-k Contingency Analysis: Towards reducing the occurrence of cascading failures, existing research efforts have focused on proposing efficient methods that can perform studies based N − k contingency scenarios. Due to the size and complexity of power systems, these "what-if" contingency scenarios are based on computationally expensive optimal power flow processes. Research in this area aims to address the computational overhead of N !/[k!(N − k)!] simulations for N − k contingencies. For example, the work in [10] describes a fast-bounding case which requires a small online memory model. Other efforts compute the active power flow change at lines and the voltage change at buses to evaluate the severity of N − 1 and N − 2 contingencies [11]. In [12], a graph-based power model analysis is presented for contingency ranking. In [13], a heuristics pruning approach for identifying N −k contingencies is discussed while a topologybased algorithm that considers whether the generator or line is in densely populated areas is presented in [14]. The authors use the concepts of closeness and betweenness centrality to determine the component's importance for a N − k criterion.
One of the main challenges of performing contingency studies in power systems with high penetration of wind is that the uncertain nature of wind causes high variability when identifying the most critical N −k contingencies of the system. Existing studies do not often take into account this variability [15]- [17]. The authors in [16] and [18] demonstrate some of the effects that intermittent power generation has in critical contingency identification. They present probabilistic power flow studies that show how the variable nature of power flow, due to wind fluctuations and uncertainties, can alter the number and location of the most critical contingencies recognized by system operators. The correct identification of these critical contingencies is of paramount importance as they can be potentially leveraged by adversaries in order to cause major disruptions in EPS [6], [7], [19], [20].
2) Severity Assessment of Electronic Devices Security Vulnerabilities: The wide-scale integration of information and communication technologies in the form of digital electronic devices into the electrical grid expands the list of possible attack vectors that adversaries could exploit to cause major disruptive events. Hence, in order to ensure the secure operation of the entire system, it is essential to consider the inherent vulnerabilities introduced by the grid-supporting OT/IT infrastructure.
One scoring system that is widely used for device-level vulnerability assessments in the IT industry is the Common Vulnerability Scoring System (CVSS) [21]. The CVSS can assess the severity of software, hardware, and firmware vulnerabilities by using numerical scores. One example of its use can be found in [22]. Here, the authors utilize CVSS to estimate the probability of successfully exploiting identified independent vulnerabilities, including zero-days, existing in components connected to the LAN of a supervisory control and data acquisition (SCADA) system. Another example of CVSS use can be found in [23], where a CVSS-based cyber asset impact score is presented with the objective of providing a real-time cyber impact severity score that can be used as a basis for processes such as vulnerability management, isolation of cyber assets, and system reconfiguration.
3) Vulnerability and Risk Assessment Methods for Cyberphysical EPS: Several researchers have focused on developing system-wide security assessment tools aimed to identify possible vulnerabilities and attack vectors which can be subsequently used to produce optimal control policies designed to guide secure operations of cyberphysical EPS. One example is presented in [24]; the authors propose power system emergency control mechanisms based on Deep Q-Networks (DQNs) to maintain the reliable operation of the system by performing dynamic breaking of generation and undervoltage load shedding. An extensive review of the latest Artificial Intelligence/Deep Reinforcement Learning (AI/DRL) models used for cybersecurity purposes can be found in [25]- [27]. These studies provide an overview of how AI/DRL models are being used to improve cybersecurity in EPS operations.
Other works have focused on more traditional ranking mechanisms to improve EPS cybersecurity. For example, the research presented in [14] assesses system vulnerability from the cyberphysical security perspective using contingency ranking methods and a cyber-intrusion ranking methodology. Similarly, in [28], an operational reliability impact assessment framework has been developed. In this study, the authors incorporate cyberphysical threats in the assessment of the EPS operation. Another approach is presented in [29], describing an overload risk assessment method based on N − 1 contingency analysis and wind penetration.

C. Paper Contributions
While the research studies described above have yielded fruitful results, prior efforts have critical shortcomings that restrict their utility and broader applicability in modern cyberphysical EPS as they do not concurrently consider: (1) contingency analysis studies that take into account intermittent wind generation when performing the cyberphysical security assessment of the EPS, (2) threat models that incorporate quantitative evaluations regarding vulnerabilities of OT/IT electronic devices when identifying the optimal attack transition policy, and (3) the use of reinforcement learning models that alleviate the problem of requiring full observability of the system from the attackers perspective when performing the assessment.
In this work, we provide a cybersecurity assessment approach for wind integrated EPS that leverage the use of DRL paradigm while considering the inherent vulnerabilities of adopting COTS electronic devices. The proposed approach takes into account physical-based aspects such as contingency analysis and wind uncertainty, together with cyber-based aspects such as quantitative scoring systems of vulnerabilities identified in IT/OT devices supporting the grid infrastructure. Our approach is capable of identifying potential threats that can be utilized by attackers to cause serious disruptions in EPS. Our contributions are summarized as follows: (1) We propose a cybersecurity assessment approach that considers adversaries that make use of OSINT modeling techniques to construct models of power systems which then are used, in tandem, with contingency analysis that takes into account wind intermittent generation, to identify the critical cyber and physical vulnerabilities of the EPS as a cyberphysical system. The assessment process is performed without the need for full observability of the system since it models the state of the power system as a partially observable Markov decision process (POMDP) that is solved using DQNs. The solution given by the proposed DQN reveals the optimal attack transition policy that an adversary would follow to potentially induce cascading failures in the assessed cyberphysical EPS.
(2) We propose an adapted version of CVSS based on contingency analysis results and vital information, from the power and communication networks, that reveal cyber and physical vulnerabilities within system nodes. The adapted CVSS is used to generate a transition graph designed to assess the complexity of each possible attack path based on various adversarial strategies.
(3) We evaluate the performance of the proposed methodology using digital real-time simulations on test power systems highlighting its value in a real-world cyberphysical scenario. We also demonstrate how our algorithm can be utilized by system operators to assess the weaknesses of the power grid.
The rest of the paper is organized as follows. In Section II, we present the methodology of the proposed cybersecurity assessment approach. Section III presents contingency analysis simulation results for various power system test cases. In Section IV, we demonstrate the effectiveness of the proposed cybersecurity assessment approach and compare different transition techniques. Finally, Section V concludes the paper.

II. CYBERSECURITY ASSESSMENT METHODOLOGY
In this section, we provide the methodology of the proposed approach. Fig. 1 shows the step-by-step process that our cybersecurity assessment approach follows. The assessment process determines first the (1) threat model based on adversary objectives and capabilities. Specifically, our work considers an attacker that leverages OSINT techniques to run (2) contingency analysis with the objective of identifying the set of k critical contingencies of the system. Our results focus on two contingencies which assess the power system condition when two components are lost, i.e., k = 2. However, the proposed approach can be extended to consider higher number of contingencies. To proceed with the assessment process without the need for full system observability, the proposed approach creates a (3) POMDP by defining a transition probability (T P ) based on the proposed adapted version of the CVSS score metric. The score evaluates the difficulty of each network transition in the generated system graph. Then, the POMDP is solved using a (4) DRL model designed to find the optimal attack policy between the previously identified contingencies. Finally, the (5) output of the cybersecurity assessment process evaluates the potential threat by revealing the optimal attack Figure 1. Graphical depiction of the major steps of the proposed cybersecurity assessment process and the optimal attack transition policy given as output.
transition policy between the identified contingency pair which could cause cascading failures in the physical system. The details of each step are presented in the following subsections.

A. Step 1: Threat Model
In this work, we consider a threat model in which an attacker can leverage publicly available information using OSINT techniques to collect sufficient system data such as the line parameters and status of circuit breakers. Also, the attacker is able to acquire data to calculate power flow and therefore run contingency analysis [7]. Depending on the degree of system contingencies (e.g., N − 1 secure system), the adversary can leverage the ranking results of contingency algorithms to identify which system elements if "removed" can lead to an insecure power system state. Although a plethora of public power system information is available, it is unlikely that the attacker will ever have full knowledge and real-time observability of the system [30]. In our approach, it is assumed that the attacker, in spite of having the necessary information to perform contingency analysis via OSINT techniques, he/she does not have the full state information of the system. Specifically, while the adversarial agent is transitioning through the cyber system network to exploit vulnerabilities in the identified double contingency nodes, he/she is unaware of his/her position relative to the contingencies and the cyber network transition complexities (based on the adapted CVSS) of the different attack paths.
In addition, we assume that the cyber system network graph is isomorphic with the physical system graph, indicating that the topology of the communication network is mapped with the topology of the physical system. Therefore, we model the environment as a POMDP in which the agent may only access the current state and make an observation for obtaining possible actions in each state (Step 3). Based on the observation results for each state-action combination, the network T P is calculated. This probability reveals the transition complexity between different states. By leveraging this methodology, a DQN-based algorithm is then utilized to identify the optimal attack transition policy between the critical contingency elements (Step 4).

B. Step 2: Contingency Analysis
In order to find the attacker's optimal attack transition policy, we first need to identify the set of critical double contingencies of the physical power system (e.g., simultaneous N − 2 or consecutively N − 1 − 1). We utilize a fast pruning N − 2 algorithm to find all the thermal constraint violations via linear power flow approximation [31]. The algorithm is initiated based on the set of all N − 2 pairs. The contingency candidate list is pruned using line outage distribution factors (LODFs). LODFs describe the power flow impact on other lines when a line outage occurs. The pruning approach is based on the thermal constraints of lines, running until the number of contingencies included in the set does not change. If the LODF exceeds its thermal constraint, it is added to the contingency candidate set. The line overload condition can be written as A xy ·B xc +A yx ·B yc > 1, where x and y are lines experiencing outages, z is an arbitrary line experiencing power flow changes, and c is a possible constraint. Matrix A xy can be calculated by

C. Step 3: POMDP Transition Model Based on Adapted CVSS
After finding the most critical contingency set, the process advances to create the corresponding POMDP of the cyberphysical-graph environment by calculating the corresponding T P between the different nodes of the system. Generally, POMDPs are used to model the response and outcomes of systems when different actions are performed at specific states. In our environment, observations made by the attacker do not provide full state information, i.e., the agent does not know apriori how many nodes the system has nor their respective states, and he/she needs to observe the environment to determine potential actions, hence the selection Figure 2. Overview of the transition probability (T P ) assessment.
of POMDP system modeling. POMDPs can be mathematically modeled as a 6-tuple (S, A, Ω, P, R, O), where S is the set of all possible states in a given environment, A contains all the agent's potential actions, Ω is a set which includes all possible observations, P is the T P for each state, R is the reward function for performing different actions, and O represents conditional observation probabilities. At the current state s, given the T P and observation o, the agent takes action a to move to the next state s . As a result of this state-action pair, the agent receives reward R. This process repeats until the terminal state is reached. In this POMDP formulation, the T P for each state is an essential factor that must be determined adequately according to the process being modeled. In our case, the T P relies on the cyber system vulnerabilities, i.e., vulnerabilities that exist in electronic devices, and their potential impacts related to the physical system, i.e., the identified power system contingencies.
Considering the cyber network system vulnerabilities as well as the optimal attack transition policy between the identified contingencies (physical vulnerabilities), a T P for each transition step (between cyberphysical system nodes) can be determined. These probabilities aid in the traversal agent's decision making since the values reveal the difference in complexity and difficulty for each transition, i.e., how vulnerable is the cyberphysical system at each node, i.e., bus, from the point of view of the attacker transition policy. In each step, the node's identified cyber and physical characteristics including the electronic device vulnerabilities, thermal limits of lines, and power generation are considered. A graphical illustration of this procedure is shown in Fig. 2. In this work, we compute the T P using an adapted version of CVSS v3.1.
CVSS is a vulnerability scoring system generally used in the IT industry to assess the severity of identified computer system's vulnerabilities. Although there exist temporal and environmental metrics in CVSS, their main aim is to reflect how vulnerabilities change over time or demonstrate uniqueness to a particular user's environment [21]. For our application, base metrics portray a better picture regarding how the cyber and physical vulnerabilities at each power system node affect the transition difficulty of the threat. More specifically, the base score provides a comprehensive assessment of the intrinsic characteristics of identified vulnerabilities using quantitative Exploitability and Impact metrics as shown in Fig. 3. The range of scores goes from 0 to 10, with 10 being the most severe -maximum value.
1) Exploitability Metric: This metric describes the difficulty and technical means by which a software, hardware, or firmware vulnerability can be exploited. In our case, the exploitability represents the difficulty of vulnerability exploita- tion for each electronic device that exists in a particular node of the cyber-layer of the power system. In other words, it represents the complexity of the transition based on the type of node (i.e., P Q or P V power system bus) to which the agent is transitioning to. The overall score of this metric is determined by five submetrics, described below. a) Attack Vector (AV) -This metric is defined as one of the following categories: network, adjacent network, local network, or physical. In a network attack, an adversary exploits a vulnerable device bound to the network stack. This type of attack is conducted through the Open Systems Interconnection (OSI) layer 3. In an EPS, an attacker may conduct a network attack by manipulating TCP-level packets flowing across a substation network. In an adjacent attack, the adversary also exploits vulnerable devices bound to the network stack, however, the attack cannot be performed across an OSI layer 3 boundary. In essence, the attack is limited to the same shared physical or logical network. An example of this type of attack is an Address Resolution Protocol (ARP) flooding attack that leads to a denial-of-service targeted at control and monitoring devices connected to a LAN segment of a microgrid [32]. In a local attack, a direct path to the vulnerable element is required (e.g., local terminal, remote terminal, or deceive legitimate users into executing malicious instructions). In an EPS, this type of attack could be performed by executing malicious code in a local control or monitoring electronic device accessed via local or remote terminal. Finally, for a physical attack, actual physical interaction between the attacker and the target is necessary. In an EPS, this means that the attacker must compromise the targeted electronic devices through physical means (e.g., causing physical damage to the devices).
b) Attack Complexity (AC) -This metric represents the amount of effort an attack on the vulnerable electronic device would require. The value of this metric, high or low, depends on the security level of the electronic devices as well as the adversary's capabilities and skills. In EPS, generation buses can be considered of more significance than load buses in regards to power grid operation and, consequently, possible threats. Typically, additional security mechanisms are in-place to protect bulk generation infrastructure [33]. This is accomplished by using electronic security devices, physical barriers, or security monitoring equipment. Hence, as part of our CVSS vulnerability scoring, P V and P Q buses are considered of high and low attack complexity, respectively. c) User Interaction (UI) -This metric reveals whether user interaction is required to exploit a certain electronic device. It quantifies the amount of participation required from a human user, different from the attacker, to successfully compromise the targeted device. For example, attackers could attempt to deceive the system operator to give them access to the control room via phishing or malware attacks. Due to the importance of P V buses, we assume that the attacker will require UI to manipulate a P V bus. On the contrary, it is assumed that attackers would not need to obtain special permission from another human user to access P Q buses. The values for this metric are: required for P V buses, and none for P Q buses. d) Privileges Required (PR) -This metric determines the level of privileges needed to carry out an attack, i.e., it evaluates the level of privileges that are required by the attacker before successfully compromising the vulnerable electronic device. Similarly to the previous metric, we designate its values according to the type of power system bus being evaluated: high for P V buses, and low for P Q buses. e) Scope (S) -This metric demonstrates whether or not compromising a particular electronic device will cause implications beyond its security scope. If the scope metric is defined as changed, attacking the corresponding electronic device will result in a detrimental implication beyond its security scope, i.e., will affect the other elements in system. If the scope is defined as unchanged, it will only cause implications to elements under the same security scope. In our context, when a P Q bus is attacked, no major disturbances are observed in other system's elements since generation is not directly affected, thus its scope can be defined as unchanged. However, if a P V bus is compromised, more severe effects on surrounding nodes of the physical EPS network, caused by power stability issues, are observed. In this case, the scenario needs to be characterized of changed scope.
Following the description of the exploitability metrics, Table  II shows a detailed comparison between the metrics values found in different available scoring systems. These scoring systems are the CVSS v3.1 [21], CVSS v2.0 [34], and the Industrial Vulnerability Scoring System (IVSS) [35]. CVSS v3.1 is the most up to date scoring system which provides the most accurate way of capturing the main characteristics of a vulnerability via numerical scores. IVSS is an outdated scoring system and not widely used and supported by the community. Other quantitative risk assessment scoring systems, such as CCSS [36] and CMSS [37], were also considered when selecting the appropriate scoring system. However, all of these scoring systems are based on the previous version of CVSS, i.e., CVSS v2.0.
2) Impact Metric: In CVSS, the impact metric is used to evaluate different exploitation methods and capture the effects of successfully exploited vulnerabilities. This metric is determined using three factors: confidentiality (C), i.e., the effect on system information disclosure, integrity (In), i.e., how detrimental the modification of system data would be, and availability (A), i.e., the system accessibility after an adverse effect has occurred. During an attack, an adversary can cause high, low, or no impact in each specified factor. For our study, the impact metric is designed to capture the effect of different exploited vulnerabilities in the EPS. During an attack on a P V or a P Q bus, the system may experience varying degrees of impacts related to total loss, some loss, or no loss of the confidentiality, integrity, and availability of certain gridsupporting devices. More specifically, if the attacker is able to attack a P V bus, we assume a worst-case scenario since the attacker demonstrated to have enough information and skills to attack a highly secure system and possibly has the means to exploit additional vulnerabilities. This, in turn, may result in a total loss of integrity, confidentiality, and availability. Using this assumption, the impact of compromising a P V will cause high impact on confidentiality, integrity, and availability. On the other hand, despite existing research demonstrating the importance of load altering attacks on power system stability [38], manipulation of P Q buses and load change attacks will likely not result in interruption of the operation of generator, load, or transmission line in the system due to frequency load shedding protections [39]. Under these circumstances, the impact of a compromised P Q bus will not be significant enough when compared to the impact a compromised P V bus [40]. Thus, we assume that compromising P Q buses will have low impact in all three categories. Finally, the no impact value is used when an attack compromises an electronic device that is not connected to any P V or P Q bus. Based on the exploitability and impact metrics, CVSS can be calculated as [21]: where ]. E, AV , AC, U I, P R, S, and I represent the exploitability metric, attack vector, attack complexity, user interaction, privileges required, scope, and impact metrics, respectively. The calculated CVSS value is used as a major factor in the computation of the T P within our transition model. The traditional CVSS scoring method provides a detailed calculation process that assesses the impact of exploiting a vulnerability with different attack vectors. However, it cannot be used directly for our application since it fails to consider important factors when used to evaluate complex cyberphysical systems. In particular for power systems, it does not take into account features such as system topology, power generation, and line constraints. Since we assume the adversarial agent does not have full topological information, we include power generation and line constraint calculations in our proposed T P calculation. Since generators provide varying amounts of power to a system depending on the current state of the grid, the relative importance of a generator (and hence its attack impact) is determined by its power output. In addition to considering the difficulty of transitioning to certain system nodes, we also examine the overload percentage of the transmission lines. If the power flow across that line is near its thermal constraint, the line could be more easily affected by changes in the surrounding system. Taking each of the aforementioned aspects into consideration, we define the T P as follows: where G is the power generation of a connected generator, n is the total number of generation units in the system, P f represents the power flow through transmission lines, and λ critical is the thermal constraint for the connected transmission line. For a power system operating under normal conditions, the range of G/{ n k=1 G k } ∈ [0, 1] and λ critical ∈ [0, 1]. Since the CV SS score ∈ [0, 10], we scale it by dividing by 10. A smaller T P value represents a cyberphysical node vulnerability of low severity, i.e., the node has lower possibilities to be exploited by attackers since it has a lower CVSS score, and it is less important in terms of overload percentage, generation amount, and thermal limits. On the contrary, a T P represents a cyberphysical node vulnerability of high severity.

D. Step 4: Solution of Adversarial Model
After formulating and defining the corresponding POMDP, in this step, we develop an algorithm to solve the model and yield the optimal transition policy for the considered threat. Due to the complexity of EPS, it is important to have a mechanism to solve sequential decision-making problems efficiently. In our studies, we develop a DQN-based DRL algorithm.
1) Q-Learning: Q-learning is an off-policy RL algorithm designed to find the optimal action an agent needs to take at the current state. All the actions that the RL agent can take are evaluated using a Q-value which determines how good is a particular action in the current state. As shown in Eq. (7), using the learning rate α ∈ [0, 1], the long term Q-value is updated using the current Q-value, the estimated optimal future value, and immediate reward. γ ∈ [0, 1] represents a discount factor that determines the importance of immediate rewards compared with potential long-term rewards. A higher Q-value demonstrates that a series of actions will produce a higher total accumulated reward. These actions are referred to as the optimal policy.
Traditionally, Q-learning is implemented using Q-tables. However, this approach is not practical nor scalable for solving large state-action environments. To solve this issue, researchers in [41] proposed the replacement of Q-tables with deep neural networks, also known as DQNs.
2) Deep Q-Network (DQN): In order to address the computational overhead of Q-learning when dealing with large, uncertain, and dynamic environments, DQNs generalize the approximation of the Q-value function using artificial neural networks rather than storing every solution in a table. For our application environment modeled as a POMDP, we assume the DQN agent starts in a random initial state s (a node in the cyberphysical network) and transitions to the next state s until it reaches both nodes of the contingency pair, regardless of transition order. At every step, the agent makes an observation in order to obtain all possible transitions of the next state. Fig.  4 presents an overview of this process. For example, if the current state is at node A, the attacker through an observation o could obtain the potential transitions to the next state which can be one of K, J, or M. As shown in Eq.
Once each potential transition is determined, the T P for each transition needs to be computed (as defined in Eq. (6)). These calculated results will be utilized to determine the security index, SI i , when making a transition from s to s as shown in Eq. (9), where γ is a discount factor, and ∆C p is the line overload difference between the current state and each potential transition state. Finally, the maximum value of the security index which represents the node with the highest vulnerabilities' score and overload value, will be used to compute the corresponding reward function. As shown in Eq. (10), the reward function considers the overall benefit of different transitions as it takes into account the security index of each potential state, SI i .
State-action-rewards tuples are stored in the replay memory set M for recording agent's experiences. This memory set assists in independently training the neural network. All environmental information of the current state (weights, biases) is stored in the action-value parameter θ. In each step, the DQN combines multi-layered neural networks with existing Qlearning algorithms to approximate Q(s, a; θ). θ − will change as the result of changing θ. Eq. (11) demonstrates the updated target value given by the current state and action, where the target action-value parameter θ − is equal to θ at the beginning of the iterations. When this number of iterations is reached during training, θ − is updated to prevent an obstructed learning process [42]. Using the parameters described, the loss-function value can be calculated as shown in Eq. (12) for each stateaction pair. It represents the error between the predicted Qvalue and the target Q-value. The goal is to determine an optimal policy that minimizes the error and ensures that the training result will be as close as possible to the target value, where the target value is the estimated expected return of the actions taken by the DQN.
The agent performs an action that is selected according to the designated exploration-exploitation ( -greedy) strategy of Eq. (13). Such strategy controls the degree of exploitation over exploration. At each step, if exploration is being performed with probability , the algorithm selects a random action a t from the action set. During exploitation with probability 1 − , the action with the maximum Q-value is taken. The target values θ − will only be updated once the desired number of iterations has been reached [42]. The overall learning process is presented in Fig. 5. This learning process is repeated until a terminal state is reached, i.e., both contingency pairs have been finally "visited" by the agent. a t = random a , arg max a Q(s , a ; θ), 1 − (13)

E. Step 5: Output of the Assessment Process
An attacker with sufficient OSINT can aggregate enough power system information (e.g., power generation, capacity, load consumption, topological data, etc.) to perform contingency analysis and identify critical system elements. These identified critical contingency elements can be leveraged to generate cyberattack transition policies following the process described in previous steps. The generated cyberattack transition policies take into account vulnerabilities in electronic devices that exist in the cyber network layer as well as physical system vulnerabilities related to contingency studies. The DRL algorithm, DQN, provides a solution known as the optimal attack transition policy that can be used to attack the devices controlling the operations of the critical elements (e.g., microprocessor-based relays controlling circuit breakers, protocol translator converters, etc.) and thus result in potential power outages in the EPS. Our methodology can also be leveraged by control center operators and stakeholders to identify vulnerable components in the EPS or investigate potential attack strategies.

III. CONTINGENCY ANALYSIS SIMULATIONS
In this section, we introduce a number of contingency simulation case studies used to demonstrate the effectiveness of the proposed approach. These case scenarios prove how the most critical contingency pairs of a system vary when wind energy systems are in-place. We provide an analysis of the varying degrees of severity with different contingency scenarios and examine how wind generation impacts critical contingencies. For this validation study, we use a doubly-fed induction generator (DFIG) model for wind power generation modeling and digital real-time simulation (OPAL-RT) for testing the system in a real-time environment.

A. Contingency Scenarios
First, we run the assessment process of Section II up to Step 2 in order to assess multiple contingency scenarios in different test systems. In Table III, we present the number of critical contingencies for N − 1, N − 1 − 1, and N − 2 scenarios in different power system test cases. For example, the IEEE 39 bus system has 13 N − 1, 19 N − 1 − 1, and 71 critical N − 2 contingencies without any wind power injection, while the number of these contingencies varies with different wind penetration levels. The N − 1 contingencies are determined by disconnecting each line and observing system responses. For N − 1 − 1, the most severe N − 1 case is removed from the system, and the process is run again. The N − 2 pruning algorithm is carried out as described in Section II. In the rest of the section, we focus on the N − 2 case as the most severe scenario. It should be noted that the proposed approach can be adapted, based on user requirements, for any number of contingencies k.

B. Wind Power Generation Modeling using a DFIG Model
A DFIG model consists of a wound rotor induction generator driven by wind turbines and an AC/DC/AC insulatedgate bipolar transistor based pulse width modulated converter. The DFIG model used in our case studies for modeling wind energy systems is developed in MATLAB/Simulink. Using this model, we are able to study the dynamic response of EPS to wind speed variations and investigate the impact of different  penetrations. Three DFIGs are modeled and integrated to the IEEE 39 bus system at buses 5, 21, and 26 ( Fig. 6) [43]. The wind speed and wind power data for each wind system are collected at one minute resolution on May 14, 2020 (1440 mins = 24 hrs) from [44]. In the rest of the paper, we investigate two scenarios of wind integrated power systems: scenario A, in which the wind data is collected from three locations in Tallahassee, FL, with similar variation and power generation levels. The wind speed and corresponding wind power generation information are provided in Fig.7(a) and  Fig.7(b). In scenario B, the wind speed and power data are obtained from three locations in Boston, MA (Wind 1), Dallas, TX (Wind 2), and Tiffin, OH (Wind 3) with different weather characteristics. For this scenario, the wind speed and power generation are shown in Fig.8(a) and Fig.8(b), respectively.

C. Contingency Scenarios with Wind Power Injection
The amount of power produced by wind energy systems fluctuates due to wind's intermittent nature. As the generation changes, power flow varies, which may affect contingency analysis results. Therefore, we simulate wind power injection levels at eight distinct timestamps, for the two simulation scenarios (scenario A and scenario B) throughout one day and observe the changes in reported contingencies with different wind penetration. These tests are performed for the IEEE 39 bus system (Fig. 6). As shown in Table IV and Table V, seven wind power integration simulation cases (SC1-SC7) are simulated for scenario A and scenario B. For each case, we present the amount of power injected by the three DFIG-based wind farms (WF1, WF2, WF3) and the number of identified N − 2 contingencies.
For scenario A in Table IV, the highest number of N − 2 contingency pairs (100) exists when WF2 and WF3 are integrated to the system (SC5) with generation of 159.20MW and 155.70MW, respectively. The least amount of pairs occurs when WF1 and WF2 turbines are injecting power into the system (SC4), and the wind power injection for WF1 and WF2 are 77.52MW and 54.52MW, respectively. As shown in the results, the number of N − 2 contingencies change when the same amount of power is injected at different locations. Additionally, injecting varying levels of power in the same location also changes the number of contingencies. As for scenario B in Table V, the highest number of N − 2 contingency pairs (103) exists when WF2 and WF3 are integrated into the system (SC5), and the wind power injection for WF2 and WF3 are 136.80MW and 187.30MW, respectively. The least amount of pairs occurs when only WF1 is injecting 126.30MW power into the system (SC1).
Comparing with the normal case of IEEE 39 bus system without wind power injections (71 pairs of N −2 contingencies in Table III), the number of N − 2 contingencies in 38 cases (out of 56 cases in total in Table IV) of scenario A are over 71. For scenario B, 44 cases (out of 56 cases in total in Table V) are more than 71. These results demonstrate how the intermittent behavior of wind energy directly affects the number and location of contingencies in EPS with high penetration of RES. A more specific case that shows how the intermittent behavior of wind can alter the number of contingencies can be observed in Table III. The number of N − 2 contingencies can increase or decrease when compared with the case of no wind injection. One scenario that results in a lower number of N − 2 contingencies is SC7 at t = 1400 where the number of N − 2 contingencies decreases from the original 71 to 67; thus making the EPS more secure under contingency conditions. A counterexample of this behavior can be observed in SC7 at t = 0 where the number of N − 2 contingencies increases from 71 to 84.

D. Digital Real-time Simulation of IEEE 39 Bus System
We further examine the effect of contingency scenarios in a digital real-time simulation environment. We observe the impact of intermittent wind power injections across the IEEE 39 bus system by analyzing the variability of all the buses voltages in the system. At t = 0.3s, a N − 2 contingency event is triggered by simultaneously disconnecting two threephase circuit breakers. To understand the severity of losing  critical elements, we disconnect the most critical pair (lines 5 − 8 and 6 − 7) from the N − 2 contingency set of the IEEE 39 bus system. Fig. 9(a) presents the N − 2 effect that disconnecting lines 5 − 8 and 6 − 7 have in the test system without any wind connected. Fig. 9(b) demonstrates the same contingency scenario (disconnection of lines 5 − 8 and 6 − 7) with wind power being injected to the system (  Fig.9(c).
In order to understand the effect that different N − 2 contingency pairs may have in the EPS, we perform studies using different N − 2 pairs present in the contingency set. For these studies, we disconnect a less critical contingency pair from the N − 2 contingency set. At t = 0.3s, circuit breakers are tripped at lines 10 − 13 and 16 − 21, in a test case system without any wind power penetration, and the respective voltage variations can be observed in Fig. 9(d). Fig. 9(e) depicts how the voltage variations change when wind penetration (Table  IV: SC7 t = 800m) is considered under the same contingency scenario. An additional case (Table V: 13 and 16 − 21). Also, besides observing the voltage variations different contingency pairs can produce, we can also observe, in some cases, how the intermittent behavior of wind power helps to mitigate the severity of line overloads. Figs. 9(d) -9(f) demonstrate this behavior. For instance, in Fig. 9(e), most buses of the power system have voltage measurements that are closer to the nominal 1.0 p.u value. On the other hand, the case in Fig. 9(f) shows the opposite, since some voltage values measured at some buses are farther apart from 1.0 p.u when compared with the case where no wind power injection is included, i.e., Fig. 9(d). Our results demonstrate how important is to coordinate the amount of wind power as it penetrates the system. For example, the authors in [17] proposed a scheme for power systems to maintain N − 1 security within different levels of wind power injection. In addition, a dynamic reserve allocation of DFIG wind farms is presented in [45] to sustain system frequency stability. In our case, the results not only demonstrate the variation of N − 2 contingency numbers but also how these results can be used to control the penetration level of wind farms to increase the N − 2 secure operational range of power systems.

IV. RESULTS: THE EFFECTIVENESS OF THE PROPOSED CYBERSECURITY ASSESSMENT
This section presents our experimental results that demonstrate the effectiveness of the proposed cybersecurity assessment approach. We evaluate the efficacy of the process according to the optimal attack transition policies given as outputs. In this part, we provide the experimental setup for the presented test cases, the DQN agent model implementation details and its corresponding hyperparameters. Six test case systems are used to demonstrate the number of transitions needed to identify the optimal attack path for the corresponding case. Furthermore, the performance of the DQN model is evaluated according to the obtained rewards and losses, i.e., convergence for each test case. Finally, the effectiveness of the DQN, used to solve the transition model, is verified by comparing it to other transition-path policy-finding methods, and specifically to the: (i) random policy search, (ii) depth-first search (DFS), (iii) Dijkstra's shortest path algorithm, and an (iv) IVSS-based DQN model.

A. Experimental Setup and DQN Hyperparameters
The RL DQN model is trained and tested on a 64-bit machine with an Intel Core i7-7600U, 2.8GHz, and 16.00GB of memory. The proposed algorithm is implemented in Julia, a high-level, high-performance, dynamic programming language. The DQN solver for POMDP is provided in [46]. The source files and models associated with this work can be found at [47]. The DQN hyperparameters are presented in Table VI.

B. Cybersecurity Assessment: Attack-Path Transition Results
In order to demonstrate the efficacy of the proposed cybersecurity assessment process, we use six test case power systems related with the contingency studies in Section III (Table III):  (Table IV SC7 at t = 800m), (d) IEEE 39 bus system with wind W2 (Table V SC5 at t = 0m), (e) UIUC 150 bus system, and the (f) Polish 2383 bus system. Based on the identified critical N − 2 pairs, the malicious agent begins at a random initial state and finds the optimal attack-path transition policy to the existing and most critical N − 2 contingencies. A contingency is identified when one of the two buses has been visited by the agent. In Table VII, we show the number of transitions required to reach both critical contingencies as well as the number of P V and P Q buses visited by the agent. For each comparison, five random initial states are selected for each test system, and the average results are presented. For example, the IEEE 39 bus system requires an average of 8.8 transitions to correctly identify the most critical contingency pair. During the transitions, an average number of 0.4 generation (P V ) and 4.6 load (P Q) buses need to be visited, i.e., compromised, by the agent. T T r is the training and evaluation time (in seconds) needed for the DQN to 'learn' the optimal attack path for different cases, and T T o is the total time (in seconds) required to complete the process. The utilization of the Polish 2383 bus system in our experimental results aids in the evaluation of our proposed process with a realistic large-scale EPS. As seen in Table VII process can be used in tandem with medium and long-term control and planning applications. On the other hand, the proposed approach would require high computing power in order to be integrated into very short-term decision making processes [48].

C. DQN Rewards and Loss Convergence
As mentioned in Section II, the DQN aims to minimize the loss between the target value and the predicted value. The DQN agent learns the optimal policy as this loss is minimized. Here, we verify and evaluate the performance of our proposed approach by examining the convergence of the DQN loss during the training process. We also show how the average reward gradually increases at each step, for each test case, up to 250 training steps. It should be noted that the total number of training steps used is 500 while the update frequency of the plot is set to 2, thus only 250 steps can be observed in the graph. Fig. 10 shows the rewards for each test case system and Fig. 11 shows the corresponding loss for each case. As shown in Fig. 10, the DQN agent progressively 'learns' how to maximize the cumulative rewards in each test case system. At the same time, as the agent 'learns', the loss keeps decreasing until it converges to a minimum value as depicted in Fig. 11. These results showcase the training process of the DQN agent and its performance on all bus test case systems.

D. Effectiveness of DQN: Comparison with other Transition Techniques
The effectiveness of using a DQN model in our cybersecurity assessment process is demonstrated by comparing our DQN agent based on the CVSS scoring system with different techniques that could be used to find the optimal attack transition policy in a graph. The techniques used to compare the performance of the proposed DQN are: (i) random policy search, (ii) DFS, (iii) Dijkstra's shortest path algorithm, and (iv) IVSS-based DQN model. The random transition technique provides a baseline, or naive case, where transitions are performed in a random manner, i.e., without any intelligent control mechanisms. DFS is a searching technique for traversing a tree structure by starting from an arbitrary root node and exploring each branch as far as possible before going back to the root node and continuing to the next branch. Dijkstra's algorithm is a more sophisticated way of finding an optimal path through a graph structure. Dijkstra's algorithm is used to solve shortest-path problems in non-negative weighted graphs by finding an acyclic path between a source and a target node with the minimum transition cost. Both DFS and Dijkstra's search policies need full observability of the network, hence, for testing purposes in those two cases, we assume full observability of the system and its corresponding contingency pair. Finally, the IVSS-based DQN model is designed to evaluate the differences between the CVSS and IVSS vulnerability assessment criteria.
The tests conducted are run using the power system test cases presented in Table VII. For each case, five random initial states are selected and the average number of transitions is calculated. The maximum, minimum, and average number of transitions for each case are shown in the box plots presented in Fig. 12. From Figs. 12(a) -12(f), we can observe that, in general, the results of the DQNs-based transition techniques tend to require fewer number of transitions, i.e., are more efficient, when compared with the random and the DFS transition techniques. When compared with Dijkstra's algorithm, our DQN implementation performs slightly worse due to its iterative learning process. However, Dijkstra's shortest path algorithm has the major disadvantage of requiring full system observability. The results demonstrate the advantages of using DQN as the main solver technique for our proposed process. Finally, it can also be observed from Fig. 12 that using CVSS v.3.1 has major advantages when compared to the IVSS scoring system. The CVSS-based DQN consistently requires fewer number of transitions in all evaluated test cases.

V. CONCLUSIONS
In this paper, we present a cybersecurity assessment approach designed to assess the cyberphysical security of EPS with high penetration of wind. The proposed process uses OSINT and contingency analysis results to identify exploitable cyberphysical vulnerabilities and generate an optimal attack  (Table IV: SC7, t = 800m), (d) IEEE 39 bus system + Wind W2 (Table V: SC5, t = 0m), (e) UIUC 150 bus system, and the (f) Polish 2383 bus system. transition policy, from an adversary perspective, that can be potentially leveraged to cause major outages in an EPS. The results provided by the proposed process are critical to improving cybersecurity visibility for system operators and stakeholders; it provides information regarding the most critical attackpath an adversary must follow to severely compromise the system alongside with information about the most vulnerable elements in the EPS at a particular time. The proposed approach is tested using digital real-time simulation, realistic data from various actual wind energy systems, and various test case power systems. Additionally, results regarding the training and convergence of the DQN agent, proposed as the main optimal attack-path transition technique, are presented and compared with other competing techniques. These results demonstrate the applicability of the cybersecurity assessment approach in modern EPS.