Cyber Restoration of Power Systems: Concept and Methodology for Resilient Observability

In order to have a properly functioning cyber–physical power system, the operational data need to be properly measured, transmitted, and processed. In case of a malicious event on the cyber layer of the power system, such as the wide-area monitoring system, cyber components, such as phasor measurement units (PMUs), communication routers, and phasor data concentrators (PDCs) may be compromised, leading to an unobservable power system. This article proposes the concept of cyber restoration of power systems, and an optimal restoration scheme to recover the system observability swiftly after massive interruptions. The cyber restoration problem is formulated as a mixed integer linear programming (MILP) problem considering PMU measurability, communication network connectivity, and PDC processability conditions, as well as cyber restoration resources as constraints. Results in the IEEE 57-bus system validate that the proposed optimization method can provide solutions that recover system observability much faster than heuristic methods, demonstrating the need for systematic cyber restoration planning research and implementation.


I. INTRODUCTION
C YBER-PHYSICAL power systems are the augmentation of conventional electricity delivery infrastructure with information infrastructure for sensing, communication, control, and computation [1], [2]. They provide unparalleled opportunities for reliable, economical, and sustainable grid operation and energy delivery but also create a large vulnerable surface for potential cyber threats [3], [4]. The 2015 Ukraine Blackout was the first confirmed cyber attack event on a power system, impacting a number of components and resulting in a major power outage in the region [5]. As an emerging topic, the cyber-physical resilience of power systems has caught wide attention of the technical community in recent years (please see [6], [7], [8] for a detailed review).
The restoration of power systems from disasters has long been an essential task and an active topic, where research efforts date all the way back to the 1940s [9]. The significance and methodologies of physical restoration of power systems have been extensively investigated by researchers [10], [11]. For cyber-physical power systems, however, besides outages in the physical infrastructure, failures of cyber infrastructure may also occur due to malicious attack [12], [13], [14]. Without proper functioning of cyber infrastructure, effective operation of complex modern power systems remains impossible. Therefore, the cyber restoration of power systems from critical conditions will soon become an equally important problem as the conventional physical restoration problem in the near future.
As shown in the literature, various types of cyber attacks may be launched to interfere power system operation, such as false data injection (FDI) attacks, denial-of-service (DOS) attacks, and man-in-the-middle attacks [15], [16], [17]. So far, research in this area has been focused on three aspects: 1) attack modeling; 2) impact analysis; and 3) defense development. For attack modeling, viable attack paths have been identified, and optimal attack strategies inflicting maximum damage to the system have been developed (e.g., [18] and [19]). For impact analysis, the adverse effects of different attack models on system operation have been quantified and validated (e.g., [4] and [16]). For defense development, various approaches for prevention or detection of the occurrence and spread of the attacks have been proposed (e.g., [14], [20], and [21]). Although the modeling and defense measures before and during cyber attacks have been extensively investigated, little effort has been made on how to restore the functionality of the cyber infrastructure after the attacks have actually taken place and the damages have been inflicted. In fact, prompt recovery of system functionality is one of the core aspects of resilience [6], [7], [8], and deserves much more attention. Recently, Duan and Dinavahi [22] proposed a hybrid fast path recovery algorithm to recover single communication link failure in cyber-physical power systems. However, it is concerned about small disturbances and is unsuitable for restoration following a severe disaster. Liu et al. [23] This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. developed a survivability-aware concurrent rerouting restoration strategy for the communication network of cyber-physical power systems under large-scale network failures. In [24], an optimal recovery strategy of components is formulated for maximizing the resilience of the cyber-physical power system. However, [22], [23], [24] are developed mainly from the communication network standpoint and do not sufficiently incorporate the specific needs of power system monitoring and control. Qu et al. [25] developed an software-defined network management scheme for recovering power system observability. However, it only takes into account compromised data processors and does not consider other types of component failures following a severe disaster.
In this article, we will fill the aforementioned gap by formally proposing the concept of cyber restoration of power systems and developing approaches for planning the restoration procedure. We envision a fundamental analogy between cyber and physical restorations of power systems.
1) The physical restoration of power systems aims to recover the energy delivery capabilities of the physical infrastructure after energy blackouts take place; 2) The cyber restoration of power systems aims to recover the information delivery capabilities of the cyber infrastructure after information blackouts take place. In cyber-physical power systems, the information delivery capabilities of the cyber infrastructure are mainly reflected in two aspects: 1) observability and 2) controllability of the power grid. In this article, we will investigate the restoration strategy of cyber components for recovering system observability after major interruptions. Observability analysis is a crucial task in power system for estimating the real-time operational state of the system [26], [27], [28]. Key components for enabling observability include sensors, communication networks, and data processors. During a major disturbance such as a cyber attack, any of these components may be compromised, such that they either stop responding or produce unreliable data that cannot be trusted and utilized. Limited resources, such as information technology (IT) crews, need to be dispatched in the most efficient way for quick restoration of the key components that can bring back system observability. In this article, we will take the cyber network of phasor measurement units (PMUs) as an example for development and demonstration of the general concepts and methodologies for cyber restoration of power systems. The significance of restoration decision-making for PMUs have been demonstrated in our recent preliminary work [29]. In this article, a much more comprehensive framework for system observability recovery, which entails the entire cyber network including PMUs, routers, and phasor data concentrators (PDCs), will be presented and demonstrated. The contributions of this article are summarized as below.
1) The fundamental concepts regarding cyber restoration planning and observability recovery are formally established and illustrated. It is proposed that the cyber restoration process should maximize the integral of the system observability over time constrained on limited restoration resources available to system operators.
2) Cyber restoration planning is mathematically formulated as a mixed integer linear programming (MILP) problem, whose solution provides the optimal sequence of restoration actions for retrieving system observability after an information blackout.
3) The proposed optimization method is shown to significantly outperform rules of thumb, justifying the importance of strategic cyber restoration planning studies. The problem formulated in this article differs from the wellresearched information system restoration problem in that it jointly considers the network properties of the cyber system and the network properties of the physical system. It integrates the observability analysis of power grids with the routing analysis of communication networks, thus, uses the network properties of both systems coordinately. To achieve the most beneficial restoration strategy for the situational awareness of power systems, the cyber restoration planning problem needs to incorporate PMUs' observability effect based on power grid topology, whereas the conventional information system restoration treats end devices (i.e., PMUs in our case) independently and does not consider how they are related in the underlying physical system with a nested structure.
Note that the proposed work advances significantly from our preliminary work [29]. In [29], the optimal PMU restoration strategy after a major cyber attack on PMUs was proposed based on the assumption that the communication network and the PDCs were ideal, i.e., when a PMU was restored, it would be automatically connected to a functioning PDC. Therefore, the restoration of the communication network devices to form data transfer paths was completely ignored. On the other hand, in this article, it is assumed that not only the PMUs, but also the routers and the PDCs may be compromised due to cyber attacks and need to be restored. Furthermore, the PMU restoration strategy in [29] was solely based on the observability analysis determined by the power grid topology. However, the cyber device (PMU, router, and PDC) restoration strategy proposed in this article is based on both the observability analysis determined by the power grid topology and the routing analysis determined by the communication network topology. The coordinated analysis and optimization framework across the two different networks (power and communication) constitutes one of the core contributions of this article. It is also worth noting that as this article models the entire cyber network nesting all PMUs and PDCs, we are able to develop the generic concept of cyber restoration of power systems in analogy to the well-known physical restoration of power systems; as [29] simply considered PMUs as individual sensors to be restored without modeling the cyber network behind it, such a high-level analogous concept could not be developed.
It should be noted that this article only focuses on the postattack restoration planning problem, with the assumption that the attack has been detected and the compromised devices have been identified. Numerous works have been dedicated to the cyber attack detection problem (see [21], [30], [31], [32], [33] for example), thus, it is not within the scope of this article.
The rest of this article is organized as follows. Section II establishes the generic concept of cyber restoration of power systems, and compares it with the well-known physical restoration of power systems. Section III describes the structure of the PMU network, and introduces the three fundamental conditions for power system observability. Section IV exemplifies the cyber restoration methodology by formulating the observability recovery problem in the PMU network as an MILP problem, and presents a scalable avenue for handling large-scale systems. Comparative simulation results are presented in Section V to demonstrate the proposed concept and methodology. Section VI concludes this article.

II. CONCEPTUAL ANALOGY BETWEEN PHYSICAL RESTORATION AND CYBER RESTORATION
The physical restoration of power systems is already a well-researched problem. The cyber restoration of the power system can be conceptualized in parallel with the physical restoration of power systems. In this section, the conceptual analogy between the physical restoration and the cyber restoration of power systems is presented in the following two aspects.

A. Objective of the Restoration Problem
When a system experiences a major disturbance, the level of functionality of the system degrades, and actions must be taken to regain the normal level of functionality after the disturbance. The process of recovering the functionality level is termed the restoration process. The change of the functionality level of a system experiencing a major disturbance over time is depicted in Fig. 1. The intervals before, during, and after the disturbance are marked as Stage I, Stage II, and Stage III, respectively. The restoration process takes place in Stage III, when the functionality level of the system needs to be recovered after the clearance of the disturbance. The objective of the restoration process is to maximize the area enclosed below the curve (shaded), i.e., to maximize the integral of functionality over time. Suppose that the restoration process starts at time t 2 and ends at time t 3 . If f (·) represents the available level of functionality of the system, then the objective of the restoration process is to maximize the integral of this function from time t 2 to time t 3 , which can be mathematically represented as below maximize a(t) where a(t) denotes the restoration actions. Both the physical restoration and the cyber restoration should follow the same concept described above, but with different definitions of the level of functionality. For physical restoration, the level of functionality represents the energy delivery capability of the system, translated into the amount of power generation capacity or load supplied [34], [35]. For cyber restoration, the level of functionality should represent the information delivery capability of the system, translated into the level of observability or controllability of the power system.

B. Restoration Strategy
Besides the objective, there is also an analogy between physical restoration and cyber restoration regarding the system structure and restoration strategy, as illustrated in Fig. 2.
The physical power system generally consists of three parts: generators, power grid, and loads. To restore the physical system after an energy blackout, generators need to be restored first so that they can supply energy to loads [34]. Afterwards, energy transmission paths need to be found between generators and loads so that generators can be connected to the loads [36]. For this purpose, tree or forest structures need to be formed so that loads (leaves) can be connected to generators (roots) and, thus, the loads can be restored [35].
Similarly, the cyber system generally consists of three parts, central data processors, communication network, and sensors/controllers. To restore the cyber system after an information blackout, the central data processors need to be restored first to provide data services to sensors/controllers. Afterwards, information transmission paths need to be found between central data processors and sensors/controllers for effective data transmission. For this purpose, tree or forest structures need to be formed so that sensors/ controllers (leaves) can be connected to central data processors (roots) and, thus, the sensors/controllers can become usable.
In the rest of this article, we will exemplify the generic cyber restoration concept by discussing the restoration of PMU network aiming at the recovering the observability of power grids after major disturbances.

A. System Description
The PMU network can be considered as a special case of the general cyber network shown in Fig. 2 (b). In the PMU network, the sensors shown in the cyber network are the PMUs, and the central data processors are the PDCs. Although each PMU typically sends its data to a designated PDC, it is also feasible to route its data to other PDCs as well, which will increase the reliability when the primary PDC is at fault [37]. The data Analogy between the structures and strategies of (a) physical restoration and (b) cyber restoration. sent to the PDC will then be used by advanced applications such as state estimation [38] for system-wide observability.
Consider a PMU-measured cyber-physical power system consisting of n buses, v PMUs, w routers, and m PDCs, and the full system observability can be recovered after β restoration steps. The power grid topology can be represented by a graph G P = (V P , E P ), with a set of vertices (buses) V P = {1, . . . , i, . . . , n} and a set of bidirectional edges (power branches) Similarly, the graph for the communication network topology can be written as as a set of nodes in the communication network that has a PDC installed. Define A = {(i, j)} ⊆ V P × V C as a set of pairs, where each pair, (i, j), represents a power grid bus and a communication network router that are co-located (i.e., in a telecommunicated power substation). The locations of the PMUs can be denoted as a set B PMU ⊆ A, which means that the PMUs can only be placed at a telecommunicated power substation.
As has been clearly defined in literature, a fully/partially observable power system refers to a system whose voltage phasors at all/some buses can be uniquely determined [38]. Next, the key conditions for observability will be reviewed.

B. PMU Measurability
In general, by installing a PMU at a bus, measurements of the voltage phasor and current phasors incident to that bus can be obtained. Based on the current phasors and line models, the voltage phasors at the neighboring buses can be obtained. If a bus is neither connected to a generator nor to a load, then the bus is called a zero-injection bus. The general measurability condition for the voltage phasor(s) of a (set of) bus with and without considering zero-injection buses are defined as follows.  neighboring buses, at most one of them does not satisfy Condition 1.a.

C. Communication Network Connectivity
Having properly functioning PMUs alone is not sufficient for making the system observable, as the measurements obtained from the PMUs need to be made available at a PDC. Therefore, a communication network is necessary for sending the measurement data obtained by PMUs to PDCs. This translates into the following condition.

Condition 2 (Network Connectivity):
There is a path consisting of properly functioning (i.e., uncompromised) routers to transmit measurement data from a PMU to a PDC.

D. PDC Processability
PDCs are data processing hubs for collecting, aligning, and proof-checking data provided by PMUs, without which PMU measurements cannot be effectively used by advanced applications in control centers. Hence, the last condition for system observability is stated as below.

Condition 3 (PDC Processability):
There is at least one properly functioning (i.e., uncompromised) PDC that receives the data from the given PMU via the communication network.

E. Illustration of Observability Conditions and Recovery Processes
The three fundamental conditions for the observability of any bus in the power grid is illustrated in Fig. 3. The bus where a properly functioning PMU is installed including all the neighboring buses satisfy PMU measurability condition. For satisfying the network connectivity condition and the PDC processability condition, this PMU needs to be connected to a path that consists of properly functioning routers as well as a properly functioning PDC so that the measurement data from the PMU can be transmitted to the PDC.
However, it should be noted that there can be multiple combinations of PMUs, routers (communication paths), and PDCs through which a bus can be observable. For instance, Bus 1 marked by the red circle in Fig. 3 can be observable in many possible ways, with two of them being illustrated in the figure and described below. In the first way, PMU measurability condition is satisfied by the measurement of PMU 1 , a PMU located at a neighboring bus of Bus 1 , shown by the yellow solid line; the network connectivity condition is fulfilled by establishing a communication path via a set of routers shown by the orange solid line; and PDC processability condition is satisfied by the processing of data at PDC 1 . In the second way, PMU measurability, network connectivity, and PDC processability conditions are satisfied by the measurement of PMU 2 shown by the yellow-dashed line, the communication path via another set of routers shown by the orange dashed line, and the data processing at PDC 2 , respectively. By satisfying either of these two ways, Bus 1 can become observable. As the observability of each bus may be achieved in multiple ways, and the observability of different buses may be achieved in shared ways, it is important to formulate a cyber restoration planning problem to guide the system operator with a globally optimal restoration strategy to retrieve system observability after major interruptions.
Obviously, the topological properties of both the physical power grid and the cyber communication network should be integrated in order to develop a truly effective strategy for observability recovery. Take the example of Bus 1 in Fig. 3 again. Without knowing the communication network topology, the restoration decision of PMUs is only determined by the power grid topology. As such, there is no priority between the restorations of PMU 1 and PMU 2 , as either of these two PMUs can make Bus 1 observable. However, it is possible that one of these PMUs is closer to a restored PDC or a restored communication subnetwork, and it takes less resource to restore its connection making Bus 1 observable. On the other hand, without knowing the power grid topology, the communication network considers PMU 1 and PMU 2 as independent end devices, and the objective is to restore the connection of as many end devices as possible. Hence, resources will be spent to restore connections of both PMU 1 and PMU 2 . However, as can be seen from Fig. 3, because of the power grid topological relation, restoring either of the PMUs bring identical benefit for the observability recovery of Bus 1 . Hence, there is no need to restore both PMUs, and resources could be spent to restore other devices that bring additional benefits to power grid observability. Thus, without integrating the topological knowledge of both the physical grid and the cyber network, the cyber restoration strategy cannot be optimally determined for supporting power grid monitoring and operation.

A. Problem Background and Objective
Suppose that a major disturbance such as a cyber attack occurs in a PMU network. A substantial portion of measurement data cannot be received, trusted, or utilized due to the improper functioning of a number of cyber components, and the system-wide observability is inevitably lost. Under these circumstances, the only way to recover the system observability is to restore the compromised components and bring them back to their normal statuses. Restoration of the compromised components may consume a significant amount of time. For instance, in the case of malware, the malware has to be detected, removed, and the original software with possibly a security patch has to be reinstalled. In the case of a DoS attack, the target component (router, switch, PMU, or PDC) cannot perform its primary functionality. To fully restore the component, it can take a few minutes to several hours to implement the updated security measures [39]. In the case of an FDI attack, the falsified data have to be identified and replaced by the authentic data. When the compromised components cannot be identified quickly, the first step is to perform a screening to all suspected components, and in the meantime all the suspected components have to be considered as unusable. Most of the actions mentioned above involve sophisticated human intervention, and consume from tens of minutes to hours [40]. With the constraint of available resources (for example, the number of IT crews) for the restoration tasks, a systematic approach should be developed to prioritize the restoration tasks and facilitate a swift observability recovery.
As seen from Fig. 1, the restoration process takes place in Stage III. The focus of this article is on this recovery stage where the objective is to maximize the area below the curve (shaded area) which is equivalent to maximize the integral of the functionality level over time. For the particular problem proposed in this article, the functionality means the observability of the power system.

B. Assumptions
The main assumptions of the problem are as follows. 1) For a bus to be observable, PMU measurability, communication network connectivity, and PDC processability conditions should be satisfied as described in Section III.
2) The number of available cyber restoration resources for each restoration step is limited. 3) The time required for the restoration of each device can be estimated before the restoration process. 4) The restoration of PMUs, routers, and PDCs are irreversible, meaning that a cyber component will not be compromised again once it is restored or if it is not affected by the disturbance.

C. Status Definition
If a PMU is restored or unaffected by the disturbance, it is defined as a functioning PMU, and the statuses of functioning PMUs after each restoration step can be tracked using a set of binary decision vectors, x (1) , x (2) , . . . , x (β) . For example, the (i, j)th entry in the αth step vector, x (α) , is defined in Table I. As has been mentioned, the data of a PMU will be usable only when the PMU itself is restored and its data can be transmitted and processed by restored routers and PDCs. Hence, we define the PMUs whose data are usable as effective PMUs, and correspondingly a set of binary status vectors,  (1) , s (2) , . . . , s (β) , are defined. The (i, j)th entry in the αth step vector, s (α) is defined in Table I.
If a router is restored or not affected by the disturbance, then it is defined as a functioning router, and if the functioning router is connected to a PDC, then it is defined as an effective router. However, because of the restoration strategy described in the following section, a functioning router can be naturally considered as an effective router, and there is no need to distinguish between these two concepts. A set of binary decision vectors, t (1) , t (2) , . . . , t (β) , are defined to track the statuses of the functioning routers after each restoration step. The jth entry in the αth step vector, t (α) , is defined in Table I.
Similarly, if a PDC is restored or not affected by the disturbance, then it is defined as a functioning PDC. For tracking the statuses of functioning PDCs, a set of binary decision vectors, u (1) , u (2) , . . . , u (β) , are defined. The jth entry in the αth step vector, u (α) is defined in Table I. Next, we will develop the objective function and the constraints for the optimal cyber restoration planning problem.
D. Problem Formulation 1) Objective Function: As has been mentioned, the objective of the proposed optimization problem is to quickly recover the observability level of the system, which is equivalent to maximize the integral of the observability level over time. Hence, the objective function can be written as maximizing the number of observable buses over time as follows: where y (α) = [y 1 (α) , y 2 (α) , . . . , y i (α) , . . . , y n (α) ] T is a binary status vector for the αth step, indicating if a bus is observable or not. with its entry y (α) i is defined in Table I. In (2), w is a weight vector indicating the degree of importance of each bus. If w takes a vector with all 1s, the objective will be the maximization of the shaded area enclosed below the curve in stage III shown in Fig. 1, with the number of observable buses being the measure of observability level, Fig. 4.
Illustration of router restoration strategy in a simple example: starting with neighboring versus starting with remote. (a) PMU network for a simple power system. (b) Observability growth for the first restoration strategy (starting with neighboring PMUs: router J, router K, router L). (c) Observability growth for the second restoration strategy (starting with remote PMUs: router L, router K, router J). and the restoration steps being the discretization of time. The objective (2) is a materialization of the general cyber restoration objective function (1), where the level of functionality is defined as the level of observability. It should be noted that the integral of the functionality over time is reduced to the summation of the observability of all time steps, as the proposed optimization problem is mathematically developed as a discrete-time problem.
2) Constraints on Restoring Routers and PDCs: To fulfill the PMU measurability, communication connectivity, and PDC processability conditions, the compromised PMUs should be restored and connected to the restored PDCs via a restored communication path. If a PMU is restored and connected to a restored PDC, it can provide an immediate observability growth which will persist throughout the restoration horizon. On the flip side, if a PMU that is not connected to a restored PDC via a communication path containing restored routers is restored, then the restoration of this PMU will not contribute to the observability growth until it is connected to a restored communication path at a later step. As the usability of the restored PMU is delayed, the resources used to restore the PMU will not result in the fastest observability growth. In other words, the contribution of the PMU for the shaded area under the curve in stage III shown in Fig. 1 will not be maximum. Hence, it will be most efficient to take restored PDCs as root nodes, and expand the communication forest from them by restoring their neighboring routers. In this way, after the restoration of a router, a path will be immediately available from the restored router and a restored PDC, thus, the restored router will become immediately usable for transmitting data to a PDC. On the other hand, if routers that are remote from PDCs are restored first, since there are still compromised routers in between, no path can be immediately set up between the restored routers and the PDCs, thus, they will not be usable.
The benefit of prioritizing the neighboring routers of restored routers is illustrated by a simple example in Fig. 4.
Suppose that a small PMU-measured power system consists of three PMUs, three routers, and one PDC as shown in Fig. 4(a). As has been mentioned, the restoration of the communication tree should consider PDC as the root node, hence, suppose that the PDC is restored from an initial step. For simplicity, suppose that the PMUs are also restored, and only the routers are to be restored. At this point, consider the two following restoration strategies. The first strategy is to restore router J, which is a neighbor of the already restored PDC, followed by the restoration of router K, and finally router L, the farthest from the PDC. The second strategy is to start by restoring router L, which is the farthest to the already restored PDC, followed by the restoration of router K and finally router J, the closest one to the PDC. If the first strategy is followed, then during each restoration time step, the newly restored router will be immediately usable for sending corresponding PMU data to the PDC. As a result, during each step, the observability level will increase with the restoration of each router as shown in Fig. 4(b). On the other hand, if the second strategy is followed, restoring routers L and K will not be able to increase the observability level of the system, as they are not connected to a functioning communication path. By following this strategy, the observability of the system will be increased only after the restoration of router J, when routers L and K are finally connected to the PDC, as depicted in Fig. 4(c). By following the first restoration strategy, although the whole power system does not become observable during the first two steps, the restored devices are still able to provide some degree of observability to the grid operator, which helps them understand the real-time operating condition of the system and take corrective actions. However, by following the second strategy, no observability can be achieved during the first two steps. Although this strategy finally recovers the same level of observability as the first strategy, it prevents the grid operator from obtaining partial information from the grid at the earlier steps. Accordingly, while comparing the area under the curve of the two strategies, the area enclosed by the first restoration strategy is greater than that of the second strategy. Thus, by comparing these two restoration strategies, it can be concluded that the restoration of the neighboring routers of restored routers is more beneficial than the restoration of remote routers.
Suppose communication trees consisting of restored PDCs and routers have been formed in previous steps. For a given step, routers which are immediate neighbors of the trees should enjoy the highest priority for restoration. We refer to these routers as the 1st tier routers in a given step. Similarly, 2nd tier routers refer to neighbors of 1st tier routers, 3rd tier routers refer to neighbors of 2nd tier routers, and so forth.
The general principle of router restoration is that: 1) the routers of functioning PDCs will be restored first and 2) in each step, the newly restored routers should be connected to previously restored routers either directly or via other newly restored routers. This can be best illustrated with the help of an example shown in Fig. 5. Suppose that the PDCs and routers marked in black are already restored from previous steps, forming two communication trees. In this particular step, the neighboring routers of the restored routers, marked in blue, are considered as the 1st tier routers. If the 1st tier routers are restored, they will be immediately connected to the restored PDC via the functioning communication tree, and can be used for PMU data transmission. For the higher tier routers, they should be restored only when the associated lower tier routers are restored. For ease of explanation, assume that only router T 1 is selected for restoration among the 1st tier routers in Fig. 5. In this condition, routers T 11 and T 12 (marked in orange), which are 2nd tier routers associated with T 1 , can be considered for restoration, since they will be connected to the restored communication tree via T 1 immediately. On the other hand, since the 1st tier router T 2 is not selected for restoration in this step, restoring its associated 2nd tier router T 21 will not increase the connectivity of the network. Thus, T 21 should not be considered as a candidate for restoration in this step.
As mentioned earlier, in order to form communication trees, at least one PDC should be properly functioning at the very beginning which can serve as the root. Therefore, if all PDCs are compromised, the initial step always involves the restoration of at least one PDC. During the restoration process, more PDCs can be restored, adding new root nodes and, thus, new trees into the network. In the example shown in Fig. 5, U 1 may be the PDC that is initially restored, from which Tree 1 is grown. During the restoration process, another PDC, U 2 , may be restored, from which another tree Tree 2 is grown.
For the αth step, the constraints for restoring PDCs (including their routers) can be formulated as follows: where M ∈ R w×m is a binary matrix whose entry (j 1 , j 2 ) is defined in Table I. A set of intermediate binary decision vectors is used to track router restoration up to each tier in a given step. Denoting q as the maximum number of tiers allowed to be restored in one step, vectors for the αth step can be denoted as For the pth tier, the jth entry is defined in Table I.
As previously discussed, at the αth step, for any intermediate tier p, only the routers that are neighbors of already restored routers should be candidates for restoration, as expressed by where D ∈ R w×w is a binary connectivity matrix representing the connections between routers in the communication network, whose entry (j 1 , j 2 ) is defined in Table I. For the last tier, the following constraint will be present: where t (α) indicates all the routers that are properly functioning at the completion of the entire αth step restoration.
3) Constraints on Effective PMUs: As has been discussed, an effective PMU needs to satisfy the PMU measurability, the communication network connectivity, and the PDC processability conditions. It is noteworthy that constraints (3), (4) and (5) already make sure that restored routers are always connected to at least one restored PDC through functioning communication network. Hence, as long as the PMU and the router at the same location are both restored, this PMU can be considered as effective 1 2 where H ∈ R v×w is a binary matrix converting the size of t (α) vectors into the size of s (α) vectors, and the definition of its entry (i 1 j 1 , j 2 ) is presented in Table I.

4) Constraints on Observability:
The observability of buses after a given step is constrained as where C ∈ R n×v is a binary connectivity matrix showing the connections between buses, its entry (i 1 , i 2 j 2 ) is defined in Table I. To consider the effect of the zero-injection buses, constraint (7) can be modified using the results in the existing [41], [42].

5) Constraints on Restoration Resources:
Since the total amount of resources for cyber restoration is limited for each step, a set of constraints that limit the numbers of PMUs, routers, and PDCs to be restored in a given restoration step α can be formulated as follows: where d (α) is the amount of restoration resources at step α; e, g, and h are vectors whose entries represent the amount of restoration resources required by each PMU, router, and PDC, respectively. If the restoration resources are IT crews, and the restoration action of a device spans a full time step, and d (α) , e, g, and h should all be (or consist of) integers. However, if the restoration of a device takes only a portion of time in a time step, then the corresponding entry of this device in the restoration resources vectors e, g, and h can be decimals.

6) Constraints on Maintaining the Normal Status of Already Restored PMUs:
It is assumed that the restoration process is irreversible, namely, if a PMU is already restored in the previous step or not affected by the disturbance, then it remains in normal status in the rest of the restoration process where 0 is a null vector with a proper dimension.

7) Constraints on Maintaining the Normal Status of Already Restored Routers:
Similarly, it is assumed that the restoration of PDCs is irreversible. In addition, in a particular step, the selection of lower tiered routers for restoration cannot be reversed when considering the selection of high-tiered routers for restoration

8) Constraints on Maintaining the Normal Status of Already Restored PDCs:
As similar to the PMUs and routers, it is also assumed that the restoration of PDCs is irreversible. The constraints can be formulated as below

9) Optimal Cyber Restoration Planning Problem:
In order to maximize the number of observable buses over time in the restoration process, the optimization problem can be formulated combining all the developed concepts and formulated constraints presented so far, as in (14) maximize (2) subject to: In (14), x (0) , t (0) , and u (0) indicate the system's initial condition. They are constant binary vectors whose entries with a value of 1 indicate the PMUs, routers, and PDCs that survive the major disturbance such as cyber attack and are properly functioning since the beginning of restoration, respectively. Note that t (0) should only account for the uncompromised routers which can find an uncompromised path to at least one uncompromised PDC, since those uncompromised routers that are disconnected from uncompromised PDCs will not be immediately usable for transmitting data. They should be accounted for by setting the corresponding amounts of resources taken for restoration to be 0 in vector g in (8), such that in the restoration process, they can be picked without taking any resources. When formulating the problem, the number of restoration steps, β, should be greater than or equal to the actual number of steps needed to recover full observability. It can be determined with the help of any heuristic methods, as heuristic methods take no fewer steps to recover the system observability than the optimization method (14). Examples of some heuristic methods are discussed in the simulation section. The solution vectors x (α) , t (α) , and u (α) (α = 1, 2, . . . , β) provide the optimal sequential restoration strategy of PMUs, routers, and PDCs that will increase observability most promptly. The vectors y (α) (α = 1, 2, . . . , β) give the locations of the observable buses after each restoration step. The optimization problem stated in (14) can be solved with the help of any well-developed MILP solvers (see [43], [44] for detailed information).

E. Scalability to Large Power Systems
In addition to the original formulation (14), the proposed optimization problem can be formulated as a rolling-horizon optimization problem [45] to increase the computational efficiency of the cyber restoration problem and achieve scalability. Herein, instead of solving the MILP problem for the entire restoration horizon, the problem is solved only for a few steps ahead. In order to achieve the full restoration strategy, the MILP problem is solved for the reduced restoration horizon iteratively until all the buses become observable. For the μth iteration of the process, the optimization problem (14) is solved for step μ to step (μ + σ ). After obtaining the solution to the restoration problem for this restoration horizon, only the immediate step (i.e., the μth step) is taken. For the next iteration, the optimization problem is solved for step (μ + 1) to step (μ + 1 + σ ), and so forth. The rolling-horizon subproblem for step μ is as follows: where (μ) represents the objective function, and (μ) includes the constraints of the original problem (14) for the μth iteration.
The scalability can also be addressed by partitioning the large-scale power system into smaller overlapping areas, and the optimization problem can be solved for each individual area, respectively. In this approach, different areas should overlap at the boundary buses, i.e., each boundary bus should appear in at least two areas, such that all cross-area topological connections can be taken into account.

A. Simulation Settings
The proposed cyber restoration problem will be illustrated on the IEEE 57-bus system. We consider a regional PMU network consisting of 37 PMUs, 57 routers, and 2 PDCs (the specific configurations will be in results to be presented below). The PMU configuration is generated by selecting an essential set of PMUs (i.e., a minimum set of PMUs that makes the system observable) along with a randomly located set of redundant PMUs to increase resiliency. As for the locations of PDCs, bus 1 is first selected since it is the slack bus and also has a large number of connections. After that, bus 57 which has a relatively small number of connections and is also far from bus 1 is selected in order to improve the representativeness of various practical conditions. For the simulation cases in this article, it is assumed that the topology of the communication network is the same as the topology of the power system. However, the restoration planning method applies to general cases where the communication network topology is arbitrarily different from the power system topology.
The observability loss index for maximizing the number of observable buses can be computed by summing the difference between the total number of buses n and the number of observable buses after each restoration steps R = β α=1 n − 1 T y (α) . (16) In the simulation cases, it is considered that all the PMUs, routers, and PDCs are compromised. The restoration of every cyber component requires one resource, and only 1st and 2nd tier routers are considered in each step. It should be noted that the proposed method can be conveniently applied to partial observability blackouts, various requirements of restoration resources, and higher tiers of router restorations.
The proposed MILP problems (14), (15) are developed and solved using MATLAB 2019b and Gurobi 9.1.1 software. A personal computer with Intel Core i7 CPU @ 2.60 GHz and 16-GB RAM is used to perform the simulations.

B. Baseline Methods
As there is no similar research along the line, the efficacy of the proposed optimization method is validated by comparison against two heuristics methods, namely the random method and the greedy method. These methods are used to imitate the restoration actions that system operators may take in lack of an optimized cyber restoration algorithm.
1) Greedy Method: For the greedy method, a set of heuristics are set up to find a combination of devices resulting in large observability increase for the immediate step. Throughout the process, the unrestored PMUs with restored routers and the unrestored routers with restored PMUs receive highest priority for restoration, since they provide immediate observability growth with fewest resources. This will be followed by the restoration of fully unrestored pair of PMUs and 1st tier routers. If there are more unrestored pairs than available restoration resources, the pairs that can recover the highest observability level will be selected. If there are more than one combination of cyber components with similar effects, random selection will be done to adopt one of them. After restoring all the unrestored pairs, the single unrestored 1st tier routers without pairing PMUs will be chosen for restoration. If there are still restoration resources left, the higher tier routers and their pairing PMUs will be restored until the resources are exhausted. The pseudo code of the greedy method is illustrated in Algorithm 1 in the supplementary document due to the space limit of this article. In summary, the greedy method looks only one step ahead and seeks to maximize immediate observability growth instead of considering the entire restoration process in decision making.
2) Random Method: For the random method, if all the PDCs are compromised, one PDC from all the candidate PDCs will be randomly chosen for restoration in the first step. Afterwards, a random combination of PMUs, routers, and PDCs will be chosen in each step which exhaust the resources of the particular step. Here, only the routers that are direct neighbors to previously restored routers will be chosen, since restoring remote routers is an obviously unreasonable choice. The pseudo code of the random method is showed in Algorithm 2 in the supplementary document due to the space limit of this article. Next, two simulation cases for testing the performances of the proposed method under different levels of available restoration resources will be presented.

C. Simulation Case 1
In the first simulation case, the number of restoration resources for each step is limited to 8, which means eight cyber components (PMUs, routers, and PDCs) can be restored in each step. The observability recovery curves obtained by using the optimization method, the greedy method, and the random method are shown in Fig. 6. All the buses are considered to have equal importance -the weight vector w in (2) is set as a vector with all 1s. There are multiple solutions for the greedy method and the random method, resulting in multiple curves in the figure. The reason is that both methods involve some degrees of random selection, as has been previously mentioned in Section V-B. For comparing the results of the heuristics methods with the optimization method on a common basis, the average curves over 10 curves of the random methods and the greedy methods are plotted, respectively. As seen from Fig. 6, the optimization method takes eight steps to fully recover the system observability. In contrast, the restoration strategies obtained from the greedy method and the random method need 12 steps and 13 steps, respectively, much slower compared to the optimization method. Noticeably, the optimization method performs better than the greedy method throughout the whole restoration process. Especially, the difference becomes more significant in the later steps of the restoration processes. The reason is that at the beginning of the restoration process, the number of candidate cyber components to be restored is small for both methods (communication tree/forest are yet to be grown); as the restored communication tree/forest expand, the number of candidate cyber components also increases, making it difficult for the greedy method to find a restoration strategy that will be truly the most beneficial. Moreover, the greedy method only concentrates on gaining the observability increase one step ahead, resulting in suboptimal solutions for the entire restoration process. On the contrary, the optimization method solves the multistep decision making problem globally and finds the strategy with the minimum observability loss overall.
For comparing the performances of different restoration strategies quantitatively, the observability loss index R for   Table II. The observability loss indices for the random method, the greedy method, and the optimization method are presented in the "Rand.," "Greedy," and "Optim." columns, respectively. The percentage reductions of the index brought about by the optimization method with the random method and the greedy method as baselines are shown in the "Optim. versus Rand." and "Optim. versus Greedy" columns, respectively. The proposed optimization method reduces the observability loss by 37.84% compared to the greedy method and reduces the observability loss by 54.73% compared to the random method. The complete step-by-step restoration strategy for the simulation case 1 yielded from the optimization method is detailed in Table III. In the "Restored Components" column presents the cyber components restored in each step, where "P" refers to a PMU, "R" refers to a router, and "D" refers to a PDC. For instance, P1, R1, and D1 represents the PMU, router, and PDC located at bus 1, respectively. The buses that become observable after each restoration steps are recorded in the "Buses Regaining Observability" column. After the 8th step, all buses in the system have become observable.
To show computational efficiency of the proposed cyber restoration problem while using rolling-horizon optimization method, the computational times taken by different lengths of the restoration horizon in the first simulation case are listed in Table IV. The columns "2," "3," "4," "5," and "6" represent the length of the restoration horizon, and the column "Compl. Soln." represents the solution obtained by solving the MILP problem over the entire restoration horizon. For the complete solution, it takes 300.28 s to reach the optimal solution, with the observability loss index being 185.5. For solutions with 2-step, 3-step, 4-step, 5-step, and 6-step restoration horizon, .48 s, and 156.6 s, respectively. As can be observed, when the restoration horizon is 2 steps long, it only takes 2.73 s for computation, but the observability loss index is higher than the complete solution at 194.5, indicating that the solution is suboptimal. However, starting from the solution with 3-step restoration horizon (which takes 4.79 s only), the observability loss index becomes equal to that of the complete solution, implying that the solution is already optimal and equivalent to the complete solution. Therefore, it can be concluded that the cyber restoration problem can be solved using the rollinghorizon optimization method to achieve the optimal solution while drastically reducing the computational time.

D. Simulation Case 2
In the second simulation case, all conditions remain the same except that the restoration resources are increased from 8 per step to 12 per step. Fig. 7 illustrates the observability recovery curves of the three methods. Obviously, all three methods yield solutions that retrieve full observability faster than the first case, taking 6, 9, and 10 steps, respectively. As seen from the observability loss index R for different restoration strategies listed in Table V, the proposed optimization method reduces the observability loss by 39.66% compared to the greedy method, and reduces the observability loss by 55.40% compared to the random method. Although all three methods perform faster than the simulation case 1, the observability loss reduction by the proposed optimization method compared to the greedy method is higher for simulation case 2 than that of the simulation case 1. It shows as the number of restoration resources gets larger, it will become even more difficult for the greedy method to produce effective results simply based on rules of thumb. The complete step-by-step restoration strategy for simulation case 2 yielded from the optimization method is detailed in Table VI. It can be seen from the figure that after the 6th step, all buses in the system have become observable.
To demonstrate the computation efficiency obtained by applying the rolling-horizon optimization method for simulation case 2, the computational times taken by different lengths of the restoration horizon in the second simulation case are listed in Table VII. For solutions with 2-step, 3-step, 4-step, 5-step, and 6-step restoration horizon, the amount of time consumed by the solver are only 2.82 s, 3.26 s, 3.37 s, 4.12 s, and 4.44 s, respectively. For the complete solution, the consumed amount of time is 22 s with the observability loss index being 127.5. As it is observed from the table that, the 2-step restoration horizon consumes only 2.82 s with the observability loss index being 129.5 which is slightly higher than the complete solution, indicating that the solution is suboptimal. However, similar to simulation case 1, starting from 3-step restoration horizon, the observability loss index becomes equal to that of the complete solution, indicating that the solution is optimal and equivalent to the complete solution. Thus, the computational time consumed by the simulation case 2 further confirmed that the rolling-horizon optimization method can be used to obtain the optimal solution with less computational time.
To illustrate the optimized restoration strategy more explicitly, the power grid observability, communication network connectivity, as well as cyber component restoration status after the 1st step of the simulation case 2 are shown in Figs. 8   Fig. 9. It can be seen that in this step, the PDCs located at bus 1 and bus 57 are restored along with the routers at bus 1 and bus 57. In addition, the 1st tier routers, i.e., the routers at buses 15 and 39, and the 2nd tier routers, i.e., the routers at buses 13, 14, and 37, are also restored. Combining the information from Figs. 8 and 9, it is known that the restored PMUs at buses 1, 14, and 57 can now transmit measurement data to the restored PDCs at buses 1 and 57 through the restored communication forest, making a total of 11 buses observable.
Similarly, the power grid observability, communication network connectivity, as well as cyber component restoration status after the final step (6th step) are illustrated in Figs. 10 and 11. The restored cyber components form a properly functioning PMU subnetwork which allows for necessary data sensing, transmission, and processing functionalities for delivering full observability of the power system.

VI. CONCLUSION
This article presents the concept of cyber restoration of power systems with the objective of recovering observability after major disasters such as wide-spread attacks. By establishing conceptual analogies between physical system restoration and cyber system restoration in terms of the objective and the restoration strategy, the general idea of the cyber restoration planning is proposed. As an example of the generic cyber  restoration concept, the restoration strategy of a PMU network for recovering the observability of the power grid is provided. An optimal planning method is proposed for maximizing the cumulative observability over time after an information blackout. A rolling-horizon method is used to solve the proposed cyber restoration problem, which demonstrates computational efficiency for guiding restoration actions online. Simulation results in the IEEE 57-bus system show that by applying the proposed optimization method, power system observability can be restored much faster than using heuristics, such as the random and greedy methods. The comparative study demonstrates the need of systematic cyber restoration planning for enhancing power system resilience in the information era.
This article pioneers the cyber restoration of power systems by investigating the recovery of observability due to its importance to power system operation as well as the well-established definition and quantification methods of observability in literature. Provided clearly defined metrics of the controllability of the power system, the same framework can be extended to investigate the cyber restoration strategy for the recovery of system controllability after major disturbances.