A Recovery Model for Faulty Networked System

In this article, we introduce a novel optimization formulation able to capture the main aspects of a networked system affected by faults that reduce the efficiency in terms of flow transmission, as well as to optimize response teams in charge of restoring the subsystems from faults. We propose a nonlinear optimization problem based on the max-flow formulation, which merges, in a comprehensive framework, the aspects related to 1) the flow management; 2) the fault propagation and its impact in terms of network efficiency; 3) the scheduling of response teams interventions with the aim to restore the nominal behavior of the network. Finally, with the aim to reduce the computational effort required to solve such a problem, we provide a linearized optimization formulation. The results, computed on a test-case network, confirm the model's applicability in real-time restoration process planning; moreover, our methodology is sufficiently general to be applicable in a multitude of emergency scenarios such as crowd evacuation or critical infrastructure under domino effect propagation.


I. INTRODUCTION
Proper scheduling and restoration of services provided by networked systems, such as critical infrastructures, are crucial in order to minimize disruptions and reduce essential services downtime, which can have serious consequences for communities and companies causing cascading effects [1], [2].The problem of identifying optimal recovery plans for networked systems is known as the integrated network design and scheduling (INDS) problem [3], it focuses on finding the best operations over the network so as to recover its performance over time.Within this class of problems, the system is modeled as a network where flows represent the provided amount of service, materials, or people.In more detail, such networks are represented by weighted directed acyclic graphs where the flows are generated by the source nodes and sent to the destinations through a set of intermediate nodes, interconnected via limited-capacity links.Disruption of infrastructure components is usually modeled by reducing the efficiency or ability of source and intermediate nodes in redirecting flows toward the outgoing links or by removing nodes or edges from the graph, thus reducing the ability of the network to deliver the service.Several scenarios can be modeled as flow networks; some examples are transportation, shipping industry, water distribution, or traffic in computer and telephone networks [4], [5].In all these scenarios, the idea is to maximize the flow through the network toward the destination nodes [6].In the case of a natural or intentional disaster, it is a challenge to find the best scheduling of recovery interventions.In the literature, recovery has often been addressed by relying upon optimization approaches with the aim to reduce the impact of the failures by restoring nodes or links.In Table I, we collect the main features of optimization approaches, from the literature, for network restoration.Notice that all these approaches encompass completely operative or inoperative elements and do not consider the case where the efficiency of faulty elements undergoes a continuous process of degradation before a restoration process begins.
Contribution and Paper Outline: In this work (see Section II), we propose a nonlinear novel optimization problem, and its linearization, for the recovery of faulty networked systems.Such a proposed problem extends the INDS formulation by considering the following improvements.
1) Each node in the network is characterized by an efficiency level in order to represent partially faulty subsystems.2) A partially faulty node is unable to operate as in its nominal state; its functionality is, indeed, reduced.3) We explicitly model the dynamics of efficiency degradation in faulty elements, which continues until the elements are reached by the restoration teams.4) The efficiency of a faulty element increases when a recovery team starts a restoration process on it.The aim is to restore the network efficiency to maximize the flow provided to the destination nodes by scheduling multiple response team interventions.In Section III, we propose a parametric analysis and a simulation campaign in order to show the effectiveness of our optimization model with a focus on the computational effort required by a nonlinear and a mixed-integer linear programming (MILP) version of the proposed problem.

II. INCIDENT RESPONSE SYSTEM
We now propose a nonlinear optimization problem able to represent a networked system affected by faults, due to physical or cyberattacks, natural disasters, or mechanical failures, that limit the efficiency of the nodes in the network.Similarly to the max-flow problem [6], the aim of the problem is the maximization of the incoming flow to the destination nodes.Moreover, due to the presence of faults, the capacity of the links outgoing a faulty node is reduced.In this formulation, we consider the time dimension of the problem in order to maximize its cumulative sum in a time horizon T .Moreover, we take into account the presence of response teams able to recover faulty nodes in order to preserve the maximum flow to the destination nodes.Let G = {V, E} be a directed acyclic graph where S ⊂ V is the set of sources able to generate a flow f i,k at each time instant k and let D ⊂ V be the set of destination nodes.In the proposed formulation, for each link (i, j) in E, at time k, the transit of a maximum flow equal to α i,k c i,j > 0 is allowed, where c i,j > 0 is the capacity of the link (i, j) and α i,k ∈ [0, 1] represents an efficiency measure defined for each node in the network (except for destination nodes), which represents the state of node i at time k.A node i is perfectly efficient at time k if α i,k = 1 while α i,k < 1 if a fault occurs on node i.When a fault occurs, the efficiency level of the node constantly decreases by δ − i > 0 at each time instant.With the aim to restore the operative level, response teams can operate on such node and increase the actual operative level of δ + i > 0 (for each involved team) at each time instant to restore the nominal capacity of its outgoing links.The model is composed of three sets of constraints: A first set directly derives from the max-flow problem and describes the behavior of the flow on the network, a second set is necessary in order to schedule the recovery operation of the response teams, and finally, a third set of constraints characterizes the dynamics about the efficiency of the nodes in the presence of faults or recovery operations.We formulate our nonlinear optimization problem as follows.
Problem 1: Consider a directed acyclic graph G = {V, E} and let α i,0 be a given initial condition for the initial efficiency of each node.Assuming the presence of L response teams, find the flows x i,j,k , the node efficiencies α i,k , and the required recovery plans q l,i,k , which solve Our model is defined on the basis of three types of decision variables: the flow on link (i, j) at time k (x i,j,k ≥ 0), the presence of the response team l on node i at time k (q l,i,k ∈ {0, 1}), notice that q l,i,k = 1 if team l is recovering node i at time k, and finally, the efficiency of node i at time k (α i,k ∈ [0, 1]).The objective (1a) is to maximize the cumulative sum of incoming flow to the destination nodes over the considered time horizon.The constraint defined in (1b) requires that for each time instant and for each node, without considering the destination nodes, the incoming flow must be greater or equal to the outgoing flow, notice that f i,k represents the flow generated by a source node i ∈ S at time k.Such constraints differ from the classic constraint about the equilibrium between the incoming and outgoing flow due to the presence of the node efficiency parameter α i,k .Such parameter, in (1c), affects the capacity of each outgoing link of node i; hence, the maximum allowable flow x i,j,k depends on such parameters and its nominal capacity c i,j,k .Equation (1d) defines the dynamic behavior of a node.If node i is perfectly efficient (i.e., when α i,k = 1) at time k and no faults occur, then α i,k+1 = 1.Otherwise, if a node i is affected by fault (i.e., when α i,k < 1) and no response teams recover such node (i.e., L l=1 q l,i,k = 0), then the efficiency level constantly decreases by a factor δ − i > 0 at each time instant k.Otherwise, if at least one team l is recovering node i at time k (i.e., L l=1 q l,i,k ≥ 1), then its efficiency level will be increased, proportionally to the number of teams involved in the recovery operations, by a factor δ + i L l=1 q l,i,k at each time instant as long as the response teams act on such node or the node became perfectly efficient.Equation (1e) ensures that for each node at each time instant, a maximum of L response teams can simultaneously recover the node while (1f) ensures that at each time instant, a response team can operate on a single node.Equation (1g) enforces the condition that a node can be recovered at time k only if it is not perfectly efficient (i.e., α i,k < 1).Finally, constraint sets from (1h) to (1j) define the upper and lower bounds for each decision variable.

A. MILP Formulation
As mentioned in [11], nonlinear optimization problems are considered to be harder than linear problems.With the aim to reduce the complexity of Problem 1, in this section, we propose an MILP version of the proposed formulation.Notice that Constraint (1d) that describes the dynamics of the nodes' efficiency and Constraint (1g) that avoids that a response team can recover a perfectly efficient node are defined as nonlinear constraints.Constraint (1d) is defined in order to consider three different cases on the basis of the node efficiency at the previous time instant and the presence of response teams.Let us introduce the following variables: Let s i,k be a binary variable such that s i,k = 1 if node i is perfectly efficient at time k, and s i,k = 0 otherwise.Let r i,k be a binary variable such that r i,k = 1 if at least one recovery team is restoring node i and r i,k = 0 if no teams are restoring the node.Concerning s i,k , on the basis of the efficiency value α i,k , it is defined as α i,k − 1 ≥ M (s i,k − 1) and α i,k − 1 ≤ Ms i,k − , where M > 0 is a sufficiently large constant and is a sufficiently small positive constant.Similarly, we define the introduced variable r i,k on the basis of the team-node assignments variables q l,i,k as Mr i,k ≥ L l=1 q l,i,k and r i,k − ≤ L l=1 q l,i,k .On the basis of such introduced variables, we now can write (1d) as notice that we denote by sat(x, a, b), with a ≤ b, the saturation of x between a and b, (i.e., sat(x, a, b) = max{min{x, b}, a}).With the aim to provide a linear equation, we introduce the binary variables a i,k and g i,k to decompose the saturation function as follows: where (3) can be linearized by introducing the two following sets of constraints: where M is a sufficiently large constant and y i,k , y (1)  i,k , and y (2)   i,k are four binary variables.The second set of constraints we linearize is defined by (1g) and prevents the response team from restoring perfectly efficient nodes.Such constraint is replaced by the equation: ).On the basis of such transformations, we propose the following MILP formulation of the problem at hand: Problem 2: Consider a directed acyclic graph G = {V, E} be given and let α i,0 be an initial condition about the initial efficiency state of each node.Find the admissible maximum flows x i,j,k , the node efficiencies α i,k , and the required recovery plans q l,i,k , and the support variables: i,k , y (1)  i,k , and y (2)  i,k that solve Fig. 1.Case study network with n = 7 nodes.Nodes 1, 2, and 3 are source flow generators while node 7 is the flow destination.For each link, the ratio is shown according to the nominal scenario.

III. SIMULATION
With the aim to analyze the proposed formulation, in this section, we simulate the behavior of a network (see Fig. 1) composed of n = 7 nodes and m = 8 links.Such network is composed of three source nodes S = {1, 2, 3}, respectively, able to generate the following flows f 1,k = 5, f 2,k = 10, and f 3,k = 10 at each time instant k while node 7 is the destination node.The objective is the maximization of the incoming flow to node 7 considering a time horizon T = 10.Notice that, simulations run in Lingo 19, a solver based on a branch-and-bound algorithm, on a machine equipped with a 3.5-GHz i7 dual-core CPU and 16 GB RAM.
In a nominal scenario (i.e., when no faults occur, hence α i,k = 1 for any node i and any time instant k), the maximum incoming flow for node 7 considering a time horizon T = 10 is z = 275 for both nonlinear (elapsed runtime: 1.11 s) and MILP formulations (elapsed runtime: 0.30 s).In this nominal scenario, as depicted in Fig. 1, for each link, the flow x i,j,k = c i,j at each time instant k, with the exception of the links (5,6) and (6,7) where the flow, respectively, is x 5,6,k = 0 and x 6,7,k = 10 for each time instant k.Moreover, response teams' recovery actions are not required; hence, q l,i,k = 0 for each (l, i, k).Notice that the solution represented in Fig. 1 is an optimal allocation of flows on the basis of the link capacities but other optimal solutions can be defined.
We now analyze the solution of the two problems considering an attack scenario characterized by multiple faults when k = 0.In more Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.detail, with reference to the network in Fig. 1, we consider two faults.Nodes 4 and 5 are affected by a fault and their efficiency is initially reduced to α 4,0 = 0.1 and α 5,0 = 0.8, respectively.In this simulation, we set T = 10, δ + i = 0.15, and δ − i = 0.10 for each node in the network.Also in this scenario, the solution of the nonlinear (elapsed runtime: 3.75 s) and MILP formulations (elapsed runtime: 0.68 s) are the same.In Table II , we collect the optimal solutions on the instance depicted in Fig. 1.
For the sake of clarity, we omit the optimal values of auxiliary variables introduced in order to linearize Problem 1.In this simulation, considering the presence of L = 2 response teams, the optimal value of the objective function is reduced to z = 233.40due to the faults that affect nodes 4 and 5. Concerning Node 4, the effects of the initial fault affect the node until k = 3.In this initial time window (from k = 0 to k = 3), its efficiency level is initially set to α 4,0 = 0.1 but response teams are involved in recovery operations (r 4,k = 1).Notice that in Table II, numbered circles represent the team l associated with the faulty node.More precisely, for k = 0 and k = 1 both the response teams work on such node with the aim to restore its efficiency while for k = 2 and k = 3, only the first team is involved in the recovery tasks.Finally, when k = 4, the node returns to the perfectly efficient state.Considering Node 5, differently from the nominal scenario, the flow is distributed on both the outgoing links (5,6) and (5,7).Notice that, despite the fault and the consequent efficiency reduction, the node is overequipped (i.e., the sum of the capacity of its outgoing links is greater than the sum of the capacity of its incoming links) and the entire incoming flow can be forwarded without any loss for k = 0 and k = 1 in fact the two response teams are involved in the recovery task of Node 4, which requires immediate action in order to reduce the loss of flow.When k = 2, the efficiency of Node 5 is further reduced to α 5,2 = 0.60, in this case, despite such node being overequipped, such efficiency level strongly reduces its outgoing flow to x 5,6,2 = 1.80 and x 5,7,2 = 4.20 while its incoming flow is equal to 7. From k = 2 to k = 4, the second recovery team works on Node 5 in order to restore the node; finally, when k = 5, the node returns to a perfectly efficient state.From the response teams' point of view, in this scenario, the recovery of a nonoverequipped node shall take priority over an overequipped node characterized by a high degree of resilience.A sensitivity analysis by varying the number of recovery teams L and the recovery speed δ + i is presented in Fig. 2 and Table III.As expected, by increasing the number of recovery teams L and the associated recovery speed δ + i , we can increase the incoming flow to Node 7 (see Table III) and reduce the recovery time (see Fig. 2).

IV. CONCLUSION
We provided a network restoration approach based on linear and nonlinear integer programming formulations on the basis of the maxflow problem.The aim is to schedule response teams' interventions in order to restore the nominal efficiency of a networked system affected by faults.Simulation results showed the optimal restoration of the maximum flow provided to the destinations by prioritizing the recovery interventions to the most critical nodes in order to avoid the loss of flow.

Fig. 2 .
Fig. 2. Efficiencies α 4 (blue), α 5 (red) and the associated recovery teams involved in the restoration process for T = 10 and L = 1, . . ., 3. Notice that the color of the team number corresponds to the associated node.

TABLE II OPTIMAL
SOLUTION CONSIDERING THE INSTANCE DEPICTED IN FIG. 1 FOR L = 2 RESPONSE TEAMS AND k = 0, . .., 7TABLE III OBJECTIVE FUNCTION VALUES z BY VARYING THE NUMBER OF RECOVERY TEAMS L AND THE RECOVERY SPEED δ +