Predicting Resilience of Interdependent Urban Infrastructure Systems

Climate change is increasing the frequency and the intensity of weather events, leading to large-scale disruptions to critical infrastructure systems. The high level of interdependence among these systems further aggravates the extent of disruptions. To mitigate these impacts, models and methods are needed to support rapid decision-making for optimal resource allocation in the aftermath of a disruption and to substantiate investment decisions for the structural reconfiguration of these systems. In this paper, we leverage infrastructure simulation models and Machine Learning (ML) algorithms to develop resilience prediction models. First, we employ an interdependent infrastructure simulation model to generate infrastructure disruption and recovery scenarios and compute the resilience value for each scenario. The infrastructure-, disruption-, and recovery-related attributes are recorded for each scenario and ML algorithms are employed on the synthetic dataset to develop accurate resilience prediction models. The results of the prediction models are analyzed and possible design strategies suggested based on the resilience enhancement attributes. The proposed methodology can support infrastructure agencies in the resource-allocation process for pre- and post-disaster interventions.


I. INTRODUCTION
Modern cities heavily rely on the proper functioning of infrastructure systems such as power, water and transportation networks. Disruptions to infrastructure systems not only threaten safety, security, and health of the inhabitants, but also incur economic losses. Climate change has resulted in more frequent and intense weather events, often causing large-scale disruptions to urban infrastructure systems [1]. These impacts are exacerbated by inadequate urban planning, aging infrastructure, and population growth, which subject the infrastructure to increasing levels of stress [2]. Recent weather-related extreme events, such as 2017 Hurricane Har-The associate editor coordinating the review of this manuscript and approving it for publication was Xinyu Du . vey in the Houston-Galveston region, 2021 Texas winter storm in Central Texas [3], [4], and 2022 European heatwaves have resulted in extensive disruptions to regional infrastructure systems and significant socio-economic impacts [5].
Urban infrastructure systems are highly interdependent and the failure of one system may lead to cascading failures in other systems [6]. Floods or hurricanes may cause the failure of power supply to water pumps operations and result in disruptions to water supply. Power driven transportation systems (e.g., city trains) may also stall due to a power failure. Flooding and debris may prevent roads from providing access to critical components of power and water systems leading to large delays in the restoration of essential services. In such situations, a holistic approach to infrastructure resilience considering both pre-and post-disaster interventions and interdependencies in urban infrastructure systems is imperative for minimizing disruption impacts as well as speeding up recovery [7].
The problem of infrastructure network modeling for resilience analysis has been extensively studied in the last decades [8]. Though initial studies focused on individual infrastructure systems, their growing interdependencies were also brought into the focus of infrastructure systems research [9], [10]. Approaches for resilience analysis can be broadly classified into (i) computational and (ii) empirical [11].
Computational approaches rely on analytical methods to model system operations and interdependencies. Common computational models include graph-based models, systemdynamics models, agent-based models, and domain-specific simulation models [12]. Specifically, co-simulation models were developed to simulate the collective behavior of interdependent urban infrastructure networks [13], [14], [15]. While these models can be used to test the effectiveness of realistic resilience strategies and interventions, they require a significant amount of computational resources and their application in rapid decision-making for emergency management is limited.
Empirical approaches are data-driven and rely on historical datasets and expert judgments to identify relationships between various attributes (i.e., related to system configuration, disruption and recovery) and system performance. In particular, Machine Learning (ML) models have proven effective in learning these nonlinear relationships and can be used for optimal pre-and post-disaster resource allocation [16], [17]. However, while ML models have been mostly applied to individual infrastructure systems, including energy networks [18], [19], [20] and road networks [21], [22], their application to interdependent infrastructure systems is limited due to data availability constraints.
In order to overcome these gaps, computational and empirical models can be combined to achieve desired objectives [23]. In this paper, a simulation model for interdependent infrastructure systems is combined with ML algorithms in order to develop resilience prediction models. Such models support rapid decision-making for optimal resource allocation in the aftermath of a disruption. Furthermore, by learning the inherent system features that make it more resilient, these models can substantiate investment decisions for the structural reconfiguration of the system.
The methodology proposed in this paper consists of three steps. First, a simulation model for the interdependent power-water-transportation network is employed to simulate a large number of disruption and recovery scenarios and compute the resilience value for each scenario. Second, ML models are trained to predict system resilience based on infrastructure-, disruption-and recovery-related features. Third, the results returned by the ML-based resilience predictive models are analyzed and possible design strategies to enhance the resilience of the interdependent infrastructure network suggested.
The rest of the paper is organized as follows: section II discusses the infrastructure simulation and hazard generation methodology, section III elaborates on the simulation-generated dataset, section IV discusses the development of ML models, and section V presents the results of the case study based on the synthetic city of Micropolis. Finally, in section VI, a few design recommendations based on the findings from the study are enlisted and section VII summarizes the key findings.

II. INFRASTRUCTURE MODELING AND HAZARD SCENARIO CREATION
In this section, we describe the simulation model for assessing the resilience of an integrated power-water-transportation network subject to disruptions.

A. SIMULATION PLATFORM OVERVIEW
The simulation model InfraRisk [24] has been deployed to model the interdependent power-water-transportation network. InfraRisk calls three Python packages to run simulation for the individual infrastructure systems. Specifically, Pandapower [25] is used to solve the optimal power flow problem. WNTR (Water Network Tool for Resilience) [26] is used to run pressure-dependent demand simulations in the water network. The static traffic assignment package [27] is used to compute road travel times between origin-destination (OD) pairs under a given set of traffic conditions. The individual infrastructure simulators communicate with each other using an object-oriented interface. In addition to the integrated infrastructure model, modules are also available for hazard generation and infrastructure vulnerability modeling, recovery modeling, and resilience quantification. The InfraRisk model is made publicly available for the research community [28].

B. HAZARD SCENARIO GENERATION
This study focuses on damage to infrastructure system components and the subsequent loss of service due to weatherrelated events, which are simulated via the hazard module in the simulation platform [24]. The failure probability of components is computed as: where p(failure i ) is the failure probability of the ith component, p(hazard) is the probability of the hazard, p(exposure i |hazard) is the probability of the component being exposed to the hazard and p(failure i |exposure i ) is the failure probability of the component given its exposure to the hazard.
InfraRisk embeds methods to simulate point events (such as random failures, fire incidents, acts of sabotage or vandalism) and track-based events (such as floods, cyclones and tornadoes). Fig. 1 represents an example point event and two track-based events. In the InfraRisk model, simulation of a point event requires 3 input parameters: the location of the event, the radius of impact and the intensity of the event. The failure probability p(failure i |exposure i ) depends on the event intensity. The factor p(exposure i |hazard) is a function of the ratio between the distance of the component to the location of the event and the radius of impact. The input parameters for the track-based events are the track of the event, the offset distance of the impact and the event intensity. For the track-based events, p(exposure i |hazard) is calculated as a function of the ratio between the perpendicular distance of the component from the track and the offset distance of the impact. We do not use any hazard-specific models for generating disruptions because the primary objective of the study is to obtain a dataset with a large number of distinct disruptive scenarios, rather than realistically modelling those scenarios.

C. NETWORK RECOVERY
The recovery actions are deterministically sequenced and implemented based on the availability of repair crews and the repair sequence, which is obtained by prioritizing failed components for repair based on a pre-determined recovery strategy and accounting for road-accessibility constraints as shown in Fig. 2 [24]. According to the algorithm in Fig. 2, a repair sequence is first computed based on a pre-defined recovery strategy. If the sequence of components (i.e., in the power, water and road networks) to be repaired is non-empty, the first component in the sequence is selected for repair. If the selected component is accessible by road, then the respective crew is sent to perform the repair and the repair sequence is updated. If the selected component is not accessible by road, then the next most important component in the sequence is selected for repair. The algorithm iterates until all components are repaired.
In order to compute the repair sequence, three criteria, which have been developed for vulnerability analysis [29], [30], are used, namely: 1) Capacity -the components with higher flows of resources through them during normal operating conditions are prioritized for repair. 2) Betweenness centrality -the components that are traversed more often by the flow of resources in the network are prioritized for repair. 3) Zone -the failed components in a certain zone (e.g., industrial, school, etc.) are prioritized for repair.

D. RESILIENCE QUANTIFICATION
This module implements the integrated infrastructure simulation in two steps, namely event table generation and interdependent infrastructure simulation. The event table provides a schedule of all the disruptions and repairs. After the event table is generated, the interdependent network simulation is performed. InfraRisk adopts a sequential simulation approach in which individual infrastructure models are successively updated at every simulation time step according to the actions scheduled in the pre-generated event table. In order to quantify the network impacts of disruption scenarios, we use the Equitable Consumer Serviceability (ECS), which is computed as the average satisfied demand for power/water assuming that all the demand nodes have equal importance. The ECS implicitly accounts for the road network performance, since travel times affect the repair times of components.
Finally the Total Performance Loss (TPL) is used as resilience metric and computed according to the well-known resilience triangle paradigm [8]: where T is the total time taken to recover the system performance [31]. Resilience is inversely related to the TPL value. A lower TPL value corresponds to better system resilience.

III. SIMULATION-GENERATED DATASET
In this section, we describe the benchmark town used as case study and the generation of the synthetic dataset for ML analysis.

A. MODEL TOWN: MODIFIED MICROPOLIS NETWORK
Micropolis, a synthetic town of 5,000 inhabitants, was initially developed by Texas A&M university to study the water distribution system [32]. The network was later extended with power and communication systems [33]. The power, water and transportation networks are adapted for our study and the resulting topology is illustrated in Fig. 3. Buildings are represented in grey blocks in Figure. Details of the power, water and transportation networks are provided in Table 1.

B. SYNTHETIC DATASET
In order to create a large dataset to train and test the ML models, we simulate various disruption and recovery scenarios. Specifically, point and track-based events are randomly generated using the hazard module described in section II-B. The probability of occurrence of point and track-based events is set to 0.75 and 0.25, respectively. For both types of events, the location (or track) are randomly distributed on the network and the radius of a point event (or the offset distance, in the case of a track event) is sampled from a    In order to analyze the effects of recovery-and infrastructure-related attributes on system resilience, multiple recovery scenarios are simulated for every disaster event by varying the number of available repair crews, the recovery strategy (i.e., capacity, betweenness centrality and zone) and the level of meshedness of the network (i.e., low, medium and high). Specifically, low and high meshed networks are computed by removing or adding 20% of the original links from the respective networks, while medium meshed networks correspond to the original network topologies. After eliminating the disruption scenarios that did not impact the infrastructure and combining them with the different recovery and infrastructure scenarios, a total of 26,600 scenarios are generated. VOLUME 10, 2022

C. INPUT FEATURES FOR MACHINE LEARNING
Input features for the ML models can be classified as disruption-, recovery-and infrastructure-related. In order to reduce the computational complexity of the ML models, features only include aggregated quantities, rather that the full list of simulation inputs. The features, their coded names and range/level representing the minimum and maximum observed values over all simulated scenarios are reported in Table 2.
Disruption-related features include the number of power lines, water mains and transportation links failed as well as two topological metrics, i.e., betweenness and degree centrality, to assess their relative importance in the network. Link betweenness centrality is computed by averaging the betweenness centralities of the corresponding start and end nodes. Betweenness centrality (BC) of a node ν is defined as: where σ st is the total number of shortest paths between node s and t and σ st (ν) is the number of those paths that pass through ν. Similarly, link degree centrality is computed by averaging the degree centralities of the corresponding start and end nodes. Degree centrality (DC) of a node ν is defined as: (4) where N is the set of all nodes and l ij denotes the link between node i and j.
Recovery-related features include the number of repair crews available for the power, water and transportation networks, respectively, and the recovery strategy adopted to compute the repair sequence.
One infrastructure-related feature, i.e., meshedness level, is also considered to test the effect of different topologies on system resilience.

IV. APPLICATION OF MACHINE LEARNING MODELS
The simulation-generated dataset is used to train several ML models to predict system resilience for new disruption and recovery scenarios for which they are not trained. A supervised learning approach is adopted to develop suitable ML models for this regression problem. The simulator output, i.e., TPL value which is inversely related to the resilience, and the input features in Table 2 are used to train the ML models.

A. MACHINE LEARNING ALGORITHMS
A single ML model does not always perform the best for different regression problems. Therefore, four commonly used ML models, briefly described below, are adopted for the regression problem.
where β = [β 0 , . . . , β p ] is the vector of coefficient corresponding to the p input features, y i is the actual value, f (x i ) = βx i the predicted value and x i = [x i1 , . . . , x ip ] T the vector of predictors. The LR model is easily interpretable from the coefficient values. However, the data need to satisfy a number of assumptions, such as non-collinearity of the input features, normally distributed errors and a linear relationship between the predicted value and input features [34].
2) SUPPORT VECTOR REGRESSION Support Vector Regression (SVR) fits a set of hyperplanes in a high dimensional space to the data and finds a model of the form f (x) = w T φ(x) + b, where φ(x) T is a linear or non-linear transformation of the feature space [35]. The resolution algorithm is [36]: where is an -insensitive error function [37] and C is the regularization parameter, which controls the trade-off between the margin w and the error. The prediction model returned by the optimization model is of the form: where a i ,â i are positive weights given to each observation and estimated from the data and k(x, x i ) is a kernel function [36]. Specifically, the polynomial kernel is expressed as where γ is equal to 1/number_of_features when the parameter gamma is set to 'auto' and d = 3 by default.

Random Forest Regressor (RFR) uses an ensemble of decision trees and relies on bagging or bootstrap aggregation,
which is a technique for improving the accuracy of the predictor by training an ensemble of trees in parallel [38]. Specifically, RFR fits an ensemble of trees of size n_estimators. Each tree T b in the ensemble is built from a bootstrapped sample of the training data and by randomly selecting a subset of predictor features and identifying the best splitting variables and split-points for tree nodes. Specifically, when the parameter max_feature is set to 'sqrt', features are sampled from a subset of size √ number_of_features. The selection and splitting process is repeated for each tree until the maximum depth (max_depth) is reached or until the number of samples in a node equals the default value of the minimum sample split parameter (i.e., min_sample_split = 2). The prediction model returned by the algorithm is of the form [39]: where θ b characterises the bth random tree and B corresponds to the parameter n_estimators. RFR has proven very effective in terms of prediction accuracy and offers features importance scores based on the reduction in the criterion used to select split points.

4) (EXTREME) GRADIENT BOOSTED TREE REGRESSOR
Similar to RFR, (Extreme) Gradient Boosted Tree Regressor ((X)GBR) uses an ensemble of decision trees. (X)GBR relies on boosting, which is a technique for improving the accuracy of the predictor by training an ensemble of trees sequentially [40]. At each iteration of the algorithm, a tree T m is built that minimizes the error based on a subsample of the training data, the size of which is controlled by the parameter subsample. Therefore, the initial prediction is updated by adding the average error, multiplied by the parameter learning_rate (equal to 0.1 by default), which controls the contribution of each sequential tree and therefore determines the speed at which the model learns the outcome. The prediction model returned by the algorithm is of the form: where M corresponds to the parameter n_estimators and θ m characterises the mth tree. XGBR [41] is a more regularized form of GBR. It uses an advanced regularization technique that enhances the model generalization capabilities and delivers high performance compared to the original GBR. Additionally on subsampling the training data to train each tree, in XGBR also the features used to train each tree and the features used in each node to train each tree are subsampled according to the parameters colsample_bytree and colsample_bylevel, respectively.
In both GBR and XGBR, the importance score associated with each feature indicates how useful or valuable the feature is in the construction of the boosted decision trees within the model.
Model accuracy and interpretability are the main criteria driving our model selection choice. In discussing the results, we focus on the RFR and (X)GBR models, since they have proven more accurate than LR and SVR, as shown by the experimental results. All four models are interpretable, since they include embedded methods for assigning a score to each input feature indicating its relative importance in making a prediction. These scores are used for performing a feature importance analysis in order to understand which factors are most likely to influence system resilience in the current case study.

B. EVALUATION METRICS
The performance of different ML models in our regression problem is assessed using the Root Mean Squared Error (RMSE) and the R 2 or R-squared score.
RMSE of an estimator is mathematically represented as: R 2 , usually expressed in percentage (%), indicates the goodness-of-fit of the prediction models. Mathematically, the score is defined as: whereȳ is the observed mean.

A. RESULT SUMMARY
Prediction models based on the ML algorithms are trained and cross-validated using 75% of the dataset (19,950 data VOLUME 10, 2022 points) and tested using the remaining 25% (6,650 data points). The hyper-parameters associated with each ML algorithm are tuned using the grid search method to obtain the best performance. Open-source scikit-learn libraries are used to implement the ML models. Table 3 shows the summary of prediction results for different ML models and the tuned parameter values for each model (other parameters are set to their default value according to the scikit-learn libraries).
RFR, GBR, and XGBR perform better than the other methods, since they provide more accurate predictions for TPL based on the RMSE and R 2 values. The predicted versus actual TPL values for the RFR and XGBR are displayed in Fig. 4. Both models tend to underestimate large TPL values. This is because large scale disruptions are underrepresented in the training dataset, as it occurs in real world datasets, resulting in lower accuracy in predicting these scenarios. Fig. 5 shows the relative importance of the input features for RFR and XGBR. Importance scores are computed assuming equal weights of the power and water systems (i.e., predicted TPL is the average of the TPL of the power and water networks). The number of failed water mains (water_mains) and the number of crews assigned to repair them (water_crew) have the highest influence on system resilience. This is due to the fact that, unless a particular section of a failed pipe is automatically isolated (e.g., by closing remotely-controlled isolating valves on both sides of the affected pipe section), the leak through the pipe continues until the repair crew reaches the pipe and performs the isolation manually. On the other hand, switches in the power grid automatically open to isolate failed power lines, thus limiting cascading failures within the network. We expect that the relative importance of water-related features to system resilience will diminish if adequate leak detection and automatic isolation mechanisms are implemented. However, such mechanisms require high investments [42] and are seldom deployed in real-world distribution water systems.

B. FEATURE IMPORTANCE STUDY
While other features are negligible according to the RFR, GBR and XGBR could capture the influence of other features, as shown by their relative importance in Fig. 5. Low meshedness levels (meshedness_low) and transportation related features (i.e., number of transportation link failures transpo_links and the crew assigned to repair them transpo_crew) are the next most important attributes for resilience prediction. In fact, a functional transportation network enables faster repair of all component of the integrated network. Power-related features (i.e., degree centrality power_sum_deg_cent and the number of failed power lines power_lines) are relatively less important in predicting system resilience. This is due to the fact that power grids embed methods for fault isolation and are usually designed in order to limit power disruptions. This is especially true for the Micropolis power grid.
In order to identify the most important features in predicting the resilience of the power grid and water distribution system individually, the study is repeated by training the ML models on the same set of input features, but considering two distinct TPL values for the power and water systems (instead of one aggregated value combining TPL values of the two systems). The feature importance study for this second sets of results are plotted in Fig. 6 for the XGBR algorithm. When the power system is considered individually (Fig. 6(a)), power related features as well as the number of failed transportation links (transpo_links) determine system resilience. This is due to the dependency of power components repair on road accessibility.  When the water system is considered individually (Fig. 6(b)), water-, transportation-and power-related features determine system resilience. This indicates the high dependency of the water system on the other two systems.

C. SENSITIVITY ANALYSIS
In order to analyze how system resilience varies under various system designs and recovery scenarios, a sensitivity analysis is conducted and the TPL values estimated under different level of recovery-and infrastructure-related features using the trained ML models. For the sake of simplicity, only the results returned by the RFR are presented. Fig. 7 shows the plots for the TPL values by RFR for 20 disruption scenarios (each line represents one disruption event), which were selected in order to represent the widest range in TPL. For each infrastructure system and disruption scenario, the maximum number of repair crews deployed is limited to the number of components failed plus two in order to limit the number of idle crews. TPL decreases (i.e., resilience increases) as the number of crews increases up to a certain point, after which some repair crews remain idle (e.g., when the number of crews is greater than the number of failed components). This decreasing trend is especially marked for a small number of crews (below 5), since the repair time of water mains is high. The plots representing TPL versus number of power/transportation crews exhibit similar trends to those of the water system, but with a greater number of outliers. These can be attributed to the ML model prediction error and imbalance in training data for some disruption and recovery scenarios.
When TPL values for different repair strategies and meshedness levels are computed, a trend is not clearly identified, as seen in Fig. 8(a) and (b), respectively. This suggests that the best recovery strategy needs to be identified for each disruption scenario and that one strategy does not work best for all disruptions. Moreover, while high meshedness levels enhance system redundancy, and therefore resilience, they also increase the probability of cascading failures within a network. This is even more prominent in water systems, where failed pipes remain connected to the system until the repair crew isolates them manually.

VI. DESIGN RECOMMENDATIONS
In view of the results discussed, some resilient system designs are suggested for the integrated power-water-transportation network: 1) Implementation of isolation valves in the water network. As observed in our experimental results, failure of water lines severely affects infrastructure resilience. Therefore, detection and isolation of leaking pipes through the use of isolation valves need to be implemented in order to avoid cascading failures within the water network. The optimal location of such valves can be determined using InfraRisk by maximizing system resilience subject to cost constraints. 2) Decoupling between the power/water systems and the transportation system. When isolation and repair of failed components is performed manually by repair crews, road accessibility to these components is essential for system recovery. As an alternative, remotely controlled isolation and recovery mechanisms could be implemented (e.g., remotely controlled shut-off valves in the water system and automatic power grid reconfiguration). 3) Decoupling between the power and the water systems.
Water supply may be interrupted due to a power outage, if the power supplied to water pumps is disrupted. In order to avoid cascading failures from the power to the water system, backup power generators or alternative power sources can be installed to supply power to water pumps. The optimal location of such generators can be determined using InfraRisk by maximizing system resilience subject to cost constraints. 4) Efficient use of repair crews. While additional repair crews can speed up system recovery in the event of major disruptions, the benefit-cost ratio decreases for small disruption scenarios since some crews remain idle. This has been observed in our study results discussed in section V. Therefore, for ensuring the efficient use of repair crews, cross-functional repair crews can be deployed between systems after appropriate training (e.g., crews can be assembled with mixed competencies to repair both power lines and road links).

VII. CONCLUSION
This paper proposes a combined methodology to predict the resilience of interdependent infrastructure systems. Specifically, we deploy the InfraRisk simulation model to simulate the performance on the interdependent powerwater-transportation network subject to various disruption and recovery scenarios. We use the resulting simulation-generated dataset to train four ML models for resilience prediction and for identifying contributing resilience factors. The proposed methodology goes beyond state-of-the-art resilience assessments by offering a trade-off between computational complexity and model accuracy and can, therefore, be used for rapid decision-making for optimal pre-and post-disaster resource allocation. In future research, the design improvement solutions proposed in section VI will be implemented in the simulation platform and several disruption scenarios will be tested. Furthermore, these solutions will be tested on a real-world infrastructure network using realistic recovery strategies against.
PARTHA P. BISWAS received the Ph.D. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, in 2019. He is currently working as a Research Scientist at the Advanced Digital Sciences Center (ADSC), Singapore. Prior to joining ADSC, he worked in the power industry for several years. His research interests include adopting computational intelligence and machine learning techniques in power systems, smart grid security, and resilience.
SRIJITH BALAKRISHNAN received the bachelor's degree in civil engineering from the University of Kerala, the master's degree in civil engineering from the Indian Institute of Technology Madras, and the Ph.D. degree in civil, architectural, and environmental engineering from The University of Texas at Austin, in 2020. He is currently a Postdoctoral Researcher at the Singapore-ETH Centre, Singapore. His research interests include developing and applying generic and specific frameworks, methods, and metrics to evaluate the risk and resilience of interdependent infrastructure systems. He is a Resilience Fellow at the 4TU Federation, the Netherlands.
BENNET NG is received the Graduate degree from the National University of Singapore. He is currently working as a Principal Software Engineer at the Advance Digital Sciences Center (ADSC). He has more than 20 years of experience in infrastructure administration, software development, training, consultancy, project management, and regulatory compliance. His research interests include threat modeling for smart applications, designing of security tools for cyber-physical systems, and machine learning applications in various domains. He is currently a Professor at Reliability and Risk Engineering, ETH Zurich, Zurich, Switzerland. His research interest includes the development of hybrid analytical and computational tools suitable for analyzing and simulating failure behaviors of engineered complex systems (i.e., highly integrated energy supply, energy supply with high penetrations of renewable energy sources, communication, transport, and other physically networked critical infrastructures). He is a member of the Atlantis Dual Doctoral Degree Program.