How Long is a Resilience Event in a Transmission System?: Metrics and Models Driven by Utility Data

We discuss ways to measure duration in a power transmission system resilience event by modeling outage and restore processes from utility data. We introduce novel Poisson process models that describe how resilience events progress and verify that they are typical using extensive outage data collected across North America. Some usual duration metrics show impractically high statistical variability, and we recommend new duration metrics that perform better. Moreover, the Poisson process models have parameters that can be estimated from observed network data under different weather conditions, and are promising new models of typical resilience events.


I. INTRODUCTION
Much of the analysis of electric power system resilience relies on describing the duration and magnitude of resilience events with quantitative metrics [1]- [9].The resilience events correspond to conditions of unusually high stress such as extreme weather or cascading and are either simulated [3]- [5] or extracted from historical data [6]- [9].The metrics of duration and extent describe the performance of the power system as it responds to the high stress and, sometimes indirectly, the impact of the event on our society.The metrics are broadly useful in improving the engineering of power system resilience, as evidenced by all the engineering references of this paper.This paper addresses electric power transmission system metrics for the duration of resilience events and the durations of the outage and restore processes occurring within resilience events.Here "outage" refers to a component being removed from service, and "restore" refers to re-energizing a component to return it to service.
The duration of a resilience event would appear to be straightforward: The event starts with the first transmission outage at time o 1 and the event ends with the last restore at time r n , so that the event duration metric is simply D E = r n − o 1 .However, we will show that the timing r n of the last restore is so highly statistically variable that it is not meaningfully representative of the power system restoration.(A metric is highly statistically variable if it is likely that its value can be much different than its estimated value, and we quantify this by the size of a confidence interval containing the estimate.)Moreover, given the redundancy that is designed into power transmission systems, the last restore may have little or no impact on the power flowing to the distribution system and then to the customers.Therefore we analyze a variety of duration metrics to find new metrics which are less variable and more representative.
Our main approach is to develop new Poisson process models for the outage and restore processes.The new models are driven by seven years of automatic outage data collected across North America by the North American Reliability Corporation (NERC) in its Transmission Availability Data System (TADS).These statistical models enable the variability of the metrics to be quantified.Moreover, parameters of the new models are closely related to some of the duration metrics.
This paper addresses the durations associated with transmission system resilience events in which there are substantial outages of transmission system elements.In particular, the paper does not address resilience events in which there are no outages or minimal outages, such as an extended heat wave that significantly limits transmission flows but causes no outages.More generally, the paper is driven by outage and restore data for transmission system elements, and therefore does not address outages of generation, distribution system elements, and loads.

A. Literature review
Much of the previous work on statistical models of power system resilience events addresses distribution systems.Zapata [10] models distribution system reliability with outages as a power-law Poisson process arriving at a queue that is serviced by a power-law repair process to produce a restore process.Wei and Ji [7] analyze distribution system resilience to particular severe hurricanes with a Poisson outage process arriving at a queue that repairs the outages to produce a restore process.Both the outage process rate and the repair time distribution vary in time as the hurricane progresses.Carrington [9] shows how to extract outage and restore processes from standard distribution utility data.
Both [7] and [10] statistically model the outage process and the component repair process, and then calculate the restore process with a first-in-first-out queue model, whereas we follow the insight of [9] in extracting and directly modeling the outage and restore processes.Modeling the restore process directly from the data avoids the complexities in queuing models of explicitly modeling the component repair and assuming an order of component repair.While [9] fits the mean and standard deviation of the distribution system outage and restore processes to give a gamma distribution of restore times, it does not give statistical process models as we do in this paper.Moreover, the forms of the outage and

B. Summary of paper contributions
This paper: 1) proposes new statistical models of outage and restore processes in transmission systems, and shows that the new models describe typical North American data.2) analyzes statistical variability and interpretation of a variety of duration metrics.3) recommends novel and more useful duration metrics.4) reports typical values for model parameters and duration metrics for North America transmission resilience events.The previous conference papers and NERC reports [28]- [31] extract resilience events from transmission system outage data and report the two duration metrics D ≥ 95% and D n for the larger or largest events.The fruitful previous applications of these duration metrics motivate in this paper the extensive new analysis of a range of duration metrics and the recommendations backed by this analysis of better performing duration metrics.The extraction of the transmission system resilience events developed in [28], [29] is not the subject of this paper, but since it is used in the data processing of this paper, we specify in section II the precise version of the event extraction used.Section II also summarizes the outage data used in the paper and states and briefly comments on the definitions of the outage and restore processes [9] since this paper uses these processes.
The duration of resilience events has clear importance to the public, engineers, regulators, and policy makers.This motivates our consideration of the performance of a range of duration metrics.We are not aware of another paper addressing the question of how duration metrics perform, and we approach the question with novel methods.In particular, the stochastic models of typical transmission system resilience processes proposed and validated with extensive data in the paper are novel, and we expect that these new models will be useful well beyond this paper's more immediate goal of proposing and analyzing better duration metrics.

II. RESILIENCE EVENTS AND PROCESSES
To obtain resilience metrics from utility outage data, we first need to automatically extract resilience events and the outage and restore processes for each event.This section explains how to do this based on previous work [9], [28], [29] and establishes the notation needed for the paper.
A. Utility data and extracting resilience events NERC's TADS collects outage and inventory data for the following four types of transmission elements: AC circuits, transformers, AC/DC back-to-back converters, and DC circuits [34] which are part of the North American Bulk Power System (i.e.operated at 100 kV or higher) [35].The detailed automatic outage data include the outage and restore time to the nearest minute, the initiating cause code for each outage, and the sustaining cause code for sustained outages.In this paper we analyze the approximately 62 000 automatic outages for all elements reported in TADS from 2015 to 2021 for the Eastern, Western, and ERCOT interconnections.
A key step in resilience analysis of real data is automatically extracting resilience events.For each interconnection, the automatic outages are grouped together into resilience events based on the bunching and overlaps of their starting times and durations.We quote from [29] the algorithm used: "Every outage in an event has to either start within five minutes of a previous outage in the event or overlap in duration with at least one previous outage in the event that has a difference in starting time not exceeding one hour.In applying this algorithm, repeated momentary outages of the same element are neglected if they occur within 5 minutes of each other."We use this algorithm to automatically group outages into resilience events (their sizes vary from 1 to 352 outages) and then analyze all the resilience events with 10 or more outages.An event that contains at least one outage with a weather-related initiating or sustained cause code is defined as a weather-related event.
The weather-related TADS cause codes are lightning, weather excluding lightning, fire, and environmental.This procedure identified 352 transmission events with 10 or more outages, 329 of which are weather-related.Note that events are defined so that if an outage is included in an event, then so is its corresponding restore.Therefore the number of outages in an event is equal to the number of restores.

B. Outage, restore, and performance processes
Suppose that the resilience event has n outages at times o 1 ≤ o 2 ≤ ... ≤ o n and n restores at times r 1 ≤ r 2 ≤ ... ≤ r n .Note that the outages are sorted into the order in which the outages occur, and the restore times are sorted into the order in which the restores occur.This sorting implies that the kth restore time r k is not usually the restore of the kth outage o k .
For each event, the outage process O(t) is the cumulative number of outages at time t and the restore process R(t) is the cumulative number of outages at time t: Both processes start at zero at the beginning of the event and increase to the total number of outages n, as can be seen in the example in Fig. 1.
00:00 12:00 00:00 12:00 Resilience studies [1], [3]- [5] often define for each event a performance (or resilience) curve P (t), which is the negative of the cumulative number of unrestored outages at time t.The performance curve decrements for each outage and increments for each restore as shown in Fig. 1.Indeed, the performance curve is related to the outage and restore processes by P (t) = R(t) − O(t).The performance curve can be uniquely decomposed into its outage and restore processes, and it contains the same information as the outage and restore processes [9].
The outage and restore processes, while straightforward, are fundamental to analyzing real outage data, and they have several distinctive features [9]: (a) The outage and restore processes routinely overlap in time in real data; this differs from the customary idealized outage and restore phases of resilience that are separated in time [1], [3]- [5], [8].(b) The analysis is at a systems level and is not focused on tracking individual elements: it only counts the numbers of outages and restores and it does not track which outaged element restored when or the order in which elements restore.(c) The forms of the outage and restore processes and performance curve readily lead to resilience metrics that describe each process; in particular, it is useful to have separate metrics describing the outage process and the restore process.

III. POISSON PROCESS MODELS OF OUTAGE AND RESTORE
This section introduces new Poisson process models that describe typical outage and restore processes in our transmission system data.Fig. 2 shows examples.The mean values of these Poisson processes are a useful approximation of the outage and restore processes.Moreover, parameters of the Poisson process models yield resilience metrics, and section VIII uses the Poisson process models to quantify the variability of the metrics.We consider two different Poisson models for the restore process, based on lognormal and exponential rates respectively.The fit of the Poisson models with the data is discussed in section VII, where it is shown that the model with a lognormal rate typically fits the restore process better than the model with an exponential rate.

A. Poisson process of outage times with constant rate
The data for each event specifies that there are n outages in the event and that the outages start at time o 1 and end at time o n .Given this information, and assuming a constant rate Poisson process, we model the outage times as occurring randomly and at a constant rate λ O in the time interval (o 1 , o n ).In particular, given that there are n outages in (o 1 , o n ), the n − 2 outage times o 2 , ..., o n−1 are independent samples from a uniform distribution on (o 1 , o n ) sorted into ascending order2 .A metric characterizing the outages is their rate λ O , which is estimated for each event as 3 The average or expected cumulative number of outages O(t) O(t) approximates the outage process O(t) as shown in Fig. 3.We see in Fig. 2 some typical examples in which the cumulative number of outage increases in the linear way given by ( 4).The total number of outages is For each event, λ O can be estimated from (3), and then the averaged outage process (4) approximates and describes the outage process O(t).
3 Since there are n−1 time differences between the n outages, the estimated average time difference between successive outages is (on −o 1 )/(n−1), and then the estimated rate λ O is the reciprocal of the average time difference.is the red stepped line.R(t) is approximated by the average restore process R(t), which is the dashed curve.R(t) is proportional to the CDF of the lognormal distribution and its slope is the Poisson process rate.

B. Poisson process of restore times with lognormal rate
The data for each event specifies that there are n restores in the event and that the restores start at time r 1 .We work with the restore times relative to r 1 ; that is, r j − r 1 , j = 1, 2, .., n.The first restore time relative to r 1 , and any other simultaneous restores at r 1 , become r 1 − r 1 = 0. Suppose that first restore that occurs at a time > r 1 is r z+1 .Usually r 2 > r 1 and z = 1.
The restore times typically happen with a rate that varies, as can be seen in the examples in Fig. 2. In particular, the rate of restores typically slows dramatically for the final restores.We model the n − z positive restore times r j − r 1 , j = z + 1, z + 2, ..., n as occurring randomly in a nonhomogeneous Poisson process at a rate proportional to a lognormal distribution.In particular, given that there are n − z outages in the time interval (r 1 , ∞) = {t | t > r 1 }, the n − z restore times r z+1 − r 1 , ..., r n − r 1 are independent samples from a lognormal distribution on (r 1 , ∞) sorted into ascending order.There are some extremely long restore times r n in the data (up to a year is recorded), and this is reflected in the modeling of the process as unbounded in (r 1 , ∞).
Let the lognormal distribution have parameters µ and σ and probability density function f µ,σ (t).Then the Poisson process rate is proportional to the probability density function: By definition of the lognormal distribution, since the restore times r z+1 −r 1 , r z+2 −r 1 , ..., r n −r 1 are independent samples from a lognormal distribution, the natural logarithms of the restore times ln(r z+1 − r 1 ), ln(r z+2 − r 1 ), ..., ln(r n − r 1 ) are independent samples from a normal distribution.The standard parameters characterizing the lognormal distribution are the mean µ and standard deviation σ of the normal distribution.Therefore we estimate µ and σ for each event by The Poisson process restore rate λ R (t) is proportional to the lognormal distribution as shown in (5).Then the average or expected cumulative number of restores R(t) is where Φ is the CDF of the standard normal distribution.Equation (8) shows that R(t)−z is proportional to the CDF of the lognormal distribution, and (9) expresses R(t) in terms of the parameters µ and σ.R(t) approximates the restore process R(t) as shown in Fig. 4.
The lognormal model has parameters µ, σ, z, and n.For each event, µ and σ can be estimated from ( 6) and ( 7) and then the averaged outage process R(t) (9) approximates and describes the restore process R(t).Examples of the approximating restore curves are shown by red dashed lines in Fig. 2.

C. Poisson process of restore times with exponential rate
We can substitute the exponential distribution for the lognormal distribution of subsection III-B to obtain a Poisson restore process with exponential rate.That is, given that there are n−z outages in (r 1 , ∞), the n − z restore times r z+1 −r 1 , ..., r n −r 1 are independent samples from an exponential distribution on (r 1 , ∞) sorted into ascending order.We analyze the exponential restore rate because it is an analytically convenient choice to try to describe the slowing rate of restores.
Let the exponential distribution have time constant τ and probability density function τ −1 e −t/τ for t ≥ 0. Then the Poisson process rate is and the expected cumulative number of restores is We estimate the exponential time constant by τ is the arithmetic mean of the positive restore times relative to r 1 .The exponential model has parameters τ , z, and n.For each event, τ can be estimated from (13), and then the averaged outage process R exp in (12) approximates and describes the restore process R(t).Examples of the approximating restore curves are shown by gray dashed lines in Fig. 2.

IV. DURATION METRICS
There are many possible metrics describing durations in resilience events.This section defines and describes a variety of these metrics.

A. Straightforward duration metrics outage duration D
The outage process starts at the first outage o 1 and ends at o n so that the outage duration D O = o n − o 1 .The first restore is at time r 1 and the time to the first restore is D r1 = r 1 − o 1 .That is, D r1 quantifies how much the start of the restore process is delayed.The restore process starts at r 1 and ends at the last restore r n so that the restore duration D n = r n − r 1 .The event starts at time o 1 and ends at time r n .The event duration D E = r n − o 1 can be split into the time to the first restore and the restore duration: (14) This section discusses restore duration, but the corresponding metrics describing event duration are easily obtained from the metrics for restore duration by adding the time to first restore D r1 as in (14).The outage duration D O and time to first restore D r1 are useful metrics, but section V explains that the restore duration D n and the event duration D E suffer from high variability.

B. Restore metrics based on quantiles
It is of interest to quantify the time to reach a given percentage x of restoration, or, equivalently, the x/100 quantile of the restore times 0, r 2 − r 1 , r 3 − r 1 , ..., r n − r 1 .There are many different definitions of quantiles ( [36] analyzes 10 definitions used in statistics), and correspondingly many ways to define restore metrics based on quantiles.This subsection discusses two metrics of restore duration based on quantiles; the first metric quantizes to a restore time while the second metric interpolates between restore times.time to first restore with at least x% restoration The ceiling function u is the smallest integer ≥ u.For example, D ≥ 95% is the time between the first restore r 1 and the first restore r 0.95n at which at least 95% of the restores are completed.It follows that D ≥ 95% = D n for n < 20, D ≥ 95% = D n−1 for 20 ≤ n < 40, and D ≥ 95% = D n−2 for 40 ≤ n < 60.For example, for n = 16, 0.95n = 15.2 = 16 and D ≥ 95% = D 16 .These quantum jumps in D ≥ 95% as n varies, and which also occur as x varies, are unsatisfactory when analyzing a range of events.This can be fixed with the following more elaborate quantile definition.restore time to x% of restoration where u = min The ceiling function u is the smallest integer ≥ u, the floor function u is the largest integer ≤ u, and u − u is the fractional part of u.
Eqn. (16) shows that D x% linearly interpolates between restore times D u and D u .D x% uses the median-based quantile definition 4 recommended by [36], but also limits u to a maximum of n in (17).When limiting applies, D x% = D n .
In contrast to D ≥ x% , D x% changes continuously as x varies and with much smaller jumps as n varies.For this reason, we strongly prefer D x% to D ≥ x% .D 50% evaluated with (16) reduces to the usual median.That is, letting = n/2 , C. Metrics related to restore process models These metrics work with the positive restore times relative to r 1 ; that is, r j − r 1 , j = z + 1, z + 2, .., n. 5 Usually z = 1 as explained in section III-B.geometric mean of positive restore times All the duration metrics in the paper (labelled with D) are given in hours so that the time unit t u = 1 hour.We now discuss the units of µ and σ.A more precise version of µ = ln D GM is µ = ln(D GM /t u ) (or D GM = t u e µ ).Dividing D GM in hours by t u = 1 in hours gives the required nondimensional argument of the logarithm [39].Changing t u will cause a change in the value of µ. σ does not depend on the units used and gives the same value for any choice of t u .
V. DISCUSSING RESTORE METRICS D n , D GM , D 95% , D ln 95% All duration metrics of the restore process are subject to substantial statistical variability that can undermine their usefulness, especially for smaller values of event size n.The variabilities of the restore metrics are analyzed in section VIII by calculating the size of their confidence interval, and only the conclusions about their variability are stated here.
The restore duration metric D n is straightforward, but it is typically too highly variable to be a reliable estimate.Moreover, D n depends strongly on the last or last few restores, preventing D n from describing the performance throughout the entire restore process.This dependence also makes D n relate poorly to transmission performance because these last restores may be unimportant for customers, or may be excessively delayed by factors out of the control of the utility, such as the difficulty of repairing transmission lines in the mountains in the winter or structural damage caused by hurricane or tornado.
The geometric mean of the positive restore times D GM is the best estimate of restore performance in terms of having the least variability.It is also clear that D GM depends on all the restores throughout the restore process.We now discuss how D GM also estimates a median of the restore process.Since the normal distribution is symmetrical about its mean value, the mean µ also estimates the median of the normal distribution, and therefore D GM = e µ estimates the median of the lognormal distribution 6 .In fact, D GM is a better estimate (less variance) of the median than applying the standard formula  (18) for the median.The detailed correspondence is that D GM estimates the median of r j − r 1 , j = z + 1, z + 2, ..., n, which is modestly greater than 7 the median of all the restore times r j − r 1 , j = 1, 2, .., n calculated in (18).That is, under the lognormal model, D GM is a good estimate of the median of the positive restore times relative to r 1 , and approximates from above the median D 50% of all restore times relative to r 1 .
While D GM is an informative metric with the lowest variability, D 95% and D ln 95% can be used as more representative of the almost complete duration of the restore process, with the compromise of higher variability than D GM .D 95% is a more smoothly varying quantile metric indicating the 95% completion of the restore process.D ln 95% is also smoothly varying.D 95% is a bit more variable than D ln 95% , particularly for small n.Overall, we slightly prefer D 95% to D ln 95% because the quantile approach is less model dependent, whereas D ln 95% will work best in the typical lognormal restore case.
Table I summarizes the metrics and our recommendations.

VI. TYPICAL VALUES OF METRICS & MODEL PARAMETERS
Typical values of metrics and parameters are given for all the data in Table I and for each interconnection in Table II; these values are expected to be useful for modeling and assessing interconnection-specific transmission events.Due to the heavy tails in their distribution, some quantities in Table II such as D n have mean values that greatly exceed the median and large standard deviations.In these cases, the estimated mean has substantial statistical variation and poorly indicates a typical value; the median is a better typical value.The large standard deviations arise from both the metric statistical variability and the metric variation between events.
On average, events in the Eastern interconnection are larger than in the West and ERCOT.It can be explained by the fact that the largest transmission events were caused by hurricanes, and all of these events occurred in the East.For all interconnections, the mean and median outage process durations 7 For z = 1, difference in the medians is (r +1 − r )/2, where = n/2 .D O are similar, and very short compared to event durations D E .The mean outage rate in the West is much higher due to several events (wildfires and a lightning storm) for which all outages started almost simultaneously.This extremely short outage duration D O results in huge outage rates (see ( 3)).
The restoration usually starts very quickly after the event starts as the time to first restore D r1 indicates.In ERCOT the average time to a first restore, 1 hour 17 minutes, is statistically significantly larger than in the East and in the West, where restoration typically starts within one hour.Overall, the time to first restore is negligible compared to event duration; this makes the event duration D E and the restore process duration D n effectively equal.In contrast, the time between the (n − 1)th and nth restores, D n − D n−1 , is sizeable and often comprises a substantial share (41% on average) of D n .This observation again underscores the impact of the last few restores to the event and restore durations.
The geometric mean of the positive restore times, D GM , is a simple and stable metric.D GM is also an approximate estimate for the time to one half of restores for the events with log-normal restore times.The largest difference between these metrics observed for the ERCOT events can be attributed to the poorer log-normal fit for the ERCOT events.On average, D GM is 12% of the entire restore process duration D n .
It is interesting to compare in Table II the sample quantile restore time D 95% with the lognormal and exponential quantiles D ln 95% and D exp 95% .D ln 95% often overestimates D 95% due to the heavy tail of the lognormal distribution, whereas D exp 95% often underestimates D 95% due to the light tail of the exponential distribution.
The parameters µ and σ for fitted log-normal distributions and τ for fitted exponential are consistent in each interconnection and across interconnections.Table V shows that µ increases and σ decreases with event size n.
Only 23 of the 352 resilience events in the dataset are not weather-related.These 23 events vary in size from 10 to 26 outages.Except for D r1 , the medians of the duration metrics in Table III are statistically significantly higher8 for weatherrelated events than for non weather-related events.Table III also shows for each weather type the median metrics for the 95 weather-related events with at least 18 outages.There are some statistically significant differences 7 among the extreme weather types: the medians of D and D 95% for hurricanes are greater than for other weather types, and D GM and µ for hurricanes and tornadoes are greater than for other weather types.The mean of the times to first restore D r1 are similar for all weather types except tornadoes; the mean D r1 for tornadoes is 1.7 hours, which is at least double the mean D r1 for the other weather types.Our analysis confirms a well-known fact that a type of extreme weather can be more typical and impactful for one interconnection than another.Among the 11 named hurricanes that caused 17 transmission outage events shown in Table III (the largest, longest and most impactful events in the data set) all except one hit the Eastern Interconnection; the exception was the hurricane Harvey (ERCOT, August 2017).Wildfires causing large transmission events usually occur in the West.These examples demonstrate a possible reason in metric variability across the system and, more importantly, the impractically of using duration metrics to compare resilience of transmission system in different interconnections.These metrics should be used to track differences in resilience and restoration for the same grid (changes in time, between different types of events etc.).

VII. FIT OF POISSON PROCESS MODELS TO UTILITY DATA
This section discusses the fit of the Poisson models to the observed utility data by a goodness of fit test, which allows for analysis of each of the 352 events, and by probability plots for the combined normalized data, which also show where the fit deviates.For the goodness of fit tests, there is some arbitrariness in the threshold amount of deviation corresponding to the significance level, as well as some dependence on the event size n, but they do give an indication of fit.

A. Outage process fit with uniform distribution
The Poisson process model with constant outage rate implies that for each event the n − 2 outage times o k , k = 2, 3, ..., n−1 should be independent samples from a uniform distribution on the interval (o 1 , o n ).We evaluated the fit of these outage times for each event to the uniform distribution as shown in Table IV.Satisfying the test means that the ideal model is not rejected at the significance level 0.05.Table IV shows that a majority of events satisfy the model.The normalized outage times (o k − o 1 )/(o n − o 1 ), k = 2, 3, ..., n−1 should be independent samples from the standard uniform distribution on the interval (0, 1).The fit of the normalized outage times for all of the events to the standard uniform distribution is shown by the QQ plot in Fig. 5.The fit in Fig. 5 is quite close over the middle range, and the main deviations occur at the ends of the distribution and correspond to simultaneous multiple outages recorded at the beginning or end of the outage process 9 .
The fits of this subsection indicate that the Poisson model with uniform rate is a typical case (a majority of all events) usefully approximating the outage process.

B. Restore process fit with lognormal distribution
As explained in section III-B, the Poisson process model with lognormal rate for the restores implies that for each event the restore times r z+1 −r 1 , r z+2 −r 1 , ..., r n −r 1 should be independent samples from a lognormal distribution.We evaluated the fit of these restore times for each event to the lognormal distribution with parameters µ, σ estimated using ( 6), ( 7) at the significance level 0.05 as shown in Table IV.Table IV shows that a majority of all events satisfy the model, and this also holds for the East and West interconnections.
For each event, the normalized restore times (ln(r k −r 1 ) − µ)/σ, k = z + 1, z + 2, ..., n should be independent samples from the standard normal distribution N (0, 1).The fit of the normalized restore times for all events to the standard normal distribution is shown by the CDF and QQ plots in Fig. 6, which show a reasonably good fit with some modest deviations.
The fits described in this subsection indicate that the Poisson process model with lognormal rate is a typical case usefully approximating the restore process.The typical lognormal case has a heavy tail that can describe some extremely delayed final restores.

C. Restore process fit with exponential distribution
As explained in section III-C, the Poisson process model with exponential rate for the restores implies that for each event the restore times r z+1 − r 1 , r z+2 − r 1 , ..., r n − r 1 should be independent samples from an exponential distribution with time constant τ .We evaluate the fit of the restore times for each event to the exponential distribution with time constants τ estimated using (13) as shown in Table IV.Table IV shows that a minority of events satisfy the model.
For each event, the normalized restore times τ −1 (r k −r 1 ), k = z +1, z +2, ..., n should be independent samples from the standard exponential distribution with time constant 1.The fit of the normalized restore times for all events to the standard exponential distribution is shown by the survival function and QQ plots in Fig. 7.There is clear discrepancy between the exponential model and the data for the initial portion and tail of the distribution.The tail in the data is much heavier than exponential, and this discrepancy in the tail is particularly significant for our purpose here of estimating restore durations.
The fits described in this subsection indicate that the Poisson process model with exponential rate only fits a minority of the events and is a noticeably poorer approximation of the typical restore process than the model with lognormal rate.

VIII. STOCHASTIC VARIABILITY OF RESTORE METRICS
The restore duration metrics vary due to variation of the restore processes between events (and this of course is what we want to quantify) but also due to the inherent statistical variability of the metric used (which we want to minimize by selecting a better metric).The statistical variability makes the metric vary between events, even if the events have the same characteristics, because of random variations in the progress of the restores.
We assess the inherent statistical variability of the metrics by assuming the lognormal Poisson model for average values of µ and σ, which vary as functions of n, and are estimated using ( 6) and (7).In this section we assume that z = 1.
We measure the size of the D GM confidence interval (20) by the multiplicative factor exp(σz c / √ n−1 ), which we call the "multiplicative half-width" of the confidence interval.More generally, we define the size of a confidence interval with endpoints c 1 , c 2 as multiplicative half-width of Now we obtain the size of the confidence interval for D ln x% .From (19), taking z = 1, ln D ln x% = µ + φ x,n σ, where φ The sample standard deviation σ has distribution (σ/ √ n−2)χ n−2 where χ n−2 is the chi distribution with n−2 degrees of freedom 10 .
Using (22) and the independence of µ and σ, the probability density function of ln D ln x% is the convolution and the CDF of ln D ln x% is We use (24), numerically integrating to evaluate the convolution, to find the 100(1 − c)% confidence interval for ln D ln  (13) shows that the exponential time constant τ is also the arithmetic mean of the nonzero restore times.In this section these n − 1 restore times are assumed to be sampled from a lognormal distribution.Using Cox's approximate method [41], the multiplicative half-width of the confidence interval of τ is 10 the definition of σ uses µ, so that the number of degrees of freedom is one fewer than the number of samples n−1

B. Variability of D k and D x%
Since the restore times r 1 , r 2 , ..., r n are sorted in increasing order, D k = r k −r 1 corresponds to the kth largest restore time and, assuming that z = 1 and k ≥ 2, D k is the (k−1)th order statistic of the n−1 lognormally distributed restore times r 2 − r 1 , ..., r n − r 1 .We evaluate in Mathematica the inverse CDF F −1 D k of the (k−1)th order statistic of n−1 samples of the lognormal distribution with parameters µ and σ.Then we find the 100 )} and its multiplicative half-width from (21).
To evaluate the variability of D x% , we approximate its inverse CDF with the linear interpolation where u is given by (17).We then obtain the 100 )} and use ( 21) to obtain its multiplicative half-width.

C. Results for variability of metrics
The size of the 90% confidence interval, measured by the multiplicative half-width (21), indicates the inherent statistical variability of the metrics.For example, a multiplicative halfwidth of 2 indicates that the interval spans from half to double of a point inside the interval.Table V shows results for metric variability, and there are some overall trends: All the metrics become much more variable as the event size n decreases.Metrics estimating a larger fraction of the entire restore duration are much more variable (consider the sequence D GM = D ln 50% , D ln 90% , D ln 95% or D 50% , D 90% , D 95% , D 100% = D n ).The quantile metrics (D 50% , D 90% , D 95% ) are always more variable than corresponding metrics related to lognormal restore (D GM , D ln 90% , D ln 95% ), but the increase in variability is modest or small for n ≥50.
Metric variability is worst and unacceptably large for D n , which always has a confidence interval size of more than a factor of 2. The high variability of the last restore r n and D n is expected due to the heavy tail of the lognormal distribution.Fig. 8 shows that the variability of D k is sharply reduced for k/n = 0.95, at least for larger n, and further reduced for k/n = 0.90.This motivates avoiding D n and considering the use of D ln 90% , D ln 95% , D 90% , D 95% , which have confidence intervals with size less than a factor of 2 for n ≥ 50 and which perform more continuously by interpolating the D k metrics.The arithmetic mean τ is highly variable for smaller values of n; it has a confidence interval size of more than a factor of 2 for n < 30.
The pervasive problem of duration metric variability is best mitigated by D GM , which has a confidence interval size of less than a factor of 2 for n ≥ 17.
This section assesses metric variability assuming the lognormal model of restores.This is a good assumption for a majority of cases, and can be regarded as a stringent assumption for the remaining minority of cases due to the heavy tail of the lognormal distribution.

IX. CONCLUSIONS
We use extensive North American transmission system data to analyze the statistical variability and interpretations of a variety of metrics for the duration of processes in resilience events.Some metrics, such as the outage duration D O , outage rate λ O , and the time delay before the first restore D r1 , are useful.Other duration metrics can suffer from excessive statistical variability, in which their estimated values are contained in confidence intervals that are so large that the estimated values of the metric are not representative.This variability is quantified using new Poisson models for outage and restore processes.The variability is worse for small events.
The apparently straightforward metrics of restore process duration D n and the event duration D E are extremely statistically variable and do not adequately describe the restore process, so we recommend new duration metrics D GM and D 95% (or D 90% ) with better performance.In particular, the geometric mean of restore times D GM has the least statistical variability, summarizes all of the restore process, and approximates a time at which half the restores are completed.The quantile-based metric D 95% indicates the time at which restoration is 95% complete, but has greater variability than D GM .D 95% uses interpolation to vary more continuously as the data changes.Table I summarizes the metrics and their recommendations, and Tables II and III give typical values for the metrics for three interconnections and different weather conditions.
Since our paper is driven by North American bulk electric transmission system outage data, strictly speaking the results describe aspects of resilience only in North American transmission grids.However, since similar transmission outage data is routinely collected worldwide, the methods of the paper are readily applicable to other transmission systems to test or confirm the models and conclusions of the paper.
We introduce novel Poisson process models for the outage and restore processes in resilience events.These new stochastic models describe how resilience events progress in North American transmission systems, and are verified with extensive utility data to be good approximations for the majority of cases.The outages occur uniformly over a short time interval, whereas the restores occur at a lognormal rate that slows to produce the long delays often observed for the last few restores.The lognormal model for the restores is a noticeably better fit than an exponential model for the restores.We give typical values of the model parameters for three interconnections and for different weather conditions to make the new models more specific and useful to other researchers.
The Poisson process models describe probabilistic outages and restores occurring according to specified rates.Averaging the Poisson process models produces formulas for smooth, deterministic curves that approximate typical outage and restore processes.These deterministic averaged models are of considerable interest for future work describing how resilience events progress in transmission systems.For example, one can derive formulas for the area, duration, and nadir metrics of mean performance curves in terms of the Poisson process parameters [42].The formulas for area under the mean performance curve are simple and intuitive, and sometimes also apply to the area under empirical performance curves that are obtained from observed data.

Fig. 1 .
Fig.1.Processes for a transmission system resilience event with 12 outages.

Fig. 2 .
Fig.2.Examples of outage processes (dark blue) and restore processes (red) for events.Red dashed line is lognormal restore approximation, gray dashed line is exponential restore approximation.p-value is from Anderson Darling test on lognormal fit to restore process.

Fig. 3 .
Fig. 3. Horizontal axes ticks show eight outage times o 1 ,o 2 ,...,o 8 produced by a Poisson process with constant rate λ O .The resulting outage process O(t) is the dark blue stepped line.O(t) is approximated by the average outage process O(t), which is the dashed line of slope λ O .

Fig. 4 .
Fig.4.Horizontal axes ticks show eight restore times r 1 ,r 2 ,...,r 8 produced by a Poisson process with lognormal rate.The resulting restore process R(t) is the red stepped line.R(t) is approximated by the average restore process R(t), which is the dashed curve.R(t) is proportional to the CDF of the lognormal distribution and its slope is the Poisson process rate.
) 2 restore time to x% restoration assuming lognormal D ln x% satisfies nx/100 = R(D ln x% + r 1 ) and nx/100 − z = (n − z)Φ[(ln D ln x% − µ)/σ] so thatD ln x% = exp µ + σ Φ −1 nx/100 − z n − z(19)Note that D ln (50+50z/n)% = e µ = D GM .arithmeticmean of nonzero restore timesτ = 1 n−z n k=z+1 (r k − r 1 ) restore time to x% restoration assuming exponential D exp x% satisfies nx/100 − z = R exp (D exp x% + r 1 ) and n(1 − x/100) = (n − z) exp[−D exp x% /τ ] so that D exp x% = τ ln n − z n(1 − x/100)The average restoring half life D exp 50% = τ ln[2( n−z n )] is the average time for the number of unrestored outages to halve averaged over the restore process assuming exponential decay.There are variants of D ln x% and D exp x% with slightly simpler formulas that describe the time to restoration of x% of the n − z nonzero restore times.For these variants, D ln x% becomes exp[µ+σ Φ −1 (x/100)] and D exp x% becomes τ ln[1/(1− x/100)].We prefer the definitions of D ln x% and D exp x% above because the time to restoration of x% of all n restore times seems more straightforward.

Fig. 7 .
Fig. 7. Fit of normalized restore data to standard exponential distribution.Above compares log survival functions; below is QQ plot.

( 1 −
c/2)}, then use(21) to find the multiplicative half-width of the confidence interval for D ln x% .Equation

Fig. 8 .
Fig.8.Size of 90% confidence interval for kth order statistic D k as the fraction k/n varies.n is number of restores.Confidence interval size is multiplicative half-width.Lognormal restore is assumed with parameters µ and σ from TableV.