The Most Frequent N-k Line Outages Occur in Motifs That Can Improve Contingency Selection

Multiple line outages that occur together show a variety of spatial patterns in the power transmission network. Some of these spatial patterns form network contingency motifs, which we define as the patterns of multiple outages that occur much more frequently than multiple outages chosen randomly from the network. We show that choosing N-k contingencies from these commonly occurring contingency motifs accounts for most of the probability of multiple initiating line outages. This result is demonstrated using historical outage data for two transmission systems. It enables N-k contingency lists that are much more efficient in accounting for the likely multiple initiating outages than exhaustive listing or random selection. The N-k contingency lists constructed from motifs can improve risk estimation in cascading outage simulations and help to confirm utility contingency selection.


I. INTRODUCTION
It is routine to choose initial line outage contingencies to assess power transmission system security with simulation.Single line contingencies, known as N-1, are tractable and their impact is tested simply by applying each outage in turn.This paper analyzes the probabilities of the more challenging N-k initial line contingencies with k>1 lines outaged at once.These multiple line contingencies, generally of higher impact and lower frequency, do occur in practice, and are simulated when assessing the risk of more extreme events such as cascading, or ensuring robustness to a list of contingencies that goes beyond N-1.We now explain how these applications motivate our analysis of the probability of N-k initial outages based on outage data routinely collected by utilities.
When assessing the risk of cascading outages with a simulation, it is usual to sample multiple initial line outages with some sort of equal probability assumption [1], [2], such as independent, equal probabilities for the individual line outages that make up each multiple outage, or equal probabilities for all N-2 outages.These equal probability assumptions are pragmatic but unrealistic, and this systematic skewing of the sampling towards contingencies that are unlikely in practice makes the resulting risk estimates less credible.We use contingency motifs to sample the multiple contingencies that have significantly higher probability.When this improved sampling K. Zhou is with the School of Mechanical and Electrical Engineering, Soochow University, Suzhou, China (email: kzhou@suda.edu.cn).
I. Dobson and Z. Wang are with the Department of Electrical and Computer Engineering, Iowa State University, Ames IA USA (email: dobson@iastate.edu;wzy@iastate.edu).
I. Dobson gratefully acknowledges support from NSF grant 2153163.
of initial contingencies is combined with the simulated cascading impact, better estimates of cascading risk can be obtained.
The power industry routinely tests and maintains power grid reliability by simulating the impacts of a list of credible initial outage contingencies.The credible contingencies include all single line contingencies and a judicious selection of the huge number of multiple contingencies that are theoretically possible.As summarized in the literature review, the credible multiple contingencies are largely chosen by their impact or by engineering judgment.The contingency motifs give sets of multiple outages that have a much higher probability based on the historic outage data.Quantifying the extent to which these sets of multiple outages have occurred much more frequently in the past can help confirm and augment the contingency list.
This paper originates from the observation from real outage data that multiple contingencies occur much more frequently in spatial patterns that we call contingency motifs.We illustrate this idea for the simplest case of N-2 outages.For N-2 outages, the contingency motif is all the outages of two lines that share a common bus; that is, they have the spatial pattern .In our first power grid example of N = 528 lines, there are N(N-1)/2 = 139 128 possible double contingencies, but only 2116 of these double contingencies are the contingency motif.If we assume each double contingency occurs with equal probability, then the probability of a contingency motif occurring would be 2116/139 128 = 1.5%.However, observing the system for 14 years, we find that 81% of the double contingencies that actually occurred are the contingency motif .Since the contingency motif is much more likely to occur, we can efficiently capture much more of the probability of the realistically occurring double initial outages with the contingency motif than by random selection from all double outages.
On reflection, it is not surprising that the contingency motif of two lines with a common bus occurs much more often, since the status of the two lines can be linked by both proximity and various details of the protection system and substation layout.But we can generalize this insight to N-3 and N-4 and quantify it statistically in order to capture most of the probability of N-k outages.The reason is that N-3 and N-4 do occur in practice and their outage data is enough for making statistically meaningful inference; however, N-k for k>4 is rare and they only account for a small portion of multiple line outages, and thus there is not enough historical data for statistical analysis.
This paper • defines contingency motifs of the power network and finds that multiple initial line outages occur much more frequently in contingency motifs.
• develops a probabilistic model and sampling schemes for multiple contingencies.• shows that the new sampling schemes and contingency lists account for most of the probability of multiple initial line outages.• applies to standard utility data, and analyzes historical contingencies from two large North American utilities.
II. LITERATURE REVIEW Researchers have proposed various model-based methods for contingency selection.The Performance Index (PI) method approximates an index for a contingency that reflects the impact on the violation of line flows or voltages [3].Then contingencies are ranked based on PIs, and those having large PIs are put on the contingency list.The PI method has a trade-off between approximation accuracy and computation speed.For improving approximation accuracy, [4] forms a PI based on three margin indices for unbalanced systems in terms of currents, voltages, and reactive power.The margin indices use the deviation beyond limits instead of the absolute values.[5] proposes two PI-based methods considering distribution networks.For speeding up computation, [6] presents an algorithm of fast N-2 contingency selection.Based on the linear power system model, it derives a set of constraints that describe the line flow overload; whenever some contingencies are considered to be credible, the algorithm constructs new constraints to identify more credible contingencies.
Besides the PI method, some works [7]- [9] first select individual critical components and then extend them to multiple contingencies.The final list is generated by screening for risky contingencies based on topological metrics or time-domain simulation.[7] describes a security assessment tool that incorporates a probabilistic contingency selection module.This method identifies critical individual components according to the conditional probabilities of individual component outages given different threats (such as bad weather, environment, aging, sabotage).It then enumerates single and multiple contingencies based on critical individual components and computes topological metrics to screen for contingencies.Instead of conditional probabilities, [8] uses a metric based on Line Outage Distribution Factors to select critical individual components.Then, line candidates for multiple contingencies are those within a specified distance from the selected critical individual components, and multiple contingencies are ranked according to the betweenness centrality.Compared to the above two methods, [9] extends pre-selected contingencies only when the post-contingency state is stable, and candidates are the components that are impacted most by the pre-selected contingencies.This method uses time-domain simulation to evaluate the system voltage stability.[10] proposes a method of forming an N-k contingency list based on substation configurations.The idea is that the protection system forms functional groups, in which components outage together because of bus configurations and protection schemes.
Moreover, statistical sampling and optimization are proposed to select multiple contingencies.[11] proposes a Random Chemistry sampling to identify large collections of multiple contingencies that initiate cascading outages.Specifically, Random Chemistry starts with a relatively large random subset of components that cause cascading outages, and then reduces the size of that subset recursively until a minimal subset is found.This minimal subset is a multiple contingency that leads to cascading, and it is minimal because any subset of it does not cause cascading.In a different approach, [12] studies the statistical properties of critical N-2 contingencies in terms of their locations in the power network, which can be used to identify critical lines.Mixed-integer linear programming is also used to identify credible contingencies.The objective is to maximize either the risk [13] or the incremental risk [14] of contingencies, and a recursive algorithm is used to select a list of credible contingencies.However, the optimization method is inadequate to generate a large collection of contingencies in a limited time.[15] formulates a mixed-integer non-linear programming problem to identify multiple contingencies that cause a large load shed.Two algorithms using power flow sensitivity and a topological metric reduce the search space to speed up computation.
Instead of assessing cascading risk, one can pose a different question that starts from a large blackout caused by a very large contingency and then asks: what is the minimal multiple initial contingency that causes the large blackout?This question is addressed by a Random Chemistry algorithm in [11].
Industry practice for contingency selection requires all single contingencies and some of multiple contingencies.The North American Electric Reliability Corporation (NERC) established a standard TPL-001-4 [16] about categories of contingencies to be adopted in transmission system planning.[17] discusses in detail and models the seven categories of contingencies in TPL-001-4.A conventional continuous Markov Chain is used.Its parameters, such as failure and repair rates, are estimated from outage data collection systems, such as the Transmission Availability Data System (TADS).The model evaluates the probabilities of different categories of contingencies; it does not consider specific contingencies in each category.[18] also discusses the standard as well as the practice of contingency analysis.It presents the experience of selecting multiple contingencies for analysis in power system planning.ISO New England [19] is developing a tool that calculates probabilities of multiple contingencies given weather conditions.Multiple contingencies are constructed from independent single contingencies.These single contingencies either have high probabilities or are selected by operators because they are in major power grid interfaces.
Commercial software, such as TRELSS [20] and PSSE [21], have user-specified contingencies and automatic multiple contingency selection modules.User-specified contingencies could have common-mode contingencies, protection control group contingencies, and other specified contingencies.The automatic multiple contingency selection assumes independent individual outages in a multiple contingency, and the PI method is used for selection.For example, in the case of an N-2 contingency, the first outage is enumerated and ranked according to PI, and the second outage is enumerated and ranked according to PI in a subnetwork without the first outaged component; then, a combination of the first outage and the second outage is formed as an N-2, and the rank of N-k contingencies is determined by the PI of the first outage and then the PI of the second outage.
The definition of multiple contingencies varies in different contexts: the meaning of k in N-k is different.NERC considers both primary and secondary devices, and k is the number of outaged devices.On the other hand, in the definition of PSSE and [19], N-1 could also be a multiple contingency under its protection scheme, which is a protection control group in [10].However, [10] considers this contingency as an N-k, where k is the number of outaged circuits.In this paper, N-k represents a contingency involving k transmission lines.
As utilities are routinely recording outage data, data-driven and probabilistic methods for contingency selection are possible and promising but are not studied as much.One datadriven approach [22] proposes a Bayesian hierarchical model to estimate outage rates of individual transmission lines considering line dependencies.Expert knowledge is of course distilled from the experience of real outages, but there is an opportunity for statistical analysis to not only confirm and quantify the expert knowledge but also reveal more hidden findings that are not easily learned from experience.Motivated by this opportunity, in this paper we analyze real outage data to find historical contingency patterns and propose systematic sampling schemes for multiple contingency selection.
It is proven in cascading simulation that initial outage spatial correlation has a substantial impact on assessing cascading risk [23], [24].In general, the increased correlation of close initial outages increases the cascading risk.This finding also motivates us to examine initial outage spatial patterns in real outage data.
By analyzing outage data recorded over ten years in two large power transmission systems, we find that multiple line outages occur more frequently in some spatial patterns.This idea is inspired by the network motif concept.Network motifs are recurrent and statistically significant subgraphs of a network that are first introduced by complex network and biology researchers to analyze gene regulation networks [25], [26].Network motifs are widely used in gene regulation networks in systems biology and successfully applied in ecological, sociological, and epidemiological networks [27].
There are studies applying network motifs to power systems.Ren et al. propose network motifs as an indicator of cascading outage risk [28].They show that cascading outages exhibit three phases as the load level increases, and the phases correspond to the decrease of the frequency of network motifs.The frequency of motifs reflects the connectivity of the power grid; hence, it can be a warning sign of the cascading outage risk.Other researchers have studied network motifs as an indicator of power grid robustness and reliability using techniques from network science [29]- [32].Specifically, they carry out attacks on the power grid by removing nodes according to some order, and monitor network motif properties such as concentration, z-score, and lifetime.Then they determine the robustness and reliability of the network based on the idea that a robust network tends to preserve longer its motif-based measurements.
Researchers also study outage patterns using influence/interaction graphs [33]- [38].An essential difference from this work is that they study propagation patterns of cascading outages, while this work aims at revealing spatial patterns of initial simultaneous outages for better contingency selection and risk estimation.However, influence/interaction graphs generated from simulated cascades could be improved by using better contingency lists for the simulated cascades.Influence/interaction graphs generated from utility data can empirically account for the frequencies of initial outages [37].It is also feasible [38] to simulate an influence graph from assumed initial conditions, and this could use better contingency lists.
The previous work on network motif applications in power systems uses the conventional definition of network motifs, which defines motifs as connected subgraphs in a network that occur significantly more frequently than in a random network [26].However, this definition is not well suited to contingency selection because the power network is not a random network; it is a particular network of known structure.Moreover, multiple contingencies can also be disconnected subgraphs.Therefore, in this paper, we newly define contingency motifs as connected or disconnected subgraphs that occur significantly more frequently than random subgraphs of the particular power network under consideration.

III. MULTIPLE LINE INITIAL OUTAGES FREQUENTLY OCCUR IN CONTINGENCY MOTIFS
We analyze 19 years of historical outage data recorded by Bonneville Power Administration (BPA) and publicly available at [39].The first 14 years of data are used for analysis and the last 5 years of data are used for testing.The power transmission network is deduced from the outage data itself using the method in [40].The automatic line outages are grouped into cascades and then generations according to the outage times 1 .Then, multiple initial line outages are extracted from the first generations of cascades and represented as subgraphs of the power network.Some patterns are frequently recurrent, and we adapt the network motif concept to represent these frequently occurring patterns as contingency motifs.
This section first describes the statistics of random patterns of the power network and statistics of patterns observed in historical outages, and then it gives the definition and identification of contingency motifs.

A. Subgraphs and patterns in the power transmission network
The BPA power transmission network is shown in Figure 1.Substations correspond to network nodes, and transmission lines correspond to network edges.The power grid has multiple transmission lines between some substations, and they are represented by one line in this network.
A k-edge subgraph is an edge-induced subgraph, which is a subset of edges of a graph together with nodes that are their endpoints.For example, {1−3, 1−6} is a two-edge subgraph of the graph in Figure 2. When an N-k contingency occurs, we can imagine that the k outaged lines in the power network are highlighted, and we observe a subgraph.Thus, each N-k contingency corresponds to a subgraph.
Two subgraphs are isomorphic when there is a mapping between their nodes such that two nodes adjacent in one subgraph implies that the corresponding two nodes in the other subgraph are also adjacent.We say that two subgraphs are the same when their nodes and edges are exactly the same.For example, in Figure 2, subgraphs {1 − 3, 1 − 6} and {1 − 5, 4 − 5} are isomorphic, but are not the same subgraph.A pattern is a set of isomorphic edge subgraphs.S k,i denotes a pattern that is a set of subgraphs s k,i , where k is the number of edges and i is the pattern identifier.An exception is S 4, * , which denotes the set of 4-edge subgraphs that are not members of S 4,i for i = 1, 2, 3, 4. Table I shows patterns of the BPA network and the number of distinct subgraphs in each pattern (the size |S k,i | of the pattern).As contingencies are always grouped according to the number of outaged components k, subgraphs are also grouped this way.

B. Probability of patterns
Let P (S k,i |k) be the probability that a pattern S k,i appears in contingency subgraphs given the number of lines k.Two  1) Uniform assumption: Suppose the network has N nodes.If no other information is available, it is natural to assume that contingency subgraphs occur uniformly in all the N k possible k-edge subgraphs of the network.That is where p uni s k,i denotes the probability of a particular subgraph s k,i given k, and "uni" indicates uniform.We call this the uniform assumption.
|S k,i | is the number of subgraphs of the network in S k,i .Then P (S k,i |k) under the uniform assumption is as shown in the fourth column of Table I.
2) Empirical probability: P (S k,i |k) estimated from the outage data is where n k,i is the number of contingency subgraphs s k,i appearing in the outage data, and n k is the number of k-edge contingency subgraphs in the network.Note i n k,i = n k .P (S k,i |k) is shown in the last column of Table I.
3) Discussion: Table I shows that the probabilities from outage data differ greatly from the uniform assumption.For example, P (S 2,1 |2) is much greater than P uni (S 2,1 |2), and P (S 3,1 |3) is much greater than P uni (S 3,1 |3).This implies that some patterns recur much more frequently than indicated by the uniform assumption.These frequently recurrent patterns are the contingency motifs discussed in the next subsection.

C. Contingency motif definition
The conventional definition of a network motif [26] considers connected subgraphs with a specific number of nodes.For example, possible size-3 motifs in Figure 2 are subgraphs {1 − 3, 1 − 6, 3 − 6} and {1 − 5, 4 − 5}.Conventional motifs are detected by computing the frequency of each pattern and comparing it with the frequency of the same pattern in random networks with the same global property, such as degree distribution, as the original network [26], [42].However, the power network is a particular, non-random network, contingency subgraphs can be disconnected subgraphs, and contingencies are grouped according to the number of lines, not the number of nodes.Therefore, the conventional definition of the network motif cannot be directly applied, and we define contingency motifs as follows.
Instead of comparing the frequency of a pattern in outage data to that in a random network, we compare the frequency of the pattern to that in subgraphs sampled randomly from the particular power network under consideration.We define a k-edge contingency motif in a power network as a k-edge pattern whose probability of occurrence is significantly greater than that when all k-edge subgraphs in the power network are assumed to have the same probability of occurrence.That is, where a ≥ 1 is large enough to indicate a significant difference.We choose a = 10 in this paper.For example, to determine 3-edge motifs, we estimate the probability of S 3,i for all i from outage data, and then compute the probability of S 3,i in the network under the uniform assumption.If the probability of a pattern in outage data is significantly statistically greater than that under the uniform assumption according to (4), the pattern is a contingency motif.

D. Contingency motifs in the power network
To detect a contingency motif from data, we compare the probability of the contingency pattern observed in outage data and the probability of that pattern under the uniform assumption.This problem can be formulated as a hypothesis test: We use frequentist and Bayesian methods to do the hypothesis test as detailed in Appendix A, and both tests identify the same motifs: S 2,1 is a 2-edge contingency motif, S 3,1 , S 3,3 , S 3,4 are 3-edge contingency motifs, and S 4,1 , S 4,2 , S 4,3 , S 4,4 are 4-edge contingency motifs.The test results are shown in Table II, including the p-value in the frequentist hypothesis test and the posterior probability P (H 0 |n k,i ) in the Bayesian hypothesis test.

E. Discussion
We only consider transmission line outages.In terms of physical elements in power systems, multiple contingencies involve primary devices (generators, lines, transformers, compensators, circuit breakers, bus-bar sections) and secondary devices (protections and telecommunication equipment).Outages of these devices can result in both single and multiple contingencies of transmission lines.NERC standard TPL-001-4 describes seven categories of contingencies related to various devices [16], and they can be further grouped into four types: N-1, N-1-1, N-2, and N-k for k>2.For example, category P3 are single-phase short circuit to ground faults of a bus-bar section; if the bus-bar section connects k lines, then an N-k line contingency occurs.
Multiple line outages can be divided into dependent and independent contingencies.Dependent outages are closely related to bus configurations and protective relays.It needs a lot of effort to build a detailed power system model including relays [43].Scheduled maintenance and forced outages change the topology of the power network, and hidden failures in the protective relay system are inevitable.There are also commonmode multiple contingencies2 that are caused by extreme weather or other external factors [44].
Two-edge stars in S 2,1 could be two transmission lines connected to the same substation faulted simultaneously by coincidence, common-mode contingencies of two lines, a circuit breaker or a tie break stuck in a breaker and a half substation, primary protection fails and zone 2 protection is activated, a fault in a bus-bar section connecting two transmission lines, or a hidden relay system failure, etc..In the first cause, the two line outages are independent because one line outage does not cause the other line outage; while for the rest of the causes, the two line outages are dependent on physical or engineered structure.Thus, S 2,1 as a motif usually reflects some inherent dependence of two lines.
The causes for three-edge and four-edge stars could include faults of transmission lines connected on the same bus-bar section, faults of bus-bar sections, transformers outages, breaker stuck, etc. S 3,3 is composed of three lines in a row.A possible cause is that these three lines are in a protection control group.S 3,4 is a triangle, which is a special local structure in the power network that is limited in number.
The precise physical or engineering dependencies causing a specific motif are not clear without detailed knowledge of a system, but the existence of the motif underlines the importance of studying multiple contingency mechanisms in detail.

A. A probabilistic model for multiple line outages
As multiple line outages show different patterns and some are contingency motifs, we partition the whole contingency space accordingly.Moreover, some patterns are disconnected subgraphs, and they have different network diameters 3 .As the network diameter follows a Zipf distribution as discovered in [46], we further partition disconnected patterns according to their diameter.The Zipf distribution has a heavy tail, implying that multiple line outages containing far-away lines do occur.
The partition is illustrated in Figure 3.The ellipse represents the space of contingency subgraphs, including N-2, N-3, and N-4.According to the different patterns, N-k contingencies are further divided into groups S k,i .Furthermore, disconnected S k,i are divided into subgroups according to their diameters.Each cell represents a S k,i with a specific diameter d.A key assumption is that multiple contingencies s k,i in each cell have equal probabilities.
We build a probabilistic model to estimate the probability of the multiple line outage s k,i with k lines based on the statistics of outage data.That is, where P (k) is the probability of k line outages, P (d|S k,i ) is the probability that pattern S k,i has diameter d, and P (s k,i |d, S k,i ) is the probability of a specific multiple contingency given its pattern and diameter.
1) Probability of the number of line outages: It is natural to estimate the probability P (k) by The distribution of k for the BPA data is P (k = 2) = 0.72, P (k = 3) = 0.24, P (k = 4) = 0.04.N-k contingencies for k>4 are not considered because of their very rare occurrence.
2) Probability of a pattern given k line outages: P (S k,i |k) is estimated during the detection of contingency motifs and is shown in Table III.

3) Probability of the contingency diameter given its pattern:
The connected contingency subgraphs have patterns S 2,1 , S 3,1 , S 3,3 , S 3,4 , S 4,1 , S 4,3 , S 4,4 .For the connected contingency subgraphs, the diameter is constant or very nearly constant 4 .Therefore, we take the diameter distribution of the connected contingency subgraphs to have probability 1 at a constant diameter.
The disconnected contingency subgraphs have patterns S 2,2 , S 3,2 , S 3,5 , S 4,2 .For all the disconnected contingencies combined together, we empirically estimate from the outage data the distribution of diameter P (d|disconnected).The disconnected contingencies are all combined together to calculate the single distribution P (d|disconnected) because of the limited outage data for these subgraphs.P (d|disconnected) is the number of disconnected subgraphs with diameter d divided by the total number of disconnected subgraphs.
S k, * has both connected and disconnected subgraphs, but a small probability.Therefore we set the diameter distribution to have probability 1 as for the connected subgraphs.However, we can only determine the value of d when a specific s k, * is given to make it a valid probability distribution, and if s k, * is given, d equals the diameter of this specific s k, * .
In summary, the diameter distribution conditional on pattern S k,i is estimated by Note that (7) does not explicitly express the support (all possible values) of d for the different S k,i because it is obvious.For example, S 2,1 has d ∈ {1}; 4) Probability of a contingency given its pattern and diameter: Finally, assume that subgraphs of a pattern S k,i with 4 s 2,1 , s 3,1 , s 3,4 , s 4,1 have diameter 1 and s 3,3 , s 4,3 have diameter 2. Although the diameter of s 4,4 is 2 or 3, 98% of s 4,4 have a diameter of 3 in the BPA network.
diameter d are uniformly distributed.That is, P (s k,i |S k,i , d) is the discrete uniform distribution

B. Probabilities of multiple line outages
Given a specific multiple contingency s k,i , we can estimate its probability with ( 5) by substituting values in Table III, and computing probabilities with ( 6), ( 7), (8).Table V shows the probability of any contingency s k,i in a pattern with diameter d. S 4, * is not included because it has a great number of distinct s 4, * , and thus each s 4, * has an extremely small probability.Table V confirms that motifs have higher probabilities than other patterns.V. FORMING A CONTINGENCY LIST A contingency list is a sample of contingencies to assess the risk of cascading outages and other system violations such as line flows and voltages exceeding limits.The risk of cascading outages is often defined as the expected value of the impact [1].Three factors are considered in estimating the risk: (1) the probability of a contingency; (2) the probability distribution of cascading outage sizes, whose uncertainty also comes from pre-contingency system states, model parameters, and how the cascade evolves; (3) the size and impact of the blackout.The cascade size and impact are usually estimated through power system simulation.The risk R(s) of contingency s with impact c(s) is R(s) = P (s)E(c(s)), where P (s) is the probability of contingency s and E(c(s)) is the expectation of the cascade impact 5 , which can be estimated by Monte Carlo simulation given the initial contingency sample s.The overall system risk is then the average of the risk of individual contingencies.A contingency list that efficiently samples from a large fraction of the probable contingencies is fundamental to this risk calculation and is the subject of this paper.Therefore, this section forms a contingency list from the contingency motifs for the BPA data.

A. Straightforward sampling of contingencies
The probabilistic model of ( 5) implies a straightforward sampling scheme for multiple line outages with four steps: The first three steps are straightforward as the corresponding random variables have a small number of discrete values.The fourth step is tricky because there is no effective way to find all subgraphs s k,i with pattern S k,i and diameter d, and randomly draw one of these subgraphs.For example, it is difficult to describe the 1 083 833 subgraphs in S 3,2 .Instead, we sample a s k,i by drawing lines sequentially.For N-2, first draw a line randomly; then find all lines that are at distance d from the first line; finally, randomly draw a second line so that the first line and the second line form an N-2. Figure 5a illustrates the steps of sampling a s 2,2 with diameter 2 using the small system in Figure 2.For N-3, draw the first line randomly and draw the second line that is at distance d from the first line, as we do for N-2; then randomly draw the third line from lines that have a distance not greater than d from either the first or the second line and form the desired pattern together with the previous two lines.Figure 5b illustrates the steps of sampling a s 3,2 with diameter 2. For connected N-4, the distance is fixed.We first sample an N-3 that is a subgraph of the desired pattern, then sample the last line randomly from lines that can form the desired N-4.For s 4,2 , we first draw a 3-edge star s 3,1 and then draw a line that is maximum distance d from any of the three lines in the star.For s 4, * , we randomly sample 4 lines; if they do not form s 4, * , we sample again until they form a s 4, * .
We will get a s 4, * with only a few trials because its probability under the uniform assumption is high as shown in Table I.
Using the sampling scheme, we draw 10 000 N-2, N-3, and N-4.It is possible that we sample a contingency that is already sampled.In this case, we discard this contingency so that there is no repetition in the samples.This is actually sampling without replacement, which is less variable than sampling with replacement [47].The three most likely contingencies at the top of the contingency list are {{348−365, 348−385}, {342− 378, 350 − 378}, {340 − 353, 340 − 354}}.They are in S 2,1 and have the same probability 0.0003.
Let M (r) be the percentage of outages in the test data (last 5 years of outage data) that are covered by a contingency list with r contingencies.To evaluate the performance of the straightforward sampling scheme, we compute M (r) of the proposed contingency list and compare it with the same size list produced by a random scheme that treats all N-k as equally likely.That is, the random scheme samples an N-k by drawing a k according to P (k) and then drawing k lines randomly from all lines.
Figure 6 shows how M (r) increases as r increases.The straightforward sampling is much more efficient than the random sampling.Since we are using a sampling method, we draw ten lists with size r to estimate the mean and standard deviation of M (r).For the straightforward sampling, the average M (10 000) is 82% with standard deviation 2%; while for the random sampling, the average M (10 000) is only The straightforward sampling scheme is designed so that outages with high probabilities are more likely to be drawn at an early stage.As a result, the solid curve in Figure 6 has a high slope at the beginning and then the curve flattens near the kink at r = 3000.When the slope of the straightforward sampling is higher than the random sampling, we are in a region where increments of effort in further sampling perform better than the random sampling.The kink is one possible indicator of stopping further sampling.Thus, we can use the first 3000 multiple contingencies to form a contingency list for detailed analysis.They cover about 70% of the outages in the test data; in contrast, the first 3000 random contingencies only cover about 4%.
The straightforwardly sampled contingencies have high coverage of outages in test data.It shows that the contingency motifs capture the spatial statistics of multiple line outages.

B. Stratified sampling of contingencies
The straightforward sampling scheme can be improved to stratified sampling, using motifs as strata.It is easy and flexible to implement and leads to more precise risk estimates than straightforward sampling.
Three contingency motifs (S 2,1 , S 3,1 , and S 3,4 ) in the sampled contingencies in Section V-A account for 78% of the probability of multiple outages in the test data.As shown in Table V, any individual s 2,1 , s 3,1 , or s 3,4 also has a higher probability than others.Another reason that we only consider these three motifs is that other motifs have a large number of distinct subgraphs (see Table I) and these three motifs can explain most of the probability.Therefore, we choose each of the motifs S 2,1 , S 3,1 , or S 3,4 as a stratum and all other patterns as a single stratum.Accordingly, ( 5) is rewritten as P (s k,i ) =P (s k,i | S 2,1 )P (S 2,1 ) + P (s k,i | S 3,1 )P (S 3,1 )+ P (s k,i | S 3,4 )P (S 3,4 ) + P (s k,i | S * )P (S * ) (9) where S * represents any pattern that is not S 2,1 , S 3,1 , or S 3,4 .Note that only one of the four terms on the right-hand side of ( 9) is not zero for a specific s k,i , so P (S k,i ) = P (S k,i |k)P (k), and P (S * ) = 1 − P (S 2,1 ) − P (S 3,1 ) − P (S 3,4 ).Directly calculating with the probability estimates P (S k,i ) in (9) instead of sampling proportional to these probabilities in the straightforward sampling scheme gives more precise estimates of P (s k,i ).
There are various choices of the number of samples in a stratum.One way is allocating samples according to their probabilities.That is, the number of samples in each stratum is proportional to its probability P (S k,i ).If a list needs 3000 contingencies, then we would have 3000×P (S 2,1 ) = 1748 , 3000×P (S 3,1 ) = 420 , 3000×P (S 3,4 ) = 18 , and 814 s * .As shown in Section V-A, this list accounts for 70% of outages in the test data.For strata of the three motifs, we randomly sample contingencies uniformly according to the fourth step of the straightforward sampling; for the stratum S * , we can use the full straightforward sampling but exclude the previous three motifs when evaluating conditional probabilities in (5).
The flexibility of the stratified sampling allows us to give more consideration to other factors.We may sample more contingencies from strata that we are interested in, or sample more contingencies from strata that generally have a high impact.For example, we could sample more contingencies from N-3 because N-3 generally has a higher impact than N-2.Stratified sampling reduces the variance of the estimate when there are more samples in the stratum with high probability than with low probability when the stratum variance is the same [47,Ch. 5.5].
Another example of the flexibility of stratified sampling is that it could be used in future work to better estimate risk.To evaluate the risk, the expected impact of contingencies in each stratum can be estimated by the average impact of contingency samples in that stratum, and the system risk is the weighted sum of the strata impacts with stratum probabilities as weights.Stratified sampling can reduce the variance of the risk estimates because contingencies are more homogeneous in each stratum than between strata, fewer samples are needed to obtain a precise estimate for each stratum, and combining these estimates for the whole population can be less variable.

C. Deterministic contingency list
A fixed or deterministic contingency list that involves no sampling could also be used.As shown in Table I, there are 2116 s 2,1 , 4653 s 3,1 , and 62 s 3,4 .The total number of these three motifs is 6831, which can be simulated in an acceptable time.Therefore we can make a contingency list that samples all the contingencies in the three motifs.This neglects the S * contingencies, but gives a deterministic contingency list that accounts for 78% of multiple outages in the test data.

VI. TEST RESULTS ON A SECOND SYSTEM
This section applies the analysis of the previous sections to a second transmission system with outages recorded by the New York Independent System Operator (NYISO) and summarizes the results, which turn out to be similar.The NYISO transmission system outage records cover New York State and parts of neighboring US states and Canadian provinces, with more network detail in New York State.The NYISO outage data is publicly accessed from its website [48] and processed according to Carrington's method in [49].Twelve years of data are used here, spanning from 2008 to 2020.The NYISO network formed from the outage data using the method of [40] has 1695 lines.The outage data do not have enough samples of N-4.Therefore, only N-2 and N-3 are considered.The first 10 years of data are used for training and the last 2 years of data are used for testing.
The contingency motifs identified in the NYISO data are S 2,1 , S 3,1 , S 3,3 and S 3,4 , which are the same 2-edge and 3-edge motifs as in the BPA data.Then, we form the probabilistic model and sample contingencies according to the straightforward sampling scheme.10 000 contingencies turn out to cover 74% of outages in the test data (0.9% standard deviation).This shows that these contingency motifs also capture the spatial characteristics of the NYISO multiple initial outages.
A contingency list is formed by the stratified sampling scheme.Motifs , , , and all remaining patterns S * compose four strata.The consideration is the same as the BPA case: any individual subgraph in the three motifs has a higher probability than others; these three motifs already account for the most probability (75%) of multiple outages; and the remaining motif have a large number (19 911) of distinct subgraphs (moreover, there are 6305 s 2,1 , 14138 s 3,1 , and 247 s 3,4 ).Since the NYISO network has more transmission lines than the BPA network, we form a contingency list with 10 000 contingencies for demonstration.Allocating samples proportional to probabilities, this list contains 6305 s 2,1 , 809 s 3,1 , 247 s 3,4 and 2639 s * , which accounts for 72% of multiple initial outages in the test data.

VII. DISCUSSION
The contingency motifs could confirm or inform industry contingency selection by indicating multiple contingencies with high probabilities.Given motifs in a power network, engineers with field knowledge can better identify vulnerable locations in the network for further analysis.In practical contingency analysis, contingencies selected from the contingency motifs could be refined by incorporating engineering considerations such as substation bus configurations.
Because transmission line automatic outage data is sparse, it is routine to group together lines by characteristics such as voltage rating and component type to determine outage probabilities of the lines in the group.The grouping of multiple initial line outages into contingency motifs is similar, except that the groups are formed by a spatial pattern the multiple outages share, and it is a similarly pragmatic way to mitigate the sparsity of multiple line outage data when estimating the probabilities of multiple initial outages.
The method of this paper relies on a single database of detailed historical outage data that is routinely collected by transmission utilities in North America and also by many utilities worldwide.The initial multiple outages can be readily extracted from the data and located on the network to find the contingency motifs and estimate their probabilities, so that the most probable contingencies can be sampled first.The network can be obtained from an inventory associated with outage data or, as we do in this paper, from the outage data itself.Since their own data is available to each utility and the computations are not difficult, we would first recommend that contingency motifs be found and applied with specific utility data to get improved contingency lists to estimate cascading risk from simulations.However, specific utility historical outage data may not be available, since the simulated system may not have associated historical data, or the study is done outside a utility.Then the risk assessment could still realize some similar benefits by relying on the overall similarities in power transmission system physics and engineering to use the contingency motifs , , and that we observed in two different transmission systems.
This paper proposes an improved method of sampling multiple initial line outages, which is only one part of assessing cascading risk.We briefly comment on how the new method fits into assessing cascading risk.The new method can simply be substituted for the various sampling schemes with explicit or implicit uniform probability assumptions, and then combined with the sampling of the grid conditions and stresses, and the sampling of any cascading outages and interactions that may occur after the initial line outages to evaluate a cascading outcome.We note that for risk analysis it is necessary to sample likely as well as less likely possibilities across as much of the sample space as possible.For comprehensive reviews of cascading risk assessment we refer the reader to [1], [50].Note that [51] explains the close relation between deterministic and probabilistic framings of cascading.
One feature of the new method that arises from it being directly driven by observed line outage data is that the initial line outage underlying causes need not be modeled and analyzed.The pattern of initial line outages in a given motif can arise from a variety of causes and mechanisms, but for the purpose of sampling initial line outages better, we only need to know the observed outcome of all these causes and mechanisms as it is expressed in the frequency of the motif.This point is specific to sampling initial outages for risk analysis, we are not suggesting in any general way that the underlying causes are unimportant; indeed the underlying causes are vital to good engineering to mitigate the risk of specific situations.
There are some specific threats to grid security that can have atypical patterns of initial outages related to the specific threat, such as terrorism, war, and solar geomagnetic disturbances.In a grid in which one of these threats is rare, these atypical patterns will be rare in the historical record, and the motifs extracted from the historical record will tend not to include these atypical patterns.The extracted motifs correctly summarize the historical probability structure of the multiple outages in that grid and are appropriate for use in risk analysis on that basis.On the other hand, one might choose to defend the grid against one of these threats that has rarely occurred in the past, and in that case we would suggest that each of these threats requires separate analysis with their own initial contingency lists.However, our improved contingency lists for multiple initial outages based on historical data can be expected to have some amount of overlap with these special contingency lists for two reasons.First, in general, better screening of multiple initial outages is broadly helpful because all threats share the same grid topology and protection systems.Second, in particular, many motifs correspond to outages related to a single substation, and efficiently augmenting the contingency analysis with these motifs tends to cover some of the starting impact of these threats if that threat starts with damage to a single substation.Earthquakes are another threat that may require separate analysis because they tend to have an unusually large number of initial outages occurring in the same minute in a specific area of the grid, but can be rare in any given specific area, giving only a few cases in recorded data.
In implementation, we expect the new sampling methods to be used offline to generate a fixed contingency list.Then it takes the same time online or offline as any other contingency list, which is a small part of the time required for cascading simulation.

VIII. CONCLUSION
In going beyond N-1 security, contingency lists of multiple initial line outages are foundational for assessing cascading risk and the security of power transmission systems.We analyze the spatial patterns of multiple automatic line outages that occurred in the same minute at the start of a cascade from historical outage data.Some patterns occur significantly more frequently in outage data than in random subgraphs of the particular power network under consideration.We call these patterns contingency motifs.The existence of contingency motifs is the result of complex physical and engineering dependencies in power systems.
Three contingency motifs ( , , and ) account for most of the probability of multiple line outages in both BPA and NYISO historical data.A contingency list formed from these contingency motifs is much more efficient than random selection or exhaustive listing.This improved contingency list can easily improve simulations that evaluate cascading risk.Specifically, to assess the cascading risk, we can use all contingencies of the contingency motifs or sample a desired number of multiple initial outages according to a sampling scheme.A stratified sampling scheme is flexible and effective.
We show that sampling based on motifs works on the outage datasets of two transmission systems.We expect that it will also be applicable to other transmission systems, and we hope that others will be able to confirm this with the outage data available to them.
Contingency motifs can substantially improve the contingency lists and the risk estimates obtained when assessing cascading risk with respect to N-k contingencies with simulations.

IX. ACKNOWLEDGMENTS
The authors gratefully thank Bonneville Power Administration and New York Independent System Operator for making publicly available the data on which this paper is based.The analysis and any conclusions are strictly those of the authors and not of BPA or NYISO.Support from NSF grant 2153163 is gratefully acknowledged.

Fig. 1 :
Fig. 1: BPA power transmission network (528 lines) derived from the outage data [40].Highlighted subgraphs are five examples of multiple line outages.Layout is not geographic.
methods are used to estimate P (S k,i |k): one based on the uniform assumption and the other based on outage data.

Fig. 3 :
Fig. 3: Partition of the contingency subgraph space.Each cell represents a pattern S k,i with a specific diameter d.Multiple contingencies s k,i in each cell are assumed to have equal probabilities.

Fig. 6 :
Fig. 6: M (r) for the straightforward sampling and the random sampling.The curves do not start at 0 because we compute M (r) for r = 100, 200, 300, ...

TABLE I :
Probabilities of patterns in BPA data and random subgraphs * is the set of 4-edge subgraphs that are not in S 4,i for i = 1, 2, 3, 4.

TABLE II :
Contingency motifs of the BPA data

TABLE III :
Distribution of patterns P (S k,i |k) for BPA data.
* P (S k,i |k) 0.391 0.217 0.217 0.087 0.087 |S dk,i | denotes the number of subgraphs in S k,i with diameter d. |S d k,i | is approximated by uniformly sampling a large number of s k,i and computing their diameters, as shown in TableIV.

TABLE V :
Probability of outages with different patterns and diameters