Identifying Critical Regions in Industry Infrastructure: A Case Study of a Pipeline Network in Kansas, USA

In the face of the budget cuts and increased size of industry infrastructure, one of the top priorities for industry infrastructure protection is to identify critical regions by vulnerability analysis. Then, limited resources can be allocated to those critical regions. Unfortunately, difficulties can be observed in existing approaches of vulnerability analysis. Some of them are unavailable due to the insufficient data. Others are susceptible to human biases. Here, we propose an approach to overcome these difficulties based on the location data of failure events. The critical geographic regions are determined by the risk ranking of different candidate regions. Risk is calculated by integrating the probability of the failure event occurring (risk uncertainty) and total failure cost (the severity of failure consequences) in each candidate region. By changing the modeled object from the components to the region where the whole industry infrastructure is located, it collects the rarely failure events which are dispersed in different positions of the industry infrastructure to provide sufficient data, then the probability can be obtained by using a Poisson point process and kernel density estimation. Meanwhile, the application of hypothesis testing avoids the susceptibility of the approach to human biases by verifying the correctness of the assumptions used in the approach. Finally, a case study of this approach is performed on a pipeline network in Kansas, USA. In addition to the validation of the feasibility of our approach, risk uncertainty is proven to be less instructive for identifying critical regions than the severity of failure consequences.


I. INTRODUCTION
The stable, continuous and safe operation of critical industry infrastructure is significantly affected by different types of hazards. These hazards have caused many failure events in critical industry infrastructures. Due to these failure events, the provision of industry infrastructure services is disrupted. The stability of the economy or national defense will be seriously affected. Therefore, critical industry infrastructure protection has attained great importance in different countries [1]- [4].
As an important part of industry infrastructure protection, the identification of critical regions is to determine the regions The associate editor coordinating the review of this manuscript and approving it for publication was Min Xia .
where the failure of elements (assets or components) of an industry infrastructure constitutes a grave threat to modern life [5]. However, this is not an easy task in practice. With the increasing size and complexity of industry infrastructure, many types of factors and many elements need to be considered in the task. The identification of critical regions becomes a costly and time-consuming process for a whole industry infrastructure. Therefore, a screening methodology is proposed.
In the screening methodology, the fragilities of an industry infrastructure are modeled with the consideration of different hazards [6]. Widely used modeling approaches include empirical approaches and network-based approaches according to the reference [7]. Based on the description of the research paradigm in figure 1, internal hazards (such as corrosion, fatigue and damage) and external hazards (the loading, operating environment and accidents) will act on assets (or components) of an industry infrastructure after those assets are exposed to those hazards. The actions might impact those elements. This triggers the deterioration, fatigue or other failure mechanisms of those elements, which results in failure events with time. Finally, the consequences are an unintentional release of hazardous contents, the loss of production and so on. Clearly, the consequences caused by failures are the effective symptoms of the vulnerability of industry infrastructure and can be measured. Then, critical regions or elements can be identified. Currently, the difference between empirical approaches and network-based approaches is mainly in how to measure the consequences.
In empirical approaches [8]- [11], risk is used to measure the failure consequences of components in industry infrastructure. Risk includes risk uncertainty and severity. Risk uncertainty is measured by the probability of component failure, and severity describes the severity of the consequence caused by the failure of components. Based on the assessed risk of different components, the critical components can be determined in industry infrastructure, and the regions where the failure components are located are the critical regions. For network-based approaches, the industry infrastructure is modeled by a topological network. The nodes of the network denote the elements of the industry infrastructure. Under the assumption that the operation of the industry infrastructure is based on the interdependence of the components (the network connectivity is full), the action of a hazard is the removal of the node that denotes the failure component. The impact is that the removal significantly degrades network connectivity. The consequence is the severity of the disconnection of the topological network. In general, the measure of the disconnection includes the degree of the nodes, the global clustering coefficient, and the average shortest path length [12]- [15]. In this paper, the proposed approach is an empirical approach. Therefore, some recent empirical approaches are reviewed in the following pages.
Despite making a significant contribution to the identification of critical regions, empirical approaches largely apply empirical data, such as expert judgments [9], [11], [16]- [26]. For this reason, empirical approaches deeply rely on the judgment of engineers, which makes these approaches subject to human biases and errors. To overcome this problem, approaches based on machine learning have been proposed [27]- [30]. With the help of machine learning, empirical approaches remove the influence of subjective judgment by refining insights from big data. Unfortunately, insufficient data is a barrier for empirical approaches to apply machine learning. In practice, the number of times that failures happen is small for a component or an asset in many industry infrastructures, which means the amount of data is not enough for each component of an industry infrastructure to build an available machine learning model.
Against this backdrop, an improved empirical approach is proposed. By changing the modeled object from the components of industry infrastructure to the regions in which a whole industry infrastructure is located, the failure data will be the aggregation of the rarely failure events which are dispersed in different positions of the industry infrastructure. Therefore, there are sufficient data to use the proposed approach.
Hypothesis testing is used to verifying the satisfaction of the properties of the model: Poisson point process before modelling, which avoids the susceptibility of the approach to human biases in analysis. Kernel density estimation is used to learn the model parameter with the location data of failure events. Therefore, the empirical data is not used, which effectively reduces the impact of subjective preference in empirical data.
By using the built model, the probability of the failure event occurring is estimated in different geographic regions where the industry infrastructure is located. At the same time, the severity of failure consequences is measured based on the total failure cost. Based on that, risks are assessed in those regions and are used to identify critical areas.
The rest of this paper includes four sections. Section II explains the application of the proposed approach. Then, a study case is proposed in section III, which is studied based on performing a group of experiments. Section IV details the results, and the findings are discussed. Finally, the conclusions are provided in section V.

II. METHOD
The proposed approach includes four steps: verification of the availability of the Poisson point process, model selection, model building and risk assessment. The figure 2 describes the process of the proposed approach.

A. POISSON POINT PROCESS
As a widely used spatial statistical technique, a Poisson point process N can model the data of spatial points. The data of spatial points is called a spatial point pattern X. It is a description of the outcome of a Poisson point process [31]. In theory, the definition of a Poisson point process N is based a region W ⊂ R 2 . In the region W, we can observe the spatial point pattern X = {x 1 , x 2 ,. . . ,x n } whose element is the point x i ∈ R 2 . Thus, the regions W is also called the observation window. In our research, the point is the failure event that occurs in an industry infrastructure. The amount of times that a failure event occurs n(B i ) is a well-defined random variable in a region B i ⊂ W, and the region B i can be arbitrary shapes and any location in the observation window W. All the elements of a Poisson point process are intuitively illustrated in figure 3.  Here, (B i ) denotes the mean number of times that the failure event occurs in region B i and can be calculated by equation (1). The intensity λ or λ(x) is the number of times per unit area somewhere. (B i ) is the area of region B i , and x ∈ B i . For the homogeneous Poisson point process, its intensity doesn't vary spatially, and its intensity is often denoted by λ. For inhomogeneous Poisson point process, its intensity varies spatially, and its intensity is often denoted by λ(x).

B. VERIFICATION OF THE AVAILABILITY OF THE POISSON POINT PROCESS
This step is to ensure that a Poisson point process is suitable for modeling a spatial point pattern. A null hypothesis is proposed according to prior research [32], which states that if the proposed approach is applied to analyze the vulnerability properly, then the properties of the Poisson point process should not be violated. Therefore, two hypothesis tests are applied to determine if it should be accepted or rejected.

1) TESTING THE PROPERTY OF THE POISSON DISTRIBUTION OF POINT COUNTS
To verified the first property, we need to determine that whether there is obvious difference between a Poisson distribution and the samples of n(W). First of all, select the observation window W to collect samples in order to guarantee that the records of failure events is sufficient to conduct most statistical tests. Based on the records of failure events, we can obtain lots of n(W) with the given period t. Then, the frequency of each kind of n(W) can be counted. The frequency count is shown in figure 4.
In the next place, the significance of the difference between a Poisson distribution and the samples of n(W) is tested through a chi-square goodness of fit test. The null hypothesis H 0 states that n(W) appears to be a Poisson distribution. The alternative hypothesis H 1 is n(W) does not appear to be a Poisson distribution. The mean (W) need first to be estimated by referring to equation (2). c is the number of types of n(W). With the estimated mean (W), the Poisson probability p(m i ) can be calculated for each m i by equation (3). Next, the theoretical frequency f e i is obtained by equation (4). Then, the χ 2 test statistic can be calculated by equation (5). According to the level of significance α and the degrees of freedom c−1, the critical value of χ 2 α (c−1) can be obtained according to the chi-square distribution table. If χ 2 ≥ χ 2 α (c − 1), we can reject H 0 . If not, we do not reject H 0 .

2) TESTING THE PROPERTY OF INDEPENDENT SCATTERING
In the verification of independent scattering, two sets of counts are required to be tested by a permutation test [33], [34]. The first set is n(B 1 ), . . . , n(B m ). The regions B 1 , . . . , B m are the candidate regions of an industry infrastructure, and some of them will be identified as the critical regions. The second set is the counts n(D 1 ), . . . , n(D j ). The regions D 1 , . . . , D j are divided from the whole observation window and will be used in model selection.
First of all, an initial Moran's I I and an expected Moran's I I need to be calculated for each set of counts by using equations (6)- (9). In general, if I > I , the set of counts are positively autocorrelated in the spatial dimension. If I ≤ I , it thinks that the set of counts are negative autocorrelation in the spatial dimension. If there is no significant difference between them, it indicates spatial independence in the spatial dimension [35]. To test the significance of the difference between them, a permutation test is applied.
The null hypothesis is that the counts n(B 1 ), . . . , n(B m ) (or n(D 1 ), . . . , n(D j )) are spatially independent. The negation of the null hypothesis is the alternative hypothesis. Based on the null hypothesis, we rearrange the counts n(B 1 ), . . . , n(B m ) (or n(D 1 ), . . . , n(D j )) for all different regions of the counts B 1 , . . . , B m (or D 1 , . . . , D j ) and recalculate Moran's I by performing an exhaustive rearrangement. After that, many Moran's I is obtained, and the amount of them is denoted by C. Then, count the number of the Moran's I which is not less than the initial Moran's I . the number is denoted by z. The P-value is z/C. If z/C is less than the level of significance in the test, we reject the null hypothesis.

C. MODEL SELECTION
In this step, we will answer the question: Which kind of Poisson point process is the suitable model, the inhomogeneous one or the homogeneous one? For this purpose, we need to determine the spatial variability of the intensity in the observation window W. First of all, the division of the observation window W should be made. After that, parts D 1 , . . . , D j (for example, the division in figure 5) are obtained, and all of the parts must have equal area a. Then, we can obtain the counts n(D 1 ), . . . , n(D j ). Due to the conclusion in step 1, we know that Poisson point process is the suitable model, then we have no reason to doubt that the counts n(D 1 ), . . . , n(D j ) are independent Poisson random variables. If the homogeneous one is appropriated, then n(D 1 ), . . . , n(D j ) should follow the Poisson distribution with the mean λa. Thus, we have no reason to doubt that there is no spatial variability of intensity. Therefore, we can test whether these counts appear to be a Poisson distribution to determine the spatial variability of intensity. To achieve the aim, we choose the chi-square goodness of fit test due to its availability with the consideration of other tests.
The χ 2 test statistic is calculated by equation (10). n e (D j ) is the expected count in region D j and can be calculated by equation (11). By using equation (12), the intensity λ is estimated based on the area of the observation window (W). Then, determine the critical value of χ 2 α (j−1) through the chisquare distribution table. If χ 2 ≥ χ 2 α (j − 1), the counts didn't appear to be a Poisson distribution. The inhomogeneous Poisson point process should be selected, and vice versa.

D. MODEL BUILDING
In the subsection of model selection, the suitable model is determined. In this section, we need to estimate the parameters.

1) KERNEL DENSITY ESTIMATION TECHNOLOGY
Estimating the intensity λ or λ(x) can intuitively be interpreted as creating a smooth intensity surface over a 2-D VOLUME 8, 2020 geographic space. Because the intensity is a constant, the surface of intensity is actually flat in a homogeneous Poisson point process. Its intensity can be calculated by equation (13).
For the intensity of an inhomogeneous Poisson point process, its surface is uneven (as shown in figure 6), and its calculation needs kernel density estimation technology.
Kernel density estimation is widely used to complete cluster and regression tasks in machine learning [36]- [38]. Based on the locations of failure events x 1 , . . . , x n ∈ R 2 , the general form of the kernel density estimator is given by equation (14): where λ(x) is the intensity of location x ∈ W, d ix is the distance from x to x i , and r ∈ R + , which is called the bandwidth. With the help of the bandwidth, only the failure observations within r are used to estimate λ(x). κ(·) is called the kernel function and allocates the weight to failure observation x i . By using the kernel function, the estimation of intensity can consider the impact caused by distance decay. Generally, the farther location x is from failure observation x i , the less that the failure observation is weighted for calculating the intensity λ(x). In our approach, a Gaussian function is selected as the kernel function, which is given by equation (15):

2) SELECTION OF OPTIMAL BANDWIDTH
The selection of the bandwidth determines the performance of kernel density estimation [39]- [42]. If a small value of r is selected, a small number of x i need to be considered, which will cause higher variance in the estimation. If a large value of r is selected, the variance will be reduced. However, the bias will be high because unnecessary noise is present. In general, it is determined by experts. In the proposed approach, the optimal bandwidth is obtained by optimization calculation.
The optimal bandwidth is the one that minimize equation (16). MISE[λ(x)] is the mean integrated square error (MISE) of the estimated intensity λ(x), which measures the difference between the estimated intensity λ(x) and the target intensitŷ λ(x). Due to its tractability, the MISE is by far the most global measure for the performance of bandwidths [43].
An approximate formula of the MISE is used to build the objective of the optimization [43] and is represented in equation (17): (17) Its analytical solution is given by equation (18) [43]: Due to the application of the Gaussian kernel, equation (18) can be replaced by equation (19): where n is the amount of the failure event that occurred in the observation window W. σ can be calculated by equations (20)- (22). w i and v i are the geographic coordinates of the failure observation

E. RISK ASSESSMENT
In this section, we first estimate the probability that the failure event occurs in different candidate regions of the industry infrastructure through the built model. Then, the severity of failure consequences is measured in each candidate region. Based on that, the risks of all the candidate regions are evaluated.

1) ESTIMATING THE PROBABILITY THAT THE FAILURE EVENT OCCURS IN EACH CANDIDATE REGION
Based on the above conclusions and the finite-dimension distribution property [44], the Poisson random variable n(B i ) follows the Poisson distribution with the mean (B i ). By using the estimated intensity λ (or λ(x)) and the equation (1), we can calculate the mean (B i ). Through the equations (23) and (24), P B i is obtained, which is the probability that the failure event occurs in region B i .

2) CALCULATING THE SEVERITY OF THE FAILURE CONSEQUENCE IN EACH CANDIDATE REGION
In our approach, the consequence of failure is defined as the total negative collection in the aspects of environment, finance and humanity. The collection should include the deaths caused by the failure events and subsequent events, the damage to the environment, the injury to the public, and the impact on businesses. According to this, the total failure cost C B i is used to measure the severity of the failure consequence in region B i . Table 1 describes the cost composition of the total failure cost.

3) ASSESSMENT OF THE RISKS OF DIFFERENT CANDIDATE REGIONS AND IDENTIFYING THE CRITICAL REGIONS
Using the estimated probability and calculated total failure cost, risk can be assessed for each candidate region. Our approach suggests using the product of the estimated probability and the calculated total failure cost to measure risks of different candidate regions. Then, the criticality can be given to each of them by sorting the risks of these candidate regions.

III. EXPERIMENTS A. OBSERVATION WINDOW
The critical industry infrastructure used in the experiments is a pipeline network in Kansas, USA. We choose the whole state of Kansas to be the observation window W. It is shown in figure 3. The area of it is 348,078 km 2 .

B. DATASET
In the dataset, 141 failure events are collected from January 2010 to January 2017. The information included in the dataset is shown in table 2.

C. EXPERIMENTAL DESIGN 1) EXPERIMENTAL DESIGN 1
The feasibility of the proposed approach is first verified in order to prove that our approach has the ability to identify the critical region among four candidate regions. All four candidate regions are different in shape, area and location, as illustrated in figure 7. In table 3, the frequency of each kind of n(W) is obtained with the period of one month and the samples of n(W). Based on this, the first test of availability verification can be conducted. For the second test, the regions D 1 , D 2 , D 3 , D 4 are obtained through the division of the observation window W, and all the four regions are shown in figure 5. Both set of counts are listed in table 4. For these candidate regions, the centroid distances between every two regions are shown in figure 8, and the threshold distance is 219,465 m. For regions D 1 , D 2 , D 3 , and D 4 , the centroid distances between every two regions are shown in figure 9, and the threshold distance is 205,360 m. Based on these values, the spatial weight matrices can be obtained. Then, the property of independent scattering is tested.    After that, we need to determine the category of the model. In the procedure of model building, the information of the failure event locations is applied to estimate the intensity and comprises the latitude and longitude. The number of failure events n is 141. After the calculation of the bandwidth of the kernel density estimator, the intensity is estimated.
Using the estimated intensity in model building, the probability can be estimated for each candidate region. The total failure cost of each candidate region is obtained by summing the total failure cost of every failure event that occurred in the candidate region. The total failure cost of each candidate region can be observed in table 5. Finally, the risks of the  candidate regions are calculated, and the criticality of each candidate region is determined.

2) EXPERIMENTAL DESIGN 2
In experiment 2, the proposed approach is applied to observe the distribution of risk in the whole state of Kansas. The objective is to identify critical regions under the assumption that there are no candidate regions. The whole state of Kansas is divided into sixteen regions, and each region has an approximately equal area of 21,754.93 km 2 . All the divided regions are shown in figure 10. According to the description of the proposed approach, the estimated probability and calculated total failure cost of each candidate region are obtained based on the intensity estimated in experiment 1. Then, the risks of all sixteen regions are obtained. Based on this approach, we can observe the distribution of risk in the whole state of Kansas.
Additionally, the spatial dependency is investigated based on the counts, the total failure costs and the probabilities that the failure event occurred in the sixteen regions. The test of significance based on Moran's I is applied to achieve the objective, which is also used in step 1 of the proposed approach. In experiment 2, the significance test is conducted for three sets of data. The three sets of data are the counts (n(E 1 ), . . . , n(E 16 )), the probabilities that a failure event occurred in each region (P E 1 , . . . , P E 16 ) and the total costs of the sixteen regions (C E 1 , . . . , C E 16 ). The centroid distances between every two regions are provided in figure 11. The threshold distance is 113,158 m. With the results of the tests, the spatial dependency can be reviewed.

IV. RESULTS AND DISCUSSION
A. RESULTS OF EXPERIMENT 1 In the step of the validation of the Poisson distribution of point counts, the calculated χ 2 test statistic is 9.97,and the critical     value of χ 2 0.05 (7) is 14.067. By comparison, the decision is not to reject H 0 . n(W) appears to be a Poisson distribution. The first property is not violated.
For the validation of the property of independent scattering, the two spatial weight matrices are illustrated in figure 12 and figure 13. The results are listed in table 6 and table 7, and indicate that both sets of counts are not significantly different from the counts with spatial independence. Therefore, there is no reason to doubt that the second property is satisfied in the case.  In model selection, the calculated χ 2 test statistic is 46.887 with the estimated intensityλ = 4.0508E-10 times/m 2 . The critical value of χ 2 0.05 (3) is 7.815. By comparison, the decision is to reject H 0 . Therefore, we choose the inhomogeneous Poisson point process.
In model building, the calculated bandwidth is 24.905 km. In figure 6, we can observe the estimated intensity of any location in the observation window.
After that, the probability that the failure event occurred is estimated in each candidate region. Then, the risk of each candidate region is assessed with the total failure cost. All of the results are shown in table 8. According to the assessed risk, region B 2 is the most critical among the regions.

B. THE RESULTS OF EXPERIMENT 2
According to the requirements of experiment 2, all the results are obtained, which are shown in table 9. Figures 14, 15 and 16 are used to illustrate the distributions of the probability, the total failure cost and the risk in the spatial dimension.
Based on the centroid distances between every two regions in experiment 2, the spatial weight matrix is determined, VOLUME 8, 2020   which is shown in figure 17. Then, significance tests are conducted based on the counts (the counts can be observed in table 9), the total failure costs and the probabilities. All the results are listed in table 10.

C. DISCUSSION
According to the estimated intensity in experiment 1, some hotspots can be observed in figure 6 in the southeast part of Kansas. These hotspots intuitively reveal that the areas where they are located have higher concentrations of failure events due to the vulnerability of the pipeline network in those areas. However, the clusters of failure events within a limited geographical region are not statistically significant. Although a comparison between the calculated Moran's index and the expected Moran's index (0.089246 > −0.066667) provides evidence to support these hotspots, the high P-value (0.333988) indicates that the hotspots are not statistically significant. Therefore, these regions of hotspots are useless to target for stakeholders' efforts knowing that they will achieve a positive effect with limited resources.  By observing the spatial distribution of the probability of the failure event in experiment 2, we think that the probability of the failure event occurring in each region is very high, and the variation of those probabilities is very small in the spatial dimension. According to the probability results shown in table 9, thirteen of the sixteen values are larger than 0.95, while the probabilities of regions E 1 , E 2 , and E 9 are less than 0.95. The reason for the low-probability regions may be incorrect division of the observation window. Due to the unknown deployment locations of the pipeline, our division of regions may make the divided regions cover more areas that have no deployment of the pipeline, which will achieve a negative effect in reducing the probability of a failure event occurring in each candidate region. Based on this limitation, we think that the thirteen high-probability regions have the objective estimation of the probability, and the mean of their probabilities is 0.918187. For the industry infrastructure, the mean implies that the likelihood of occurrence of a failure event is very high. Additionally, the coefficient of variation is calculated based on the thirteen high probabilities. The result is 0.022236. For the total failure cost, the coefficient of variation is 0.955012. There is obvious difference between the two coefficients of variation. By comparison, the variability of risk uncertainty is very small in spatial dimension. Moreover, small variability can also be observed in the spatial dimension. According to the results of the probabilities shown in table 10, there is not sufficient evidence to support that there is a cluster of high or low values of the probability with reference to the average in the spatial dimension. Therefore, we conclude that the probability of occurrence of a failure event is very high in the regions where the pipeline is located, however the spatial variability is too small. This implies that risk uncertainty is less instructive for identifying critical regions than the severity of failure consequences. It also indicates that the total failure cost is more affected by location.
According to the investigation of the spatial dependencies of the total costs and the probabilities of occurrence of a failure event in the sixteen regions in experiment 2, there is not enough evidence to prove that geospatial interdependency exists in industry infrastructure. As an interdependency type of industry infrastructure, geospatial interdependency supposes that the proneness to failure of adjacent components is similar because of the proximity of infrastructure components in the spatial dimension [45]. Geospatial interdependency is widely used for industry infrastructure as an assumption in vulnerability analysis [14], [46]- [48]. However, the investigation of the spatial dependency reveals that the total failure costs and the probabilities of occurrence of the failure event in the sixteen regions are both spatially independent, which can be reviewed in table 10. This means that the preferences of failures of components are not similar even if components of the industry infrastructure are located in adjacent regions.

D. APPROACH COMPARISONS
To help researchers observe the differences between our approach and other existing methods, a comparison is made in five aspects.
In the first aspect, the comparison is made with the consideration of the difficulty of acquiring the required input data or information. For the locations and loss of failure events which are used in our approach, they are easily obtained through news. For other approaches, the required input data or information includes the constitution of the system and the design file of components, which is not only important for stakeholders but also for homeland Security. Therefore, our approach has an advantage over other approaches in this aspect.
The second aspect is the size of the data which is need for modelling. By comparison with network-based approaches, our approach is not good. However, our approach doesn't need large amounts of data by comparison with the empirical approach, in particular to the approach which applies other machine learning technologies. Therefore, we think our approach is not bad in this aspect.
The third aspect is the ability of approaches to consider the system characteristics in modelling. In this aspect, our approach is weak. In our approach, geographic coordinates are used to be the variables of risk, which means there is no system characteristics that are considered in risk assessment. Therefore, our approach has no ability to consider the system characteristics in modelling.
For the fourth aspect, the comparison is made in computational cost. the approaches which are used in comparison is limited to the empirical approaches that also use machine learning technologies [49]. In computational complexity, kernel density estimation is one of the machine learning technologies which have the very low computational complexity. In addition, the small amount of input data reveals that our approach doesn't need large computational cost.
The fifth aspect is the falsifiability of the approach. In the aspect, our approach is excellent. because it lacks a common standard for evaluating the accuracy of vulnerability analysis, the results of almost all approaches can't be falsified. For this reason, our approach applies hypothesis testing before modelling, which effectively improves the falsifiability.

V. CONCLUSION
This paper proposes an improved empirical approach for the vulnerability analysis of industry infrastructure to identify critical regions. It is accomplished by determining a risk value for any region where an industry infrastructure is deployed. This risk value is characterized by risk uncertainty and the severity of consequences caused by failures. The risk uncertainty is measured by the probability of the failure event occurring in a geographic region, and the total failure cost is used to measure the severity of the consequences. With the assessed risk, a ranking of the different geographic regions is obtained. Then, the criticalities of the different geographic regions are determined.

A. THE CONTRIBUTIONS OF THE PROPOSED APPROACH
The first contribution is categorized as the contribution in application. Being one of the spatial statistical techniques, the Poisson point process is innovatively applied for industry infrastructures to analyze their vulnerability. By overcoming the difficulty of the insufficient data and the impact of human biases, it effectively and objectively estimates the probability VOLUME 8, 2020 that the failure event occurs in different geographic regions. On top of that, the following benefits are obtained: 1) By comparison with existing approaches, the uncommonly used location information in the records of failure events is used for modeling, which means that the records of failure events are deeply uncovered by using our approach.
2) The difficulty of insufficiency of data, which most approaches are facing, is overcome by the transformation of the modeled object. In essence, the transformation is achieved by the Poisson point process. It changes the modeled object from the components to the region where the whole industry infrastructure is located, all the infrequent failures of different components are aggregated to form a large amount of data. Therefore, there are sufficient data to use the proposed approach.
The second is the contribution in methodology. A hypothesis test is applied to test the assumptions used in our approach, which increases the falsifiability of the results. In current research on vulnerability analysis, it lacks a common standard for evaluating the correctness of the analysis for the existing approaches. It means that there is no foundation to make comparison among different approaches. Due to this reason, we explore to evaluate the correctness of the analysis based the approach itself. By using hypothesis testing, the correctness of the applied assumptions and hypotheses can be determined, which effectively improves the falsifiability of the proposed approach.
In theory, the proposed approach has two contributions. The first one is the findings of risk uncertainty. We find that the variability of risk uncertainty is very little in different geographic regions where an industry infrastructure is located. The diversity of risk is mainly caused by the severity of failure consequences. At the same time, the significant variability of the severity of failure consequences in spatial dimension indicates that the severity of failure consequences is more sensitive to the geographical position of failure events.

B. THE LIMITATIONS OF THE PROPOSED APPROACH
The first limitation is the poor modelling capability. The variables, reflecting the characteristics of industry infrastructures, is replaced by geographic coordinates. Therefore, the proposed approach can't consider many system characteristics in vulnerability analysis. The second limitation is the strict constraints of the application of the proposed approach. Due to the requirement for the properties of the Poisson point process to be validated before modelling, the vulnerability analysis may not be made when the properties are not hold. Therefore, the application of the proposed approach is limited.

APPENDIX
The dataset can be acquired online at: https://www.kaggle. com/usdot/pipeline-accidents/metadata. It was collected and published by United States Department of Transportation's Pipeline and Hazardous Materials Safety Administration.