Identification of Traffic Accident Patterns via Cluster Analysis and Test Scenario Development for Autonomous Vehicles

Increased safety is one of the main motivations for traffic research and planning. The arduous task has two components: (i) improving the existing traffic policies based on a good understanding of risk factors related to trends in traffic accidents, and (ii) underpinning the emerging technologies that will advance the safety of vehicles. For the latter route, the introduction of connected and automated vehicles (CAVs) is a promising option as CAVs can potentially reduce the number of accidents. However, to reap their benefits, they need to be introduced in a safe manner and tested for their ability to safely deal with risky scenarios. However, the identification of such test scenarios remains a key challenge for the industry. This study contributes to both routes to increased safety by (i) analyzing UK’s STATS19 accident data to identify patterns in past traffic accidents, and (ii) utilizing this information to systematically generate scenarios for CAV testing. For task (i), the patterns in the accidents were identified in terms of static and time-dependent internal and external factors. For this purpose, the study employed a clustering algorithm, COOLCAT, which is particularly suitable for dealing with high-dimensional categorical data. Six different clusters emerged naturally as a result of the algorithm. To interpret the findings, we applied a frequency analysis to each cluster. The frequency tests showed that in each cluster, certain distinct real-world situations were represented more significantly compared to the non-clustered reference case, which are the markers of each cluster. The second task (ii) complemented the first task by synthesizing the relationships between attributes. This was done by association rule mining using the market basket analysis approach. The method enabled us to develop, drawing from the characteristics of the clusters, non-trivial test scenarios that can be used in the testing of CAVs, especially in virtual testing.

creating a state of "informed safety" which, in turn, leads to the development of trust in CAVs [5]. However, owing to the increased complexity of CAVs [6], ensuring and evaluating their true capabilities and limitations remains a challenge [7]. It is suggested that to prove that CAVs are safer than human drivers, they need to be driven for over 11 billion miles [8]. This might seem to be an unrealistic proposition, but an alternate school of thought of Hazard Based Testing, that focuses on the quality of miles, suggests testing for "how a system fails" as compared to "how a system works" [9]. Understanding how a system may fail can be either done in a proactive manner (e.g., via safety assessments involving hazard identification) [10], or in a reactive manner (e.g., by analyzing road accident databases), [11]. While the former would be intrinsic to the system, the latter would yield extrinsic factors that may lead to hazards. Identifying extrinsic factors, even for normal, human-driven systems, requires a deep understanding of the relationships between them. Once such an understanding is achieved for human-driven systems, it can serve as a basis for developing tests and test scenarios to help train CAVs.
The goal of this study is to devise a systematic way that underpins the aforementioned reactive path by creating realistic real-world scenarios that are archetypal of highrisk traffic situations. This is a two-stage problem requiring one to develop an approach that is capable of (i) detecting patterns in a wealth of accident data and (ii) synthesizing scenarios based on the significant relationships within these patterns. In this study, improving on [12], we used a cluster analysis approach for stage (i) and association rule mining for stage (ii). We demonstrate our approach using the UK traffic accident database.
The approach presented in this study offers several prospects. First, cluster analysis can provide an efficient way to cast scattered accidents into natural groups which exhibit collective characteristics. These groups can sometimes be of very small sizes (or have very small subgroups), which depict rare but distinct traffic situations that might be omitted using other traditional methods such as regression. Second, many existing traffic data analysis methods, a priori, categorize variables as dependent and independent. Our methodology does not require such assumptions and allows the extraction of naturally occurring relationships within the data (i.e., stage (ii)). Third, thanks to the particular clustering algorithm used in this study, streams of new incoming scenarios can be classified appropriately and efficiently, helping with maintenance of large databases.
Applying the suggested methodology, it was found that the accident dataset can be differentiated into six distinct clusters, each of which shows different characteristics. These are (i) fatal, late night, off-junction accidents on motorways with high-speed limit, (ii) two-wheeler (bicycles and motorbikes) accidents on minor roads at a junction while turning left or right, (iii) fatal, two-wheeler accidents on slip-roads connecting to major roads in foggy weather; (iv) off-junction accidents involving buses on unclassified roads; (v) accidents on private drives involving reversing and parked vehicles; and (vi) night accidents at multi-armed junctions of major roads with low speed limits involving buses and bicycles. Following the identification of these clusters, market basket analysis was applied to each cluster to ascertain the quantitative relationships between the incluster attributes, which can be regarded as proto scenarios. These rules are then combined to obtain scenarios that represent the corresponding clusters.
The remainder of this paper is organized as follows. Section 2 provides a brief review of the literature on accident data analysis concentrating on data mining methods. Section 3 provides an overview of the data format and how the data was processed into the form that was used in the study. Section 4 introduces our analysis method and the algorithms used. In Section 5, we present our findings. In Section 6, these findings are interpreted in the context of scenario generation and are utilized to systematically develop natural pre-crash exemplary scenarios. Finally, Section 7 concludes the paper.

II. BACKGROUND
A vast amount of research exists in accident analysis research on the relationships between accidents and surrounding conditions [13]. A commonly used approach for analysis is to formulate the relationships in a somewhat cause and effect form using classical or contemporary techniques, including various types of regression models [14]- [20], [11], [57], [58]; Bayesian analysis [22]- [25]; neural-network models [21], [26]- [29].
An alternative approach is not to assume a pre-set relationship and let the data reveal itself. This provides more flexibility and fidelity for data mining methods. Following this spirit, in recent years, data mining strategies have attracted increased attention in safety research and automated driving systems (ADSs) such as association rule mining [30][31][32]; and decision trees [33][34][35][36][37]. The main advantage of the data mining approach is that it does not presume, as in most machine learning methods (such as regression), a priori differentiation of variables according to a pre-set model in mind. Hence, this arguably provides more flexibility and fidelity for data mining methods.
One type of data mining strategy, which has been explored to a lesser extent (in the context of traffic accident data) is cluster analysis [38]. The crux of this technique is to group traffic accidents according to microscopically or macroscopically defined criteria, which allows for comparative examination of these groups [39]. Among the past studies, in [40] k-means clustering method was used to analyze accident hotspots whereas in [41] and [42] the same method was used for the severity prediction of accidents.
More recently, related k-means clustering methods were used by [12] for crash analysis at road junctions, by [43] for pedestrian pre-crash scenarios and by [44], [45] for the assessment of automated emergency braking systems in accidents.
To leverage the use of clustering methods, one needs to be mindful of the algorithms' data processing procedures. To this end. the first order of consideration is the suitability of the method for the data type under study. Most clustering methods that have been employed in traffic research employed the k-means algorithm [46] and its variants, kmedoids [47] or k-modes [48]. While k-means is a popular solid clustering method it is not very suitable for categorical data as the mean of a categorical variable is not meaningful. On the other hand, k-medoids and k-modes can handle categorical data. However, they are known to suffer from poor performance when working with high dimensional data [12] and may not be the most ideal method if one intends to analyze datasets with a large number of attributes, which is one of the central aims of this study. This problem can be partially circumvented by reducing the dimensions (i.e., discounting certain variables with educated decisions/guesses), as was done in some recent works [12]. However, one should be wary of resorting to approaches such as handcrafted feature selection for cluster analysis as they may be prone to error or bias [53]. Considering that most traffic accident data, especially the UK STATS19 database, consist of attributes that are predominantly categorical, it is advisable to use an algorithm designed for categorical data clustering such as COOLCAT [49], ROCK [50], DBSCAN [51], and SQUEEZER [52], LIMBO [63]. A second point of consideration for deciding on an algorithm is the criterion for distinguishing clusters. Most clustering methods that have been employed in traffic research rely on distance-based algorithms using microscopic (local) criterion/basis for assignment to clusters such as DBSCAN and its many more recent variants [62]. However, employing an algorithm that works by criteria based on the global properties (such as entropy) of the data groups can provide new insights to identify the trends in the data and is preferred in this study. Another issue to take into account is the speed. For instance, even though ROCK is a categorical clustering algorithm that connects clusters with links (hence preserving some level of non-local properties), due to it is agglomerative nature it is slow and not scalable. SQUEEZER on the other hand is fast, however, the clustering is very sensitive to ordering of the data, as the clusters are built incrementally from single element. Hence, considering these aspects, in this paper we use an entropy-based algorithm, COOLCAT, which is, by design suitable for categorical data clustering [49]. Moreover, COOLCAT can work with high-dimensional data without compromising on the quality. It distinguishes clusters based on the measure of entropy which is a global feature of the data. Also, COOLCAT is efficient and can handle streams of incoming data with ease. Furthermore, clustering with COOLCAT is relatively less data dependent since initial cluster seeds are independent of the order in the data. One downside of the COOLCAT is the initialization stage which has quadratic complexity which may increase the overall time cost. This is a price paid for requiring a more stable and consistent clustering which is a comparable cost to other similar clustering algorithms such as LIMBO.
While providing useful insight for understanding accident patterns, cluster algorithms alone may not immediately convey a meaning to the clusters formed. In other words, one needs to understand what these clusters represent. For small clusters with a small number of attributes, this can be achieved by eyeballing the clusters. However, for clusters with a large number of data points and attributes, one needs a systematic way to interpret what each cluster signifies. Furthermore, even after a cluster obtains meaning in terms of its indicator attributes, this does not provide much clue on the relationship between these variables, which is crucial in understanding the development of individual scenarios. For this purpose, in this paper, we propose a two-step procedure that identifies the key attributes that distinctively describe each cluster and then extracts the previously unknown relationships between the attributes within those clusters. The first step is to run comparative frequency tests between the clusters and the reference distribution of the attributes. The second step involves employing the association rule mining method (i.e., market basket analysis) on the distinguished attributes.

A. FORM OF THE DATA AND PRE-PROCESSING
This study is based on an analysis of publicly available data collected from police reports in the UK [1]. Accidents from the 2016-2018 period were taken as the base data, which amounts to 389238 accidents in number. In its raw form, the data is stored in different files describing the accidents depending on the perspective of either common attributes (e.g., weather condition, light condition) or specific attributes (e.g., sex of the driver, vehicle type). Not all attributes recorded in the datasets were regarded as relevant for the analysis. For instance, the effects of cultural origin were discounted. Likewise, variables that were thought to be unimportant were disregarded, such as local authority district and police officer attendance. As the main goal of this study is scenario development, only those attributes (or variables) that were thought to have a direct influence on accidents were kept. After this, the data were reorganized from the perspective of the driver, which meant duplicating the common variables. Furthermore, only those accidents involving one vehicle or two vehicles with physical impact were considered. The reason for this is to keep the scope of the paper focused on test scenario generation for AVs. Since overwhelming majority of the traffic accidents involve one or two vehicles it was decided to restrict the analysis to such accident types.
Another important point is that most of the attributes recorded in the STATS19 database were categorical with many superfluous values. Therefore, certain variables are restructured, for instance, by merging cases. An example of this is provided in the appendix. The full dictionary can be found in the STATS19 database [1]. Furthermore, for each accident with a missing value, a random value from the possible set of values from the respective category was assigned.

B. ODD AND BEHAVIOUR COMPETENCIES
As mentioned earlier, a major challenge in the CAV industry is the development of test scenarios. Considering the high demand in this domain, an established format for scenario description is instrumental for easy and standardized exchange of scenarios. This gave birth to the operation design (ODD) concept detailed in (BSI, 2020) and defined as "Operating conditions under which a given driving automation system or feature thereof is specifically designed to function, including, but not limited to, environmental, geographical, and time-of-day restrictions, and/or the requisite presence or absence of certain traffic or roadway characteristics'. ODD consists of three main classes of descriptors: scenery (such as drivable are, junctions, physical structure, etc.), environmental conditions (such as weather and light conditions), and dynamic elements (such as traffic conditions and speed of the vehicle). As shown below, many of the attributes from the STATS19 dataset can be easily mapped onto the attributes in ODD. A complementary concept that is used in this paper (and included in STATS19 variables) is the "behavior competencies" (e.g., vehicle maneuver), which basically describes driving behavior [55]. Together, ODD and behavior competencies constitute the backbone for scenario development.

C. CRASH DATA VARIABLES
This study takes the perspective that the traffic accidents can be described solely in terms of the local effects, that is, factors and output that are immediately present at the time and location of the accident. Overall, 21 variables from the STATS19 database were selected to be used in the analysis: Accident Severity, Skidding  These variables were chosen because they either: provide information about the outcome of the accident e.g. Accident Severity and Skidding and Overturning, or provide information on the conditions around the accident e.g. Light Conditions and Road Surface Conditions or give details of the accident scenario e.g. Vehicle Maneuver and Sex of Driver. Variables that were superfluous to this like local authority district were removed.
Most variables included in the analysis are selfexplanatory. We only describe the 1 st Road class variable which shows the road type. This can come as Motorway, A, B, C or unclassified road. These are the standard UK road classes. Motorways and A roads are major roads while B and C roads are minor roads. Unclassified roads are roads that do not fit into the other classifications and are usually local roads intended for local traffic.

IV. DATA ANALYSIS
After cleaning and organizing the data, we discuss the method of analysis. As noted previously, the rationale for using unsupervised learning approaches is that these techniques allow one to extract important information from the data without making any prior assumptions on the relationships between data attributes, which is a significant advantage.
We used a combination of complementary learning techniques. The first step involved clustering the data. Once this step is complete, the second step of the analysis is to understand what these clusters mean. The following subsections discuss these steps in detail.

A. CLUSTERING OF ACCIDENT DATA
This was the first step in the analysis. As mentioned earlier, clustering analysis has a long history, but its use in accident data is a relatively recent development. Therefore, although there are dozens of clustering algorithms available for general clustering purposes, the accident data under consideration are exclusively categorical and general-purpose clustering algorithms, such as k-means (which are designed for dealing with continuous variables), are less likely to yield high-quality clustering. Second, for the purposes of this study, we are more interested in differentiating clusters based on the global features of the attributes in each cluster, rather than individual similarity relationships between the data points in those clusters. The choice makes a marked difference in the type of algorithm to be used.

A.1. COOLCAT Categorical Clustering Algorithm
The COOLCAT algorithm was first proposed in [49]. It was designed specifically for categorical datasets. Unlike most other clustering algorithms (such as k-medoid and kmodes) that have been used in accident analysis research, COOLCAT is not based on a distance metric. Rather, central to COOLCAT is the concept of entropy, which is borrowed from physics and information theory and measures the disorder in a given system. Then, the goal of the algorithm is to group the data points of the system in clusters in a configuration that minimizes the average entropy. In this setting, entropy in a cluster can be quantified in terms of the normalized frequencies of the attributes within the cluster, treating each variable independently from each other. This crucial difference, that is, distinguishing clusters with respect to globally defined differences instead of local metric distances, is one of the advantages of COOLCAT when dealing with categorical data and can help better describe the clusters in the interpretation stage. Another advantage of COOLCAT over more classical algorithms (such as k-means and k-medoids) is that COOLCAT performs incremental clustering and hence can handle streams of new incoming data without the need for clustering from scratch.
Given the number of clusters, the algorithm begins by forming cluster seeds that are chosen as the most different elements from each other in the dataset. Then, the remaining data points are combined with the seed clusters one by one according to the average reduction in the entropy of the system. Once one iteration is completed, a portion of the data points are redistributed among the clusters (provided that the new assignments decrease the overall entropy) to minimize path dependence effects.

B. INTERPRETATION OF CLUSTERS
The second step of the analysis focuses on ascertaining the meanings of the clusters formed by the clustering algorithm. This involves determining the significant variables that describe the clusters more distinctively and extract the a priori unknown relationships or rules between these significant variables.

B.1. Frequency Analysis for Identification of Significant Variables
Because the COOLCAT method is not metric-based, another approach for identifying the meaning of the clusters is needed. A frequency analysis was used to determine which variables appear significantly more than expected in each cluster compared with how frequently they are in the rest of the data. This is possible because the data is categorical and frequencies exist, whereas in continuous data, they would not.
Significant variables in each cluster were identified using the chi-square test. As the data is in binary form, for every data point, each variable has either a value of 1 if it was present in that accident or 0 if it was not. The chi-squared value for each variable is given by: 2 2 ( 1(var) 1var)) ( 0(var) 0 var)) (var) 2 1(var) 2 0(var) where represents an arbitrary variable and  O1 -observed number of 1's in the cluster,  E1 -expected number of 1's in the cluster,  O0-observed number of 0's in the cluster,  E0 -expected number of 0's in the cluster.
The expected number of 1's is given by the size of the cluster multiplied by the frequency of the variable in a comparison set divided by the size of the comparison set. This comparison set contains the full data (representing the distribution of the entire population). E1 is then given where N and frq are the total number of data points in the full data and the frequency of the variable in question, respectively. The significance of a variable is determined by whether the frequency of that variable significantly differs from the expected frequency (at a significance level of p<0.05) under the null hypothesis that it does not. After the significant variables are found, the index relative frequency = observed/expected is calculated to identify which variables are more overrepresented in the cluster. In the sequel, we require, for the relative frequency of a variable to be larger than a set threshold to be deemed as the signifier or indicator of a cluster (see Section 5).

B.2. Market Basket Analysis
Market Basket Analysis (MBA) (Agrawal, 1993) is a method that is mainly used on transactional data to identify which products are found together in customers' purchases.
In general, the idea is to find association rules between variables that appear together unusually frequently. The first step in MBA is to find frequent itemsets using the Apriori algorithm. A k itemset is a subset of all possible variables of length k. For example, in a shopping domain or context, an itemset could be {Bread, Milk, Eggs, Cheese}, while in a traffic accident context, the itemset would be {Motorbike, Entering Junction, Turning Left}. An itemset is said to be frequent if its support exceeds a given threshold. The support of an itemset X is given by the frequency of X, that is, the number of data points to which all members of the itemset belong to, divided by N the total number of data points, that is, The Support is essentially a measure of how rare an itemset is.
In the second step, once frequent itemsets are found, is to identify association rules within them. This is done by partitioning the itemset into two subsets, the antecedent and the consequent, which then gives the association rule antecedent  consequent. For example, an itemset X={x1, x2, x3} can be split into antecedent A={x1, x3} and consequent C={x3}, which would give the rule A C.
Two metrics were used to identify the strength of the association: confidence and lift. Confidence is given by the frequency of the union of the antecedent and the consequent (the joint itemset), which corresponds to the intersection of the data points, divided by the frequency of the antecedent. i.e., Intuitively, for rule A C, this is the probability that C occurs, given that A also occurs. The lift is given by the support of the entire itemset divided by the support of the antecedent multiplied by the support of the consequent.
For association A C, this is a comparison between how often A and C actually appear together, with how often A and C would be expected to appear together if they were independent, based on their support within the dataset. If the lift is less than 1, then even if the rule has high confidence, it indicates that A is not strongly associated with B any more than it coincidentally appears together. On the other hand, if the lift is higher than one, then this indicates that the items appearing together are not coincidental. A summary of the concepts is given in figure 1. In this section, we present the main findings of this study in two stages. First, the previously explained COOLCAT clustering method was applied to a sample of 20000 data points that were randomly selected from the collection of accident records. As COOLCAT was robust against high dimensionality, no attempt was made to reduce the number of attributes further. In the second stage, a combination of frequency analyses followed by MBA was carried out to extract the significant associations for each cluster which formed the scenarios obtained from those clusters. We report that the COOLCAT clustering algorithm was coded and executed in MATLAB 2019a while the MBA method was implemented in python 3.7 using the mlxtend package [59].

A. RESULTS FOR COOLCAT CLUSTERING
In this section, we present the results of the clustering method. After the cleaning process, the data, which is entirely categorical, was converted into binary form (or business transaction form), where each category of a variable was treated as a new variable. The proposed algorithm was applied to a random sample of 20,000 accidents that were selected from the reference list of 549,575 accidents that took place between 2016-2018.
For the differentiability and quality of the clusters, an assessment concerning the goodness of separation of the clusters needs to be performed in the post-clustering stage, as the total number of clusters is pre-specified in the COOLCAT algorithm. The ideal cluster number for a clustering is one of the topics that there is no scientific consensus as to which clustering is the best (simply because clustering assessments depend on the measure that one uses). Commonly used measures include average silhouette (AS) scores, Dunn index (DI), and the DB index, which are all based on distance functions imposed on the data. However, COOLCAT does not use a distance function for clustering, and distance-based assessments may not be ideal. Alternatively, one can use normalized mutual information (NMI), which is an information theoretic measure of the level of clustering. For the best clustering, we compared the scoring indices mentioned above, and the majority rule was applied to choose the ideal cluster number.  Table 1 shows that the NMI values tend to increase as the cluster number increases (with occasional drops). On the contrary, average Silhouette and Dunn scores tended to decrease with increasing cluster number (all computations were done with Hamming distance). It was observed that the DB score mostly stabilized after k>3 and was somewhat insensitive to the cluster numbers. In these respects k = 2,3 do significantly better in obtaining high AS and DI scores. However, NMI scores are very low for k = 2,3 (and AS has a theoretical bias towards configuration with low cluster numbers). For k > 5, the NMI scores were considerably higher compared to the case with k<6; however, the AS and DI scores were substantially low. Therefore, considering all aspects, the optimal cluster number was determined to be k* = 6.

B. INTERPRETATION OF CLUSTERS
As discussed in the introduction, the advantage of the clustering algorithm is that it groups the data into distinct homogenous clusters without making any assumptions about the relationships among the variables. However, this does not inform us about what each cluster represents. Here, we systematically investigated and interpreted the clusters at varying levels of detail.

B.1. Frequency analysis of cluster attributes
The first level of analysis unveils which variables are overor under-expressed in a particular cluster, which are then interpreted as indicators of what that cluster is and what it is not. Here, the reference measure will be the entire data (all accidents between 2016-2018) which has its own distribution. Therefore, significant deviations from the reference distributions are interpreted as signifiers of the cluster under consideration. This deviation was assessed using the Chi-square test for each variable, as introduced in the previous section. The advantage of this approach is that it is free from human bias and provides a simple natural interpretation for each cluster if the clustering algorithm is capable of distinguishing data patterns from each other.
The frequencies of variables in the six clusters formed are compared to the reference frequencies (the whole data), and those variables that showed significant differences (p<0.05) were noted. To further strengthen the interpretation, only those variables (among the significant ones) that are overexpressed with at least 1.25 times more than the reference variables are designated as the cluster signifiers or indicators. Tables 2-4 show, for each cluster, the indicator variables and their relative frequencies (ratio of frequency of a variable within a cluster to the overall ratio of in the reference set). A thorough discussion of each cluster is provided in section 6.

B.2. Market Basket Analysis of Clusters with signifiers
The first-level investigation by frequency analysis is complemented by the second-level investigation, market basket analysis (MBA), which runs on significant variables in each cluster. This is motivated by the idea that although these significant variables are clustered together, they are not necessarily directly linked to each other. MBA helps the variables that are strongly associated with each other to be more precisely identified and provides more arguments to make inferences on the signifiers. Note that it is possible to run the MBA on each cluster with the full set of variables, which has been adopted by some of the previous studies (Pande and Abdel-Aty, 2009). However, we believe that restricted MBA is more meaningful. This is because, on the theoretical side, one is really after those associations that are cluster specific, which describe, with more fidelity, the traffic scenarios that are more likely to occur in that particular cluster. In fact, this has been the whole point of the clustering method to start with, that is, a deeper and more focused analysis of patterns. On the practical side, narrowing down the number of variables significantly reduces the computational time, which will prove profitable if one tries to perform MBA on larger samples.
When applying the MBA, we adjusted the thresholds for the parameters depending on the cluster. The values for the minimal support, confidence, and lift for each cluster are presented in Table 3-8 along with the set of multi-item associations obtained from the Apriori algorithm. After testing, the threshold values of support = 0.00001, confidence = 0.3, and lift = 1.5 were chosen. Such a low support threshold was used to allow almost all the rarest variables to potentially appear in the output rules, as identifying edge cases is important in scenario testing. The confidence and lift thresholds were chosen as they provided a good number of strong rules. They also guarantee that for every rule, the consequent appears in at least one-third of the accidents in the cluster containing the antecedent (from the 0.3 confidence) and that the rule is observed over %50 percent more often than expected compared to random occurrence (from the lift value of 1.5).
To help give a high level understanding of the generated associations, a plot for each cluster was generated using the python package pyvis which shows the strongest links between variables. These are shown in the appendices (figures 8-13). Significant variables for Clusters 1-2 and their relative frequencies with respect to the reference (unclustered) full data.

A. UNDERSTANDING CLUSTERS WITH COOLCAT
For Cluster 1, one reads from Table 2 that it is a severe (i.e., serious and fatal) accident cluster. It is also a non-junction cluster depicting accidents that took place on motorways with high-speed limits (50-70 mph) in late night in dark places with no light. These accidents in this cluster appear to involve pedestrians or objects on the road, which might be one of the reasons why fatal and serious accidents are over-expressed in this cluster. Adverse weather and road conditions such as high winds, snowy weather, and frosty surfaces seem to have played a role in drivers' loss of vehicle control and hit the nearside and offside of the road, causing such severe accidents. As this is a non-junction cluster with a high road speed limit, the related maneuvers are, expectedly, overtaking and changing lanes.
Cluster 2 (Table 3) significant variables suggest that this is a minor road cluster (C roads and unclassified) at junctions with low-speed limit (20-30 mph) involving more dominantly two-wheelers (bikes and motorbikes). Being an at-a-junction cluster with two wheelers, the key maneuver types leading to accidents appear to be left turns and right turns (as one would expect). Cluster 3 is also a severe accident cluster indicated by the fatal accidents attribute. The main differences from Cluster 1 are that Cluster 3 is a junction cluster and the accidents in this cluster mostly occur on A-roads instead of motorways which are important distinctions. Among the junctions, slip roads deserve special attention as they are highly overexpressed (rel. freq.=5. 44). Adverse weather and road conditions also play a significant role in this cluster. Driving on high-speed limit roads under adverse weather with risky maneuver types at a junction (such as changing lane to left, changing lane to right, going ahead with bend) seem to have led to vehicles losing control and leaving the carriageway (i.e., hitting the roadsides and getting rebounded) which may have resulted in severe outcomes. This cluster also has an interesting element, that is, accidents of left-hand drive (LHD) vehicles (European vehicles) which are generally ignored in most accident analyses due to being rare cases (but nevertheless important as we shall see later in this section).
Cluster 4 describes the off-junction accidents like Cluster 1. However, there are important differences. First, accidents in this cluster occur on roads with slow speed limit. Second, most of these accidents occur on unclassified minor roads where one can see parked or reversing vehicles. Interestingly the accidents frequently involve buses and trams.
Cluster 5 is another junction cluster but without adverse weather conditions. It predominantly involves roundabouts, many-armed junctions and private drives. The accidents in this cluster take place, mostly, at junction entrances. Interestingly, maneuvers which would normally be regarded as safe are substantially more expressed in this cluster such as parked, reversing and waiting. Therefore, a deeper analysis of this cluster can yield unexpected associations between these accident attributes.
Finally, Cluster 6 is a night cluster describing accidents that take place on A-roads. The difference from Cluster 3 is that the accidents happen at very low speed limit roads (20 mph). And differently from Clusters 2,3 and 5, in this cluster, midjunction accidents are more prevalent in this cluster. Another specialty concerning this cluster is that this is the only cluster with crossroads type junction as significant. Curiously, bicycles and buses are more commonly represented in this cluster.

B. ASSOCIATION OF ATTRIBUTES IN CLUSTERS AND TEST SCENARIO GENERATION
As emphasized in the introduction, one of the main motivations of this study is the identification of the test scenarios (temporal/spatial conditions) for CAVs that are correlated with important outcomes. The clusters formed in Section 5, along with the significant variables identified, enabled us to find such conditions. In this section, we explicitly demonstrate how this is done using MBA. It should be noted at the outset that the MBA procedure is operated only on the significant variables of each cluster to extract the most relevant scenarios. This means that, based on the analysis of Section 5, no scenario will have the Weekend or Weekday and Urban or Rural Area variables as these variables were not found to be significant. Such information can either be deduced from the context or be generated randomly if they were to be included in a simulation.
In order to mine the most interesting associations the MBA parameters are taken according to the characteristics of each cluster (e.g. by varying the support threshold of variables in the respective clusters). Here we first display and discuss, in Tables 5-10, the top-ranking associations in terms of their confidence or lift values. For the purposes of scenario generation, the standard MBA procedure is modified considerably. First, repeating rules (from each cluster) are removed. Second, associations that are not mutually exclusive are combined in a consistent way to yield longer associations. The longer the association rule, the more detailed the concrete scenario. The rationale is that each independent rule depicts the strong tendency of a set of variables to appear together. A natural combination of such rules forms the conditions/characteristics (environment-related or driver-related) of a scenario. We note here that we do not require an order or direction for the associations of attributes that allow flexibility to focus on different accident settings. It should be emphasized that no hard rules (except for the requirement of a maneuver) are imposed to derive the scenarios; in principle, any compatible combination of rules and the attributes with high confidence and high lift could be a scenario candidate.
Also, no claim is made on the presented exemplary scenarios being unique (they probably are not). Each exemplary scenario represents a non-trivial, interesting situation that is present in the respective cluster and leads to important consequences. For each exemplary scenario a diagram was created using SUMO (Simulation of Urban Mobility) to aid with visualization [60].
For Cluster 1, recalling that this is a serious or fatal accident cluster on a motorway and away from a junction, lane-changing maneuvers (to right or left) and overtaking combined with negative environmental conditions are associated with serious outcomes such as leaving the carriageway and overturning (rule #2) or skidding/jackknifing (rule #3). From the association rules, it can also be inferred that goods vehicles are more at risk of getting involved in motorway accidents than other vehicles. Other rules can be interpreted in a similar manner (Table 5).
Exemplary scenario 1. A vehicle overtakes another vehicle offside on a motorway with a wet surface. A possible outcome for this scenario is that it leads to an accident that causes skidding and rebounding from the nearside (rule #3) as shown in figure 2.   For Cluster 2, Table 6 lists some of the main associations. Cluster 2, being a two-wheeler cluster, comprises traffic situations for bicycles or motorbikes. Rule #1 indicates that accidents at private drive or entrance, when clearing junction to an unclassified road are strongly linked to turning right maneuvers. Rule #2 illustrates a scenario for bicycles on unclassified roads, but while turning left to an unclassified road clearing a junction. Both rules have high lifts.
Exemplary scenario 2. A bicycle on an unclassified road at a T or staggered junction makes a left turn and when about to clearing the junction gets into an accident (rule #2) as shown in figure 3.  For Cluster 3, a number of interesting scenarios can be generated (Table 7). Again we describe the first few interesting rules (giving scenarios) and others can be interpreted in the same way. Rule #1 describes a situation in which vehicles are going ahead and bending at a T or staggered junction on an A road in a windy day gets into a crash and hit from the nearside. Such accidents are strongly linked to road surface being wet/damp. Rule #2 suggest that drivers should be careful at T or staggered junction as going ahead and bending to clear junctions on frosty/icy roads are strongly associated with accidents at such junctions.
Exemplary scenario 3. A vehicle during high winds at a roundabout of an A road goes ahead and bend in the middle of the junction and gets into an accident. The road surface was wet (rule #3) as shown in figure 4.  Cluster 4 describes accidents with back impact points which are generally found on minor roads (unclassified) of urban areas where vehicles often need to reverse their vehicles to park or to get into the road. The level of detail provided by this rule is low. As discussed earlier, in such cases, for scenario development, other relevant variables defining a scenario can be generated randomly.

Exemplary scenario 4.
A vehicle reverses on an unclassified road and gets hit from the back (rule #1) as shown in figure 5. Cluster 5 is a true junction cluster that mostly involves female drivers. Here we discuss the most strongly associated conditions. Rule #1 describes situations in which the vehicles are hit from the back while moving off and entering the roundabout. Rule #2 also describes an entering junction situation but at a private drive or entrance.

Exemplary scenario 5.
A vehicle reverses to a private drive or entrance and gets hit from the back while entering the junction (rule #2) as shown in figure 6. Finally, a set of association rules, mined from Cluster 6 focusing on accidents involving buses/trams and bicycles, on crossroads are described in Table 9. Rule #1 indicates that of the accidents that involve buses/trams at midjunctions that are trying turn right, a significant portion of them happen at crossroads. Also, when buses/trams which try changing lane to left end up, almost certainly, with crashes impacting on nearside (rule #2). Also, for buses/trams driving at night, roundabouts pose risks especially when clearing junction rule #3). Rules #4 depicts general situations linking crossroad accidents to turning left maneuvers which resulted in nearside crashes at low speeds at mid-junctions. On the other hand, 4-arm/other junction accidents which involve turning right happen almost always when clearing junctions (rule #5). Furthermore, rule #7 suggests that cyclists who are turning right and clearing junctions are linked to accidents at roundabouts. On the other hand, rule #8 suggests that accidents in which bicycles change lane left almost certainly take place on crossroads. Association rules via application of MBA procedure on Cluster 6.

Exemplary scenario 6.
A bus driving in darkness with lights lit makes a right turn on a junction with more than 4arms and when clearing the junction gets into an accident (and hit from nearside) (rule #6) as shown in figure 6.

VII. CONCLUSION
This study aims to achieve two high-level objectives. The first objective was to underpin the research on safety analysis of traffic accidents by identifying patterns based on past accident records. This was performed using a cluster analysis method. This approach reveals the natural patterns in the data without making any prior modelling assumptions, which is advantageous considering the complexity of factors that can affect the outcomes. The second objective was to develop a method based on the information obtained from accident clusters, which will help design test case scenarios for AVs, thus filling an important gap in the industry. To achieve both objectives, several novel approaches were taken to deepen some of the existing methods to obtain more useful results while considering possible future challenges in industrial applications (such as handling of continuously growing large datasets).
For the first objective, the COOLCAT clustering algorithm was used on the processed STATS19 dataset to determine the natural grouping of accidents. COOLCAT employs natural global clustering criteria (entropy), which suits particularly well to cluster noisy categorical data and is able to handle large dimensions with ease. To the best of our knowledge, this is the first application of the COOLCAT algorithm in traffic accident research. Using various cluster quality metrics, six clusters are obtained from the algorithm. The frequency tests conducted on each cluster indicated that Cluster 1 was described by nighttime serious/fatal accidents on motorways away from the junctions, which involved changing lanes (right/left) and ended up with a skidding/overturning vehicle; Cluster 2 was described by minor road accidents by two-wheelers at junctions on low-speed limit roads involving right/left turns; and Cluster 3 by fatal/serious accidents on A roads but at junctions (especially slip roads) by left-hand driving vehicles. Similarly, Cluster 4 can be represented by accidents on unclassified roads with low-speed limits (likely to be narrow street roads) away from junctions involving U-turn or reversing maneuvers, which often ended in hits from the back; and Cluster 5 depicts relatively more minor accidents at junctions with 'gentle' maneuvers such as parked, waiting, and moving off. Finally, Cluster 6 describes accidents at junctions of road A with a low-speed limit where the main maneuver types were turning right/left or moving off. The results suggest that particular care should be given in making policies/regulations for elements described in the clusters.
For the second objective, based on the information obtained from the clusters, the MBA methodology was applied for association rule mining. As the standard MBA produces repetitive rules (when ordering is not counted), which may only partially describe accidents, we extended the method considerably by systematically combining nonconflicting rules that provided much higher details for the test scenarios. As expected, scenarios obtained from this procedure reflect the characteristics of the cluster that they come from. Once the scenarios are obtained, they can be used in real or virtual environments for CAV training by varying the unspecified attributes as free variables. This will significantly speed up the training processes of CAVs, as they will be driven on quality miles rather than on random routes.
There are theoretical and practical implications of this work. First clustering, as a method for accident analysis, is underexploited. It can be used along with other existing methods (e.g., regression) and enhance them by homogenizing the data. Furthermore, data specific cluster models, such as COOLCAT can serve to better obtain higher quality results instead of more generic algorithms. On the practical front, the output of this work has immediate industrial applications. The proposed approach provides an a-to-z methodology to generate, in a nearly automated manner, high quality test scenarios that can be used in simulations by manufacturers. In fact, test scenarios obtained via the proposed method are now being prepared (data format adjustments) to be deposited into the recently launched, world's largest scenario repository, SafetyPool TM [61].
There are also apparent limitations of this work, mostly due to the scope of the data that was used. The analysis can provide details to the extent that the data can provide, but not more. Although we tried to keep the number of attributes high, the real world contains conditions that may be important but not covered in the present data (such as the position of the sun and curvature of the road). In future studies, multiple data sources can be combined to provide a more detailed description of each accident, which will affect the formation of accident clusters and the association rules extracted from those clusters (i.e., more detailed test scenarios).

A. PLOTS FOR ASSOCIATION RULES IN CLUSTERS
Below are the plots of association rules represented by arrows between variables along with their corresponding confidence values

B. RESTRUCTURING OF STATS19 TRAFFIC VARIABLES
Here we provide an example of re-categorization of the data for the case of the traffic variable: Vehicle types. For the sake of simplicity of the analysis, the original categories (Table 3) of the raw data are restructured to give the new ones Motorcycle over 125cc and up to 500cc 5 Motorcycle over 500cc 8 Taxi/Private hire car 9 Car 10 Minibus (8 -16 passenger seats) 11 Bus or coach (17 or more pass seats) 16 Ridden horse 17 Agricultural vehicle 18 Tram 19 Van / Goods 3.5 tonnes or under 20 Goods over 3.5t. and under 7.5t 21 Goods 7.5 tonnes and over 22 Mobility scooter 23 Electric motorcycle 90 Other vehicle 97 Motorcycle -unknown cc 98 Goods vehicle -unknown weight -1 Data missing or out of range PAUL JENINGS is a physicist who has been with WMG for over 25 years working on research with industrial and academic partners. He has built groups in Intelligent Vehicles, Energy Storage and Management, and Experiential Engineering through significant research and capital funding Paul and his team deliver research which has a tangible impact on industrial partners' competitiveness. One of his early successes was in the area of automotive sound quality, in collaboration with Jaguar Cars (now JLR). The research resulted in a fundamental change in the way JLR benchmark their cars Since 2014 Paul has led WMG's multidisciplinary Intelligent Vehicles research activity, which draws in capability from across the department including Complex Electrical Systems, Communications, Experiential Engineering, Cyber Security, Modelling and Simulation, Visualization and Business and Operations.