Cause Analysis of Traffic Accidents on Urban Roads Based on an Improved Association Rule Mining Algorithm

The traffic accidents on urban roads are result of joint actions between multiple factors, namely, human, vehicle, road and environment. To identify the key causes to such accidents, it is necessary to mine the association rules between relevant risk factors out of the statistics on these accidents. Considering the multiple layers and dimensions of accident data, this paper improves the Apriori algorithm to mine the association rules between risk factors, and probes deep into the causes of traffic accidents on urban roads. According to the layer and dimension of specific attributes, the parameters like support, confidence and lift were adjusted to find the qualified association rules between risk factors. The results were further screened to obtain a series of meaningful association rules. The research results enable the traffic department to formulate pertinent accident control measures, and promote the traffic safety on urban roads.


I. INTRODUCTION
Traffic accidents on urban roads are an inevitable problem in urban transport, causing massive casualties and property losses. Currently, China has achieved good results on traffic safety, and curbed road traffic safety from further deterioration. However, the traffic safety situation in urban areas of China is not yet optimistic. Statistics from the World Health Organization (WHO) show that China ranks the 98 th among 178 countries in road traffic death rate (10.45 deaths per 10,000 vehicles). According to China's statistics on traffic accidents in 2015, an average of 23 accidents occur on every 100kms of urban roads in China, which is far greater than that on expressways, first-class highways and second-class highways.
Considering the severity of traffic accidents on urban roads, Chinese and foreign scholars and experts have explored deep into the influencing factors and analyzed the causes of urban road accidents. Concerning the influencing factors, 80-90% traffic accidents are related to human factors like the driver [1], [2]. For instance, Larsen and Kines [3] drew the following conclusions after investigating specific types of accidents: the main causes of head-on collisions The associate editor coordinating the review of this manuscript and approving it for publication was Dalin Zhang. include excessive speed, drunk driving and driving under the influence of illegal drugs; the drivers involved in these accidents are all males; in the left-turn accidents, the most common causes are attention errors and advanced age. McGwin and Brown [4] discussed the types and causes of traffic accidents that are common among drivers in different age groups, revealing that old drivers are prone to turning and changing lanes due to slow reaction and lack of observation, while young drivers may face accidents because they are risktaking and unskillful. Ballesterors et al. [5] found that the youngest drivers have the highest rate of traffic accidents under safe road conditions.
With the application of mathematical statistics in traffic accident analysis, many scholars have attempted to identify the causes of traffic accidents from multiple perspectives. Based on the records of 1,606 accidents, Abdel-Aty and Radwan [6] disclosed the influence of multiple factors (i.e. traffic volume, vehicle speed, number of lanes and lane width) on the frequency of traffic accidents through negative binomial modeling, and evaluated how the influence of each factor varies with the genders and ages of the drivers. Chen et al. [7] applied a hierarchical Bayesian logistic model to examine the significant factors on the severity of driver injury, and concluded that: injury severity is enhanced by road curve, type of accident, as well as the gender, age and alcohol/drug involvement of the driver, and suppressed by wet road surface, male drivers and seatbelt use.
Concerning the cause analysis, the causes of traffic accidents are often examined by aggregate methods like linear regression [8], clustering [9], [10] and time series [11], and disaggregate methods like Bayesian network [12] and discrete selection model [13]. Treating all accidents as a whole, aggregate methods are easy to implement, but not comprehensive enough to consider the features of each accident. Meanwhile, disaggregate methods are complicated in computation, but excel in modelling effectiveness and prediction accuracy.
In addition, many intelligent models, which are based on machine learning, have been introduced to risk modelling of traffic accidents on urban roads, namely, neural network (NN) [14], Bayesian neural network (BNN) [15], and support vector machine (SVM) [16]. Compared with traditional models, these models are good at linear or nonlinear approximation. Nevertheless, the generalization ability of such models is rather limited, calling for verification against massive measured data [17], [18].
The above analysis shows that the distribution laws of influencing factors can be obtained through statistical analysis of traffic accidents on urban roads, shedding light on the causes of such accidents [19]. However, it is impossible to identify the associations between multiple risk factors of traffic accidents through data analysis on the statistical distribution level, not to mention mining out the key accident chains from accident data [20].
To solve these defects, the classic Apriori algorithm was improved from the angles of data structure, measuring indices, and subjective constraints, in the light of the statistical features of traffic accidents on urban roads. The algorithm flow was optimized by the R programming language. In the improved algorithm, the association rules were mined in four steps: data processing, modelling of multidimensional data, algorithm mining and rule interpretation. Then, the improved Apriori algorithm was adopted to identify the key causes of traffic accidents on urban roads, through multidimensional and multilayered mining of accident data. The improved algorithm proposed in this paper further improves the analysis method system of causes of urban road traffic accidents, which has a strong practical significance.
The remainder of this paper is organized as follows: Section 2 explains how to mine the association rules between risk factors of traffic accidents on urban roads with the improved Apriori algorithm; Section 3 verifies the improved Apriori algorithm with actual data; Section 4 puts forward the conclusions.

II. METHODOLOGY
Traffic accidents on urban roads are attributable to many risk factors, which are complexly associated with each other. To understand the accident causes, it is helpful to firstly identify the association rules between these risk factors.
Many analysis strategies have emerged for risk factors of traffic accidents on urban roads. Most of them only evaluate the accident causes by a single, static index, considering linear factors (e.g. slope, horizontal curve radius, and turning angle), driver factors (e.g. gender and age), and traffic conditions (e.g. traffic flow, vehicle speed, and road occupancy).
On actual roads, however, traffic accidents are the result of joint actions between multiple factors, namely, human, vehicle, road and environment. These factors often have complicated associations. At present, most of the studies on the causes of urban road traffic accidents have not been able to divide these interrelated rules effectively. It is urgent to propose an association rule mining algorithm based on the statistical characteristics of urban road traffic accidents to improve the efficiency and accuracy of accident cause analysis.
Therefore, this paper attempts to mine the causes of traffic accidents on urban roads based on association rules. Firstly, the statistics on such accidents were processed into structured data. Next, the Apriori algorithm was improved to extract the association rules between influencing factors of the accidents. The extracted rules lay the basis for traffic department to formulate risk control measures.

A. APRIORI ALGORITHM
The Apriori algorithm is a famous association rule mining method proposed by Agrawal et al. in 1993. The algorithm adopts a two-phase iterative search for frequent itemsets [21]: (1) Find all itemsets for which the support is greater than the threshold support in the entire database through level-wise search; (2) Create rules from each frequent itemset by binary partition, and look for the ones with high confidence [22].
The Apriori algorithm, a milestone of data mining, provides a level-wise 1D data mining method that finds frequent itemsets in a dataset for Boolean association rule [23]. The algorithm is simple in structure, and easy to understand, involving no complex formula derivation. The relevant concepts and formulas are defined as follows: Let T = {t 1 , t 2 , . . . , t n } be the set of transactions called database (the database of traffic accidents on urban roads), and I = {i 1 , i 2 , . . . , i n } a set of n attributes called items (the accident attributes). Each transaction t i = (i = 1, 2, . . . , n) corresponds to a subset of I (t i ⊆ I ) (every record of traffic accident contains several accident attributes).
A rule can be defined as an implication, A⇒B, where A i (i = 1, 2, . . . , n) and B j (j = 1, 2, . . . , m) be two subsets of I (two accident attributes in a record of traffic accident). A⇒B implies that, if A occurs in an accident, B will also occur (A⊆I, B⊆I and A∩B = ∅).
There are three key indices in the Apriori algorithm, namely, support, confident and lift. The support of association rule A⇒B in database T refers to the proportion of accidents involving both A and B in all accidents: The confidence of association rule A⇒B in database T refers to the probability for an accident containing A to contain B, i.e. the proportion of accidents involving both A and B to those containing A: The lift of A⇒B in database T [24] refers to the ratio of the probability for an accident to contain both A and B to that for an accident to contain B only:

B. IMPROVED APRIORI ALGORITHM
According to the above analysis on association rule mining, all the attributes must be converted into the same dimension, before applying the Apriori algorithm to mine the association rules between risk factors of traffic accidents on urban roads. This normalization process might reduce the scanning efficiency of the database in the search for frequent itemsets, and output meaningless search results. In other words, the Apriori algorithm is not directly applicable to the mining of the multilayer, multidimensional dataset of traffic accidents on urban roads. Therefore, the Apriori algorithm was improved to suit the association rule mining of the multilayer, multidimensional data of traffic accidents on urban roads. Specifically, each accident attribute was regarded as a predicate. Since the accident data are multilayered, the layers of attributes related to the mining demand were selected for association analysis. In this way, the multilayer mining problem is converted into single-layer mining problem, and the associations between different layers of attributes could be revealed. This method can avoid a large number of meaningless errors in the output results of traditional association rule algorithm due to too many factors, and improve the efficiency of data mining algorithm to some extent. The flow of the improved Apriori algorithm is explained as follows: Step 1. Generate the frequent itemset. Input: an n-dimensional database; the minimum threshold support, min-sup.
Output: an n-dimensional frequent itemset.
(1) Initialize the parameters: k=1, L = ∅ and C 1 = ∅; (2) Generate the 1 st n-dimensional candidate itemset C 1 ; (3) Generate the 1 st frequent itemset L 1 ; (4) Generate the k-th candidate itemset C k based on the (k-1)-th frequent itemset L k−1 ; (5) Generate the k-th frequent itemset L k based on the k-th candidate itemset C k : Calculate the sup of each item in C k by formula (1), and add all the items whose sup>min-sup into L; (6) Obtain the n-dimensional frequent itemset L, and go to Step 2.
Step 2. Generate the set of multidimensional association rules R Input: Frequent itemset L; the minimum threshold confidence, min-conf.
Output: A set of association rules R. (7) Initialize R = ∅; (8) Calculate the conf and lift between the nonempty subsets of each frequent itemset by formulas (2) and (3), respectively; (9) Add all the association rules whose conf>min-conf and lift>1 into R.

A. DATA COLLECTION AND PROCESSING
The mining of association rules is greatly affected by the structure of data samples on traffic accidents on urban roads. The mining efficiency can be enhanced if the data are structured. However, the traffic accident records have many missing items. Before applying to association rule mining, the original records should be processed into structured data. Based on the original records, the structured data are often obtained in three steps ( Figure 1): data cleaning, data transformation and data reduction. After visiting traffic departments, the authors learned that the statistics of traffic accidents on urban roads in China are VOLUME 8, 2020 currently collected and archived by traffic police division, and most data are kept confidential.
Considering statistical availability, the authors paid a visit to the traffic police division in a city in northern China, and acquired 589 records on traffic accidents in that city, which happened between January and September, 2018.
The 589 records were processed by the workflow in Figure 1, creating 576 valid records of structured data. Each valid record contains 15 attributes. The serial number, name, type, and meaning of each attribute are listed in Table 1 below.
The data of traffic accidents on urban roads belong to multiple layers and dimensions. It is unwise to directly apply all the accident attributes to the association rule mining between risk factors. For one thing, the numerous attributes will drag down the mining efficiency. For another, it is difficult to find useful association rules with high confidences between the multidimensional data on different layers.
Based on multilayer multidimensional association rules, this paper puts forward an accident risk analysis approach to mine the data of traffic accidents on urban roads. To facilitate the mining of association rules, the accident attributes in the dataset of traffic accidents on urban roads were divided into three layers ( Figure 2).
As shown in Figure 2, the data samples on traffic accidents of urban roads are divided into three layers. On the first layer, there are four attributes, including accident, time, external environment, and road condition.
On the second layer, the four primary attributes were broken down into 15 secondary attributes: accident was split into four parts (accident form, accident cause, accident type, and accident location); time was split into three parts (season, week, and hour); external environment was split into four parts (terrain, weather, illumination and visibility); the road condition was split into five parts (pavement condition, intersection/section type, road shape and isolation facility).
On the third layer, each secondary attribute was further split into a number of tertiary attributes. The full list of all attributes is provided in Table 1.

B. ASSOCIATION RULE MINING 1) ASSOCIATION RULES BETWEEN SECONDARY ATTRIBUTES UNDER ACCIDENT
From the arules package for R programming language, the Apriori algorithm was selected and improved to suit multilayer, multidimensional data, according to the procedure in Subsection 2.2.
Firstly, the targets of association rule mining were selected, considering the features of attributes on different layers. Here, the mining focuses on the association rules between tertiary attributes of accident and those of other secondary attributes. Secondly, the parameter thresholds of the association rule mining algorithm were adjusted for each target. Finally, the association rules thus acquired were analyzed, in view of the actual meaning of each attribute. The causes of traffic accidents on urban roads can be analyzed in the following steps: Step 1. Selection of analysis targets: The attributes to be analyzed are selected based on their levels, forming a dataset of relevant attributes. Step 2. Parameter setting: The minimum support (min-sup), minimum confidence (min-conf) and lift (lift) are determined according to the accident attributes for the association rule mining.
Step 3. Association rule mining: Following the flow of the improved Apriori algorithm, the frequent itemset satisfying the minimum support is solved, and the qualified association rules are then identified based on the minimum threshold confidence and threshold support.
Step 4. Interpretation and application of association rules: The qualified association rules are further analyzed and screened, and the consequents are extracted as the association rules between accident form, accident type and accident cause. Next, the practical meanings of these association rules are interpreted to guide the application.
According to the abovementioned flow of association rule mining for risk factors, the authors probed into the relationship between the three secondary attributes under accident, namely, accident form, accident type and accident cause.
Firstly, the urban road traffic accident dataset, which only contains no other secondary attribute but accident form, accident type and accident cause, was inputted to the improved algorithm. Next, the initial parameter threshold is finally determined by analyzing the results of several trials. The minimum threshold support (min-sup) and minimum threshold confidence (min-conf) were initialized as 0.1 and 0.25, respectively, and 21 initial association rules were found to satisfy these initial thresholds.
The visualization package arulesViz of R programming language was adopted to prepare the scatter plot for the 21 rules. The scatter plot is presented as Figure 3, where x-axis is support, y-axis is confidence, and the shade of green color is the degree of lift.
It can be seen from Figure 3 that the supports of some association rules fell between 0.1 and 0.2, and those of other association rules ranged between 0.25 and 0.35, indicating the coexistence of high-and low-frequency qualified association rules. In terms of confidence, most association rules fell between 0.4 and 0.7, and only a few remained below 0.4. This means the initial threshold confidence is too small. Thus, the confidence was increased to 0.4 to facilitate rule screening. Moreover, half of all rules were below 1 in terms of lift, revealing that many rules fail to satisfy the constraint that the lift should be greater than 1. VOLUME 8, 2020  The above analysis shows the necessity of parameter adjustment, before further screening effective association rules. Due to the severity of the consequences of traffic accidents, the association rules between important yet lowfrequency risk factors might be filtered out, if the threshold support is too high. Hence, the minimum threshold confidence was reset to 0.4 and lift to 1, without changing the minimum threshold support. In this way, a total of 10 association rules were found to satisfy constraints: sup>0.1, conf>0.4 and lift>1.0.
Considering the meanings of secondary attributes under accidents, the consequents cannot be the causes of accident, but the form and type of accident. In addition, the relationship between accident form and accident type is only meaningful, when accident form is taken as the antecedent and accident type as the consequent. Thus, the inspect function of the arules package was called to extract the association rules for the consequents of accident type and accident form. The association rules with practical value were thus obtained ( Table 2).
As shown in Table 2, a total of 5 meaningful association rules were obtained between the secondary attributes under accident. The following laws can be extracted from these rules: (1) The fatal traffic accidents on urban roads are mostly caused by driving with other behaviors that undermine driving safety; (2) Collisions with moving vehicles often lead to injury accidents; (3) The collisions with moving vehicles, which arise from driving with other behaviors that undermine driving safety, are very likely to cause fatal accidents; (4) Crashes into stationary vehicles often results in fatal accidents; (5) Tailgating can easily bring about collision with moving vehicles. Based on these rules, the traffic department can prepare countermeasures to reduce the accident risk.

2) ASSOCIATION RULES BETWEEN ATTRIBUTES UNDER ACCIDENT AND THOSE UNDER THE OTHER PRIMARY ATTRIBUTES
In the preceding subsection, the association rules between secondary attributes under accident were mined out, reflecting the correlations between accident type, accident form and accident cause. This subsection aims to identify how the causes of traffic accidents on urban roads correlate with the attributes under other primary attributes. For this purpose, four primary attributes (accident, time, road condition and external environment), 15 secondary attributes and 74 tertiary attributes were taken into account.
The improved apriori algorithm was applied to mine the association rules between attributes under accident and those under the other primary attributes. Firstly, the thresholds (minimum support, minimum confidence and lift) were adjusted to find the effective association rules. A lot of effective rules were obtained, due to the sheer number of tertiary attributes. To facilitate subsequent analysis, the subset function and sort function of the arules package were called to further screen the set of effective rules. Finally, the useful association rules were obtained, and used to interpret the correlations between attributes under accident and those under the other primary attributes.
(1) Association rules between attributes under time and those under accident If all attributes were considered, the extracted association rules will change with the frequency of multiple attributes. To disclose how the attributes under time affect those under accident, three secondary attributes under time (season, week, and hour) and the secondary attributes under accident (e.g. accident type and accident form) were selected as the targets of association rule mining. A total of 37 effective association rules were obtained through the abovementioned procedure. For the lack of space, the ten association rules with the highest confidence (Table 3) were selected for further analysis.  The following laws can be extracted from these rules: (1) On workdays in summer, the traffic accidents occurring at deep night are often fatal, most of which are collisions with moving vehicles; (2) The accidents on workdays are often fatal, most of which are collisions with moving vehicles; (3) The accidents on holidays are often injury accidents. In addition, more accidents tend to occur on workdays than on holidays, due to the high travel frequency on these days.
(2) Association rules between attributes under external environment and those under accident To disclose how the attributes under external environment affect those under accident, four secondary attributes under external environment (weather, terrain, illumination and visibility) and the secondary attributes under accident (e.g. accident type and accident form) were selected as the targets of association rule mining. A total of 99 effective association rules were obtained through the abovementioned procedure. The ten association rules with the highest confidence (Table 4) were selected for further analysis.
Since the minimum threshold support was set to 0.1, the low-frequency attributes were not included in the frequent itemset, and thus not present in the effective association rules. However, the following laws can be derived from the highconfidence rules: (1) On a sunny day, fatal accidents are likely to occur on plains in the evening without street lighting; most of these accidents are collisions with moving vehicles; (2) During daytime, the collisions with moving vehicles are often injury accidents; (3) On a sunny day, most injury VOLUME 8, 2020 accidents occurring on plains with high visibility are resulted from collisions with moving vehicles. (

3) Association rules between attributes under road condition and those under accident
To disclose how the attributes under road condition affect those under accident, four secondary attributes under road condition (pavement condition, road shape, intersection/section type and isolation facility) and the secondary attributes under accident (e.g. accident type and accident form) were selected as the targets of association rule mining. A total of 122 effective association rules were obtained through the abovementioned procedure. The ten association rules with the highest confidence (Table 5) were selected for further analysis.
The following laws can be derived from these strong association rules: (1) Fatal accidents are likely to occur at ramp entrance/exist, if the pavement is dry and the road is straight; (2) Collisions into moving vehicles often take place on a dry road section with corrugated beam barrier; (3) Fatal accidents are likely to occur on a dry road section, due to collisions into moving vehicles; (4) Injury accidents are likely to occur on a dry and straight road sections with corrugated beam barrier, due to collisions into moving vehicles.

IV. CONCLUSION
The cause analysis of traffic accidents helps the traffic department to identify the key influencing factors of accidents, and disclose the correlations between these factors, laying the basis for formulation of risk control measures. Based on association rule mining, this paper proposes a cause analysis strategy for traffic accidents on urban roads. Considering the multiple layers and dimensions of the data on such accidents, the authors improved the flow of the apriori algorithm, and applied the improved algorithm to mine the association rules between risk factors of traffic accidents on urban roads. According to the layer and dimension of specific attributes, the parameters like support, confidence and lift were adjusted to find the qualified association rules between risk factors. The results were further screened to obtain a series of meaningful association rules. In addition, it should be pointed out that although this research conducts an example analysis on the sample data of urban road traffic accidents, the improved algorithm proposed in this study can be extended to practical application problems with multiple layers and dimensions sample data characteristics.