Using Data Mining to Estimate Patterns of Contagion-Risk Interactions in an Intercity Public Road Transport System

The COVID-19 pandemic has had very negative effects on public transport systems. These effects have compromised the role they should play as enablers of social equity and environmentally sustainable mobility and have caused serious economic losses for public transport operators. For this reason, in the context of pandemics, meaningful epidemiological information gathered in the specific framework of these systems is of great interest. This article presents the findings of an investigation into the risk of transmission of a respiratory infectious disease in an intercity road transport system that carries millions of passengers annually. To achieve this objective, a data mining methodology was used to generate the data required to ascertain the level of risk. Using this methodology, the occupancy of vehicle seats by passengers was simulated using two different strategies. The first is an empirical approach to the behaviour of passengers when occupying a free seat and the second attempts to minimise the risk of contagion. For each of these strategies, the interactions with risk of infection between passengers were estimated, the patterns of these interactions on the different routes of the transport system were obtained using k-means clustering technique, and the impact of the strategies was analysed.

public perception of the risk of infection associated with the 23 use of public transport, has on occasion caused a considerable 24 loss of passengers on public transport systems, up to 90% in 25 some cases [1], [2], [3], jeopardising their role as enablers 26 of social equity [4]. Therefore, in the context of epidemics or 27 The associate editor coordinating the review of this manuscript and approving it for publication was Razi Iqbal . pandemics, it is of interest to have information that allows for 28 an objective assessment of infection risk on public transport, 29 as this knowledge can be used to develop and assess effective 30 measures to mitigate it. 31 This article presents the findings of a research study 32 designed with the aim of determining the infection risk 33 among passengers on the different routes of a public road 34 transport system (PRTS). This information can be used to 35 identify the routes with the highest risk of infection and to 36 assess the impact of different measures to minimise the poten-37 tial risk. Such measures include devoting more resources to 38 those routes with the highest risk, or other measures that 39 do not involve more resources, such as using certain seating 40 strategies for passengers. The latter type of measures provide 41 alternatives to those that have been routinely implemented 42 small droplets, which evaporate before they are deposited, 97 forming residual particulates from the dried material, called 98 aerosols or droplet nuclei. Building on this contribution, 99 it was considered useful to collect data reflecting human-100 to-human contacts, as these data provide patterns of disease 101 spread and enable effective disease control measures to be 102 implemented [8], [9], [10]. Specifically, when studying the 103 dynamics of disease spread, the social contact hypothesis is 104 used. This assumes that the number of potentially infectious 105 contacts between people is proportional to the number of 106 social contacts, with this proportionality factor being an indi-107 cator of the infectivity of the disease [11]. A mathematical 108 model used in these studies [12] uses the next-generation 109 matrix (N) to estimate how many people in different age 110 groups will become infected as a result of contact with an 111 infected person in a given group. This matrix is of such 112 relevance that a considerable number of studies have been 113 carried out to obtain it using different methodologies. 114

115
Modelling infectious interactions between people in large 116 populations is a scientific challenge of interest. To do this, 117 network theory is used, representing the interactions in a 118 network called a contact network. As such, techniques that 119 attempt to synthesise these networks have been developed. 120 These techniques can be classified into two types: those 121 that generate the interaction network using real or simulated 122 egocentric data, and those that generate the network from 123 the simulated behaviour of individuals. In works based on 124 egocentric data, these data are provided by people (egos) 125 whose identity is known and refer to interactions they have 126 had with other people (alters) whose identity is unknown, but 127 some data are provided, such as their approximate age, for 128 example. The result is a set of interaction networks with a star 129 topology, in which egos are connected to their alters, which 130 provides valuable information about the heterogeneity of the 131 contact network and the patterns of interactions between dif-132 ferent population groups. Valuable information is provided by 133 this type of study, such as the patterns that these interactions 134 follow and the probability of interactions between different 135 alters from egocentric data [13]. 136 Ferguson et al. [14] and Longini et al. [15] describe how 137 to estimate patterns of social contacts from census data, 138 assuming that they reflect the distribution of groups in the 139 population and household size. There is a comprehensive 140 set of studies inferring these patterns from data obtained 141 from surveys of the populations under study. The method-142 ology used in these studies have often organised in three 143 stages. The first stage consists of a survey of selected indi-144 viduals, in which they are asked to provide information on 145 their close contacts. The second stage consists of obtain-146 ing a representation of the network of contacts between 147 different population groups, using a contact matrix (C). 148 Finally, the third step consists of analysing the dynamics of 149 the disease using the next-generation matrix (N), which is 150 obtained from the C matrix and available epidemiological 151 VOLUME 10, 2022 data. Wallinga et al. [11] present a study on how to obtain 152 transmission parameters by age group from a social contact 153 survey on the conversational partners of the participants. 154 The survey was conducted by face-to-face interviews in 155 Utrecht, the Netherlands, in 1986, where 3084 were invited, 156 2106 completed a questionnaire and 1813 met the crite-In the context of the COVID-19 pandemic, different studies 169 have been conducted on contact patterns in pre-pandemic and 170 pandemic periods in Luxemburg [28], China [29], UK [30], 171 Netherlands [31] and USA [32]. These studies show that as a 172 result of various measures to ensure social distancing, social 173 contacts are reduced by between 40% and 85%, depending 174 on the country. In [33] a study is carried out on the patterns to detect the presence of two people in the same place at the 207 same instant in time, to generate a static network of contacts, 208 and to predict the number of contacts that occur at each 209 location. The researchers found that the network was highly 210 heterogeneous, in terms of the number of contacts, and had 211 properties analogous to ''small-world'' networks. In order to 212 better predict outbreaks of SARS (Severe Acute Respiratory 213 Syndrome), Meyers et al. [36] obtained the contact network 214 of an urban population using different mathematical models 215 and through a stochastic simulation of the behaviour of the 216 people in the population, where contacts occur randomly, 217 in homes, schools, workplaces, hospitals and other public 218 places. The researchers drew on population data from the 219 city of Vancouver, British Columbia. Stochastic simulation 220 of the behaviour of individuals belonging to large popula-221 tions was also used in [37]. The researchers found that the 222 dynamics of influenza epidemics modelled using the contact 223 network generated from the simulation was consistent with 224 epidemiological data from the 1957-1958 and 2009 influenza 225 pandemics. Technological advances in mobile communications and sen-229 sor networks have also been applied by researchers to 230 epidemiological monitoring. In this context, and more specif-231 ically in the epidemiological monitoring of airborne dis-232 eases, close contact is defined as two persons spending 233 a certain amount of time at a distance of less than a 234 given threshold. The following is a review of literature on 235 contact networks generated in different contexts of social 236 relationships, using different types of sensors. The method-237 ologies followed by all these studies have the same objective, 238 which is to obtain data useful for modelling the dynamics of 239 infectious disease, using a compartmental SIR (Susceptible-240 Infected-Recovered) model [38], or to evaluate the impact of 241 epidemiological control measures. These data are: frequency 242 of contacts, duration, location of contacts, contact network, 243 and contact matrices between different clusters of partici-244 pants. Because they do not coincide with the aims of the 245 research presented in this article, we have not considered 246 studies on the tracing of contacts for epidemiological control 247 in health crisis situations. 248 Isella et al. [39], analysed contact data from a scien-249 tific conference and a museum exhibition using RFID 250 technology. The number of contact records analysed was 251 10 000 for a scientific conference and 230 000 for an exhibi-252 tion. Cattuto   presented by Goscé and Johansson [53]. The study used 318 epidemiological parameters of influenza-like illnesses and 319 mobility data from the London Underground obtained from 320 travel records generated by its passengers using an automatic 321 payment system based on a contactless card. Troko et al. [54] 322 examined whether the use of public transport is a risk factor 323 for acute respiratory infection. The authors used epidemi-324 ological data obtained in the 2008-2009 influenza season 325 and related it to data on bus and tram usage using multiple 326 regression techniques.

327
Recently, in the context of the COVID-19 pandemic, 328 Luo et al.
[55] described a contact-tracing study on an out-329 break in Hunan Province, China, involving 10 passengers on 330 two public transport buses. A case of community transmis-331 sion among bus passengers was reported by Shen et al. [56]. 332 The authors suggested that this outbreak was due to poor vehi-333 cle ventilation. A study on the risk of COVID-19 transmission 334 among passengers on a high-speed train system in China 335 was presented by Hu et al. [57]. In this study, the authors 336 developed a model that quantifies the risk of transmission 337 on the basis of travel time and distance between passengers. 338 Severo et al. [58] analysed the role that the urban rail system 339 in the city of Lisbon played in the transmission of COVID-19 340 in said city. The authors used confirmed SARS-CoV-2 data in 341 this city for the period from 2 March to 5 July 2020 and, 342 using geographical data, linked the cases to the train stations 343 closest to the homes of the infected passengers. The authors 344 concluded that there is no relationship between proximity 345 to train stations and illness, suggesting that socioeconomic 346 factors affect infection dynamics.

348
The objective of this study is to acquire information on 349 the risk of infection on the routes operated in a PRTS. The 350 knowledge gained can be used to identify the routes with the 351 highest risk and to evaluate the impact of different measures 352 to minimise this risk. To achieve this objective, data are 353 required which, on many occasions, are not available and 354 therefore have to be estimated by processing a large volume 355 of data. For this reason, a data mining methodology was 356 used. The formal framework used in the methodology, and 357 then the methodology itself, which consists of two stages, 358 as illustrated in Fig. 1, are set out below.

359
This methodology differs significantly from the method-360 ologies employed in the studies cited in the previous section 361 on related works. With regard to the studies that use surveys 362 to infer the network of contacts, this methodology makes it 363 possible to obtain a large number of samples without first 364 having to select the elements that form the sample to be anal-365 ysed. Potentially, all passengers who use the public transport 366 system under consideration contribute with their trips to the 367 initial sample. 368 Compared to studies that infer the contact network by sim-369 ulating the behaviour of the study population, this method-370 ology uses data that reflect the real movements of people 371 and does not simulate these movements. This avoids the high 372 VOLUME 10, 2022 computational cost involved in such simulation techniques.

373
As for the studies that use sensors to determine the contacts 374 between people, the methodology presented herein estimates 375 these contacts without the need for any technological imple-376 mentation, as it uses data taken from the transport operations.  is not to estimate the number of close contacts in the PRTS, 411 since no personal passenger data are collected and therefore 412 it is not known whether or not the passenger is infected, but 413 to estimate the number of close interactions between passen-414 gers. In the formal framework used, an interaction is defined 415 as the event in which two passengers physically remain in 416 the same public transport vehicle for a period of time, at a 417 distance of less than value v 1 . When one or more interactions 418 with a cumulative duration equal to or greater than v 2 occur 419 between two passengers in period of duration v 3 , then a close 420 interaction event occurs between them.

421
For the purposes of this research, the entities of interest 422 for the PRTS are: the transport network, the routes defined 423 in this network operated by public transport vehicles, and the 424 vehicle journeys made by these vehicles along these routes. 425 The transport network is represented as a directed graph 426 G = G (N , A), where N represents the set of nodes of the 427 network and each node of this set represents a point in the 428 transport network where passengers can board or alight from 429 the vehicles N = {n i }, where subscript i is the point identifier, 430 and A represents the set of simple arcs linking two nodes 431 A = {a i }, where subscript i is the arc identifier. The next 432 entity to be defined is the route. A route is defined as the 433 journey taken by vehicles carrying passengers. Considering 434 graph G, a route is defined as an ordered sequence of arcs 435 (a i , . . . , a n ), where a i , . . . , a n ∈ A. The set of routes defined 436 in the transport network is represented by R = {r i }, where 437 subscript i is the route identifier. A segment of route r i is 438 defined as an ordered sequence of arcs (a p , . . . , a q ) along 439 route r i . The entity associated with the planning of operations 440 performed in the transport network is the vehicle journey. The 441 set of completed vehicle journeys is represented by J = {J i }, 442 where J i is the set of journeys completed on the route identi-443 fied by subscript i. Alternatively, the set of vehicle journeys, 444 irrespective of the route followed, that are completed in a time 445 period T is represented by the notation J T . The set of vehicle 446 journeys that consist of carrying passengers on route i during 447 time period T is represented by J i,T . If instead of time period 448 T , we have moment of time t, then J i,t represents the set 449 of vehicle journeys on route i for which the start time is t. 450 Finally, if v identifies a vehicle, then J i,t,v represents a vehicle 451 journey on route i that begins at time t and is performed by 452 vehicle v. The trip taken by a passenger on vehicle journey 453 J i,t,v is defined as the route segment (a p , .., a q ) that the vehi-454 cle has travelled while the passenger is on the vehicle. The 455 duration of the trip the passenger has made is the time elapsed 456 since the passenger boards the vehicle at origin node a p of the 457 arc and alights at destination node a q of the arc.

458
At this point, the concept of an interaction event between 459 two passengers, p 1 and p 2 , on the PRTS used in the method-460 ology can be formalised. Specifically, an interaction event is 461 said to occur if the following three conditions are met:    Table 1 summarises the entities used in this formal 478 framework.

479
To study the interaction events between passengers in the The objective of this stage is to generate the data records 492 representing the interaction events that may occur on each of 493 the routes of the PRTS during the selected study period T . The 494 data structures and procedures are shown in Fig. 1. The main 495 source of the data is the Transport Data Base (TDB), which 496 contains all data relating to the definition of the transport 497 network, the planning of operations and the provision of 498 services. The Transport System Graph (TSG) is a graph 499 database that contains, firstly, all the entities mentioned in 500 the previous section, completed, consolidated and coher-501 ent in the study period -fundamental aspects when han-502 dling a large volume of data -to facilitate, secondly, the 503 process of estimating interactions that are meaningful and 504 persistent.

505
This stage comprises four processes. The first two pro-506 cesses -final node estimation and selection, filtering and 507 loading -generate and complete the set of entities and 508 relationships to be represented in the TSG. The first -509 final node estimation -estimates the destination node of 510 the trips made by the users when necessary and will be 511 explained in detail in Section III-B1. The second -selection, 512 filtering and loading -encompasses all the tasks related 513 to the generation and loading of the TSG from, on the one 514 hand, the records contained in the TDB relating to the trans-515 port network, vehicles, users, cards, services and trips made, 516 and on the other, the destination stops as estimated by the 517 previous procedure, guaranteeing the reliability, accuracy, 518 completeness and consistency of all the data. The third -519 seat identification -obtains, for each seat of each type of 520 bodywork in the fleet of vehicles, the set of seats that are 521 at a distance less than or equal to a parameter called the 522 safety distance, based on a two-dimensional representation 523 of the vehicle bodywork (location of seats). This safety dis-524 tance may correspond both to the epidemiological parameter 525 v 1 and to the distance threshold of the different seat alloca-526 tion policies. Once the three processes described above have 527 been executed, the data necessary for the estimation of the 528 interaction events that take place in the vehicle journeys are 529 generated. This estimate is obtained by means of the fourth 530 process in this stage -interaction generation -which, 531 based on parameter v 1 and the seat allocation simulation, 532 which will be explained in Section III-B2, generates a record 533 of the total estimated interactions for each of the completed 534 vehicle journeys, composed of the fields shown in Table 2, 535 where field NI 1 is the total number of interactions lasting 536 1 minute, NI 2 the total number lasting 2 minutes, and NI m 537 the total number of estimated interactions lasting longer in 538 the vehicle journey. As this is an estimation process that 539 under certain conditions performs a random allocation of 540 vacant seats, repeated execution of this process will generate 541 different sets of records, which are of interest in the modelling 542 phase.    Liew [61], who proposed a method based on decision trees. 567 Considering previous works that address how to estimate the 568 destination stop [62], [63], a procedure was developed to infer 569 the final destination of the trips made by passenger p -from 570 one of the two categories above -when this information has 571 not been recorded.

572
With the technologies commonly used by intercity road 573 transport services, it is possible to obtain information about 574 the trip made by passengers -at which node they started, 575 which vehicle they used and at which moment in time they 576 boarded the vehicle -but the end point and the duration of 577 their trip are not always recorded. This problem can be over-578 come in the case of frequent travellers because they generally 579 use specific personal payment systems, such as contactless 580 cards, which automatically record payment transactions and 581 identify the user. There are several types of frequent users, 582 among which the most common are:

583
• Passengers that make multi-stage trips, such that the end 584 node of one stage (transfer node) is close to the start node 585 of the next stage.

586
• Passengers who make single-stage trips to their place of 587 work, study, public service or leisure and who also return 588 using the PRTS.

589
These types of trips exhibit a common pattern: on two consec-590 utive trips made by the same passenger, the destination node 591 of the first is located within a short distance of the origin node 592 of the second. This proximity will be determined by a distance 593 threshold depending on the type of transport network, smaller 594 in the case of urban transport and larger in the case of intercity 595 transport. This procedure is based on the known data for two 596 consecutive trips made by p. For each trip made by p on 597 vehicle journey J i,t,v , node n at which p started the trip and 598 time t of the beginning of the trip are known, where node n 599 is an origin node of one of the arcs forming the sequence of 600 arcs (a p , . . . , a q ) that form the segment of route i travelled on 601 J i,t,v . Moreover, t ≤ t , meaning that the start of the user's trip 602 t is equal to or later than the start of vehicle journey t. The 603 purpose of the procedure is to ascertain the final stop of the 604 trip made by p on J i,t,v and, therefore, the sequence of arcs 605 that form the segment of route i travelled by p. To estimate 606 final stop q of journey J i 1 ,t 1 ,v 1 , the procedure uses the known 607 data for the next trip made by p. If J i 2 ,t 2 ,v 2 is the next trip made 608 by p, then node n 2 and time t at which he or she started the 609 journey are known. If nodes n 1 and n 2 , the starting nodes of 610 the two vehicle journeys, are not the same, and are not within 611 a distance threshold that determines that they are similar (on 612 both sides of a two-way road, at an intersection, or are close 613 consecutive nodes on the same route), then final stop q of the 614 trip made by p on J i 1 ,t 1 ,v 1 would be the stop on route i 1 closest 615 to stop n 2 at which p started the trip on J i 2 ,t 2 ,v 2 , provided 616 that this final node q is at a distance from n 2 not greater than 617 the proximity threshold indicated above, that is, it is not too 618 far away. Once the final stop has been deduced, the time of 619 the trip made by p will be the sum of the time taken by v to 620 traverse the sequence of arcs (a n 1 , . . . , a q 1 ). 621 Fig. 3 illustrates this procedure. It represents, by means 622 of a graph, a generalisation of the procedure in the case of 623  Table 3 presents the numbers of trips for which the destina- -Vehicle journey J i 1 ,t 1 ,v 1 taken by passenger p -Node n 1 at which p started journey J i 1 ,t 1 ,v 1 -Time t at which p started journey J i 1 ,t 1 ,v 1 -Next vehicle journey J i 2 ,t 2 ,v 2 made by the passenger -Node n 2 at which p started journey J i 2 ,t 2 ,v 2 -Maximum distance DPmax at which two nodes are considered to be close -Maximum distance DSmax at which two nodes are considered to be similar Goal: -Node q, estimated destination of p on vehicle journey J i 1 ,t 1 ,v 1 if Euclidean distance between n 1 and n 2 > DSmax then Obtain sequence of arcs of route i 1 starting at node n 1 .
Output data for this step: sequence of arcs (a n 1 , . . . , a q1 ) that form the largest possible segment of route i 1 travelled by p for each route arc of the sequence (a n 1 , . . . , a q 1 ) do Obtain the Euclidean distance between the destination node of the route arc and node n 2 . Output data for this step: sequence of distances d a n 1 , . . . , d a q 1 end for Obtain the minimum value dmin of the sequence d a n 1 , . . . , d a q 1 and arc a j 1 in which this value has been obtained. Output data for this step: destination node q of arc a j 1 if (dmin < DPmax) and (Euclidean distance between n 1 and q > DSmax) then The estimated destination stop of p on journey J i 1 ,t 1 ,v 1 is the final stop of arc a j 1 else The destination stop cannot be determined. There is no near stop to the starting stop of the next journey, or it is similar to the starting stop of the previous journey end if else The destination stop cannot be determined. The starting stops are the same or similar end if to be similar -was 500 metres; the value of the DPmax 664 parameter -which indicates when two stops are close to 665 each other -was 1 km. Considering that in the case of inter-666 city transport, stops are spaced along the length of a route, 667 these distance thresholds are reasonable and conservative.

668
As can be seen in Table 3, for all trips for which the 669 destination stop was estimated, in 71.6% the estimated des-670 tination stop was less than 1 km from the actual destina-671 tion stop. Considering the results of this validation test, this 672 parameterisation of the DSmax and DPmax values makes it 673 possible to obtain an estimate of the destination stop for a 674 VOLUME 10, 2022 TABLE 3. Numbers of trips for which the destination stop was estimated. The first column shows the distance between the estimated stop and the actual stop (D). The second column shows the number of trips (NT) for which the destination stop was estimated as a function of the value of D.    In this research, two alternative seat allocation policies 704 were considered. The first is the Empirical Policy (EP). This 705 policy is based on observed behaviour whereby a passenger 706 prefers not to sit next to another passenger, without any other 707 consideration. The second policy aims to reduce the risk of 708 infection and is called the Minimise Risk Policy (MRP). 709 It consists of assigning the user to the free seat that is more 710 than 2m away from the largest number of passengers, in order 711 to avoid as many interactions as possible with passengers 712 on board the vehicle when boarding. In both policies, if the 713 occupancy of the vehicle does not permit strict application of 714 the allocation criterion, then a seat is randomly allocated from 715 the vacant seats that are in the best circumstances according 716 to the allocation policy used.

717
The allocation procedure is based on three parameters the 718 values of which vary according to the allocation policy. The 719 first parameter is the safety distance, which is determined by 720 the allocation strategy. The second parameter is the affected 721 seats list, which is a list associated with each seat of each 722 type of bodywork in the vehicle fleet that contains the list of 723 seats that are affected by its occupancy, and which is directly 724 dependent on the value of the safety distance parameter. The 725 third parameter is the risk potential, which is a value assigned 726 to each of the vacant seats in the vehicle during the course of 727 a vehicle journey and which determines its potential risk: it 728 increases as the seats in which it appears in the affected seats 729 list are occupied and decreases when any of these seats are 730 vacated.

731
The procedure simulates seat occupancy by passengers on 732 each vehicle journey J i,t,v , taking as input parameters the 733 affected seats list pertaining to the vehicle bodywork type, 734 and the safety distance of the policy to be applied and the 735 origin and destination stops of each of the trips made by the 736 passengers on that vehicle journey. Following the route order 737 established for that vehicle journey, each stop is treated by 738 the procedure in the following way: first, it vacates the seat 739 of the passengers arriving at their destination and assigns 740 it the corresponding risk potential according to the occu-741 pancy of the affected seats, lowering the risk potential of the 742 seats that are vacant in the list of affected seats, and then 743 it allocates the passengers starting their trip a seat with the 744 lowest risk potential among those that are randomly vacant, 745 increasing the risk potential of the vacant affected seats. The 746 algorithmic description of the seating procedure is described 747 in Algorithm 2.

749
In general, in a data mining project, the modelling phase 750 is designed to generate new knowledge, applying tech-751 niques of varying nature -both statistical and machine 752 learning -depending on the type of problem posed. As has 753 already been noted, the objective of the methodology is to 754 obtain information by detecting the patterns followed by 755 interaction events between passengers on the different routes 756 of the PRTS over a given period of time. To obtain these 757 patterns, a clustering process was implemented (see Fig. 1), 758

Algorithm 2 Assignment of Seats in a Vehicle During a Vehicle Journey Input data:
-Safety distance. In the case of EP, this is the minimum distance between the centres of two adjacent seats, and in the case of MRP, it is 2 metres. -Affected seats list. This is a list for each seat in each bodywork type in the fleet, showing the number of seats that are affected by occupancy of the seat. This list depends directly on the value of the safety distance parameter as determined by the allocation policy used. Goal: -Potential risk of a seat. This is a value that is assigned to each of the free seats in the vehicle during the course of a vehicle journey. The value increases as the seats that appear in the affected seats list are occupied and decreases when any of these seats are vacated. When a vehicle journey, J i,t,v , begins, the initial risk potential value is assigned to all the seats in the vehicle. This initial value is the minimum, as it is assumed that there are no passengers in the vehicle. At each stop the vehicle makes during the vehicle journey: for each user that alights from the vehicle do Their seat ap is vacated and the minimum risk potential value is assigned. for each seat in its affected seats list do if the seat is occupied then The risk potential of the newly vacated seat ap increases. end if end for for each user boarding the vehicle do They are randomly assigned one of the seats with the lowest risk potential on the vehicle. for each seat af in its affected seats list do if the seat af is free then The risk potential of seat af increases.  to the discretisation of the duration of the interactions: the 769 data records of the estimated interactions have a temporal 770 granularity of 1 minute, but the analysis can be carried out 771 with a greater granularity -5 minutes, 10 minutes, and 772 so on -depending on the type of routes or the ultimate 773 objective of the study. 774 Therefore, the interaction events on vehicle journey J i,t , 775 that is, each field of record E i,t , are accumulated in intervals 776 of k minutes, giving rise to an array of n integer values, 777 E i,t [n]. A second relevant parameter is that which determines 778 the number of generations of estimated interactions to be 779 considered at this stage. If there are more than one, the final 780 arrayÊ i,t [n] will be calculated as the arithmetic mean of the 781 records created for each vehicle, that is, if G is the number of 782 generations to be processed andÊ g,t [n] corresponds to the 783 estimated events in generation g, then the final interaction 784 record will be: Finally, if in period T there have been N vehicle journeys, 787 at moments of time t 1 , t 2 , . . . , t N of vehicle journeys on route 788 i, then the overall representation of the interaction events of 789 that route in that period, E i,T , is obtained from the expression: 790 That is, it is obtained by dividing the estimated number of 792 interactions in all vehicle journeys by the number of com-793 pleted journeys.

795
The objective of this stage of the methodology is to obtain 796 information to assess the risk of infection on the different 797 routes of the transport network, based on the interaction event 798 records E i,T described above. From the definition of the data 799 record E i,T [n] expressed in (2), epidemiological information 800 of interest can be extracted for each route for period T . Specif-801 ically, ME i,T which is the estimated number of interaction 802 events on route i will be determined by (3) The technique chosen to obtain patterns for the interaction

858
The proposed methodology was applied to the intercity PRTS 859 on the island of Gran Canaria (Canary Islands, Spain). This  A relational database was used to implement the method-869 ology, with the relevant data required for this study from 870 the operator's transport database, Neo4j, to implement 871 the graph database used by the methodology, and the 872 RStudio development environment [64] for programming 873 the procedures used in the data preparation and modelling 874 stages.

875
In the study period, 440 different routes were identified 876 on the transport network, with a total of 70 734 vehicle 877 journeys made. The number of passenger trips made in this 878 period was 2 260 744. Of these trips, 1 101 338 recorded the 879 origin stop and the destination stop, and 1 159 406 did not, 880 so the process of estimating the destination stop described 881 in Section III-B1 was applied to this set of trips. As a result 882 of this process, an estimation of the destination stop could 883 be completed on 860 909 trips; this was not possible on 884 298 497 trips. Finally, the process of selection and filtering 885 of records resulted in a total of 1 797 107 trips being loaded 886 into the TSG, and these were used to estimate passenger 887 interactions according to the two seat assignment policies 888 described in Section III-B2. Table 4 illustrates these data by 889 associating them with the entities defined in the formalisation 890 described in Section III-A.

891
Once a complete set of transport activity data was obtained 892 and represented in the TSG, the remaining processes of 893 the methodology were implemented by adopting a series 894 of decisions based on aspects related to the transport net-895 work, epidemiological aspects and the modelling technique 896 used.

897
In relation to the transport network, firstly, the routes were 898 classified depending on the time taken to complete them, 899 generating four subsets, four categories of routes R 1 , R 2 , 900 R 3 and R 4 with the following characteristics: subset R 1 con-901 tains routes which take less than 25 minutes to complete, 902 R 2 routes which take more than or equal to 25 minutes and 903 less than 35 minutes, R 3 routes which take between 35 and 904 47 minutes, and R 4 routes which take more than or equal to 905 47 minutes to complete. The maximum duration of a route in 906 the transport network is 137 minutes.

907
The duration and number of routes in each group is shown 908 in Table 5. Thus, the number of interaction event duration

918
The reason for this decision is to analyse the patterns 919 of interaction events according to the geographical areas 920 through which the route services pass. The total number of 921 routes in each area is shown in Table 6. unsupervised algorithm that appears to give partitions which 932 are reasonably efficient in the sense of within-class vari-933 ance, is easily programmed and is computationally economi-934 cal [65]. The process subdivides the n input data records into 935 k partitions where each is associated with the partition nearest 936 to its mean, where the mean of each partition is its significant 937 element and its centroid, the profile that characterises it.

938
To evaluate the quality of the clusters that were obtained the In these plots, the centroids of clusters C 1 , C 2 and C 3 are represented by green, blue and red curves respectively.
The same criterion of presentation has been used in all 954 of them: each column represents the three clusters obtained 955 for each policy, ordered by the number of routes they con-956 tain. In each cluster, the curve representing the centroid 957 obtained by applying the k-means algorithm is drawn. In the 958 k-means algorithm, the centroid of a cluster represents its 959 most significant value and corresponds to the mean value 960 of the elements that form the cluster. The cluster with the 961 green centroid is the most numerous, the cluster with the 962 blue centroid is the second most numerous, and the cluster 963 with the red centroid is the least numerous. In all the graphs, 964 the horizontal axis represents the discretised duration of the 965 average number of interactions per vehicle journey. The red 966 vertical line identifies the boundary of the mean number 967 of events lasting 15 minutes or more, above which close 968 interactions are considered. In addition, the legend of each of 969 the graphs includes four values that are considered significant 970 for analysis purposes: the total number of routes belonging 971 to the cluster (size), the value of its silhouette (sil), which 972 quantifies the coherence of the cluster, the maximum value of 973 average interactions of the profile obtained (max), and finally, 974 the sum of its average interactions with a duration greater than 975 or equal to 15 minutes, which may be considered a metric for 976 quantifying the total number of close interactions (CI) that 977 may occur in each cluster.

978
The plots in Fig. 5(a) show the results of the data clustering 979 procedure for the R 1 set of routes (routes which take less than 980 25 minutes to complete), when interactions were estimated 981 FIGURE 6. Plots of clusters and centroids obtained by applying (a) policy EP and (b) policy MRP to group of routes R 2 . In these plots, the centroids of clusters C 1 , C 2 and C 3 are represented by green, blue and red curves respectively.  The second is that the maximum centroid values decrease 1010 by 22% in C 1 , the largest cluster, and by about 12% in 1011 C 2 and C 3 respectively. And the third, closely related to the 1012 preceding observation, is that the values characterising the 1013 centroids also decrease in C 1 , C 2 and C 3 , by 20%, 24% and 1014 slightly more than 16% respectively. Again, cluster C 1 has the 1015 highest coherence and C 2 contains the most disparate route 1016 profiles.

1017
Plots (a) and (b) in Fig. 6 show the results of clustering 1018 the R 2 category data (routes which take more than or equal to 1019 25 minutes and less than 35 minutes) using the two defined 1020 policies. In this case, there is hardly any reduction in the 1021 total number of routes affected by interactions, but there is a 1022 significant reduction in the estimated close interactions per 1023 vehicle journey in the results in (b) compared to those in 1024 (a), which is around 47% in the largest cluster C 1 , 33% in 1025 cluster C 2 and 11% in cluster C 3 .

1026
The results for set R 3 (routes which take between 35 and 1027 47 minutes), with 106 routes, are shown in Fig. 7. In this case, 1028 between 3 and 5 routes have no estimated interactions, and in 1029 cluster C 1 a 31% reduction in interactions is observed when 1030 the MRP seat assignment policy is applied. In clusters C 2 and 1031 C 3 there is a regrouping of routes, all of them with a rather 1032 low coherence.

1033
Finally, the results for the 118 routes in the last set R 4 1034 (routes which take more than or equal to 47 minutes to 1035 complete), which contains the routes with the longest journey 1036 times of more than 47 minutes, are presented in Fig. 8  Tables 7 and 8 show these results for the different geo-1046 graphical areas into which the transport network was subdi-1047 vided, also distinguishing between the two policies applied.

1048
As a first approximation, Table 7 shows the total number of  interactions are not estimated when the more conservative 1054 seat allocation policy is applied, with the exception of the 1055 R 1 route category in the northern part of the transport net-1056 work, where the number of routes with interactions decreases 1057 by just over 17%, from 40 to 33.  Table 8, by contrast, shows the distribution of the routes in 1059 each of the geographical areas and each category in the clus-1060 ters obtained. Although no substantial decreases are observed 1061 when applying the different policies, it does reflect data con-1062 cerning the type of route in each area of the transport network, 1063 such as, for example, the fact that almost half of the routes in 1064 the south zone have a profile with a high number of close 1065 interactions.

1067
The estimated interactions, as presented in this paper, provide 1068 new knowledge in two ways: on the one hand, about the 1069 interactions that may be occurring in the transport network, 1070 and on the other hand, the extent to which these are affected 1071 by applying different seating policies. This provides a way 1072 of measuring the effect of implementing rules or procedures 1073 to determine passenger locations in order to reduce contact 1074 between people. It should be noted that the results refer to 1075 estimated interactions over the entire study period, without 1076 distinguishing between different types of day (e.g. working 1077 or non-working) or between different time bands, which 1078 is a higher level of detail and is covered by the proposed 1079 methodology.

1080
The EP policy, where a passenger prefers to sit in a seat 1081 where the surrounding seats are unoccupied, determines the 1082 minimum threshold of interactions in systems where no seat 1083 allocation is applied, as it does not take into account people 1084 travelling together or the preferences of certain age groups. 1085 For this reason, the results obtained by applying this policy 1086 can be considered a measure of the interactions that, at the 1087 very least, are occurring in the vehicle journeys, both at 1088 network level and at the level of individual routes. From the 1089 results obtained with the records of the three simulations 1090 carried out with this policy, it can be seen that in Table 7, 1091 of the 440 routes of the transport network, in 26 no interaction 1092 is estimated, which represents 6%, and it is area C which has 1093 the highest proportion of routes with no interactions, more 1094 than 25%. In general, these are routes with a low number of 1095 passengers and vehicle journeys, and almost all of them have 1096 short routes, with journeys of less than 25 minutes.

1097
In Table 8, of the 414 routes with estimated interactions,   1098 firstly, area N stands out, with a generally low interaction 1099 profile, since more than 90% of its routes are grouped in C 1 .  followed would not, however, be applicable to the case of 1142 urban public road transport, where standing is permitted and 1143 is common. In the context of a pandemic, it is common to 1144 limit vehicle occupancy in this type of transport using criteria 1145 that are not based on objective parameters. The proposed 1146 methodology could therefore be applied to obtain information 1147 that would facilitate the planning of transport services with 1148 the aim of reducing the risk of infection based on a calculation 1149 of capacity using objective parameters, as opposed to simply 1150 reducing capacity by an arbitrary amount. Another limitation 1151 is that it is assumed that there is a risk of infection in vehicles 1152 when two passengers are on the same vehicle at the same 1153 time. Therefore, the presence of two passengers at the same 1154 stop on the transport network has not been considered. In the 1155 case of intercity public road transport, this limitation is of 1156 relative importance for two reasons. The first reason is that 1157 this type of transport is planned around timetables, which 1158 means that passengers arrive at a stop a few minutes before 1159 catching the vehicle in which they will be travelling, and 1160 it is not common for them to spend long periods of time 1161 at the stops. The second is that most of the stops on this 1162 type of transport system are located outdoors, thus reducing 1163 the risk of infection. The final limitation is that since the 1164 passenger's seat in the vehicle is not known, the location of 1165 the passenger was simulated based on a seating allocation 1166 policy. The importance of this limitation is also relative, since 1167 the objective of the study was to learn on which routes and 1168 at what times the risk of infection is greatest. In this study, 1169 the policy applied was an EP policy, the aim of which is 1170 to approximate the passenger's seating behaviour. In reality, 1171 close interactions are likely to be greater, as the possibility 1172 that passengers may be travelling together is not taken into 1173 account. However, for the purposes intended, this limita-1174 tion does not invalidate the information obtained. Moreover, 1175 by simulating the location of passengers in vehicles, it is 1176 possible to assess the impact of different seating strategies 1177 designed to minimise the risk of infection and maximise the 1178 available vehicle capacity.

1180
This article presents the results of a research project designed 1181 to gather information about the risk of infection on the routes 1182 of an intercity road transport system. This information can 1183 be used to identify the routes with the highest risk and to 1184 assess the impact of different measures to minimise this risk. 1185 To achieve this objective, a data mining methodology was 1186 used. The results were obtained by analysing a real case of 1187 a transport system where the data from an intercity transport 1188 operator on the island of Gran Canaria was analysed for the 1189 month of December 2019.

1190
The results provide new insights into the interactions that 1191 occur between passengers in a public transport network, 1192 useful both for epidemiological control by health authorities 1193 and for the transport operator when implementing effective 1194 measures to reduce the risk of infection. Specifically, the 1195 effects of two seat allocation policies were analysed. The first 1196 of these policies is an approximation of the usual behaviour 1197 the second is a strategy that aims to minimise the risk of 1199 infection. The methodology used to obtain these results was 1200 parameterised in accordance with epidemiological aspects 1201 and entities related to transport activity. To be precise, the def-1202 inition of close contact for COVID-19 was used, together with 1203 the duration of the routes analysed and the geographical area 1204 in which they operate. Given the fact that the parameters of the methodology can be adapted, it could be applied to other diseases and use other transport-related aspects, such as the type of route, time bands, periods of time, etc. This is made by 1208 possible by the fact that the initial transport activity data can 1209 be used to generate a coherent and robust data set structured 1210 in the form of a graph. In order to obtain information about the 1211 interactions that occur on the transport system, the k-means

1505
Since 1987, he has been a Professor with 1506 the Informatics and Systems Department, Uni-1507 versity of Las Palmas de Gran Canaria, where 1508 he is currently the Director. His research inter-1509 ests include ubiquitous computing, intelligent 1510 transport systems, data mining, and technologies for education.