Autonomous and Human-Driven Vehicles Interacting in a Roundabout: A Quantitative and Qualitative Evaluation

Optimizing traffic dynamics in an evolving transportation landscape is crucial, particularly in scenarios where autonomous vehicles (AVs) with varying levels of autonomy coexist with human-driven cars. While optimizing Reinforcement Learning (RL) policies for such scenarios is becoming more and more common, little has been said about realistic evaluations of such trained policies. This paper presents an evaluation of the effects of AVs penetration among human drivers in a roundabout scenario, considering both quantitative and qualitative aspects. In particular, we learn a policy to minimize traffic jams (i.e., minimize the time to cross the scenario) and to minimize pollution in a roundabout in Milan, Italy. Through empirical analysis, we demonstrate that the presence of AVs can reduce time and pollution levels. Furthermore, we qualitatively evaluate the learned policy using a cutting-edge cockpit to assess its performance in near-real-world conditions. To gauge the practicality and acceptability of the policy, we conduct evaluations with human participants using the simulator, focusing on a range of metrics like traffic smoothness and safety perception. In general, our findings show that human-driven vehicles benefit from optimizing AVs dynamics. Also, participants in the study highlight that the scenario with 80% AVs is perceived as safer than the scenario with 20%. The same result is obtained for traffic smoothness perception.


I. INTRODUCTION
Modern society grapples with a large amount of societal challenges.Among the most pressing is the constant increase The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . in levels of pollution and related traffic congestions [1], both of which threaten the sustainability and livability of our urban environments [2], [3].At the same time, at the forefront of the technological revolution are autonomous vehicles (AVs).These vehicles, driven by advanced algorithms and sensors, promise to redefine how we perceive transportation [4], [5], [6].They can potentially alleviate some of the most persistent issues of modern urban transportation, from increasing safety [7], [8], [9], [10], [11] to optimizing traffic flow [12], [13], [14], [15].However, as with any nascent technology, the real-world implementation of AVs is laden with challenges.Among the most significant are the safety, and prohibitive costs associated with testing and validating the efficiency of these vehicles in real conditions.One aspect that is fundamental to address these challenges is the evaluation of driving scenarios in which AVs are present together with human drivers, in order to assess not only the performance attained by the AVs, but the behavior of the whole hybrid multi-agent system of drivers, together with the perception of driving comfort and safety experienced by the human drivers involved.In our paper, in order to conduct such analysis conjugating both safety and realistic conditions, we bypass the logistical and financial constraints of real-world testing by harnessing the power of state-of-the-art simulation tools.Our primary focus is on a small-scale yet intricately complex scenario: a roundabout in Milan, Italy.Utilizing Simulation of Urban MObility (SUMO) [16], a cutting-edge traffic simulator, we create a realistic environment where both AVs and human-driven vehicles (HVs) coexist, navigating the roundabout under realistic traffic loads.Also, we bridge the gap between static simulation and real-world experience by integrating SUMO with VI-WorldSim1 [17].This user-friendly, fully integrated graphic environment not only accelerates vehicle development offline but also facilitates a more immersive experience on driving simulators.To enhance the realism of our study, we leverage a high-fidelity cockpit that replicates real-world driving conditions installed at the DrisMi Lab at the Polytechnic University of Milan, 2 enabling us to evaluate, to the best of our knowledge for the first time, the scenario from a qualitative perspective.In our simulation, HVs adhere to realistic dynamics as simulated by SUMO, while AVs actions are dictated by a policy learned via Reinforcement Learning (RL).
In general, the integration of microscopic traffic simulators with driving simulators is not a novel concept [18], [19].However, previous studies have presented certain challenges, which our paper addresses and resolves [20], [21], [22], [23].Specifically, our approach ensures synchronized cosimulation in real-time, utilizing the same road network in both simulators and achieving an impressively low delay of 5 ms.
Our findings, as detailed in this paper, shed light on the multifaceted impact of AV integration.While the benefits of AVs in reducing pollution and alleviating congestion are evident, interestingly enough, the presence of AVs also augments the efficiency of human-driven vehicles.As we show in the Tables in Section V-B, as the AV penetration rate increases, the ripple effects are felt across the entire traffic ecosystem, offering insights into a future where harmonious coexistence between AVs and HVs might redefine urban transportation.More precisely, we evaluate the learned policy and the interaction between AVs and HVs qualitatively using traffic smoothness perception and safety perception as metrics.The evaluation is carried by surveying the perception of individuals who use the cockpit installed at the DrisMi Lab.While a number of participants felt a difference between the presence of 20% and 80% of AVs on the streets, most of the people preferred the scenario with an 80% penetration rate of AVs both in terms of safety and smoothness as highlighted in the Tables of Section V-B.
The advances presented in our study can be summarized as follows: • We propose a framework that consists of the integration of three realistic simulators: SUMO, VI-WorldSim, and a cockpit.This framework enables quantitative and qualitative evaluation of AVs and HVs interactions.
• We employ RL to learn AVs behaviors in a real-world scenario (i.e., a roundabout in Milan, Italy) with realistic traffic loads.
• We measure the reduction of crossing time (up to −10,72% for AVs and −8,52% for HVs), emissions (up to -38,98% for AVs and −39,13% for HVs) and consumption (up to −35,82% for AVs and −35,15% for HVs) as a quantitative metric for policy evaluation, and traffic smoothness perception and safety perception as qualitative metrics.The rest of the paper is structured as follows.In Section II, we briefly introduce SUMO, VI-WorldSim, the cockpit and RL concepts as background notions.We then conduct a literature review of works that study mixed-traffic scenarios, and of works leveraging RL for AV and traffic simulation in Section III.In Section IV, we first discuss how we turn SUMO into an RL environment, the algorithm we use, namely Proximal Policy Optimization (PPO) [24], how the test environment (i.e., the roundabout in Milan) is designed, and how the data for policy fine-tuning and traffic loads are collected.In Section V, we present both the quantitative and qualitative results, and we comment on some of the findings.In Section VI, we summarize the paper and present some interesting future directions to follow.

II. BACKGROUND A. SIMULATIONS IN SUMO
Simulations are extensively used in a variety of fields such as urban planning [25], transportation [26], [27], [28], robotics [29], epidemiology [30], gaming [31], and others.Experiments in the real world are often costly, dangerous, and infeasible.Thus, simulators provide a solution for evaluating hypotheses and methodologies in silico where certain aspects of the behavior faithfully mirror the real world.
SUMO [16] is a state-of-the-art multi-agent simulator for transportation systems that reproduce realistic behaviors of drivers.In SUMO, it is possible to deploy multiple agents that use different transportation means (e.g., cars, public transit, bicycles) to reach different goals.The agents can move within a street network that defines the environment in which they can operate.Interestingly enough, SUMO is also a microscopic traffic simulator, i.e., each agent is modeled as an individual based on separate and different car-following and lane-changing models.
SUMO's workflow to simulate realistic traffic is organized as follows.First, the road network is defined to match the real world, and road loads (e.g., a realistic number of agents using a specific street) are specified.Second, the agents are executed in the road network through high-fidelity simulations and scored according to a cost function that measures if certain goals have been reached.Next, as in this work we use a learned Reinforcement Learning policy trained in collaboration among all the AVs involved, SUMO re-plans, re-executes, and re-scores the actions taken by the agents using a co-evolutionary algorithm until nobody can unilaterally improve their trips.At that point, the system has an equilibrium, and we can inspect the individuals' typical behaviors.

B. VI-WorldSim
VI-WorldSim is an innovative software solution designed to facilitate creating and testing lifelike driving scenarios.These scenarios include traffic flow, pedestrians, weather conditions, and sensor feedback.By utilizing VI-WorldSim, individuals can immerse themselves in many realistic environments, from bustling urban streets to expansive highways and specialized test sites.This versatility allows for an in-depth evaluation of how a vehicle model responds to many situations, ensuring that every potential challenge is addressed during development.Also, the realism of the simulations combined with the integration with realistic cockpits can be used to evaluate scenarios qualitatively safely and relatively cheaply.An example of what a VI-WorldSim simulation looks like can be seen in Figure 1 B.

C. VI-WorldSim AND SUMO INTERACTION
The traffic scenario is generated thanks to a co-simulation between SUMO, which is used as a traffic engine, and VI-WorldSim, which simulates the vehicle's motion driven by the human in the loop and allows a graphical representation of the traffic scenario.SUMO is in charge of simulating all virtual vehicles in the roundabout, comprising HVs and AVs.In particular, SUMO receives the data about the car driven by the human in the loop (ego-vehicle) from the driving simulator through VI-WorldSim.The current position of all simulated vehicles is fed to VI-WorldSim which is in charge of all the graphical environment of the driving simulator, of the interface with the human in the loop and of the simulation of the motion of the ego-vehicle according to the request of the real human driver.All these simulations are performed in real-time and the corresponding data are stored in a real-time database.Additional information about the communication schema between SUMO and VI-WorldSim can be found in [17]

D. THE COCKPIT
The human in the loop drives the vehicle from the cockpit of the driving simulator, which moves accordingly to the simulated motion of the ego vehicle.The cockpit can be seen in Figure 1 A. The cockpit is equipped with a telematic box, which is connected to the Controller Area Network bus of the cockpit, reads the dynamic data of the car, and transmits them to a 5G radio platform.The 5G radio platform transmits the data to and from an edge server and is controlled by a Next Unit of Computing (NUC) where the services (e.g., learned policy) are installed.Finally, the edge server hosts the RL infrastructure, including the policy, which is used to control the connected automated vehicles simulated by SUMO.The characteristics of the cockpit are summarized in Table 1.

E. REINFORCEMENT LEARNING
Reinforcement Learning (RL) constitutes a branch within the domains of Artificial Intelligence and Machine Learning that draws its initial inspiration from the Pavlovian paradigm of conditioning, wherein organisms adapt their behavioral choices based on the gratifications (rewards) and aversions (punishments) received.Analogously, in the classic RL setup an artificial agent (sometimes referred to as an actor) is engaged in dynamic interactions with its environment, thereby selecting at discrete time intervals its course of action.RL algorithms aim at training the agent, enabling it to independently attain optimal behaviors within a designated environment, in alignment with contextually defined objectives inherent to the specific problem at hand.The remaining part of this section summarises the fundamental RL definitions (see for a more detailed introduction [32], [33], [34]).The RL problem can be mathematically formalized as a Markov Decision Process, by considering a tuple (S, A, p, r) such that S and A are the sets of possible agent' states and actions, respectively, while p : S × A × S −→ [0, 1] indicates the probability to transition from state s to state s ′ by acting a, i.e. p(s, a, s ′ ) = P(s t+1 = s ′ |s t = s, a t = a).The reward function r : S ×A −→ R + 0 maps a state-action couple with its immediate, intrinsic desirability in relation to the agent's task at hand.In this context, the agent's decision-making process is modeled as a policy function π : S × A −→ [0, 1], such that π(s, a) represents the probability of acting a while in state s, determining de facto the agent's behavior.It is possible to define a value function V π : S −→ R + 0 associated with any policy π, that represents the expectation with respect to p of the cumulative discounted rewards obtained by π over time from each state, considering γ ∈ (0, 1] as discounting factor penalizing rewards obtained further on in time, i.e., where we define per each step t the distributions p t = p(s t , a t , • ) and π t = π(s t ).It is sometimes useful to consider the action-value function Q π : S ×A −→ R + 0 associated to π, where Q π (s, a) indicates the expected cumulative discounted rewards obtained by following policy π over an infinite horizon, starting from state s and performing as a first action a, i.e.
keeping in mind that the following property holds The agent's overall goal is to learn a policy π * that maximizes the expected long-term desirability of each state, i.e., such that for each s ∈ S In many cases it is common to model r as a function linking states and actions with an associated cost (instead of a reward), without loss of generality, and substitute the maximization with a minimization in (1).

III. LITERATURE REVIEW
This section reviews significant contributions in the literature regarding the study of mixed-traffic scenarios and the design of RL policies for AVs.In this work we do not focus on the development of new RL algorithms for AVs training.On the contrary, in order to realistically study the interaction and coexistence of AVs and HVs, we find more sensible to employ established and, hence, more reliable RL techniques, closer to the deployment phase, than extremely recent methods.Hence we introduce the related literature not from the algorithmic point of view, but summarizing some of the most common problem design choices.

A. AVs-HVs INTERACTION
Predictions suggest a gradual market integration of AVs, with estimates ranging from 24% to 87% market share by 2045 [35], [36].Most of the studies about the role of AV integration explore how such vehicles will lead to a fundamental transformation in mobility [37], [38], [39], [40], [41].A number of research papers have delved into the advantages of AVs, and in particular Connected AVs (CAVs), concluding that they can notably decrease traffic incidents, alleviate congestion, lower fuel use, and offer essential mobility options [37], [38], [39], [42], [43], [44], [45], [46], [47].Researchers highlighted that even a small penetration of AVs may significantly impact, for example, the number of parking requirements and the number of accidents [38], [42].Other researchers also investigated how street features like Variable Speed Limits and road capacities impact urban scenarios under different levels of AV integration [40], [41].
Focusing on the interaction between HVs and AVs, in [48] researchers propose considerations on the design of AVs that can safely and intuitively interact with other traffic participants, based on common human interaction strategies.A driving simulator experiment was conducted in [49] on 34 individuals, to investigate the behavior of HVs exposed to different road design of lanes dedicated to AVs on motorways.
The coexistence and interaction of AVs and HVs was investigated as well in [50], through a field test conducted on 18 participants, focusing on gap acceptance, car-following, and overtaking behaviors, showing that drivers interacting with recognizable AVs adopt smaller critical gaps, and after overtaking, merge closer in front of those.The work in [51] is devoted to understanding HVs behavior in mixed traffic at unsignaled priority T-intersections, through a driving simulator experiment on 95 human drivers, whose findings suggest that human drivers change their gap acceptance behavior in mixed traffic depending on AVs recognizability and driving style.Finally, the authors of [52] aimed at quantifying the behavioral changes caused by human drivers following either an AV or an HV, and their impact on safety, fuel consumption and pollution, by analyzing data from a field experiment on 9 drivers.Their work shows that, when human drivers follow AVs, lower driving volatility in terms of speed and acceleration can be achieved, and consequently more stable traffic flow behavior, lower crash risk, less fuel consumption and emission production are experienced.However, to the best of our knowledge, there is no study that evaluates how the penetration rate of AVs may impact the human driver's perception of traffic smoothness, safety and comfort in a scenario in which AVs and HVs coexist.Such evaluation is the main goal of our study.

B. RL DESIGN FOR AVs
In the context of RL policies training for AVs, literature can be found detailing the most common choices related to the design of the RL problem to be solved, in particular regarding states, actions and rewards design for AVs training.It is frequent to consider a states space that includes the position, heading, velocity of the vehicle, and the presence of obstacles within the sensor's view, possibly employing a Cartesian or Polar occupancy grid centered around the vehicle.This grid is often enhanced with lane-related details such as lane number, path curvature, historical and predicted trajectory of the vehicle, longitudinal measures such as time-to-collision, and broader scene-related information such as traffic regulations and the locations of traffic signals [53], [54].The action space instead is by definition related to the actuators present on the vehicle and devised to the vehicle control task.Multiple actuators come into play, both continuous (as for instance steering angle, throttle, and brake) and discrete (i.e., gear changes).A framework incorporating temporal abstractions, such as options (sub-policies that extend primitive actions over multiple time steps, to be chosen instead of lowlevel actions), can simplify action selection [55].Concerning the design of suitable reward functions for RL agents in the context of autonomous driving, researchers follow a variety of approaches, in order to tackle the many different sub-skills that all together characterize the general goal of autonomously driving a vehicle.Examples include measures such as distance traveled towards a destination [56], speed of the vehicle [56], [57], [58], maintaining the ego vehicle at a standstill [59], avoiding collisions with other road users or scene objects [56], [57], adherence to sidewalk rules [56], staying in the lane, ensuring comfort and stability while avoiding extreme acceleration, braking, or steering [58], [59], and adherence to traffic regulations [57].

IV. METHODOLOGY
This section describes all the steps and experimental setups adopted to carry out the study.First, we describe how SUMO can be turned into a realistic RL environment using Python libraries like Flow [60], Ray RLlib [61] and OpenAI Gym [31].

A. TURNING SUMO INTO A REALISTIC REINFORCEMENT LEARNING-BASED ENVIRONMENT
In this work, we transform SUMO, a high-fidelity multiagent transportation simulator, into a realistic Reinforcement Learning environment to optimize and evaluate policies for AVs.The ultimate goal is to use SUMO to learn policies in which AVs take optimal actions to reduce emissions and to minimize the time to cross a real roundabout in Milan, Italy.
To transform SUMO into a Reinforcement Learning environment, we integrate Flow [60], an open-source Python package that can be used to create a communication layer between SUMO and Ray RLlib [61].Remarkably, Flow can be used to investigate the so-called mixed autonomy scenarios where only a portion of the deployed cars are AVs, and the others are controlled by car following models (CFMs), a set of ordinary differential equations that realistically mimic basic traffic dynamics on single-lane roads.This ability to study mixed autonomy scenarios is paramount for policymakers and traffic engineers as it represents a more realistic shortterm scenario.In Flow, SUMO is connected with Ray RLlib and OpenAI Gym [31] through TraCI, a package to control communication over network protocols.The environment in which agents operate consists of a realistic network representing a physical road layout (e.g., speed limits, lanes, length, shape).The actors are the deployed cars.Some of them (marked with ''rl_agent'') are controlled by a learned policy and make decisions according to a specific goal they have to minimize.Other cars (marked with ''human_agent'') base their decision on pre-defined driver models.
Flow allows to rely on observer functions to map the set of states S to the observations O.This permits to tailor the information provided to the controller, choosing a subset of the SUMO states of the vehicle.We structure the observation vector o n t ∈ O available at instant t to vehicle n so that it contains the last measured value of position x n t and acceleration ẍn t of the vehicle.Moreover, we include in o n t information obtained from the estimated position and acceleration of its front (F), back (B), left (L) and right (R) neighbors in the scenario, indicated respectively as {x k t , ẍk t } k with k = F, B, L, R, if such vehicles are present at instant t around vehicle n.The values associated with the neighbors that are missing at time t are replaced by an opportune placeholder.The described components of the observations vector are divided by quantities that are characteristic of the chosen scenario, such as the scenario's dimension x max and the maximum acceleration ẍmax .Summarizing, we consider as observation o n t the vector The action a n t ∈ A decided by agent n at time t, instead, consists of the next acceleration value ẍn t+1 to actuate, and of a discrete decision c n t+1 ∈ {0, 1}, corresponding to changing the line/maintaining the current one, that is Our work aims at leveraging the presented infrastructure to allow AVs to learn via RL techniques the optimal control law to cross a roundabout as fast as possible.In this sense, we design a stage-cost function r : O×A → R ≥0 measuring the deviation d of the AVs velocity ẋn t from a user-defined desired velocity v.In order to penalize the early termination of roll-outs due to collisions or other failures, we subtract such deviation from the peak allowable deviation d max (v).Finally, to ensure non-negativity, the cost is then bounded below by 0, i.e., 4) Simultaneously, we are interested in reducing polluting emissions.Hence, we leverage the previously described velocity-based stage-cost with an analogous one, punishing the deviation from a target level of pollution P, i.e., d(p n t , P), where the pollution levels p n t are measured in function of the actuated decisions a n t , and P is estimated by considering a single vehicle in the scenario, running at constant velocity.The two components of the cost are summed and normalized, and then assigned as a stagecost to the AV.We use the well-known Proximal Policy Optimization [24] algorithm (see Section IV-B for more details) to learn a policy allowing the AVs to decide at each instant which action to take, based on the agent's individual observations' vector o n t ∈ O.All the AVs involved in the roundabout scenario collaborate in training a central policy by providing their simulated experience in order to update the policy parameters.Moreover, during training and deployment, when interrogated, the policy exploits both the local observations of the agent taking the decision and information related to the position and acceleration of other vehicles in the roundabout that are received through a central communication scheme.
Differently from the state-of-the-art studies, we evaluate the learned policy quantitatively and qualitatively, analyzing results obtained through tests carried out at the DriSMi Laboratory of Polytechnic University of Milan.This laboratory has a high-fidelity last-generation cable-driven driving simulator (see Figure 1 and [62], [63]).The traffic scenario is generated thanks to a co-simulation between SUMO (traffic engine), and VI-WorldSim, 3 that simulates the motion of the vehicle driven by the human in the loop and allows to have a graphical representation of the traffic scenario.The combination of SUMO and VI-WorldSim allows us to evaluate the learned policy in terms of the traffic smoothness and safety perception of the passengers (i.e., real humans who agreed to participate in the experiments in our case).Further details about the evaluation procedure are shared in Sections V-A and V-B.

B. PROXIMAL POLICY OPTIMIZATION
To learn the policy parameters, we employ the Proximal Policy Optimization (PPO) method [24].PPO is a prominent policy-optimization algorithm, deriving from the Trust Region Policy Optimization (TRPO) algorithm [64], and improving it from the flexibility and computational complexity point of view.Both PPO and TRPO are designed to stably and efficiently optimize decision policies, focusing 3 https://www.vi-grade.com/en/products/vi-worldsim/on refining policies by iteratively adjusting their parameters while limiting the extent of these updates in order to maintain stability during learning.This is realized by PPO within a dual-step process: policy evaluation and policy improvement.During policy evaluation, data are gathered by executing the current policy within the environment.Subsequently, the advantages of actions taken with respect to expected returns are computed.The advantages serve as a measure of how favorable the chosen actions were, with respect to expected outcomes.Following policy evaluation, policy improvement is performed through several epochs of optimization.In each epoch, PPO computes surrogate objectives that quantify the change in the policy's performance with respect to the previously considered set of policy parameters θ old , guided by the advantage values, i.e., These surrogate objectives facilitate the optimization of the policy in a manner that promotes positive action shifts while maintaining a threshold ϵ on the magnitude of policy updates.This threshold, referred to as the ''clip parameter,'' curbs policy updates from straying too far from the original distribution, ensuring a measure of stability and preventing drastic policy shifts that could lead to instability.PPO's distinctive feature lies in its capacity to strike a balance between exploiting the advantages of updated policies and maintaining a controlled adjustment process.By constraining policy updates and employing the surrogate objective, PPO achieves stable and incremental policy improvements, contributing to efficient and reliable RL.These characteristics, together with the performance attained by PPO in many benchmark examples and applications [65], motivated us to choose PPO as the learning algorithm for AVs policy optimization.Moreover, in order to conduct an objective evaluation of the effects of deploying AVs on realistic scenarios in the presence of human drivers, technologies that are both well-performing and well established, as for instance PPO, should be preferred to algorithms that, although cutting-edge, are not at maturity, and hence are less commonly deployed in real applications.

C. ROUNDABOUT DESIGN AND DATA COLLECTION
To operate in an environment that is as realistic as possible, we designed a roundabout inspired by a real-world one in Milan, Italy.In Figure 2 A), we can see how the roundabout looks in the real world, while in panel B), we can see the SUMO's roundabout.It is a four-leg miniroundabout, showing medium-high traffic and, therefore, being a challenging environment for the AV policy.Moreover, it has some important details, which make this particular scenario of general interest, specifically: • every leg has pedestrian crosswalks immediately before the entrance of vehicles inside the circulatory roadway.
• Two of the legs are central arteries of the city, greatly increasing traffic on the roundabout.
• The roundabout has a standard configuration widely distributed in European urban areas [66] with significant flows.A calibration procedure was conducted to replicate the number of vehicles approaching the intersection and their positions during the simulation.Firstly, measurements were taken for the maximum queue length, upstream and downstream flows for each leg, considering road vehicles, pedestrians, and bicycles on the actual roundabout.This process was repeated for six consecutive time slots, each lasting 10 minutes.
Subsequently, the results of these measurements were compared with simulations conducted in SUMO to calibrate the seven most relevant parameters that define the traffic conditions in the considered scenario.The parameters under consideration include the distance at which a pedestrian is considered, the minimum time interval to cross the path of another vehicle when entering the roundabout, the time before a driver enters the roundabout even if obstructing the way of an incoming vehicle, the maximum acceleration and deceleration of the vehicles, the time interval between vehicles, and the drivers' reaction time.For each simulation run, a cost function was constructed to compare the mean and maximum queue lengths between the measured and simulated data.The calibrated parameters are those that minimize the differences between the two sets of data.Specifically, Table 2 lists the SUMO parameters detailed below and their values: • jmCrossingGap: minimum distance between the vehicle and the pedestrian that is heading toward the point of conflict of its trajectory with that of the vehicle; • jmTimegapMinor: minimum time interval for a vehicle to enter an intersection where it does not have the rightof-way, before a vehicle with right-of-way; • impatience: driver's intent to obstruct a vehicle with the right of way; • accel: maximum acceleration for the selected vehicle type; • decel: maximum deceleration for the selected vehicle type; • tau: minimum time interval between consecutive vehicles; • actionStepLength: driver reaction time.
As can be noted from 2 B), although present in the real roundabout, in the final network there are no restricted lanes or pedestrian crosswalks.Such elements have been removed, after the calibration process, as they are not within the scope  of the AI@EDGE project.Correction coefficients, taken from the literature, have been used to take these elements into account in the modified model.For the same reason, only cars are considered; other vehicles and pedestrians are considered via equivalent coefficients [67].

A. QUANTITATIVE EVALUATION
In our work, we focus on evaluating the efficacy of the learned policy in reducing the time AVs need to cross the roundabout and emission and fuel consumption.Concerning the former, given a simulation S of 3600 seconds (i.e., 1 hour) with a given percentage of AVs and a total of n vehicles v 1 , v 2 , . . ., v n , we associate to each vehicle v i its entering time t in v i and exiting time t out v i .The average time needed by the cars in the simulation S to cross the roundabout is then computed as Note that by sampling V = {v 1 , v 2 , . . ., v n } it is possible to estimate the average time for AVs and HVs.As a reminder, the simulation is carried out to have a total of 1540 vehicles passing through the roundabout over one hour as emerged from the field measurements.Time is measured in seconds.
In Table 3 and in the right panel of Figure 3, we can see the results when having a different amount of AVs leveraging the learned policy.
TABLE 3. Given a certain penetration rate of AVs (column %AVs), we measure the average time that AVs and HVs need to cross the roundabout as in Eq.( 5).The measurements are in seconds.We observe that as the number of AVs (column #AVs) increases, both AVs and HVs, on average, need less time to cross the roundabout, highlighting how HVs may also benefit from the optimization and diffusion of AVs.
At 0% AV penetration, only HVs are present.The average time taken by HVs to cross the roundabout stands at 17.94 seconds.As we introduce AVs into the system, there's a slight improvement in the average crossing times for both types of vehicles.At 10% AV penetration, the average crossing time for AVs is 17.15 seconds.The HVs also experience a marginal decrease in their average time, clocking in at 17.32 seconds.As the percentage of AVs increases from 10% to 50%, there is a gradual and consistent decrease in the average crossing times for both AVs and HVs.At the midpoint, with 50% AVs and 50% HVs, AVs take an average of 16.58 seconds, while HVs take a slightly longer 16.99 seconds.Beyond the 50% mark, as AVs become more dominant, their average crossing times continue to decrease, reaching 15.31 seconds at 100% penetration.Interestingly, the HVs also benefit from the increased AV presence.At 90% AV penetration, when only a small fraction of HVs remain, their average crossing time reduces to 16.41 seconds -the minimum reached value.Throughout the progression, AVs consistently demonstrate a trend of reduced average crossing times with their increased presence.HVs also benefit from the introduction of AVs.The data suggests that as the percentage of AVs in traffic increases, the roundabout navigation becomes more efficient for both vehicle types.This could be attributed to the predictability and coordination of AVs, which seems to not only benefit their kind but also aid in optimizing the flow for human drivers.
Another evaluation metric that the policy should optimize is fuel consumption and pollution.Both information can be extracted from each vehicle's property and computed by SUMO.To give an idea of how the model is performing, instead of using the values provided by SUMO we report a more interpretable score normalized between 0 and 1.To this end, for each scenario S, we take the worst-performing vehicle (score max ) and the best-performing vehicle (score min ) as normalizing factors.We then performed a min-max normalization, ending up with a score between 0 and 1 for each vehicle, with 0 indicating lower emission or lower fuel TABLE 4. Given a certain penetration rate of AVs (column %AVs), we measure the consumption score and emission score of both AVs and HVs, following Eq.6.Similarly to what happened with the crossing time, we observe that as the number of AVs (column #AVs) increases, both AVs and HVs, on average, consume less fuel and reduce their emissions.
consumption and 1 indicating the worst performance.In particular, the adopted formula is: where n is the number of vehicles in simulation S and the term score(v i ) in the formula can indicate both the emission or the consumption level of the i-th vehicle v i .The results can be seen in Table 4 and in the left -emissions-and center -consuption-panel of Figure 3.When no AVs are present (0% AV penetration), the average fuel consumption score for HVs stands at 0.74, with an emissions score at 0.69.As AVs are introduced into the system at 10% penetration, their fuel consumption and emissions are recorded at 0.67 and 0.59, respectively.Interestingly, even with a modest AV penetration, there's an observable improvement in HV metrics as well.The consumption for HVs drops to 0.69, and their emissions decrease to 0.63.As the percentage of AVs in the system grows from 10% to 50%, there's a consistent improvement in both consumption and emissions for both vehicle types.At 50% AV penetration, AVs record consumption and emission values of 0.53 and 0.46, respectively, while HVs show values of 0.56 and 0.51.Beyond 50% AV penetration, the trend of decreasing consumption and emissions continues for AVs, reaching their lowest at 0.43 and 0.36, respectively at 100% AV penetration.HVs also see continuous improvement.Throughout the entire range, AVs exhibit lower consumption and emission values compared to HVs.The difference becomes more pronounced as the percentage of AVs increases, highlighting the efficiency of autonomous driving systems.In conclusion, not only do AVs demonstrate better fuel efficiency and lower emissions, but their presence also seems to positively influence the performance of human-driven vehicles, similar to travel time reduction.This could be attributed to smoother and more predictable traffic flow, leading to less stop-andgo traffic and more consistent driving speeds learned by the policy.

FIGURE 3.
Concerning the quantitative analysis, we measured the impact of AV penetration rate on emissions, fuel consumption and average time to cross the scenario.On the left, we have the results related to emissions.In the middle, we show the results for fuel consumption while on the right we have the average crossing time.As we can observe, the best results are obtained for the scenario with 100% of AVs.However, having some AVs (e.g., 10% to 90% penetration rate already provides some benefits.Interestingly enough, also HV benefits from the penetration rate of AVs.

B. QUALITATIVE EVALUATION
As previously mentioned, tests have been conducted at the DriSMi Laboratory of the Polytechnic University of Milan 4 to qualitatively evaluate indicators of the passengers' comfort in the described driving scenario, considering different AV penetration levels.The laboratory has a high-fidelity, lastgeneration driving simulator as described in Section II-D.The integration between SUMO, VI-WorldSim and the simulator itself allows us to evaluate, to the best of our knowledge for the first time, traffic smoothness and safety perception of passengers as qualitative metrics to analyze the passengers' comfort.To collect the necessary information for our study, a panel of ten participants has been selected for the preliminary tests.The participants were chosen from individuals without previous experience with driving simulators.The panel consists of 5 females and 5 males, aged between 22 and 33 years, with driving experience ranging from 1 to 15 years.Before the test, each participant was given instructions on how to operate the driving simulator and signed an informed consent form.Additionally, each participant spent about ten minutes driving in a simple motorway scenario to become familiar with the driving simulator before the actual test.Individuals were exposed in simulation to experiences in two driving scenarios, characterized by the 20% and 80% of AV penetration rate, respectively.Following the simulated ride, the participants were asked to fill out a quick survey, consisting of three close-ended questions: • Regarding traffic smoothness, which of the following statements do you agree with the most?
• Regarding safety perception, which of the following statements do you agree with the most?
• Globally, which of the two scenarios did you prefer?Per each question, the interviewed individuals were asked to select one among five possible alternatives, summarized in the following three tables (  As we can see, most of the voters say that the scenario with 80% of AVs was smoother with respect to the one with 20% of AVs.One participant did not perceive differences between the two scenarios and 40% of participants preferred the scenario with 20% of AVs, although the majority of them only partially. with the first, second, and third question, respectively).In total we collect feedback from 10 individuals.While the sample size is limited, a few tentative conclusions can be drawn.In Table 5, most of the respondents (5 out of 10) felt that the traffic was smoother in the scenario with 80% AVs, to varying degrees.A smaller segment (4 out of 10) felt the opposite, indicating that the 20% AV scenario was smoother.Just 1 out of 10 respondents did not perceive any significant difference in traffic smoothness between the two scenarios.Results suggest that as AV penetration increases, the perceived smoothness of traffic might improve.However, divergent opinions underscore the complexity of human perceptions and the subjective nature of such assessments.
Table 6 shows that a combined total of 8 respondents felt that the scenario with 80% AVs was safer to some degree compared to the 20% AV scenario.On the contrary, only 2 out of 10 respondents felt that the 20% AV scenario was safer in any capacity.All the participants perceived TABLE 6.A summary of the answers to the second question of the survey.According to 80% of the participants, the scenario with 80% of AVs is perceived safer than the scenario with 20% of AVs.Zero participants did not perceive differences between the two scenarios and 20% of participants partially preferred the scenario with 20% of AVs.TABLE 7. A summary of the answers to the third question of the survey.From a general perspective, 70% of participants preferred the scenario with 80% of AVs while 30% preferred the scenario with 20% of AVs, although only partially.
a difference in safety between the two scenarios.These results suggest an inclination towards perceiving higher AV penetration as safer.Analogously to what was observed while commenting the perception of smoothness in traffic, even when it comes to safety the diversity in opinions highlights that safety perceptions can be subjective and can vary among individuals.
Finally, Table 7 shows that, overall, these results hint at a leaning towards the 80% AV scenario (as shown by the 7 over 10 individuals preferring it either definitely or partially, versus only 3 just partially preferring the 20% AV scenario), suggesting that most participants might perceive benefits in scenarios with higher AV penetration.Naturally, even in this case the variety of responses also emphasizes the subjective nature of such preferences, and individual perceptions can vary based on personal experiences or beliefs.Further research could delve deeper into the reasons behind these preferences, providing a more comprehensive understanding of public sentiment toward AV integration, possibly by benefiting of a larger sample of collected feedback from individuals.

VI. CONCLUSION
This research has brought to light significant insights into how human drivers may perceive in terms of safety and comfort the integration of AVs in complex urban traffic scenarios, specifically examining a roundabout in Milan, Italy.Our approach leveraged state-of-the-art traffic simulation tools like SUMO, VI-WorldSim, well-established RL methods and a high-fidelity cockpit to understand the dynamics of AVs penetrations alongside HVs.
The findings of our study underscore the substantial benefits that AVs offer in mitigating common urban challenges such as pollution and traffic congestion.By employing established RL techniques, notably the PPO algorithm, we were able to model and analyze the behavior of AVs in realistic traffic settings.The simulation results are promising, showing that AVs enhance their operational efficiency and positively influence the overall traffic flow, benefiting HVs in the process.
Our primary objective included conducting qualitative assessments with human participants.These assessments revealed a notable preference for environments where autonomous vehicles (AVs) are more prevalent, attributing this preference to enhanced safety and smoother traffic flow.Conducting such evaluations is crucial, as AVs and humanoperated vehicles (HVs) will increasingly share roads in the future.Notably, previous research has not explored human drivers' perceptions of AVs and HVs coexisting in a realistic, comprehensive manner, as our proposed framework does.Our findings are in line with quantitative data, indicating a future where AVs and HVs can coexist more seamlessly, leading to safer and more efficient urban environments.
While our study has made strides in understanding the potential of AVs in urban settings, it also opens avenues for future research.Further exploration into alternative reinforcement learning algorithms could provide deeper insights into the optimization of AV behavior.Additionally, expanding the scope of human-in-the-loop evaluations with a larger and more diverse participant pool would be invaluable in enriching our understanding of public perception and acceptance of AVs.Such studies could also explore other aspects of urban mobility impacted by AV integration, including pedestrian safety and public transportation systems.

FIGURE 1 .
FIGURE 1. A) The cockpit was installed at DrisMi laboratory of the polytechnic university of Milan.B) An example of the output of VI-WorldSim after the integration with SUMO.

FIGURE 2 .
FIGURE 2.A) The selected roundabout in Milan observed from OpenStreetMap; and in B), the corresponding design of the roundabout in SUMO.Also, realistic traffic loads for fine-tuning have been measured through field measurements.

TABLE 1 .
A summary of the physical characteristics of the driving simulator (cockpit).

TABLE 5 .
A summary of the answers to the first question of the survey.