Sim-on-Wheels: Physical World in the Loop Simulation for Self-Driving

We present Sim-on-Wheels, a safe, realistic, and vehicle-in-loop framework to test autonomous vehicles' performance in the real world under safety-critical scenarios. Sim-on-wheels runs on a self-driving vehicle operating in the physical world. It creates virtual traffic participants with risky behaviors and seamlessly inserts the virtual events into images perceived from the physical world in real-time. The manipulated images are fed into autonomy, allowing the self-driving vehicle to react to such virtual events. The full pipeline runs on the actual vehicle and interacts with the physical world, but the safety-critical events it sees are virtual. Sim-on-Wheels is safe, interactive, realistic, and easy to use. The experiments demonstrate the potential of Sim-on-Wheels to facilitate the process of testing autonomous driving in challenging real-world scenes with high fidelity and low risk.


I. INTRODUCTION
Evaluating how a self-driving car performs in dangerous scenarios is hard.Pure real-world evaluations create situations that are dangerous to participants, while pure simulation evaluations may simulate various scenarios inaccurately, such as cases in which the vehicle has extreme control inputs.This paper describes a mixed method, Sim-on-Wheels.In Simon-Wheels, we run actual autonomy stack on real cars, but create scenarios by inserting people and objects into the sensor feed in real-time.This means we can evaluate the autonomy stack in scenarios known to be dangerous to pedestrians without risking harm because the pedestrians are simulated.Furthermore, we apply the control inputs to a real vehicle.If the autonomy could cause an uncontrolled skid, we will be able to measure that.Fig. 1 illustrates our evaluation pipeline and Tab.I compares it to previous methods.In contrast to previous approaches, Sim-on-Wheels is simultaneously safe, interactive, realistic, and easy to use.
There is no current consensus on evaluation protocols for autonomous vehicles.Safety evaluation is typically through a combination of real-world road-tests, off-policy data collection, and computer simulation.Real-world testing is a resource-intensive and risky process, and testing in some scenarios is unethical (because there is a strong chance of hurting a participant).Annoyingly, these are the cases where evaluation is particularly important.Off-policy data can be an effective tool for training and evaluating perception algorithms, but does not yield a closed-loop evaluation of the safety of the entire autonomy stack.Computer simulation is safe and scalable, but is not currently reliable in extreme physical and mechanical situations.Sim-on-Wheels is a mashup of real-world road tests (so we can observe true vehicle behavior) and computer simulation (so we don't have to risk harm to participants).
An ideal self-driving evaluation environment should be safe, realistic, and closed-loop.Achieving safe evaluation is challenging, because one should be evaluating dangerous scenarios but experiments that pose a risk to life are unethical.Realistic evaluation is essential -we need to be sure that evaluation predictions reflect real-life behavior.Finally, closed-loop evaluation is essential because we must evaluate interactions between the environment and the end-to-end perception to action behavior of the controller.Sim-on-Wheels is intrinsically safe and closed-loop.We show that Sim-on-Wheels results are realistic by both evaluating the realism of the inserted objects and by comparing conclusions on real and Sim-on-Wheels scenarios (Section IV-D).We use Simon-Wheels to "spoof" a total of 40 variations of safety-critical scenarios using two different autonomous vehicle pipelines (Section IV-C).With the capability of testing scenarios configured at system limit, our Sim-on-Wheel framework reveals our modular agent is more cautious than our end-toend learned agent in terms of obstacle avoidance, achieving a lower collision rate but taking longer to reach the goals.

Brake!
Fig. 1: Sim-on-Wheels Pipeline.In Sim-on-Wheels' evaluation paradigm, vehicle autonomy is evaluated on images that are perceived in the real world but transmitted to the onboard simulator and manipulated in real-time to show important and dangerous traffic scenarios.The autonomy is asked to react to the manipulated sensory input as if the scenario actually happened.Onboard evaluation can be conducted in real-time to verify the safety and effectiveness of the autonomy.
assess a full autonomy stack.Another is to use real-world road tests [3], [4], [5].These are expensive and pose large risks to safety [32], and so are necessarily limited in scope.
Road tests on test tracks [6], [7] are somewhat safer than actual road tests, but are expensive to set up and necessarily provide relatively little environment diversity.
Yet another is to use a simulator, which is safe and convenient.Simulated sensor inputs (as in [8], [9], [10]) face a sim2real gap, despite significant literature on improving the realism of simulation (e.g.data-driven simulation in [11], [12], [14], [15]; dynamic models in [33], [34], [35] ; environments in [13], [36], [37], [38]).It is extremely difficult to be sure that a simulator captures all relevant physical modeling.This is particularly important in dangerous scenarios, where one expects extreme control inputs and odd physics may become important.For example, reverted rubber hydroplaning is an effect where very aggressive braking causes tire rubber to break down and capture a surface water film that breaks contact with the road; this and similar effects can make an important contribution to whether a stack is safe, but may not appear in simulators.Though this could be dealt with by adding modeling capacity to simulators, it remains difficult to know what to add and when to stop.In contrast, Sim-on-Wheels uses a real vehicle (and so relies on nature for these effects) but simulates dangerous scenarios (and so does not endanger participants).
Vehicle-in-the-loop simulation Sim-on-Wheels generally is a vehicle in the loop simulation, because it incorporates the entire vehicle into the test.Early such methods use a simulated driving environment [39], [40], [41], [42], with attendant sim2real problems.MiRE [43] improves realism by using a body tracking system to map a human into the scene to act as a pedestrian, but the environment is far from realistic.AR on LiDAR [44] inserts objects in a perceptually realistic manner into LiDAR point clouds (but not RGB images).WIL [45] is a general framework for integrating simulated sensor inputs and real inputs, but does not attend to rendering realism.In contrast, Sim-on-Wheels provides realistic rendering aimed at specific, safety-critical scenarios.

III. SIMULATION IN THE PHYSICAL WORLD
Sim-on-Wheels operates by inserting actors, objects, and their animations into the camera stream observed by a controller for a physical autonomous vehicle platform (the "egovehicle") moving in a real test space.Fig. 1 depicts the entire pipeline of our framework.There are three main components.Authoring: one must first author a driving scenario to be evaluated (which is likely to be safety critical), including defining appropriate real-world waypoints and animation sequences for virtual actors/objects, and determining the planned path for the ego-vehicle to follow (Section III-A).Insertion rendering: a real-time procedure takes raw RGB-D images, composites the simulated events into the image stream, and re-publishes the composite images to the agent.Sufficiently realistic insertion rendering means the ego-vehicle should react in real-time to the inserted objects as if they were truly present (Section III-B).Evaluation: metrics such as time to goal, stopping distance, and collision rate are computed onboard to evaluate the effectiveness and safety of different autonomy agents (Section III-C).

Jaywalker
Jaywalker w/ Occlusion Traffic light violation 3D assets Fig. 2: Illustration of Testing Scenarios and 3D Assets.Left: Our scenarios depict common pre-collision events such as road obstacles, jaywalking, jaywalking with occlusions, and traffic light violations.These scenarios are represented as customizable and reproducible spatial-temporal waypoint trajectories for all actors, with triggers, and can be easily expanded upon.Right: We show a subset of our 3D assets, generated from real-world 3D scans using an iPhone equipped with LiDAR.

A. Authoring Safety-Critical Scenarios
We choose safety-critical pre-crash scenarios based on the NHTSA pre-crash event report [46].Our scenarios encompass common traffic events such as static obstacles, jaywalking, jaywalking with occlusions, and traffic light violations.They are modeled as spatial-temporal waypoint trajectories for all actors, allowing for full reproducibility of each scenario.The testing is conducted to mimic both rural and urban environments, either in a straight road segment or a four-way intersection.A selection of the scenarios is depicted in Fig. 2.
Authoring involves selecting from a rich collection of 3D assets, including artist-designed assets from SketchFab [47] and in-house created assets reconstructed using an iPhone and multi-view reconstruction software [48].For each scenario, the human actors are animated by Mixamo [49] with realistic and diverse human animations, such as walking and running.
The evaluation procedure involves triggering each scenario as the ego-vehicle reaches the trigger zone at a certain speed range.A crash will occur if the vehicle fails to conduct any evasive action.The hyper-parameters of each scenario can be adjusted to control the level of difficulty, including the type and trajectory of static objects for the static obstacle scenarios, the type and trajectory of traffic light runners and trigger distance for the intersection scenarios, and the walking speed, type and the number of jaywalkers and trigger distance for the jaywalking scenarios, etc.Our scenario bank can be easily expanded to cover additional safety-critical events.One unique advantage of Sim-on-Wheels is that it enables setting aggressive hyper-parameters without any risk for physical harm to any vehicles or pedestrians, and the results of the evaluations provide a comprehensive understanding of the performance limits of each autonomy stack.Another is that the effect of (say) actor motion or dress on outcomes can be assessed by evaluating the same scenario for different instances of each actor.

B. Insertion Rendering
Insertion rendering involves producing realistic frames of a scene by inserting assets (image fragments; 3D models; etc.) into a target image (variants in [50], [51], [52], [53], [54]).Realistic frames can be produced very fast if difficulties presented by lighting, shadows, and geometrical consistency can be managed.Once the scenario is triggered, we render the simulated scenarios and compose them into images in real time.This requires the insertion rendering to be realistic, efficient, and geometrically consistent.To achieve this, we adopt a real-time OpenGL-based rasterization pipeline [55].
We first place the object accurately in the predefined world coordinate, and the camera pose is acquired in realtime through an RTK-INS localization module.The lighting consists of the skybox and the sunlight; parameters are inferred from real-time weather, GPS, and the time of day.
The rendering process is then conducted using a customized physical-based rendering (PBR) shader that follows the split-sum shading model, as described by the equation: where x is the observed point, ω o is the outgoing ray and ω s is the incoming ray.L(x, ω o ) is observed radiance; L a is ambient sky color and L s is the directional sunlight; f r is the Cook-Torrence reflection model [56]: under sunlight, and f s describes specular reflection, which accesses the base color, roughness, and metallic textures of the object to compute specular reflection.The resulting output, as shown in Fig. 3, exhibits a visually appealing surface appearance.
Shadows cast by the inserted objects contribute to the perceived realism.In our framework, a two-pass shadow mapping procedure is applied [57].The first pass renders a depth buffer from the lighting source to the visible surface, and the second pass renders per-view depth from the camera perspective.The inconsistent depth between the two identifies shadowed areas, and Poisson sampling is used to reduce aliasing effects.In addition, occlusion reasoning is conducted by comparing the rendered depth and perceived depth from the stereo cameras, ensuring the correct depth ordering between the foreground objects and the background scene.Finally, the rendered objects are composited into the perceived image through alpha-channel blending.

C. Evaluation
The performance of an autonomous driving stack is being evaluated in a simulated safety-critical scenario using recorded behaviors.We adopt the evaluation metrics from the CARLA platform [8], which include the collision rate and the trip completion time.The collision rate represents the percentage of scenarios where the ego-vehicle experiences at least one collision, which is determined by checking for overlap between the oriented bounding box of the vehicle and other Real Simulation Fig. 3: Rendering Quality.We evaluate the quality of insertion rendering by comparing reconstructed 3D humans/objects in Sim-on-Wheels with their real-world counterparts under the same pose.The results demonstrate that our real-time insertion rendering can produce realistic, high-fidelity appearances, and cast-shadows.Tab.IV quantitatively measures the sim2real gap.virtual objects.In addition to ensuring safety, we aim for our autonomous vehicle to be as efficient as possible, which is reflected in the goal-reaching time.During each run, the goal-reaching time is measured until the vehicle is within 5m of a static obstacle, or in other scenarios, within 1.5m of the end of its planned path.In any scenario, if the vehicle is not able to reach its goal, we penalize that run by recording its time metric as 100s.Furthermore, in order to account for real-world uncertainties, we report the mean collision rate and goal-reaching time over multiple runs under different hyper-parameters for one safety-critical scenario.

IV. EXPERIMENTS
The goal of the experiments in this section is to address the following three crucial questions: (1) Can the Sim-on-Wheels framework, as proposed, be utilized as a rigorous and comprehensive benchmark for evaluating the performance of various autonomous stacks?(2) Can we empirically validate the authenticity of our simulation?(3) To what extent does the onboard simulation result in an increase in latency?
In this section, we first provide an overview of the hardware platform and the test track used for our experiments.We then benchmark the performance of two self-driving agents in various safety-critical scenarios using the Sim-on-Wheels framework and conduct a comprehensive analysis of their performance.To quantify the reality gap between the simulation and the real world, we conduct an empirical analysis of the gaps in sensory data and their impact on the perception and action output of the onboard autonomy.Finally, we provide analysis and discussions of our framework.

A. Real-World Testbed
All of our experiments are carried out on the Polaris GEM e2, a street-legal, two-seater electric vehicle with a top speed of 25 mph.The sensor stack of the vehicle includes a Velodyne-16 lidar, a Novatel RTK GNSS+INS unit, a ZED 2 Stereo Camera, and a Delphi ESR 2.5 Radar.The vehicle also supports drive-by-wire through the PACMod kit, enabling steering, acceleration, and braking through software.Our experiments utilize the AStuff Spectra 2 [58], an industrialgrade edge computing platform equipped with an Nvidia A4000 GPU.This computer is connected to a built-in monitor located on the passenger side dashboard, facilitating the ease of running and debugging software directly in-vehicle.
The experiments were carried out in a shared testing track facility, where the testbed area was secured using cones and traffic tape to restrict public access and guarantee the safety of all individuals involved.At all times, a designated safety driver was present behind the wheel of the vehicle, while another team member was stationed outside the vehicle as a lookout, prepared to respond in the event of an emergency.

B. Evaluated Autonomous Agents
We subject two distinct autonomous agents for evaluation using the Sim-on-Wheels framework: (1) a modular autonomy stack, and (2) an end-to-end imitation learning stack.
1) Modular autonomy: Our modular autonomy pipeline takes as input the RGB-D image stream and a coarse planned path.It is composed of four components: detection, tracking, motion prediction, and rule-based longitudinal planning.
Obstacle detection is split into two modules: static obstacles and dynamic traffic participants.Both rely on pretrained segmentation models without any fine-tuning.Static obstacles are detected by applying a pretrained foreground segmentation model [59].Dynamic obstacles are detected by applying a pretrained instance segmentation model [60] to extract instance masks for pedestrians and cars.Each instance mask is converted to a 3D position by taking the median x and y coordinates from the associated depth.
At the tracking stage, greedy matching [61] is performed to associate the latest detected object and existing tracks based on a bird's eye view.We then estimate the state (velocity and position in bird's eye view) through a linear motion model.Using these velocities and the ego-car's current speed and planned trajectory, we predict the positions of each entity at every 0.2 s step up to 10 s into the future and identify   The results show that the modular pipeline is better overall at avoiding collisions, but the imitation learning pipeline reaches the goal faster.
potential collisions.At each future time step, we search for collisions within a fixed collision radius (3 m) and travel distance threshold (5 m).This threshold can be increased to compensate for latency.If a collision is found, we output a desired speed of zero.Otherwise, we output 2 m/s.

2) Imitation learning (IL) agent:
We also train an endto-end neural controller agent using behavior cloning.The network takes as input the latest eight frames of speed, position, and RGB image information, which are separated by 0.2 seconds.The controller outputs a continuous brake command to be executed 0.2s in the future, accounting for the latency of the simulation and autonomy pipeline.
The network follows a spatial-temporal recurrent neural network architecture [62], [63].We use an FCN [64] backbone pre-trained on instance segmentation on the COCO dataset [65] to encode object-level information.The visual features are concatenated with other meta-inputs and fed into a gated recurrent unit (GRU) to incorporate temporal information.The decoder is a multi-layer perceptron outputting a brake value.
To train the network, diverse real-world data is collected from human driving where the driver brakes for static objects and jaywalking pedestrians.The training data contains 195 sequences of static-object braking and approximately 200 sequences of dynamic jaywalking.The network learns to minimize the difference between human driving and its own actions using a L 1 regression loss, with the Adam optimizer and a learning rate of 5e-4.Weighted sampling is applied to focus on the relevant samples (e.g.snippets right before brakes), and data augmentation techniques, such as random color jittering and random cropping, are used to increase robustness.We also apply a dropout rate of 0.8 during training.
3) Vehicle controller: The output from both the modular agent and the IL agent is sent to the same vehicle controller to produce the final vehicle command.The longitudinal direction is controlled by a proportional-integral (PI) speed controller [66].Meanwhile, the lateral direction is controlled by the Stanley controller utilizing a bicycle model [67].
4) Human driver reference: We also benchmark human driver performance within our Sim-on-Wheel system as a reference.The human driver operates the vehicle, navigating it along the planned path and avoiding collisions within simulated scenarios by monitoring real-time augmented video feeds displayed on the GEM vehicle's screen.To prevent the human driver from prematurely reacting to the expectation of an obstacle, we randomized the spawn point of the moving agent, forcing the driver to react on the fly.

C. Autonomy Benchmark Results
Tab. II reports the performance metrics of our driving agents using the Sim-on-Wheels evaluation framework.The qualitative examples of the representative scenarios are depicted in Fig. 4. Our analysis, based on the evaluation results and log replays, highlights a few key findings.
Our results indicate that the modular agent generally takes longer to reach the destination due to false positive detections of obstacles, leading to intermittent braking.Nevertheless, it is capable of safely reaching the goal in most cases involving static objects and standard jaywalkers.However, the jaywalker-with-occlusion scenario presents a challenge for the modular agent, as there is limited reaction time between the first appearance of the jaywalker and a potential collision.This results in the failure of the agent to brake in time.Furthermore, the challenging textures of the occluding walls causes the perception module to miss detections.
In contrast, our results indicate that the imitation learning agent does not experience intermittent braking and achieves a shorter time to reach the goal.Furthermore, it performs acceptably well in the intersection scenario with vehicles, which was not part of the training data.However, the agent tends to react late, resulting in higher jaywalker collisions.This may be due to latency differences during training and onboard deployment.It is important to note that such drawbacks were not frequently observed during offline validation, highlighting the importance of real-world vehicle-in-the-loop testing.
Note that none of the agents, including the human driver (Tab.III), achieve a zero collision rate across all the scenarios, which highlights the difficulty of our designed safety-critical

D. Reality Gap Analysis
In Fig 3, we qualitatively assess our insertion rendering by comparing real vs.simulated results.The image pairs appear quite similar overall, demonstrating the robust realism of the framework.However, some minor differences do exist, such as incomplete 3D reconstructed shapes, slight differences in shaded color, and variations in sunlight intensity, shadow shape, and cloud patterns due to the two images being taken in a windy outdoor environment at different times.
Tab. IV reports the quantitative measure using the peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [68], and a learned perceptual similarity metric (LPIPS) [69].The mean absolute error (MAE) of each pixel is calculated and the percentage of outlier pixels with errors over a threshold of 25.5 under the RGB intensity range (0, 255) were reported.As indicated in Tab.IV, the results show that the virtual object insertions exhibit high fidelity and realism with a low percentage of outlier pixels.
We also evaluate the impact of the reality gap on the performance of our autonomy pipelines.Firstly, we measure the mean Intersection over Union (mIoU) between the static obstacle segmentation network's outputs on real and simulated images.Although the silhouettes of the obstacles match closely, the mIoU is not perfect, which is likely due to the sensitivity of the network to slight variations in color and the background, such as cloud movement.This limitation is a result of the time required to physically arrange obstacles.
Additionally, we evaluate the action-level reality gap by comparing the agent's behavior in real and simulated scenarios.For this, we set the initial vehicle position, planned path, obstacle type, and position to be identical in both scenarios, with only the obstacle being different (real object vs its digital twin).Our experiments show that the modular agent can stop without collision in both real and Sim-on-Wheels   experiments.However, we do observe a small discrepancy in its actual trajectory, indicating that the reality gap is not fully closed yet.This may also be due to other factors, such as changes in background illumination as mentioned above.

E. Generalization to other environments
We can easily adapt to new environments as long as the scenario is compatible with the surroundings.As one example, we generalize to KITTI-360 data [70] as shown Fig. 6.

F. Runtime Analysis
The runtime performance of the Sim-on-Wheels system was evaluated on an onboard computer equipped with a single Intel Xeon(R) E-2278G CPU @3.40GHz ×16 and a single Nvidia RTX A4000 GPU.The results are presented in Tab.V.The standalone rendering component of the system processes pre-recorded sensor data with a frame rate of 23 FPS.Upon integration with ROS, the frame rate is impacted by the consumption of system resources by hardware drivers, communication modules, the autonomy agent, and the controller.Nonetheless, the system is able to publish rendered images at a minimum frame rate of 10 FPS, even when the autonomy agents run concurrently with the Sim-on-Wheels.
Our results show that the end-to-end latency from the camera capture time to vehicle command increases by 18-21% with the integration of Sim-on-Wheels, which could be mitigated by equipping the vehicle with two onboard GPUs.Additionally, both autonomy agents can predict future states and actions to compensate for such latency.Tab.V also reveals that the raw ZED2 stereo camera streaming, decoding, and post-processing consumes a substantial amount of time.Despite using separate callback threads to process incoming ROS messages in parallel, a delay still persists due to highbitrate camera streaming.We plan to further investigate a faster sensor-computer interface to address this issue.

V. CONCLUSION
We propose Sim-on-Wheels, a vehicle-in-the-loop framework for evaluating the performance of autonomous vehicles in real-world scenarios in a safe and realistic manner.To the best of our knowledge, Sim-on-Wheels is the first framework of its kind to support the integration of simulation and realworld testing practices for the safe and realistic evaluation of autonomous vehicles.Our results demonstrate the versatility and reliability of Sim-on-Wheels as a framework for evaluating various agents.To further support the research and development of autonomous driving, we will open-source Simon-Wheels to the community and establish a safe, closed-loop, end-to-end, real-world benchmark.
Reached Goal: No; Collision: No Reached Goal: Yes; Collision: Yes Reached Goal: Yes; Collision: No Reached Goal: Yes; Collision: No

Fig. 5 :
Fig. 5: Human Driving Failure Case: The one episode in which the human driver collided was the variation of Jaywalker with Occlusion scenario that involved multiple fast jaywalkers.The left image is captured at the braking point.Jaywalker with Occlusion Agent Type Collision Rate ↓ Time To Goal ↓ Human 0.08 31.79
Time to Obstacle ↓ Collision Rate ↓ Time To Goal ↓ Collision Rate ↓ Time To Goal ↓ Collision Rate ↓ Time To Goal ↓ Fig.4: Qualitative driving results.: We show bird's-eye view layouts of four different scenarios across both agents in ego-vehicle coordinates (meters).The images are captured at the first braking point before approaching an obstacle.We can see that the modular agent tends to be more cautious and takes longer to reach the goal, while the imitation learner drives smoothly but brakes too early/late in certain scenarios.

TABLE II :
Sim-on-Wheels Benchmark Results: These metrics are computed as described in Section III-C.All time metrics above are measured in seconds.Findings:

TABLE III :
Human Driver Benchmark on Sim-on-Wheels.These metrics are computed as described in Section III-C scenarios.It is worth mentioning that many of the test cases, particularly the jaywalking ones, are impractical to test in the real world due to safety concerns and can only be physically evaluated with Sim-on-Wheels.

TABLE IV :
Reality Gap.The reality gap is assessed using three metrics: 1) sensor image fidelity between reality and simulation, 2) mIoU of perception algorithm output, and 3) similarity of final actions (brake values).Results indicate a small reality gap for Sim-on-Wheels, validating its efficacy and reliability as an evaluation framework.

TABLE V :
System Runtime Breakdown.IL stands for imitation learning agent, and modular for modular agent.