Addressing Limitations of State-Aware Imitation Learning for Autonomous Driving

Conditional Imitation learning is a common and effective approach to train autonomous driving agents. However, two issues limit the full potential of this approach: (i) the inertia problem, a special case of causal confusion where the agent mistakenly correlates low speed with no acceleration, and (ii) low correlation between offline and online performance due to the accumulation of small errors that brings the agent in a previously unseen state. Both issues are critical for state-aware models, yet informing the driving agent of its internal state as well as the state of the environment is of crucial importance. In this paper we propose a multi-task learning agent based on a multi-stage vision transformer with state token propagation. We feed the state of the vehicle along with the representation of the environment as a special token of the transformer and propagate it throughout the network. This allows us to tackle the aforementioned issues from different angles: guiding the driving policy with learned stop/go information, performing data augmentation directly on the state of the vehicle and visually explaining the model's decisions. We report a drastic decrease in inertia and a high correlation between offline and online metrics.


I. INTRODUCTION
A UTONOMOUS driving is becoming a reality.To make this possible, several problems have to be solved, such as perception [1], planning [2], and forecasting [3].A recent trend that has obtained remarkable results is to directly train driving agents from raw observations with Imitation Learning (IL) [4], [5], i.e. learning to mimic demonstrations from expert human drivers.In this way, the autonomous driving problem is tackled holistically, without having to rely on different heterogeneous modules.
Imitation learning, however, has some limitations.Since the driving capabilities are learned by behavioral cloning, IL models usually lack explicit causal understanding.Rather than rules, relations between patterns are learned, thus making the agent vulnerable to spurious correlations in the data.This phenomenon is known in the literature as causal confusion [6].In particular, when training IL agents for automotive, there is evidence of a special case of causal confusion referred to as the inertia problem [5], [7], [8].The inertia problem stems from a spurious correlation between low speed and no acceleration in the training data, making the driving agent likely to get stuck in a stationary state.As a consequence, when a state-aware agent halts (e.g. at a traffic light or in a traffic jam), it may not move again when it should.For state-awareness here we refer to any source of information that can inform the agent about its halted state, such as a state variable, either explicitly modeled or implicitly inferred, that encodes velocity.
A second issue that limits the applicability of IL is the gap between offline and online driving capabilities [9], [10].Codevilla et al. [10] showed that there is a low correlation between offline evaluation metrics (e.g.frame-wise Mean Squared Error in steer angle prediction) and the success rate in online driving benchmarks.In online driving, the output of the model influences future inputs, violating the i.i.d.assumption made by the learning framework [11].Accumulation of small errors thus brings the vehicle into new states, never observed at training time [12].Similarly to the inertia problem, this issue manifests itself the most in state-aware models: the more variables are observed by the model, such as ego-velocity or previous driving commands, the sparser the coverage of the training data gets, making it more likely to end up in underrepresented configurations at driving time.
To summarize, IL agents suffer from ill-distributed training data that presents spurious correlations and domain shift compared to the test set.These issues make it particularly hard to train state-aware agents: using multiple input sources increases the chances of discovering unwanted correlations in the data or of observing under-represented inputs at inference time, for which the agent does not know how to act confidently [11]- [13].In this paper, we address these difficulties in training state-aware IL models.
In literature, some attempts to identify and solve these issues have been done.The inertia problem has been addressed by regularizing training through vehicle speed prediction [5], whereas [10] demonstrated the usefulness of two data augmentation approaches to improve offline and online driving capability correlation: first, augmenting the training set with lateral cameras, thus simulating a vehicle with an unusual trajectory, and second, perturbing the driving policy to record samples where the vehicle recovers from anomalous states.In this paper, we build on these ideas without collecting additional data.We propose an IL agent that propagates the state of the vehicle through the model and uses it as the core of a multi-task architecture.On the one hand, this allows us to explicitly train the model to avoid issues such as the inertia problem.On the other hand, this allows us to perform data augmentation on all the observed data, reducing the distribution shift between training samples and what the agent may see at driving time.
Our IL agent is designed as a hierarchical transformer model with state token propagation.The vehicle's state is encoded in a special token of a vision transformer [14] and is enriched with new information at each stage of the architecture.At first, we predict whether the vehicle must stop or go, directly tack-ling inertia.This information is passed to the next stage which predicts the driving commands (namely steer, throttle, and brake).Finally, the model leverages a differentiable Command Coherency Module (CCM), encouraging the model to correctly bring the vehicle to the desired future state by generating nonconflicting controls.Such command is used only at training time and acts as a regularizer.Since our architecture is based on a transformer encoder [15], it heavily relies on attention.We leverage such attention to gain insights about what the model is focusing on to make its decisions (e.g., the vehicle's state or visual patterns), following the recent trend of designing explainable driving models [16]- [18].
Interestingly, the ability to explain the model's decisions provides us with a better understanding of the inertia problem.Inertia makes an IL model halt and stay still whenever the speed of the vehicle is close to zero.However, it is hard to discriminate this phenomenon from other kinds of failures that make the vehicle stop indefinitely.For instance, if part of the environment is mistakenly interpreted as a crossing pedestrian or a red traffic light, the vehicle will wait indefinitely for the state of the surrounding environment to change.Whenever this happens, a different solution must be sought in order to enforce the visual backbone of the model, rather than its causal inference capabilities.By combining the model's attention with a retrieval-based explainability method, we are able to highlight these differences and isolate instances of inertia from backbone failures.
The main contributions of our paper are the following: • We propose a state-aware conditional imitation learning model for autonomous driving.The model is multi-stage and exploits state propagation through different transformer layers breaking down the generation of driving commands into coarse to fine tasks.• We specifically address issues in state-aware imitation learning such as the inertia problem and the offline/online performance gap.Inertia is drastically limited by state token propagation and multi-stage learning, whereas the correlation between online success rate and offline metrics is enforced via data augmentation on the vehicle's state.
• We propose a combination of the transformer's self attention with an ex-post semantic explainability method that we use for inspecting model failures.This points out interesting "hallucinations" of the visual backbone that cause behaviors mistakenly confused with inertia.
II. RELATED WORKS Imitation Learning (IL) is based on the idea that, to learn a complex task, a model can observe the demonstrations of an expert performing it [19], [20].This paradigm has been successfully applied to autonomous driving.One of the first approaches based on IL predicted steering commands for lane following and obstacle-avoiding tasks [21].The task soon evolved into the so-called Conditional Imitation Learning (CIL), in which predictions are conditioned on high-level commands such as turn or go straight.Several works followed this approach [4], [17], [22]- [24], also combining it with reinforcement learning [25]- [27].
To obtain better driving capabilities, several sensors and additional synthetic data are often used [4], [28]- [31].Large use of environmental information is done by prior work in the form of semantic segmentations [26], [32]- [34], top-view maps [30], or both [27], [35]- [38].Similarly, other methods leverage depth information [23], [38], LiDAR data [24], [24], [32], [38] or cues such as traffic light states [26], [33], lane position [26], and intersection presence [26], [33].Such data has been also used in the form of affordances, low-dimensional representations of environmental attributes [22], [26], [39].Differently from all the aforementioned methods, we rely on a purely RGB-based approach.Whereas these methods have access to environmental data, either as inputs or as additional sources of supervision, we assume to have access only to the RGB stream and the state of the vehicle (i.e., current speed, steer, acceleration, and brake), which is a direct consequence of the driving policy.A similar assumption is done in recent works such as [5], [40], [41].
Training a state-aware imitation learning agent hides some challenges [13].Despite its simplicity and effectiveness, it breaks the i.i.d assumption made by any statistical supervised learning framework since current decisions influence future inputs [11], [12].The main difficulty that needs to be addressed is trying to keep the model in a state close to what has been observed at training time [13].When this does not happen, online errors tend to accumulate over time, generating less accurate behaviors [11], [42].The effect is to have online capabilities that do not correlate with offline error metrics measured on a validation set [10], which makes the agent difficult to train.A solution to bridge this gap is to perform data augmentation.Codevilla et al. [10] showed that collecting data from three different cameras while adding noise to the driving policy helps in recovering from unexpected scenarios.This however requires collecting hours of additional data.Image-level data augmentations such as changes in contrast, brightness and tone also have beneficial effects, especially for generalizing to similar scenarios with different conditions (e.g.weather) [5], [10].Nonetheless, augmenting the pixel space has a limited effect on state-aware models, where predicted quantities are provided as input.Differently from prior work, we perform augmentation on the vehicle's state, injecting it into the model as a special token of a transformer [15].Augmenting the state leads to a better coverage of the state space during training.
The presence of the state token allows us to address another well-known issue with imitation learning in automotive: the inertia problem [5], [7].This has been addressed in literature by predicting the current speed of the vehicle [5] or via causal imitative learning [8], also based on speed prediction.A memory-based approach for retrieving previously observed scenarios has also been exploited recently [43].The common speed-prediction solution proposed in [5] suffers from a high collision rate, likely due to overcompensation of inertia.Instead of making the network predict its current velocity, we leverage a multi-stage architecture, where a stop/go loss based on the actual causes for stopping (presence of pedestrians, traffic lights, other vehicles) conditions the command generation.In this way, we inform the model about external elements that should be taken into account while driving.We find this solution to almost eradicate inertia entirely.

III. OVERVIEW
Imitation Learning (IL) trains an agent by observing a set of expert demonstrations to learn a policy [20].In the simplest scenario, IL is a direct mapping from observations to actions [19].In automotive, the expert is a driver, the policy is "safe driving" and the demonstrations are a set of (frame, drivingcontrols) pairs.In this paper, we address Conditional Imitation Learning (CIL), a declination of imitation learning where the policy must reflect a given high-level command, such as turn right or follow lane.As in prior work (e.g.[4], [17], [22]), we divide our architecture into multiple branches, with separate heads learning command-specific policies.However, differently from prior work, we structure our model as a hierarchy of stages, each of which is dedicated to addressing different aspects of driving, as depicted in Fig. 1.
The proposed model is state-aware, in the sense that it takes as input the speed and the steer, acceleration and brake values predicted at the previous timestep.In principle, informing the model of the current state of the vehicle could ensure temporal smoothness and coherency in the driving policy (i.e., the predicted driving controls).In practice, this makes the model vulnerable to spurious correlations in the data, bringing out the inertia problem.To address this issue, we propose a multi-stage transformer model with state token propagation.We feed the vehicle state to the model as a special token of a vision transformer (ViT) [14].Operatively speaking, the state token fulfills the same role as the CLS token in standard ViTs.However, by enclosing vehicle measurements we can inject information into the model and let it correlate to relevant spatial features via self-attention.After each layer, the state token is enriched with spatial information and is decoded into coarse-to-fine driving commands, depending on the stage.The coarser of such commands is a decision on whether the vehicle should stop or go, thus explicitly addressing inertia.Injecting the state token into the model has the additional benefit of enabling data augmentation on the state values itself, addressing what is arguably the biggest limitation of imitation learning, i.e., the inability to perform well in previously unseen states [9] that is also responsible for the gap of accuracy between offline and online driving.We also introduce a regularizer that ensures coherency in the generated driving commands.This is different from similar solutions adopted in prior works, where speed is predicted to reduce inertia [5], but here we use it to reduce online-offline evaluation gap.

A. State Token Propagation
Our model exploits a multi-stage transformer encoder architecture.The hierarchy of layers reflects a coarse-to-fine learning where each stage generates a different output.The rationale is that the i − th stage can inform stage i + 1 by taking the output of the encoder corresponding to the state token and propagating it as the new state token.To enrich the token with increasingly complex semantics, at each stage we decode it into a different output with a Feed Forward Network (FFN), specific for separate tasks.
We define our multi-task hierarchy as follows.The first stage predicts whether the agent should halt the car or keep it going.This is specifically thought to address the inertia problem.This stage does not produce any driving control and is expected to focus on traffic lights and other agents.The second stage generates the actual driving commands: throttle, brake, and steer.This second stage of the model should instead learn and understand road topology and ego-motion patterns.Thanks to the propagated state token, the generation of the driving commands is conditioned on the stop/go decision of the previous stage.The third and final stage is the command coherency module that acts as a regularizer, thus we use it only at training time.The initial state token is the embedding of steer, throttle, brake and speed at time t − 1.
To cope with the non-uniform distribution of vehicle states in the train set (see Sec. VII-A), we introduce a data augmentation strategy based on noise injection to perturb the state token.We inject a zero mean Gaussian noise with σ = 0.1 for driving controls, since they are all in [0, 1].For the speed, that takes values in [0, 10], we use σ = 1 instead.

B. Pixel-State Attention
Every stage of the model performs token-to-token attention, thanks to the transformer's self-attention.The advantages are twofold: on the one hand, prior work has shown that explicitly modeling attention improves driving capabilities [17], [40]; on the other hand, it provides a built-in interpretability mechanism that can be used to visually explain decisions.
In our model, the attention involves not only visual patches as in [17], [40] but also the state of the vehicle.First, the output of the convolutional backbone, i.e. a feature map f of size tokens, corresponding to 1 × 1 spatial patches in the feature map.Patches are then linearly projected into a D-dimensional space to adapt them to the input size of the transformer.The four scalar quantities that compose the state of the vehicle (speed, steer, acceleration and brake) are lifted to a dimension of D/4 and concatenated into the Ddimensional state token, which we refer to as x state .As in [14], a learnable positional embedding E pos is added to all the N + 1 tokens.To summarize, the set of N + 1 tokens fed to the encoder is composed as follows: where The self-attention carried out in every layer of the transformer is thus a pixel-state attention, where every pixel of the feature map can attend to each other plus the state token.This allows us to inspect at each stage which information is privileged by the model: when the state token carries relevant information from the previous stage (e.g., if the vehicle must stop), the model will give it high importance; vice-versa, if the image carries meaningful cues (e.g., an intersection) the model will focus on the interested pixels.

C. Command Coherency Module
A possible cause for low correlation between off-line error and on-line driving performance [10] can be found in throttle, brake and steer being predicted independently.What is missing is the optimization of a common goal that brings the vehicle from one initial state to a desired one, considering all three quantities.Furthermore, individual biases may interfere with the quality of the overall policy.
To generate the appropriate driving behavior, the predicted commands must be compatible with each other.To this end, we introduce the Command Coherency Module (CCM).The CCM takes as input steer, throttle, brake and speed at time t and predicts the future speed at time t + 1.We first train the command coherency module on training measurements to learn how such quantities affect the speed of the vehicle.Once the module is trained, we freeze it and use it as a regularizer while training the driving agent.To implement the CCM, we use a lightweight multi-layer perceptron with three layers and ELU activations.
Our CCM shares some traits with the speed prediction module of [5].Here, the authors feed a frame-based estimate of the speed to the model.Instead of feeding the predicted speed as an additional input, we optimize it to regularize the outputs and conciliate the driving commands.

D. Architecture and Training
The proposed model is composed of a shared convolutional backbone plus four parallel branches, one for each high-level command.The shared backbone consists of 5 convolutional layers with ELU activations.The first three layers have respectively 24, 36, and 48 5 × 5 kernels with stride 2, followed by two other layers with 64 3 × 3 filters with stride 1. Input images are resized to a 200 × 88 px, yielding a 4 × 18 × 64 feature map.After flattening we obtain N = 72 visual tokens.Each branch is a multi-stage transformer encoder with input size D = 64.We use 3 heads with a depth of 4 for each encoder stage.
The first stage of the transformer takes the state token x state along with the N visual tokens.The stage outputs N + 1 transformed tokens, among which the enriched state token is used to predict whether the vehicle should stop or go.To optimize the stop/go prediction we use an L1 loss: where respectively for traffic light stop, pedestrian stop, and stop due to other vehicles.The second stage is in charge of generating driving commands.Similarly to the first stage, the propagated state token taken from the output of the stage is fed to a feed-forward regressor to predict steer, throttle and brake.We use an L1 loss for driving command prediction: where s ∈ [0, 1], t ∈ [0, 1] and b ∈ [0, 1] are respectively the predicted steer, throttle and brake values, s, t and b the corresponding ground truth values and α, β, and γ are weights with values 0.5, 0.45, and 0.05, as in [23].For the Command Coherency Module, we also use an L1 loss.The CCM loss L CCM , and the stop/go loss, denoted as L SG , contribute to the total loss according to: L T otal = λL c + κL CCM + τ L SG where λ = 0.8, κ = 0.1 and τ = 0.1.We train our model end to end with the Adam optimizer for 100 epochs with a batch size of 64 and a learning rate of 0.0001.

V. MODEL EXPLAINABILITY
The self-attention of the transformer stages in our model allows us to inspect the behavior of the model, thus providing explanations for the predictions.We refer to this as builtin explainability.Since we have dedicated each stage of the model to different tasks, we can leverage such information to gain insights about what is important for different aspects of the learned policy.We combine the built-in explainability with ex-post explainability, i.e. an approach specifically designed to provide an additional interpretation of the model's behavior at inference time.

A. Built-in Explainability
In both stages of the transformer model, we can obtain visual explanations in the form of attention maps.The maps are obtained by considering the attention between the state token and the image patches.The first stage provides information on what the model looks at for stop/go prediction, whereas the second identifies relevant image regions for a correct navigation.

B. Ex-Post Semantic Explainability
Built-in explainability only explains which regions are taken into account.However, it does not provide information about how these regions are interpreted by the model.We propose an ex-post semantic explainability that combines visual attention with k-NN search of image features.
We gather offline a set of m feature maps from the training set and collect the D-dimensional descriptors of each spatial location.In this way, we obtain a total of M = m * N feature vectors, N being the number of image spatial patches.We denote the i-th feature in the set as y i .At inference time, we extract the feature map f of the input image and, for any spatial location of interest (e.g., the most attended ones by built-in attention), we perform a k-NN search with FAISS [44] using the L2 distance: where f p is the p-th feature vector of the input image (p ∈ {1, . . ., N }).
For each k-NN we reproject the feature back onto the original image and take the semantic segmentation of the corresponding region 1 .This allows us to inspect what the model is hallucinating by finding the dominant semantic category in the neighbors and allows us to interpret failures.

A. Dataset
For training and evaluating our model, we use the Corl2017 [45] and NoCrash [5] datasets, both based on 1 Ground truth segmentations are available in the NoCrash dataset [5]  the Carla simulator [45].The Corl2017 dataset has expert demonstrations driving across the same town with a set of different weather conditions.Testing is performed by driving in different conditions: same town and weather as training; same town and new weather; new town; new town new weather.Testing also includes 4 tasks: go straight, one turn, navigation, navigation dynamic.The navigation tasks require driving from two distant waypoints and the dynamic scenario includes other vehicles and pedestrians.NoCrash has been designed to evaluate advanced driving skills such as stopping at traffic lights, avoiding collisions and driving in dense traffic environments.The evaluation involves 25 episodes on three navigation tasks, spanning from an empty town scenario to a dense traffic one.Corl2017 has 657.601 frames and NoCrash instead 1.279.738frames, divided into frontal and two lateral cameras (−30°,+30°).For both datasets, the agent must comply with a given high-level command among go straight, turn right, turn left and follow lane.As in [5] we train on a subsample of 10% of the data, comprising 10 hours out of a total 100 hours of driving.Both datasets provide metadata including the current state of the autonomous vehicle and environmental information such as driving commands, highlevel commands and position.

B. Results
We report in Tab.I the results on the Corl2017 dataset.For a fair analysis, we compare our method directly against other RGB-based methods.We also report methods that leverage additional sources of supervision such as depth and semantic segmentation or additional data to train the model.The results show that our method obtains better or on-par results when compared to other RGB-based models.Per-task success rates are in the supplementary material.
Compared to Corl2017, where traffic light violations and collisions are not considered, the NoCrash benchmark is extremely more challenging since environmental cues must be taken into account.We report results in Tab.II.Our approach outperforms RGB methods, with the only exception of CILRS [5], which performs slightly better in some empty scenarios.In the more challenging scenarios with regular and dense traffic, our approach performs better than the competitors, highlighting the capacity of the model to interpret patterns relative to traffic lights and other agents.
In Tab.III we show the percentage of traffic light violations committed by our model.These results are computed on the task Empty both for Training Conditions and for New Weather & New Town.As a baseline, we also report the results for a Single Stage model, i.e. a simplified version of our approach without the first stage.This model is state-aware as the full model, but does not exploit the stop/go loss which we designed to prevent inertia.Interestingly, our model outperforms the single stage baseline by a large margin, showing the usefulness of the stop/go loss to correctly focus on traffic lights.At the same time, we significantly lower the traffic light violations compared to CILRS, despite it obtained a higher success rate in the empty tasks for New Weather & New Town (see Tab. II).We attribute this difference to two factors: (i) CILRS' strong ResNet vision backbone yields better generalization across weather conditions; (ii) higher capacity of our model to focus on traffic lights thanks to attention and stop/go loss.
Attention plays an important role in identifying relevant cues.Since we employ transformer encoders in every stage of the model, we can visually inspect self-attention for every stage.We create heatmaps by reprojecting on the image the attention value relative to the state token against every visual token (Fig. 2).The heatmaps for the two stages reflect the tasks that are addressed at the corresponding levels: stop/go decision and driving command generation.The first stage focuses on small scene details such as traffic lights or pedestrians (additional qualitative examples for the first stage of our model are shown in Fig. 9), while the second stage attention is scattered and attends regions that are important for correct navigation such as intersections and roadsides.

VII. STATE TOKEN AND INERTIA PROBLEM
We inspect the relative importance of the state token and the image image content.The state token emitted by the first stage is used to predict a stop/go decision with a dedicated loss.This makes the token carry useful information to the second stage, which is in charge of generating the actual driving commands.In presence of a halt cue (e.g.red traffic light) encoded in the state token propagated to the second stage, the attention scheme of the second stage focuses on the state token rather than on the image patches.When the state token indicates that the vehicle can advance, the attention focuses instead on the image patches to generate appropriate driving commands.Fig. 3 shows examples of stage 2 attention, with values of the state token and of the whole image, accumulated for each visual token.
The stop/go loss has a great impact on driving performance.In Tab.IV we show the effect of removing such loss on the NoCrash benchmark.In densely trafficked environments, the success rate is almost halved when removing the loss.Similar results are obtained with the single stage baseline.We also test a model trained using a random vector as state token (w/o ST), yet keeping the stop/go loss: success rate heavily drops,  especially in crowded environments.By feeding the state of the vehicle, the agent becomes aware of its speed and momentum, e.g.indicating whether and how a turn is taking place.This is hard to deduct from a single image.Furthermore, the use of the state token and the stop/go loss, have a direct effect on addressing the inertia problem since the first stage is explicitly trained to predict movement.Tab.V shows the failure rate due to inertia.As in [5] we identify the inertia problem when an agent is still for 8 seconds before time out.Most failures of the single-stage baseline can be traced back to inertia and these are almost completely eliminated with the multi-stage model.Surprisingly, the NewWeather-Empty configuration in the NoCrash Benchmark exhibits the highest failure rate, attributed to inertia (as indicated in Table 5).In this context, when the vehicle comes to a halt, it becomes trapped in a stationary state due to inertia.Notably, in an empty scenario, the sole discernible visual cue is the traffic A more general analysis of the causes of failure is also provided in Tab.VI.The multi-stage model considerably reduces collisions with pedestrians and vehicles, compared to the single-stage baseline.Interestingly failures due to time out (which include inertia) are almost eliminated.Tab.V and Tab.VI indicate that, despite addressing in a very effective way the inertia problem, the model still suffers from a few inertia failures.We exploit the Ex-post Semantic Explainability approach presented in Sec.V to inspect 50 episodes of the NoCrash benchmark where the inertia problem still occurs at traffic lights.In 56% of the cases where the vehicle is stuck at a green light, the k most similar features to the attended one contain a red traffic light, in 18% a pedestrian crossing, and in 3% a vehicle (Tab.VII).
In Fig. 4 we show the top 10 nearest samples of the image region with the highest attention value (first transformer stage).The first two rows show failure cases: the model correctly focuses on the traffic light but although it is green, the model maps it in a region of the latent space densely populated by red traffic lights.We also show a sample of correct driving, where the vehicle accelerates as soon as the light turns green: retrieved images all depict green lights.This suggests that what may appear as inertia might instead be confused with a failure of the backbone that mistakenly "hallucinates" halt cues.

A. Online/Offline Evaluation and Noise Injection
To address the offline/online shift, exhaustive coverage at training time of possible input configurations (observed environment + internal state) could be a solution, yet it is difficult to achieve.For instance, the NoCrash dataset is unbalanced and throttle and steer values are extremely biased (Fig. 5).This limits the possibility of effectively inputting the vehicle state into the model at driving time.Our data augmentation strategy that injects noise on the state token (Sec.IV-A) is intended to address this limitation.We introduce a zero mean Gaussian noise with σ = 0.1 on the driving controls (which are in [0, 1]) and with σ = 1 for speed.This has the effect of letting the model see at training time different combinations of state values.In Fig. 6 we show the joint distribution of steer and throttle values with and without noise injection.Two modes for throttle can be observed corresponding to the over-represent stationary and full-throttle scenarios.At the same time, steer has a Gaussian distribution centered in zero (indicating no steer).With noise injection, we get a more uniform distribution in the low-steer interval [-0.25, 0.25] across all throttle values.Also, higher steer values obtain a more uniform coverage.
In Fig. 7 we quantify the correlation between online success rate and offline validation MAE using the sample Pearson correlation coefficient, as done in [10].We plot the results without using data augmentation via noise injection (corr: -0.64) and with (corr: -0.92).Despite not having a huge impact on the results in training conditions, as shown in Tab.IV, in generalization conditions noise injection brings noticeable benefits.From the plots in Fig. 7 it can be seen that without using data augmentation there are huge differences for similar MAE values (e.g., 20% success rate gap with a small difference of 0.0001 in MAE).
Another component to help the agent act as the expert demonstrator is the command consistency module (see IV-C).This module acts as a regularizer, encouraging the model to generate driving commands that are not in conflict with each other, and thus preventing unwanted behaviors at driving time.The necessity of CCM also stems from the fact that maneuvers (e.g. a right turn) could be performed in different ways (e.g.slow and narrow or fast and wide turn).Results in Tab.IV confirm the usefulness of CCM.

Fig. 1 .
Fig.1.A convolutional backbone extracts a feature map, which is fed to a multi-stage transformer architecture.The first stage (E1) takes the feature and a state token, which is propagated across the network.The output of E1 corresponding to the state token is decoded into a stop/go prediction with a Feed Forward Network (FFN).The second stage (E2) uses the propagated state to predict driving commands.Finally, the Command Coherency Module is used as a loss regularizer to ensure consistency between driving commands.

Fig. 2 .Fig. 3 .
Fig. 2. Each row shows visual attention for the two stages of the model w.r.t an input image.The two stages reflect important cues for the stop/go and command generation losses respectively.

Fig. 4 .
Fig. 4. Top 10 nighbors for the highest scoring attention after a traffic light turns green.We show examples of both successful crossing of the traffic light (framed in green) and failed due to red light "hallucination" (framed in red).

Fig. 6 .
Fig. 6.Distribution of Steer-Throttle without (left) and with (right) Noise Injection on NoCrash.We have a better coverage of the space with Noise Injection.

Fig. 7 .
Fig. 7. Pearson correlation between online success rate and offline validation MAE obtained by training the model multiple times without (left) and with (right) data augmentation on the state token.Dot size corresponds to different training epochs.

TABLE II SUCCESS
RATE ON NoCrash.METHODS ABOVE THE LINE HAVE ACCESS TO PRIVILEGED INFORMATION NOT USED BY OTHER RGB-BASED METHODS: TOP-VIEW MAPS (LBC, WOR), SEMANTIC SEGMENTATIONS (LBC, WOR) AND ADDITIONAL DATA (FASNET, MODE, CADRE, GRIAD).

TABLE III TRAFFIC
LIGHT VIOLATIONS ON NoCrash BENCHMARK (Empty).

TABLE IV ABLATION
STUDY SWITCHING OFF MODEL COMPONENTS ON NoCrash.WE REMOVE: NOISE INJECTION FUNCTION (W/O NI), COMMAND COHERENCY MODULE (W/O CCM), STATE TOKEN (W/O ST) AND STOP/GO LOSS (W/O S/G LOSS).WE ALSO SHOW RESULTS FOR THE SINGLE STAGE BASELINE.

TABLE VI SUCCESS
RATE AND NUMBER OF FAILED EPISODES TERMINATION CAUSE ON NOCRASH.WE COMPARE WITH THE BASELINE AND CILRS IN ALL TASKS AND WEATHER CONDITIONS IN TOWN01.FOR SUCCESS RATE WE UNDERLINE THE METHOD WITH BEST PERFORMANCES (HIGHER IS BETTER ↑), FOR FAILURE CASES WE HIGHLIGHT (BOLD) THE METHOD WITH THE LOWEST NUMBER OF OCCURRENCES FOR THAT TYPE OF FAILURE (LOWER IS BETTER ↓).

TABLE VII %
OF DETECTED ENTITIES IN FEATURES WHEN THE VEHICLE IS STOPPED AT GREEN TRAFFIC LIGHT ON NOCRASH.