Multi-Task Adaptive Gating Network for Trajectory Distilled Control Prediction

End-to-end autonomous driving is often categorized based on output into trajectory prediction or control prediction. Each type of approach provides benefits in different contexts, resulting in recent studies on how to combine them. However, the current proposals are based on heuristic choices that only partially capture the complexities of varying driving conditions. How to best fuse these sources of information remains an open research question. To address this, we introduce MAGNet, a Multi-Task Adaptive Gating Network for Trajectory Distilled Control Prediction. This framework employs a multi-task learning strategy to combine trajectory and direct control prediction. Our key insight is to design a gating network that learns how to optimally combine the outputs of trajectory and control predictions in each situation. Using the CARLA simulator, we evaluate MAGNet in closed-loop settings with challenging scenarios. Results show that MAGNet outperforms the state-of-the-art on two publicly available CARLA benchmarks, Town05 Long and Longest6.


Multi-Task Adaptive Gating Network for Trajectory Distilled Control Prediction
Shoaib Azam , Member, IEEE, and Ville Kyrki , Senior Member, IEEE Abstract-End-to-end autonomous driving is often categorized based on output into trajectory prediction or control prediction.Each type of approach provides benefits in different contexts, resulting in recent studies on how to combine them.However, the current proposals are based on heuristic choices that only partially capture the complexities of varying driving conditions.How to best fuse these sources of information remains an open research question.To address this, we introduce MAGNet, a Multi-Task Adaptive Gating Network for Trajectory Distilled Control Prediction.This framework employs a multi-task learning strategy to combine trajectory and direct control prediction.Our key insight is to design a gating network that learns how to optimally combine the outputs of trajectory and control predictions in each situation.Using the CARLA simulator, we evaluate MAGNet in closed-loop settings with challenging scenarios.Results show that MAGNet outperforms the state-of-the-art on two publicly available CARLA benchmarks, Town05 Long and Longest6.
Index Terms-Autonomous agents, end-to-end autonomous driving, gating network, imitation learning, intelligent transportation systems.

I. INTRODUCTION
L EARNING effective driving policies is pivotal for the development of end-to-end autonomous driving solutions.Typically, these driving policies are distinguished based on their outputs, falling into either trajectory prediction [1], [2], [3], [4] or direct control prediction categories [5], [6], [7].Trajectory prediction aims to forecast the vehicle's motion in the future over a specified horizon and uses separate controllers, for instance, PID or model predictive controllers (MPC), to translate the planned trajectories to the vehicle actuators.Conversely, control-based methods optimize the control signal directly.Both trajectory and direct control prediction have merits and demerits.In particular, trajectory prediction outcomes can be integrated with other tasks, like semantics and occupancy prediction methods [8], or multi-agent interactions [9], enhancing safety and refining the planned trajectory.However, since trajectory prediction relies on controllers to convert planned trajectories into control signals, the type of controller used may constrain its performance.On the other hand, control-prediction methods often result in discontinuous and unstable behavior because they make independent predictions at different steps.However, a clear consensus on which paradigm is superior remains elusive.
The underlying research question rarely studied in the literature is how to combine the trajectory and control prediction based on observed situations.The pioneering work in this direction is Trajectory-guided control prediction (TCP) [10], which has developed a multi-task learning framework for combining both prediction methods by heuristically determining a situation-fusing parameter.However, this heuristic parameter cannot fully capture the situation-dependency of the optimal combination of trajectory and control predictions.
To fill this gap, we introduce MAGNet (Multi-Task Adaptive Gating Network for Trajectory Distilled Control Prediction) by designing a gating network that learns the situation-fusing parameter based on the perception of the environment.MAGNET employs a multi-task learning strategy to perform trajectory and control prediction simultaneously.MAGNet incorporates the self-attention mechanism to distill the control prediction branch with the trajectory guidance to address the limitations of direct control methods that predominantly focus on immediate low-level actions, often not fully capturing the complexities of end-to-end autonomous driving.Moreover, our method dynamically learns the situation-fusing parameter, adapting to the environmental input representation, for fusing the trajectory and control prediction outputs.By doing so, we achieve a more dynamic integration of trajectory and control predictions, enhancing the vehicle's situational awareness.
The main contributions of this letter can be summarized as follows: 1) We developed MAGNet with a novel gating network that dynamically fuses control and trajectory predictions, distinguishing it from traditional methods that typically depend on static or heuristic-based approaches for integration.This methodological advancement empowers the model to adapt its integration strategy to suit each unique driving scenario.This flexibility is anticipated to enhance the robustness and accuracy of driving policies.2) We have integrated a self-attention mechanism into the control prediction branch of MAGNet, primarily due to its ability to enhance the model's focus on the most pertinent features derived from trajectory prediction.The use of self-attention in this context is novel because it allows MAGNet to selectively emphasize critical aspects of the input data, which is crucial for making precise and efficient control decisions in dynamic and complex driving environments.

3) Our evaluations and ablation studies demonstrate that
MAGNet's situation-based fusing parameter outperforms heuristic methods, with experimental results on CARLA benchmarks confirming its efficacy over state-of-the-art models in closed-loop settings.

A. End-to-End Autonomous Driving
End-to-end autonomous driving methods, classified into trajectory and direct control prediction approaches, learn to map sensor data to actions via imitation learning (IL) or reinforcement learning (RL) [11].RL, particularly model-free reinforcement learning, is effective in autonomous driving, adapting well to data shifts and proven successful in vehicle control [12].Furthermore, model-based methods learn the world model using pre-recorded trajectories and compute action-value functions, which, with sensor inputs, train a policy for error-correct navigation [13].Some studies separate perception from the RL process in driving policy learning [14], [15], [16].
In literature, end-to-end driving policies are often learned through imitation learning, particularly behavior cloning.This involves feature representation steps like mapping BEV semantics to waypoint prediction [2], or incorporating global and temporal reasoning [17].Some studies also focus on a unified framework that integrates perception, prediction, and planning using intermediate representations [8], [18].Sensor fusion techniques are increasingly used in driving policy learning, such as combining Lidar and image data with self-attention and GRUbased decoders for trajectory prediction [3], [19].Additionally, some methods learn policies from both ego and other vehicles' perspectives using viewpoint-invariant representations [9], and also improve the decoder for trajectory learning [20].

B. Multi-Task Learning and Knowledge Distillation
Multi-task learning trains networks on related tasks to boost performance and generalization, a technique increasingly applied in end-to-end autonomous driving systems [27], [28], [29].FASNet, within a multi-task learning framework, forecasts future states and actions using deep-predictive coding and vehicle kinematics, with control signals produced from a weighted average of predicted actions [30].Similar to our work, Trajectory-guided control prediction (TCP) follows a multi-task learning framework for trajectory and control prediction and then adopts a heuristic approach to fuse them [10].Unlike TCP's heuristic integration of trajectory and control predictions, our method employs a learned fusion strategy via a situation-aware gating network, adjusting fusion coefficients for contextual precision.We also enhance branch interaction with a self-attention mechanism, optimizing knowledge distillation by prioritizing salient feature integration.
Knowledge distillation has been used in autonomous driving, training a privileged agent with extensive data and then using it to train a sensorimotor agent with limited data [1].Some studies include an an alignment module as enhancement, to better transfer knowledge from teacher to student, optimizing learning through a coaching approach [31].

A. Problem Setting
In end-to-end autonomous driving, the objective is to translate an input representation x into a corresponding control action u.In this letter, we consider the input representation which encompasses sensor signal s i , vehicle speed υ, a high-level navigation command ρ, a goal point (x, y).This goal point (x, y) provides a target location for the vehicle's navigation, integral to the driving task.The resulting control action constitute of longitudinal control signals: and the lateral control signal: In our research, we explore methods to contextually and adaptively merge the outputs of trajectory and control prediction in a learnable manner.For the trajectory prediction, a point-to-point navigation approach is adopted by learning a driving policy π that imitates the behavior of an expert policy π * in a supervised manner with the loss function, L: where W are the ground-truth waypoints and π(x) is the learned policy for predicting the waypoint over the horizon T .Similarly, the control branch is trained in a manner consistent with behavior cloning in imitation learning, where expert-provided control signals directly supervise the model's current control predictions, and it can be formulated as: where D corresponds to the dataset.The dataset D is collected by rolling the expert policy π * that interacts with the simulated world.Each trajectory τ = (x 0 , u * 0 , x 1 , u * 1 , . .., x T ) comprises of state-action (x, u * ) T i=0 pairs, where u * includes the controls signals and waypoints information, along with the goal point data.

B. Architecture
Fig. 1 provides an overview of the MAGNet architecture, which consists of four main components: an encoding stage for feature extraction, trajectory prediction and control prediction branches, and a situation-based gating network for fusing the outputs of these branches.The encoding stage is further divided into two encoders.The image encoder (E I c ), built on a ResNet [32] architecture, is responsible for extracting feature embeddings (I C emb ) and feature vector (I C feat ) from the input  and (I M feat ), respectively.The feature vector F is propagated to the subsequent two branches and the gating network.The following sections detail the trajectory branch, control branch, and situation-based gating network.
1) Trajectory Prediction Branch: Unlike the control prediction that directly predicts the control action, the trajectory prediction branch, as illustrated in Fig. 2, predicts the planned trajectory over the horizon K, which are then processed by low-level controllers u traj = I(W), where I corresponds to the low-level controller, W corresponds to the waypoints.In the proposed method, the trajectory prediction branch inputs the combined feature vector F, down-sampled to a feature vector of f = 256 by passing through a series of linear layers.For predicting the next waypoints, we have employed the auto-regressive GRU [33] model and initiated the hidden states of the GRU model with the feature vector f .The auto-regressive model, built on a GRU architecture, utilizes the current position and goal location as inputs.This design enables the network to concentrate on pertinent contextual information within its hidden states, thereby enhancing its ability to predict subsequent waypoints.Finally, a linear layer followed by GRU layers is used to predict the next waypoints (w 0 , w 1 , . .., w K ) over horizon K = 4. Two PID controllers, one for longitudinal and another for lateral Fig. 3.Control prediction branch.The architecture for predicting the multistep control prediction through trajectory branch supervision.GRU for multistep control prediction and self-attention for knowledge distillation between trajectory and control.control, process the predicted waypoints for generating control actions in the form of throttle, brake, and steer, respectively.
2) Control Prediction Branch: As illustrated in Fig. 3, we designed the control prediction branch to predict the multi-step control actions in the future by distilling the information from the trajectory branch.Since the traditional control prediction methods follow the behavioral cloning approach, which relies on independent and identically distribution, it does not hold in the case of closed-loop settings.To address this limitation, we employ self-attention to design a trajectory distilled control prediction branch.
The control prediction branch comprises two branch networks: value and policy head.An initial feature vector F undergoes processing through a series of linear layers to produce a down-sampled feature vector x, which is then utilized by both the value and policy heads.In the trajectory distilled control prediction, the self-attention is used initially to compute the attention matrix A ∈ R m×n , as shown in (3), where the Q, K and V matrices are derived from the measurement features (I M feat ).The rationale behind employing the self-attention mechanism in our model lies in its capability to independently evaluate and integrate input features, both measurement and image data.This approach ensures contextually informed and temporally coherent feature integration, which is critical for making accurate decisions in dynamic driving environments.It is to be noted here that the self-attention employed in our trajectory distilled control prediction branch is different from the TCP.TCP uses trajectory-guided attention to focus on specific regions of the sensor input, creating an attention map that aggregates 2D image features for control prediction.However, MAGNet Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
employs a self-attention mechanism that merges measurement features with image features, enhancing the model's capability to focus on the most relevant aspects of the input for control prediction.This approach is more dynamic and context-aware, allowing for integrating different types of input features.
This initial attention matrix is used to compute the feature embedding for the control prediction branch by taking the dot product with image feature vector (I C feat ).The core logic unfolds within the temporal loop implemented as GRU with a prediction horizon K.For each iteration [0, K − 1], the model ingest a concatenated vector x in ∈ R n+2p , where n and p denote the dimensions of the current control state x and the parameters μ and σ, respectively.The hidden state h ∈ R q is updated using the GRU decoder as illustrated in the (4), where h t−1 , which serves as h in is the input hidden state for the GRU at the current time step t.
The hidden state h and a trajectory-guided hidden state u traj hidden are subsequently used to compute a waypoint-based attention map wp A using another self-attention mechanism.This attention map is applied to I C feat to produce a new feature embedding, which is then combined with h to obtain the merged feature.This merged feature updates x in as shown in (5): Throughout the loop, the model refines these variables iteratively, generating a sequence of multi-step control predictions (u 0 , u 1 , . . ., u K ) that are dynamically and temporally coherent.Incorporating self-attention mechanisms into the architecture significantly enhances the model's capacity for sequential decision-making.
3) Gating Network: The gating network, as illustrated in Fig. 1, serves as high-level decision-making in fusing the trajectory and control prediction outputs to yield an optimized and context-aware command to the vehicle actuators.The primary objective in designing the gating network is to fuse it with situational awareness capabilities.It aims to dynamically evaluate and choose between trajectory-based controls u traj = I(W) and direct control signals u ctrl = (u 0 , u 1 , . . ., u K ) from the control prediction branch.This enables the gating network to make context-sensitive decisions in various driving scenarios, such as navigating intersections, executing turns, or overtaking other vehicles.
To this end, the gating network generates two outputs: a high-level command g Φ and a situation-fusing parameter g α , respectively, by receiving the combined feature vector F as input.The high-level command g Φ encompasses a set of commands including 'straight', 'left turn', 'right turn', 'lane-following', 'change lane to the left', and 'change lane to the right'.The g Φ is an auxiliary information that is predicted from the proposed MAGNet framework.Mathematically, let F represent the situation context derived from the sensor information (e.g., image and measurements); the gating network can be expressed as in (6): g α is a function of F and the outputs u traj , u ctrl from the trajectory and control branches as illustrated in (7): where W g α and b g α are learnable parameters.The softmax function ensures g α is a probabilistic weighting factor in the [0,1] range.The high-level command g Φ network outputs the discrete high-level commands and is expressed as in ( 8) Finally, the output control action P is a weighted sum of u traj = I(W) and u ctrl = (u 0 , u 1 , . . ., u K ), modulated by g α is given by ( 9) as: The model can thus adaptively balance long-term planning and immediate reactive behaviors, making it highly robust and adaptive to a variety of dynamically changing environments.

4) Loss Design:
The MAGNet framework includes trajectory planning loss L traj , control prediction loss L ctrl , auxilary loss L aux and the gating loss L G .Since the MAGNet focuses on incorporating the trajectory and control prediction in a unified framework with situation-based fusion, the proposed method was trained in two phases.In phase one, the trajectory and control prediction branches are trained end-to-end without a gating network and then frozen for training the gating network in the second phase.
The trajectory loss L traj can be expressed as shown in ( 10) where w i , and ŵi signify the predicted and ground-truth waypoints at time i, respectively.λ F serves as a tunable weight for the feature loss L F , which computes the L 2 distance between f (0) traj and f Expert at the current time step, thereby acting as an auxiliary supervisory signal.f (0) traj is the feature representation of the predicted trajectory at the initial state and f (0) Expert represent the feature representation from the expert demonstration.For the control prediction, the L ctrl loss is expressed in ( 11) The loss function L ctrl comprises four terms.It uses Kullback-Leibler (KL) divergence to measure the difference between predicted and ground-truth Beta distributions, initially and over future time steps i.A feature loss, weighted by λ F , enhances the model's learning at each time step.The aggregated loss L for phase one is: where L aux is the weighted sum of L 1 loss for speed prediction and L 2 loss for the value prediction, respectively.
After training, the trajectory and control prediction branches are fixed, and the gating network G is trained end-to-end.The loss function L G is expressed in (13).
Here, λ i , λ command and λ L1 are the hyper-parameters.
L command is the loss between predicted high-level command and ground-truth as by ( 14), whereas L 1 corresponds to regularization term given by ( 15) The combined output o combined,i for each control signal i ∈ steer, throttle, brake is computed as a weighted average of the outputs from the trajectory and control prediction branches, denoted as o traj,i , and o ctrl,i respectively.The weights α traj,i and α ctrl,i modulate these contributions as expressed in (16).
IV. EXPERIMENTS

A. Benchmark
In this work, CARLA simulator used for the closed-loop evaluation of the proposed method [34].We have used two widely used benchmarks, Town05 Long and Longest6 [3], where the Longest6 benchmark uses the six longest routes of each town (Town01-Town06) comprising 36 routes.In each benchmark, the routes are defined by a sequence of navigation points together with sensor and high-level command data (turn right/left, lane changing and following, straight).The task in closed-loop driving is to drive the autonomous agent to the desired destination by simulating the traffic situation and also include challenging scenarios, for instance, obstacle avoidance, crossing unprotected intersections, and sudden control loss.

B. Data Collection
In our experiments, we choose Roach [16] for the supervision as an expert model.Roach is an RL-trained model incorporating privileged information, including roads, routes, lanes, vehicles, pedestrians, and traffic elements, rendered into a 2D bird-eyeview (BEV) image.This learning-based expert offers advantages over rule-based experts by providing a richer set of information beyond just direct supervision signals.
For data generation, we adhere to the protocol outlined in [10], rolling out an expert policy with privileged information to gather the dataset using the CARLA simulator.Our data collection settings utilize a monocular camera (front-facing), IMU, GPS, and speedometer.We have collected data in Town01, Town03, Town04, and Town06, under various environmental conditions, resulting in 189 K data points for training.

C. Evaluation Metrics
Our model's performance is assessed using CARLA Leaderboard metrics, focusing on Route Completion (RC) for measuring route success, Infraction Score (IS) for traffic rule adherence, and Driving Score (DS) as the primary metric combining RC and IS for a holistic performance evaluation [3], [9], [34].

D. Training Details
The training of the MAGNet is done in two phases.In the first phase, the trajectory and control prediction branches are trained end-to-end.For this, the image encoder adopts ResNet architecture trained on ImageNet [35].The size of the input RGB image is 900 × 256, with the FOV of the camera set to 100 deg.In the trajectory and control branch, the T = 4 corresponds to the next four future steps at 2HZ.For the PID settings, we follow the same settings as proposed in [3], where the values of K p = 5.0, K d = 1.0 and K i = 0.5 are for the longitudinal control, and the values of PID controllers are K p = 0.75, K d = 0.3 and K i = 0.75 for lateral control.The hyper-parameters used in the training for phase one are as follows: λ traj = 1, λ ctrl = 1, λ F = 0.05 and λ aux = 0.05.For the training of the gating network (G), the hyper-parameters are set as follows: λ steer = 1, λ throttle = 1, λ brake = 1, λ command = 1, and λ L 1 = 0.5.The training for both phases is done on 2 Nvidia V100 GPUs, having a memory of 32 GB each.The Adam optimizer [36] is used for each training phase with a learning rate of 5 × 10 −4 and weight decay of 1 × 10 −7 .In both training phases, the models are trained for 60 epochs having a batch size of 64.

E. Results
We compare the proposed method MAGNet with other state-of-the-art methods on two publicly available benchmarks, Town05 Long and Longest6, in closed-loop settings.Table I illustrates the quantitative results of MAGNet with the stateof-the-art methods on Town05 Long benchmark.In our quantitative evaluation, the proposed method is equally compared to the camera and Lidar-based state-of-the-art methods.As the MAGNet employs a monocular camera for predicting the diving policies, it obtains better driving, route completion, and infraction scores when compared with camera-based driving agents.Specifically, MAGNet achieves a driving score of 73.3 ± 3.9, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.98.5 ± 1.18 of route completion, and an infraction score of 0.69 ± 0.05, outperforming the ThinkTwice [20] (a camerabased driving agent) by 12.8% in driving, 3.1% in route completion and 14.5% in infraction scores, respectively.Similarly, MAGNet also performs better when compared with Lidar-based methods; for instance, MAGNet outperforms LAV(a Lidarbased driving agent) [9] by a margin of 57.6% in driving score, 41.1% in route completion and 8.2% in infraction score respectively.Since MAGNet follows a multi-task learning framework, we compared our method to the baseline method TCP [10], which also follows the multi-task learning framework.Upon evaluation, the MAGNet outperforms the TCP [10] baseline method by 28.2% in driving score, 22.5% in route completion, and 8.2% in infraction score, respectively.As for the Longest6 benchmark, MAGNet has also shown better performance when compared with state-of-the-art methods as illustrated in Table II.For instance, MAGNet achieves the driving score of 71.43 ± 2.3, route completion score of 84.54 ± 1.5, and infraction score of 0.87 ± 0.05, as compared to TCP [10], where it achieves the driving score of 42.86 ± 0.63, route completion score of 61.83 ± 4.19 and 0.71 ± 0.04 of infraction score.Thus, MAGNet outperforms TCP [10] by a margin of 66.7% in driving score, 36.7% in route completion, and 22.5% in infraction score on Longest6 benchmark, respectively.Similarly, when the proposed MAGNet is compared with camera and Lidar-based methods, it performs better in the driving, route  The efficacy of MAGNet is illustrated in Fig. 4, showcasing adaptability in various driving scenarios.The qualitative findings align well with quantitative benchmarks, substantiating its comparative effectiveness against state-of-the-art methods.

F. Ablation Study
This section presents a quantitative analysis of control-only, trajectory-only, and our proposed method, using a uniform feature extraction process with a ResNet-based image encoder and a measurement encoder.The control-only model uses only the control branch, while the trajectory-only model uses only the trajectory branch.Control-only predictions use the feature vector F and trajectory-only predictions down-sample F for the GRU decoder to forecast future waypoints.As shown in Table III, control-only exhibits higher reactivity but more infractions, and trajectory-only shows lower route completion, both under-performing compared to our proposed method, which combines both approaches with a situational gating network, leading to superior performance metrics.Additional we have extended our ablation study to include a heuristic-based combination of control-only and trajectory-only module.We have adopted the same heuristic-based approach used in TCP for fair comparative analysis.While the heuristic approach improved over the individual control-only and trajectory-only models, it still did not achieve the performance level of our integrated MAGNet approach as illustrated in Table III.
We have conducted a statistical analysis to evaluate the effectiveness of our MAGNet model.Our study assesses MAGNet's efficacy, focusing on the g α parameter within its gating network and comparing it to TCP's heuristics approach.We investigated the impact of these parameters on throttle, brake, and steer controls.Moreover, we demonstrated MAGNet's adaptability to environmental changes through attention maps, as shown in Fig. 5. Tables I-II shows a quantitative comparison between MAGNet without attention and the proposed MAGNet with attention.
Fig. 6(a)-(d) details the results of this comprehensive analysis, comparing MAGNet with TCP across different routes and driving conditions.It highlights instances where the agent alternates between 'traj' (trajectory) and 'ctrl' (control) modes in response to varying situations.Notably, we found that MAGNet's throttle, brake, and steering profiles are significantly smoother than those of TCP, demonstrating the efficacy of our model.Additionally, the analysis reveals the adaptive behavior of the g α parameter in MAGNet, which dynamically adjusts based on the driving context.We also present a distribution of 'traj' and 'ctrl' modes across various routes in Fig. 7.This distribution reveals that 'ctrl' mode is favored at lower alpha values and 'traj' mode at higher ones, indicating their respective suitability for different driving scenarios.

V. CONCLUSION
In this work, we present MAGNet, a framework designed to learn situational fusion strategies that integrate trajectory and direct control predictions.We also develop a trajectory-distilled control prediction technique that leverages self-attention for multi-step control output predictions.Our findings indicate Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
that the situational fusion parameter can be effectively learned without resorting to heuristic methods for merging trajectory and control predictions.Notably, our proposed approach surpasses the leading TCP method in a closed-loop setting across two widely recognized benchmarks.Furthermore, compared to state-of-the-art methods, including those using camera and Lidar-based agents, MAGNet performs better in driving score, route completion, and infraction score metrics.
The challenge of effectively fusing situation-based parameters in autonomous driving remains an open issue.While our proposed work takes a significant step forward by adaptively learning the situation-based fusing parameter, it still needs to incorporate rules-based methods.Specifically, combining signal-temporal-logic (STL) with adaptive learning introduces complexities in harmonizing these adaptive approaches with established rules.The key challenge lies in ensuring their cohesive operation to improve system safety and efficiency, presenting a promising avenue for future research.

Fig. 1 .
Fig. 1.Overview of architecture.The architecture comprises of three modules: trajectory prediction branch, control prediction branch and gating network.The encoded features are shared by all the three modules.The gating network receive both outputs from the trajectory and trajectory-distilled control prediction branch, and fuse them by learning the situation-based fusing parameter.

Fig. 2 .
Fig. 2. Trajectory prediction branch.The architecture receives the encoded features F, down-sampled and passed to GRU based decoder for predicting the next waypoints.

Fig. 5 .
Fig. 5. Attention map visualizations for MAGNet: (a) showing 'traj' mode selection highlighted by focused attention regions, and (b) illustrating 'ctrl' mode selection where attention disperses relevant to control adjustments.These attention maps illustrate that the model is learning the representations.

TABLE I COMPARISON
OF MAGNET WITH STATE-OF-THE-ART METHODS ON TOWN05 LONG BENCHMARK IN TERMS OF DRIVING SCORE (DS), ROUTE COMPLETION (RC) AND INFRACTION SCORE (IS)

TABLE II COMPARISON
OF MAGNET WITH STATE-OF-THE-ART METHODS ON LONGEST6 BENCHMARK IN TERMS OF DRIVING SCORE (DS), ROUTE COMPLETION (RC) AND INFRACTION SCORE (IS) completion, and infraction scores, as illustrated in Table II on Longest6 benchmark.