Evaluating Pedestrian Trajectory Prediction Methods With Respect to Autonomous Driving

In this paper, we assess the state of the art in pedestrian trajectory prediction within the context of generating single trajectories, a critical aspect aligning with the requirements in autonomous systems. The evaluation is conducted on the widely-used ETH/UCY dataset where the Average Displacement Error (ADE) and the Final Displacement Error (FDE) are reported. Alongside this, we perform an ablation study to investigate the impact of the observed motion history on prediction performance. To evaluate the scalability of each approach when confronted with varying amounts of agents, the inference time of each model is measured. Following a quantitative analysis, the resulting predictions are compared in a qualitative manner, giving insight into the strengths and weaknesses of current approaches. The results demonstrate that although a constant velocity model (CVM) provides a good approximation of the overall dynamics in the majority of cases, additional features need to be incorporated to reflect common pedestrian behavior observed. Therefore, this study presents a data-driven analysis with the intent to guide the future development of pedestrian trajectory prediction algorithms.


I. INTRODUCTION
A VOIDING accidents involving vulnerable road users (VRU), such as pedestrians, is one of the paramount objectives and ongoing challenges for autonomous vehicles.According to the road safety report published by the World Health Organization (WHO) in 2018, pedestrians account for 23% of the 1.35 million deaths globally caused by road traffic accidents [3].Consequently, the prediction of pedestrian behavior remains an active area of research as evidenced by recent publications [4], [5], [6], [7], [8].This line of research, known as pedestrian trajectory prediction, focuses on the modeling of their future trajectories and enables autonomous systems to consider their behavior [9].The trajectories are estimated with a bird's-eye-view of the scene and the motion history of each pedestrian as visualized in figure 1.In addition, interactions can be explicitly considered.Initially modeled through simple rule-based methods [10], various architectures based on neural networks [11], [12], [4] as well as hybrid approaches [5], [6] have emerged in recent years, trying to improve the overall accuracy of the predicted paths on widely-used benchmarks like ETH/UCY [13], [14] and SDD [15].Although datasets tailored for traffic environments exist that are better suited for autonomous driving applications [16], their evaluation predominantly revolves around vehicles.Hence, the development and assessment of pedestrian trajectory prediction methods primarily still occurs on the former.While over the years this development has undoubtedly resulted in significant advancements in the field, the adopted Best-of-N (BoN) evaluation approach falls short in quantifying their suitability for autonomous systems.In this regard, it is crucial to determine the accuracy of the most probable prediction, the extent to which the motion history influences its precision, and how effectively these methods scale as the number of observed pedestrians increases.
Therefore, in this paper we want to build upon the work outlined in previous publications [4], [17], [18] by evaluating the state of the art with respect to the accuracy, feature requirements, and computational efficiency when generating single trajectories.The contributions of this work are therefore fourfold: • First, we evaluate the overall performance by reporting the ADE and FDE metrics when sampling a single trajectory.This is a critical measure for practical applications where usually only one prediction per agent is utilized to plan safe and collision-free paths.
• Second, we examine the sensitivity to input features by limiting the observed motion history to a maximum of two timesteps.While this resembles an extreme case, a prediction based on an initial observation is crucial in practical applications.Furthermore, deeper insights into the inner workings of each method can be gained.• Third, we benchmark all contributions on a GPU while measuring runtimes to gauge how well each approach scales with an increasing number of agents.• Fourth, by analyzing the generated predictions, shortcomings of the investigated approaches are derived and directions for future work provided.The paper is structured as follows: In chapter II we refer to related work already conducted in this field and evaluation methods employed.Following that, our overall evaluation procedure is introduced in chapter III.Based on the findings outlined in chapter IV, we discuss the results obtained in chapter V and derive potential steps to guide future research and development in chapter VI.

II. RELATED WORK
Pedestrian trajectory prediction can be categorized into two different approaches.The first category encompasses knowledge-based methods, with the simplest model being the constant velocity model (CVM) [17].This model predicts linear trajectories based on an agent's most recent observation and is used as a baseline for trajectory prediction.A more sophisticated approach is the social force model [10], which models human motion by considering attractive and repulsive forces that characterize the agent's interaction with the environment.It has found extensive usage in simulation environments for controlling interactive agents.Expanding on this model, advanced algorithms such as ORCA [19] and BRVO [20] have been developed and adapted for predicting human motion.These models consider the velocities of surrounding obstacles to estimate collision-free paths.The introduction of deep learning frameworks has led to a shift in focus, and the research community has employed neural networks to better approximate pedestrian trajectories.This second group of approaches is known as learning-based methods.In 2016, Social-LSTM [21] was published and has since remained one of the most influential approaches in the field.It utilizes a Recurrent Neural Network (RNN) to consider the past trajectory of observed agents and incorporates them into a pooling module to model interactions.Building upon the same architecture, Social GAN (SGAN) [11] was introduced in 2018, replacing the RNN architecture with a Generative Adversarial Network (GAN) and improving model accuracy by sampling reasonable and diverse trajectories.In 2020, Trajectron++ [12], a Graph Neural Network (GNN), pushed the boundaries further.It leverages graph representations to account for cross-class interactions, such as between a car and a pedestrian.Currently, Y-Net [22] holds the second position in the common ETH/UCY benchmark [21] and, to the best of our knowledge, is the best-performing open-source contribution in this field.Y-Net utilizes an Encoder-Decoder setup, incorporating past trajectories and semantic maps to generate multimodal trajectories.AgentFormer [23] was published shortly afterward, using a Transformer architecture with a specifically designed attention mechanism to model interactions.Since then, limited progress in enhancing the prediction accuracy has prompted researchers to explore other crucial factors of trajectory prediction methods.As a result, Social-Implicit [4] was introduced in 2022, leveraging feed-forward neural networks in place of recurrent ones.This shift yielded enhancements in computational efficiency without compromising on the model's accuracy, thereby contributing to the progression in this field.

Motion history
The selection of models for this evaluation was based on their strong performance, either recently or historically, establishing them as state of the art in various aspects.Each model represents a distinct neural network architectural type addressing the same problem, enabling a more insightful comparison.The chosen models include SGAN [11], Trajectron++ [12], AgentFormer [23], Y-Net [22], and Social-Implicit [4].To establish a deterministic baseline, these models were benchmarked against a CVM.With the exception of the CVM, each architecture incorporates the motion history of multiple agents.SGAN and Trajectron++ sequentially process individual timesteps through their underlying RNN-based architecture, while the remaining ones process them in parallel.All models utilize positional data from the dataset with some also incorporating velocity (AgentFormer, Trajectron++, Social-Implicit) and, in the case of Trajectron++, even acceleration information.Additionally, explicit modeling of interactions with other agents in the scene is common among the investigated approaches, except for CVM and Y-Net.The CVM disregards interactions entirely, while Y-Net indirectly considers interactions by encoding the observed motion history in a heatmap.The mechanisms employed for modeling interactions directly are social pooling (SGAN), including spatial information within a spatial-temporal graph (Trajectron++, Social-Implicit), or the consideration in a specially designed attention layer (AgentFormer).The last feature used for the prediction is spatial information about the environment which is represented through simple obstacle maps or semantic maps containing roads, walkways, and non-walkable areas.Among these models, only Y-Net actively employs an obstacle map with two distinct classes.Trajectron++ and AgentFormer have the capability to include this type of information but have not utilized it in the given benchmark.Nevertheless, this marks them as the only two models to explicitly attempt To assess the effectiveness of a given method, various metrics are commonly employed, with the Average Displacement Error and Final Displacement Error being the most prevalent ones [9], [24].ADE measures the average Euclidean distance between predicted trajectories and observed ground truth trajectories, whereas FDE focuses solely on the last position, representing a measure of the error accumulation over the entire prediction horizon.In this evaluation, the models predict 12 future timesteps based on the motion history derived from the last eight positional observations.To account for the multimodal behavior of humans, a Best-of-N (BoN) approach is typically used, sampling 20 potential trajectories [11], [12], [22], [23].The trajectory with the smallest ADE is then selected to determine the overall performance.While this evaluation procedure has yielded significant progress in recent years, it has also faced criticism.Firstly, Schöller et al. [17] addressed the overall complexity of neural networks by demonstrating that a CVM can yield comparable results when the same BoN evaluation is applied to it.Secondly, concerns have been raised regarding the metric itself, as small error reductions during the evaluation on the ETH/UCY benchmark might also be caused by the inherent non-deterministic nature of Neural Networks [4].To address this, the authors propose a confidence-based metric for evaluating trajectory distributions.Lastly, the applicability of prediction algorithms in real-world scenarios, partially addressed in [17], [18], remains a pressing concern.Consequently, the primary focus of this work is the evaluation of open-source state-of-the-art prediction models while generating a single trajectory for each pedestrian in the given scenario.

III. METHODOLOGY
In terms of practical implementation for autonomous driving, the most significant factors of a prediction algorithm encompass the precision of the most likely trajectory, the prerequisite for an available trajectory from an initial observation, and the runtime required for generating said trajectories for the whole scenario.Consequently, the evaluation is performed across the following three criteria: Accuracy (BoN, most likely), feature requirements (limiting the number of temporal observations), and the inference time of each model.Whenever possible, the reported results were cross-checked against the values presented in the original publications or contributions found within the issues of the respective repositories.The following three sections outline our evaluation procedure, describe the data utilized for testing and training, and provide details about the explicit implementation of said experiments.

A. Evaluation procedure
To provide a measure for an overall comparison, the ADE and FDE were determined with the evaluation scripts provided by each individual contribution.Although these metrics don't provide a sufficient measure when a BoN evaluation is employed as has been previously addressed by Mohamed et al. [4], they provide valuable insight into the overall accuracy when single trajectories are predicted.Since the majority of models are non-deterministic and, in some cases, trajectories are randomly sampled rather than selecting the most likely one, each of the five scenes in the ETH/UCY dataset was evaluated five times.Afterwards, the average for each scene was calculated before combining all values to the overall results presented in figures 2 and 3. To sample a varying number of trajectories, the respective variables in each contribution were adjusted to either one or 20.Additional details are provided in section III-C.
As previously stated, almost all input features employed are present in real-world scenarios.Among the investigated approaches, Y-Net is the only model utilizing a semantic map on the given dataset and heavily relies on it due to its encoding mechanism.For this reason, an evaluation of the spatial information's influence on the model's performance was not conducted.Instead, the focus was shifted to the temporal information since all algorithms incorporate a motion history of eight timesteps for each agent within a scene.To investigate the impact of this feature, the number of available observations was limited and the corresponding metrics were computed.Specifically, the input tensor of each model was modified to contain either only the last timestep or the last two.Here, a timestep refers to an observation at any given time, encompassing both positional information and velocity data.While this limitation represents an extreme condition since not much information about an agent's movement can be derived, in practical applications a prediction might be needed right from the initial observation.Moreover, by evaluating these extreme cases, further insight into the workings of each method can be gained.Although experiments with more than two timesteps were conducted, these variations did not yield significant improvements and are therefore not reported in this study.For the most part, the missing entries were represented by a zero value within the respective tensor which applies to the positional data as well as the relative displacements and velocities.All other adjustments differing from this approach are listed in the implementation details below.
In order to evaluate the scalability of each approach as the number of observed agents increases, we measured the execution time of each model using the built-in CUDA-API for PyTorch [25].This API offers precise time measurements through an accurate event recording.Specifically, we measured the execution time of a model starting from the point of providing individual scenario inputs until receiving predictions for all agents within the scene.The data preprocessing time was excluded from these measurements.To allow for a fair comparison, all experiments were performed on a Nvidia Tesla V100 device with CUDA version 10.1 and Python 3.8.10.To ensure reliable estimates of the average scene prediction time, a batch size of one was chosen, and each model was evaluated five times.The median execution time and the interquartile range (IQR) were calculated across all scenes, providing insight into the range of runtimes for different numbers of agents.To visualize these results, Appendix A contains a detailed overview of the two models that exhibited the largest variations across all employed scene sizes.

B. Dataset and Training
This study focuses on the assessment of short-term pedestrian trajectory prediction algorithms for time horizons up to 4.8 s.To facilitate an effective comparison and encompass scenarios of diverse complexity, the widely used ETH/UCY dataset [13], [14] was selected.This dataset includes situations with over 50 agents as well as scenarios with predominantly non-linear trajectories [26], enabling an extensive evaluation of the chosen models.The data splits employed are the ones initially introduced by [11].In the conducted Leave-One-Out Cross-Validation, four out of five scenes (Eth, Hotel, Univ, Zara1, and Zara2) were used for training and the remaining one for testing.In cases where the network weights were not provided alongside the code for the selected method, the models were trained according to the procedure outlined in each publication.An exception was made for Y-Net, as specific parameters were not given, and the provided code did not yield the expected results.Thus, a hyperparameter tuning was necessary, with the optimal parameter settings being 5.0 for Resize, 0.5 for Temperature, 31 for Kernel size, and 4 for Sigma Gaussian Kernel.Afterwards, the training was conducted for 150 epochs employing a batch size of 16, two semantic classes and a learning rate of 0.0001.Besides this, no retraining was done between the individual evaluation steps, adjusting only the necessary variables in the code to vary the number of generated trajectories.

C. Implementation details
Alongside the steps described in the overall evaluation procedure, additional implementation details are provided in this section.
SGAN.During our evaluation as well as indicated in the paper itself [11], the model without the pooling module delivered slightly better results, offering the same performance with less complexity.Therefore, results are reported for this configuration.
Trajectron++.The model provides different evaluation modes to assess the quality of the generated predictions.For this study, the Z-Mode was used over the 'most likely'-mode employed in the paper since it delivered almost identical results while decreasing the runtime drastically.Y-Net.The default sampling of 20 potential trajectories employs two modes, TTST and CWS.However, since clustering cannot be applied to a single trajectory, the TTST mode was disabled, reducing inference time.To limit the observed motion history, missing timesteps were overwritten with the earliest observation since Y-Net uses positional data to encode the trajectory into the provided map.AgentFormer.For the sampling of 1 trajectory, the provided pre-trained models with the suffix pre were used since the diverse sampling technique DLow wasn't applicable.For the feature evaluation, non-existent entries are set to −1 [27].
Within the attention matrix, the earliest observation was repeated for the remaining entries, delivering the best results.

IV. RESULTS
In this section, the results of the conducted evaluation are reported.In the first part, the general performance of each model is compared between the BoN approach and the most likely trajectory sampling.Subsequently, the second and third paragraphs focus on the temporal feature requirements and the execution time for predicting an entire scenario.Within this section, the variable K refers to the number of trajectories sampled whereas N stands for the number of timesteps provided to the network.

A. Accuracy
While some methods benchmark the performance against a linear regression model as a baseline [11], [12], fitting a straight line to the observed motion history, a CVM has not been utilized as a reference in the investigated contributions before.However, due to its proven performance in previous studies [17], the CVM was selected as a baseline for this evaluation.The average values of the ADE and FDE for one and 20 generated future trajectories are presented in Fig. 2, with the exact values for each scene being contained in Appendix B. The input configuration for each model is described in TABLE I, where the motion history consists of eight and the prediction horizon of 12 timesteps.
In Fig. 2, blue shades represent the ADE metric and orange ones highlight the FDE scores.Lighter colors were chosen for the evaluation of a single trajectory whereas darker ones visualize the results obtained for the BoN approach, evaluating the best of K = 20 sampled trajectories against the ground truth.The horizontal, dashed lines indicate the results of the CVM, following the same color scheme.When sampling 20 distinct trajectories, the investigated models demonstrate the anticipated strong performance on the ETH/UCY dataset, outperforming the simple CVM baseline with an ADE of 0.52 m and an FDE of 1.141 m.However, we were unable to replicate the reported results of Y-Net despite extensive parameter tuning, and the values for Trajectron++ are slightly worse due to previously reported issues with the velocity and acceleration derivatives.Taking these factors into account, AgentFormer achieves the best performance in our BoN evaluation with 0.233 m and 0.393 m, respectively.With an increase in the average displacement of 3.8 cm and 10.1 cm for the FDE, Trajectron++ scores second.This changes when sampling only one trajectory and selecting the most probable one whenever possible.In this configuration, while all Final Displacement Errors now exceed one meter, the SGAN architecture is among the least affected and delivers competitive results.Conversely, AgentFormer exhibits a significant decline in performance, ranking second last in the provided overview.Trajectron++ yields the most precise predictions with 0.555 m in average and 1.162 m in final displacement.Interestingly, none of the featured models surpass the performance of the CVM which only considers information from the last timestep.Therefore, the influence of individual timesteps on the performance of each model will be examined in the following section.

B. Feature requirements
Previous studies suggest that the motion history of an agent may not be as crucial as commonly assumed [17].Therefore, in order to gain insights into the inner workings of each method, we examined the influence of one, two, and eight observed timesteps provided to each model.Details on the necessary modifications for this analysis can be found in section III, and the summarized results are presented in figure 3.
Similar to the previous evaluation, the figure highlights ADE values in blue whereas FDE values are represented with shades of orange.The brighter the tone, the fewer observations were provided to the network.Again, the respective metrics corresponding to the performance of the CVM are indicated through dashed, horizontal lines.When focusing on the overall trend between the metrics, it can be seen that first, both metrics express a similar behavior across all investigated methods, and second, the lowest errors are achieved when all observations are considered.Reducing the amount of information available to one timestep, not all models become worse to the same degree.Whereas AgentFormer and Y-Net exhibit the worst performance in our evaluation with Average Displacement Errors up to 1.941 m and 3.757 m respectively, the remaining three models show almost no difference when provided with only one or two timesteps.While the ADE for Social-Implicit improves from 1.11 m to 0.745 m when adding an additional observation, the change in accuracy for Trajectron++ and SGAN is less than 1 cm.These displacements are almost identical to the ones obtained when considering the whole motion history with gains being as little as 0.4 cm in the case of Trajectron++.However, when an additional timestep is provided to both Y-Net and AgentFormer, a substantial decrease in both Average and Final Displacement Errors can be noted, with differences as high as 2.261 m and 3.635 m, respectively.Further improvements in accuracy can be observed when considering the full motion history, although the increase is significantly smaller in comparison.

C. Runtime
The runtime analysis shows a significant difference in the inference time of the compared models as visualized in figure 4. According to this, the CVM has the lowest runtime with a median inference time of 0.15 ms, while the highest median inference time was measured for Trajectron++ with 131.34 ms.However, not just the central tendency but also the runtime variability is an important measure for real-time applications.As a result, the CVM has the smallest interquartile range with a deviation of 0.01 ms between the 90th and 10th percentiles.In contrast, Trajectron++ has the highest variability with an IQR of 197.54 ms.For the remaining models, the median runtime ranges between 1.69 ms, 4.02 ms and 49.01 ms for Social-Implicit, SGAN and AgentFormer, respectively.Furthermore, it can be observed that both Social-Implicit and SGAN have a low variability with an IQR of 0.11 ms and 0.22 ms, and therefore indicating a weak runtime dependency on the number of agents in the scene.In contrast, Y-Net has a median runtime of 82.39 ms and shows a large variability with an IQR of 96.87 ms, similar to Trajectron++.In comparison, AgentFormer has an IQR of 4.92 ms.An evaluation of the relationship between the runtime and the number of agents in a scene is provided in Appendix A.
The results also indicate no direct correlation between the model runtime and the achieved ADE.Even if the CVM has both the lowest runtime and the smallest ADE, the second most accurate network (Trajectron++) has the highest median runtime overall.Moreover, both Social-Implicit as well as SGAN achieve comparable accuracies, but their inference time deviates by more than an order of magnitude from the runtime of the CVM.While this observation holds true for AgentFormer, the Y-Net model has the highest ADE, but only the third-highest median runtime.
V. DISCUSSION

A. Quantitative analysis
Based on the results gathered in the previous chapter, the strengths and weaknesses of investigated approaches are revealed when analyzed for practical applicability.In terms of accuracy, AgentFormer and Trajectron++ perform best in the Best-of-N evaluation, indicating their to generate plausible and multimodal trajectories across diverse scenarios.For the most part, this might be attributed to the Conditional Variational Autoencoder (CVAE) employed in both architectures, as this has been shown to result in more diverse behavior [28].Although both models additionally incorporate an attention-based mechanism to encode relevant interactions in the scene, the mechanism implemented in AgentFormer had minimal to no impact on the overall accuracy when evaluated in this study.When considering only the most likely trajectory, the accuracy of AgentFormer decreases drastically while Tra-jectron++ still manages to score first among the investigated approaches.This behavior can be partially attributed to the different evaluation procedures included in the latter, returning the most likely path from the employed Gaussian Mixture Model.Following Trajectron++, Social Implicit scores second without this additional evaluation step.Notably, both of these methods employ graph-based networks, highlighting the potential advantages of utilizing this architecture to consider spatial interactions.The temporal modeling on the other hand doesn't seem to have a significant impact as both models utilize a maximum of two timesteps.Hence, the recurrent architecture as well as the 1D temporal graph convolution don't unfold their full potential, leaving room for improvement.Based on the results obtained from the CVM it seems disproportionate to incorporate the entire motion history in the design of future models, but we argue that the consideration is still relevant as will be shown in section V-B.Since Y-Net and AgentFormer are the only models not explicitly processing relative displacements or velocities, their differing behavior can be attributed to the missing orientational information when provided with a single timestep.To make reasonable predictions, at least two timesteps should be provided here, resulting in performance more similar to the single-timestep case for the remaining models.Therefore, architectures that incorporate object velocities alongside positional information can be seen as beneficial for practical applications, but it remains a challenge to estimate these precisely based just on raw sensor data.
In terms of execution time, we have analyzed the average inference of the models and evaluated their scalability by applying them to scenarios with varying pedestrian counts.While Trajectron++ has demonstrated the best accuracy, it lags behind in terms of inference time primarily due to its reliance on an LSTM-based architecture.On the contrary, feed-forward neural networks (FNN) hold the promise of significantly faster execution times.This is precisely why Social-Implicit stands out as the network with the fastest inference speed among the investigated models.In addition to this architectural choice, its graph representation ensures robust scalability, making it naturally adaptable to varying agent counts.Nevertheless, while Y-Net employs CNNs within its Encoder-Decoder architecture as well, it exhibits considerably longer runtimes and a less favorable scaling behavior as indicated in appendix A. While the slow inference is caused by the model size, counting over 50 million parameters in total, its spreads can be attributed to the prediction of a single pedestrian at a time, leading to a linear increase in runtime.Similarly, the transformer architecture employed in AgentFormer counts over 6 million parameters, resulting in an equally high runtime but offering better scalability due to the generation of multiple trajectories at a time.Despite SGAN using a recurrent architecture based on LSTM cells, it still shows a fast inference across all scenarios due to its small size, highlighting the interdependency between model size, output generation, and architecture once again.Overall, while a parallel generation of trajectories is more suitable as it requires only a single inference pass and therefore leads to faster runtimes, a sequential approach provides flexibility when adapting to changes in the prediction horizon.
Nevertheless, when considering the trade-off between accuracy and runtime visualized in Fig. 4, the CVM emerges as the most suitable method on the dataset used.This is in contrast to the current development outlined in chapter II where Y-Net, Trajectron++, and AgentFormer were considered state-of-theart models based on the Best-of-N evaluation approach.As these results suggest that simplistic models, such as the CVM, continue to remain competitive compared to more sophisticated, learning-based approaches, the next chapter will focus on a qualitative analysis, trying to explain this discrepancy.

B. Qualitative analysis
For the analysis provided, representative samples of the ETH/UCY dataset were selected and categorized into four commonly observed prediction scenarios highlighted in figure The first category visualized with figure 5a captures situations where pedestrians move in an overall linear manner within the prediction horizon.For these scenes, a directional spread for the individual approaches is noticeable, whereas the constant velocity model and Trajectron++ manage to generate the most accurate predictions.In general, the simple CVM already provides a good approximation for these cases, but they only account for a small fraction of scenarios within the dataset [26].Hence, non-linear predictions represent the majority of cases with the second category where a challenging scene is visualized in figure 5b.Although neural networks offer a distinct advantage when generating non-linear trajectories [4], none of the investigated models fully capture the dynamics shown.While these situations are considered unpredictable due to the inherent epistemic uncertainty involved [22], all models manage to capture the overall movement direction of the scene.Along with the first category, these types of errors occur on an operational level, where the overall intention is detected and only minor variations remain [29].To improve the prediction in these situations, the inner state of the agent would need to be considered, which is difficult to determine by any outside observer [1].Moreover, drawing from the field of evacuation dynamics, cultural factors, among many more, might play an important role as well [30].While within this dataset, a linear prediction often provides a reasonable approximation for non-linear cases, neural networks have a distinct advantage in situations where environmental factors have a strong influence on decision-making [2].
As the categories a) and b) solely focus on dynamic agents, the third one comprises static ones as displayed in figure 5c.These are commonly encountered in urban scenarios and are the most prominent cause for the performance differences observed.Notably, all approaches analyzed tend to generate trajectories even in the absence of any positional changes between successive timesteps.Furthermore, the absence of any directional information results in the prediction of linear trajectories with various orientations.This kind of error is different from the one observed in the first two categories as it reflects the inability of some methods to reason about a pedestrian's goal or intent.Since human behavior is inherently goal-oriented, the resulting trajectory is strongly influenced by this aspect [29], [22].On a data level, this bias can be explained by the under-representation of static scenarios within the given dataset.Despite both AgentFormer and Y-Net considering additional timesteps in contrast to the other approaches, as well we Y-Net reasoning about one's goal, they do not appear to effectively leverage this information.In the sample presented, the CVM provides the most precise estimation, followed by Trajectron++, which might hint at its capability to reason about different velocity profiles.In comparison, Social-Implicit is designed to explicitly handle pedestrians with varying velocity profiles, consisting of an ensemble architecture trained on subsets of the original dataset for different speeds [4].Nevertheless, although this architecture models this motion-less group, it still exhibits this bias.
The fourth and last prediction category is represented by state-changes where an agent changes its motion state between the observation and the ground truth as highlighted in image 5d.These situations are often found in traffic environments and approaches intended to be used in autonomous systems should be capable of handling those.Here, all investigated approaches fail to identify the intention of the pedestrians in the presented case.Although such behavior might be predictable when taking into account spatial information in urban environments, it still poses a challenge even for human drivers [31].To the best of our knowledge, these situations have not been explicitly addressed by trajectory prediction research and highlight the significance of considering additional information like motion history and semantic clues besides just the last timestep [1].While for most dynamic and static cases, the constant velocity model provides a reasonable approximation, it reaches its limits for the non-homogeneous movement represented by this category as well as real-world urban situations related to this [2].

C. Limitations
Given the results presented, the question can be raised whether pedestrians predominantly move linearly and thus for the dynamic cases, the CVM delivers a good approximation the average walking behavior observed.This hypothesis seems to be supported by the results showcased for Trajec-tron++.Nevertheless, it remains uncertain whether the findings presented in this study can be generalized to other datasets such as SDD [32] or nuScenes [16] as contextual information plays an important role in decision-making [2].Hence, it is to be expected that when evaluating more complex traffic environments, the CVM will perform significantly worse for non-linear cases.As the ETH/UCY dataset exclusively comprises pedestrian data, this paper presents a general evaluation of the ability of certain methods to generate single pedestrian trajectories when confronted with the requirements of autonomous systems.To apply trajectory prediction algorithms to real-world applications, the consideration of other road users as well as the road infrastructure is vital and requires a deeper investigation [1].Regardless, the consideration of non-homogeneous movements represents a significant challenge and an open research topic.In addition, it needs to be highlighted that with the employed Leave-One-Out Cross-Validation, five models were individually trained and tested, whereas in practical applications only one model should be utilized.Furthermore, the transferability of prediction methods across different scenes and datasets remains a challenge and an active area of research [33].

D. Future research
As represented through the quantitative and qualitative analysis outlined in this chapter, open challenges remain that need to be addressed before applying these systems in an autonomous vehicle.While the CVM seems to provide a good tool for approximating the most likely trajectory for dynamic pedestrians, it shows weaknesses for state-changes and nonhomogeneous cases.Such situations frequently occur in urban scenarios and require a deeper scene understanding which can only be reflected through learning-based approaches.Consequently, the focus should be shifted to the development of approaches that fulfill two key criteria: First, the integration of spatial information alongside an agent's motion history, and second, the effective utilization of these features, as not all architectures currently achieve this.These considerations also play an important role in improving the overall reliability of these methods as the detections used are usually derived from noisy sensor data.In addition, the focus should be shifted towards traffic-oriented datasets to better align with situations encountered in real-world applications.As for the model architecture, the development of hybrid approaches can also be considered, training a network explicitly for the detection of state changes and static cases, while using the CVM to approximate the overall dynamics.

VI. CONCLUSION
In this study, we have conducted a comprehensive evaluation of state-of-the-art pedestrian trajectory prediction methods when confronted with the requirements found in autonomous systems.Our analysis focused on measuring the ADE and FDE metrics based on single trajectories, investigating the impact of individual timesteps within the motion history, and measuring the overall runtime for different scenario sizes.Our findings demonstrate that Trajectron++ and Social-Implicit, which leverage graph-based interaction modeling, yield the most accurate results among the investigated architectures.Furthermore, we discovered that many models underutilize the available motion history with the majority of approaches only considering the first two timesteps.
When evaluating the trade-off between accuracy and runtime, Social-Implicit demonstrates the best performance overall, ranking second only to the CVM on the ETH/UCY dataset.This result can largely be attributed to the shortcomings of learning-based approaches to handle static scenarios.In addition, state-changes represent a challenge for all investigated approaches and require the inclusion of additional features such as semantic maps.Consequently, although the CVM provides a good approximation for most dynamic cases, improvements can still be achieved when considering a pedestrian's intention.Moving forward, this study offers valuable insights for three distinct groups: researchers in the autonomous driving domain intending to implement these algorithms into autonomous systems, scientists dedicated to enhancing existing approaches in terms of runtime and accuracy, and those seeking to devise a new metric that accurately represents the requisites for realworld applications.While future research directions are given by the development of hybrid approaches or the incorporation of intention recognition into the trajectory prediction framework, it remains an open question whether some pedestrian behavior might just not be predictable.

ACKNOWLEDGMENT
As the first author, Nico Uhlemann initiated the idea of this paper and contributed essentially to its conception, implementation, and content.Felix Fent contributed to the conception of this research, the analysis of the generated data, and the revision of the research Markus Lienkamp made an essential contribution to the conception of the research project and revised the paper critically for important intellectual content.He gave final approval of the version to be published and agreed with all aspects of the work.As a guarantor, he accepts responsibility for the overall integrity of the paper.The authors would like to thank their project partner Enway GmbH, as well as the Munich Institute of Robotics and Machine Intelligence (MIRMI) for their support.
Nico Uhlemann, Felix Fent and Markus Lienkamp are with the Technical University of Munich, Germany; School of Engineering & Design, Institute of Automotive Technology and Munich Institute of Robotics and Machine Intelligence (MIRMI)

Fig. 1 .
Fig. 1.The three features commonly employed for pedestrian trajectory prediction: the motion history of each agent, spatial information, and interactions.

Fig. 4 .
Fig.4.Runtime of the pedestrian trajectory prediction models represented by their median inference time over all scenarios and their variability as interquartile range (IQR) between the 90th and 10th percentiles over the Average Displacement Error for K = 1 and N = 8.

Fig. 5 .
Fig. 5. Selected samples from the ETH scenario, showcasing four commonly observed pedestrian motion patterns alongside the predictions of each investigated model.

Fig. 6 .
Fig. 6.Runtime over the number of agents in a scene for the a) Y-Net and b) Trajectron++ model.