Quantifying the Complexity of Standard Benchmarking Datasets for Long-Term Human Trajectory Prediction

Methods to quantify the complexity of trajectory datasets are still a missing piece in benchmarking human trajectory prediction models. In order to gain a better understanding of the complexity of trajectory prediction tasks and following the intuition, that more complex datasets contain more information, an approach for quantifying the amount of information contained in a dataset from a prototype-based dataset representation is proposed. The dataset representation is obtained by first employing a non-trivial spatial sequence alignment, which enables a subsequent learning vector quantization (LVQ) stage. A large-scale complexity analysis is conducted on several human trajectory prediction benchmarking datasets, followed by a brief discussion on indications for human trajectory prediction and benchmarking.


I. INTRODUCTION
With the emergence of autonomous vehicles and advances in the field of intelligent robots in general, the task of human trajectory prediction gained a significant amount of research interest in recent years. Besides more classical, physics-based prediction approaches, e.g. building on the Kalman filter [1] or the social forces model [2], a range of deep learning approaches have been proposed to tackle the problem. The most common deep learning models either build around long short-term memory networks (e.g. [3]), convolutional neural networks (e.g. [4]), generative adversarial networks (e.g. [5]) or transformers (e.g. [6]) and vary in contextual cues considered for prediction. Commonly used contextual cues include social (e.g. [7] [8]) and environmental (e.g. [9] [10]) cues. For a comprehensive overview of existing human trajectory prediction approaches, the reader may be referred to [11] [12] [13]. Inherent to prediction model development is the need for proper benchmarks aimed at measuring a model's prediction performance. Due to the direct relation between dataset complexity and model capacity, creating a not too simple or too hard-to-solve benchmark for human trajectory prediction is still a difficult task. On large datasets, even simple models overfit, while in other cases the prediction performance is poor on individual samples, even for high capacity models. Here, one of the difficulties is the open question of how the complexity of a given dataset for trajectory prediction can be quantified. As a consequence, current attempts in standardized benchmarking originate from heuristics or experiencebased criteria when assembling the data basis. Recent examples are the TrajNet challenge [14] or its extension Tra-jNet++ [11] [13].
Currently, in human trajectory prediction, the analysis of benchmarking data only takes on a minor role and focuses on a specific aspect of the data. The most common subject for analysis is the existence of social interaction resulting in nonlinear behavior with respect to motion. For such analyses, social force models and collision avoidance methods [15] or deviations from regression fits [3] [16] are employed for example. Such methods are also used in TrajNet++ for splitting up the benchmark into interaction and non-interaction tasks. Besides that, basic analyses include dissecting velocity profiles or positional distributions [17]. Following that, there is still a lack of approaches trying to analyze datasets as a When targeting a qualitative analysis of data complexity, a common approach are low-dimensional embeddings for data visualization, like for example t-SNE [18] or variations of PCA [19]. While such approaches are viable for nonsequential, high-dimensional data, a prototype-based clustering approach seems more viable for sequential data. This is especially true for trajectory datasets, where each dataset can be reduced to a small number of prototypical sub-sequences specifying distinct motion patterns, where each sample can be assumed to be a variation of these prototypes. Additionally, in the context of statistical learning, the complexity of a dataset can be closely related to the entropy of a given dataset. Measuring the dataset entropy, i.e. the amount of information contained in a dataset, is still an open question in the context of trajectory datasets gaining interest recently [20] [21].
Towards this end, an approach for estimating an entropyinspired measure, the pseudo-entropy, building on a dataset decomposition generated by an adequate pre-processing and clustering method is proposed. For decomposing a given dataset into clusters of distinct velocity-agnostic motion patterns, a spatial alignment 1 step followed by vector quantization is applied. Given the dataset decomposition, the pseudoentropy is estimated by analyzing the prediction performance of a simple trajectory prediction model when gradually enriching its training data with additional motion patterns. This paper is an extension of [20] and focuses mainly on dataset complexity. The main contributions are: 1) A learning and heuristics-based approach for finding a velocity-agnostic, prototype-based representation of trajectory datasets. 2) An approach for estimating trajectory dataset complexity in terms of an entropy-inspired measure. 3) A coarse complexity-based ranking of standard benchmarking datasets for human trajectory prediction.
The paper is structured as follows. Sections II and III present a data pre-processing approach necessary for the actual dataset entropy estimation detailed in section IV. In addition to a coarse dataset ranking, the evaluation section V discusses the ranking, as well as the approach and methods used throughout this paper. Further, some interesting findings resulting from the analysis are discussed. Section VI summarizes the paper, lists potential implications for the state of benchmarking and gives a brief discussion on potential future research directions.
For convenience, several definitions and notations used throughout the paper are listed below: • A trajectory X is defined as an ordered set 2 of points {x 1 , ..., x N }. 1 Not to be confused with a temporal sequence alignment 2 strictly speaking it is not a true mathematical set, as it might contain duplicates.
• The length of a (sub-)trajectory X always refers to its cardinality |X|, rather than the spatial distance covered. • The distance between two trajectories X and Y of the same length |X| = |Y | is defined as d tr (X, Y ) = |X| j=1 x j − y j 2 . • The number of samples, the trajectory length and the number of prototypes are denoted as N , M and K, respectively. For indexing, i, j and k are used. • The q-quantile, with q ∈ [0, 1], of a set of numbers {·} is denoted as Q q ({·}).

II. SPATIAL SEQUENCE ALIGNMENT
With the goal of reaching a velocity-agnostic, prototypebased trajectory dataset representation in mind, the trajectory alignment approach proposed in this section fulfills two integral roles as a pre-processing for the subsequent clustering step. On the one hand, it aligns the data, such that similar patterns are pooled together. On the other hand, it removes variations in velocity among trajectories, therefore generating a dataset with normalized velocity. This, in turn, is essential in obtaining a velocity-agnostic dataset representation. In addition, velocity normalization ensures that similar motion patterns, which only vary in velocity, i.e. in scale, can be pooled together. Given a set of trajectories (samples) X = {X 1 , ..., X N }, as sequences of M subsequent points X i = {x i 1 , ..., x i M }, each sample is first normalized by moving it into an arbitrary reference frame and scaling it to unit length: Here,x is the centroid of a trajectory. It has to be noted that this normalization solely serves the purpose of moving all samples into a common value range and therefore it is not a good normalization in terms of pooling similar samples. Then, all samples are aligned relative to a single learned prototypeŶ = {ŷ 1 ,ŷ 2 , ...,ŷ M } by using similarity transformations, which are retrieved from a regression model φ : X → {t, α, s} with translation t, rotation angle α and scale s.Ŷ and φ are learned by minimizing the mean squared error between each aligned sample X φ i = φ(X norm i ) and the prototypeŶ using stochastic gradient descent. This is different from linear factor models, where the whole training set has to be considered. With respect to equation 2 and the similarity transformation, the trivial solution that maps all samples onto the zero-vector has to be avoided. A brute force approach to this problem is to enforce a minimum scale for the prototype. These steps result in a minimum variance alignment of all samples with respect to the learned prototype. Further, by learning the prototype and the transformation concurrently, the prototype adapts to the most dominant motion pattern and the normalized data is aligned accordingly. An exemplary result of this alignment approach is depicted in figure 1. By aligning all samples with a single prototype, aligned samples have a common orientation and form clusters of similar samples.

III. LEARNING VECTOR QUANTIZATION
Clustering approaches can be applied after the spatial alignment, as it distributes random errors homogeneously over the sequence and exposes clusters of motion patterns. In the landscape of clustering approaches, there exists a wide range of approaches to choose from, ranging from simple established approaches (e.g. k-means [22], DBSCAN [23] or learning vector quantization [24]) to more sophisticated or specialized neural models (e.g. [25] [26] [27]). In the context of this paper, the choice of the clustering approach itself only plays a minor role, thus a learning vector quantization (LVQ) approach is employed, as it can directly be integrated into a deep learning framework. For a more comprehensive review on clustering approaches, the reader may be referred to recent surveys, for example [28] or [29].
The resulting pipeline corresponds to the encoder part of an auto-encoding architecture for representation learning inspired by [30], which is capable of learning meaningful representations. Here, aligned samples are mapped onto K prototypes 3 Z = {Z 1 , ..., Z K }, with Z k = {z k 1 , ..., z k M }, in quantized space. This results in a concise set of prototypes representing the given dataset. The prototypes are learned by minimizing the mean squared error between the aligned samples X φ = {X φ 1 , ..., X φ N } and the respective closest prototype Z z(i) in quantized space: where L reg is the regularization term, which is discussed in section III-B. The index of the closest prototype for a sample X φ i is determined by 3 These are distinct from the alignment prototypeŶ Note that due to the fixed trajectory length, the mean squared error is a suitable similarity measure. If the length would vary, a more sophisticated measure would be necessary [31]. Using L for learning the LVQ parameters, two aspects have to be considered: 1) As the value for K is unknown a priori, it should in general be chosen larger than expected. 2) Due to the winner takes all strategy, L LV Q only updates prototypes that have supporting samples. In order to achieve consistent training and quantization results under these conditions, the following sections present approaches for initialization (III-A), regularization (III-B) and refinement (III-C). While the initialization and regularization primarily focus on 2), the refinement, build upon the expected results using the proposed initialization and regularization approaches, focuses on 1).
For describing these approaches, the support of a prototype plays an integral role. The support π(Z) k of the aligned dataset X φ for each prototype Z k ∈ Z is aggregated in A resulting set of prototypes using the approach described in this section and following subsections, for the dataset shown in figure 1, is depicted in figure 2. It can be seen, that the prototypes cover a certain range of motion patterns: constant velocity, curvilinear motion, acceleration and deceleration.

A. INITIALIZATION
The main objective of the initialization step is two-fold. On the one hand, the number of out-of-distribution prototypes should be reduced in the initial set of prototypes Z init . On the other hand, Z init has to be spread across the data X φ , in order to identify different motion patterns more consistently.
Taking this into account, the alignment prototypeŶ is set as the first prototype Z 1 , as it should resemble the most dominant motion pattern. Under the assumption that VOLUME 4, 2016 other relevant motion patterns are dissimilar toŶ , a Forgy initialization [32] is applied for initializing the remaining K − 1 prototypes. Accordingly, the remaining prototypes are randomly selected from a subset X ⊂ X φ , where X contains all samples are defined as the q ix -quantiles of all sample distances with respect toŶ . An upper bound τ ihigh is employed to reduce the risk of initializing a prototype with out-of-distribution samples from X φ . Depending on the choices for q ilow and q ihigh , there might not be enough samples to choose from (|X | < K − 1). In this case, q ilow can be gradually reduced until |X | ≥ K − 1. An example for X and X with q ilow = 0.9 and q ihigh = 0.95 is given in figure 3.

B. REGULARIZATION
While the initialization helps in increasing the average support for each prototype 4 , some out-of-distribution samples 5 might be assigned to individual prototypes, resulting in little support from other samples.
To ensure optimization of all prototypes, a regularization term L reg is employed, which is set to move outof-distribution prototypes closer to more relevant samples or clusters of samples. Following this, different definitions for L reg can be used. On the one hand, L reg could move all prototypes Z k ∈ Z \ Z * towards the most supported prototype Z * = Z k * , where k * = arg max k π(Z) k : Under ideal conditions, Z * should be equal to the alignment prototypeŶ , which roughly represents the overall mean of the dataset, and L reg behaves accordingly. In practice, however, this assumption might not hold due to noise, increasing unpredictability of the optimization. Hence, in the following L reg is defined to move all prototypes towards the global mean by minimizing the global error 4 Compared to a simple random initialization 5 Outliers or trajectories with annotation errors Intuitively, by choosing an appropriate value for the regularization weight γ, this definition of L reg moves low-support prototypes in more reasonable areas within quantized space and the winner takes all loss function L LV Q keeps them within range of relevant sample clusters. Additionally, when K is too large, superfluous prototypes are very similar after optimization.
As a side-note, very imbalanced prototype sets, in terms of many low-support prototypes, can also be measured by the perplexity score Due to P Z not being directly derived from Z, it is not a good term for optimization, and thus for L reg . Nevertheless, P Z can be used later on when assessing the complexity of a dataset.

C. REFINEMENT
Finally, a heuristic refinement scheme, building on the expected results when using the regularization approach presented in section III-C, is employed in order to remove unnecessary prototypes when K was too large. The refinement step consists of two phases. In the first phase, low-support prototypes are removed from Z by using a dataset-dependent threshold τ phase1 = phase1 · |X| : The second phase revolves around removing prototypes similar to the most supported prototype Z * . It is assumed, that Z * is close to the global mean of the dataset. This implies, that superfluous prototypes are driven towards Z * because of the global mean regularization, allowing to detect and remove these prototypes. For assessing similarity, prototypes Z k ∈ Z \ {Z * } are first aligned with Z * in terms of their starting points z k 1 = z * 1 and initial orientations z k 1 . An aligned prototype Z k is then considered as similar to Z * when at least phase2 · 100 % of its points are in close proximity to respective points of Z * : τ (Z * ) j is the per-point distance threshold of the j'th trajectory point calculated from the supporting samples The 0.99-quantile is used instead of the maximum to exclude outliers in the data. A visual example for determining similarity is given in figure 4. Example of assessing prototype similarity in the second phase of the refinement scheme. The first row shows the alignment of a prototype Z k (blue) with the highest-support prototype Z * (red), The second row shows how the maximum distance per-point is estimated for determining similarity. In this example, as seen on the right, Z k is only within similarity range for 4 of 8 points, thus it is determined as dissimilar when choosing an overlap factor phase2 > 0.5.

IV. ESTIMATING TRAJECTORY DATASET ENTROPY
This section discusses an attempt in moving towards enabling a thorough complexity analysis of human trajectory prediction benchmarking datasets. While previous work (e.g. [15], [17], [3]) focuses on statistics directly derived from the datasets, like histograms or deviations from linear prediction, the approach proposed in the following relies on a dataset decomposition X φ decomp = {X 1 , ..., X k } learned by the LVQ model introduced in section III, i.e. the clusters X k ∈ X φ assigned to each prototype Z k ∈ Z after refinement. This decomposition is used for an effort in estimating an entropyinspired measure for a given trajectory dataset X , declared as the pseudo-entropy H pseudo (X ). Given a dataset, it is assumed, that a high information content yields a high entropy value, which finally gives a direct quantification of complexity. Compared to the OpenTraj approach described in [21], which was developed independently at the same time and mainly focuses on analyzing raw trajectory data, the approach presented in this paper focuses on the analysis of more abstract trajectory data in alignment space and clusters of motion patterns extracted from aligned data.
As an initial proof of concept, H pseudo (X ) is estimated using the change in prediction performance of a simple machine learning-based trajectory prediction model when gradually increasing the amount of information in the training data. The amount of information in the training data can be controlled to a certain extent by using the dataset decomposition in alignment space generated by the LVQ model. In general, most current trajectory datasets are strongly biased towards linear motion patterns, thus other, more complex patterns are less relevant and carry more information. Following this, by ordering the clusters in X φ decomp by decreasing relevance, a training set initially consisting only of X 1 ∈ X φ decomp can be gradually enriched with information by first adding X 2 , then X 3 and so on. Then, a simple prediction model M is trained on each composed training dataset. In this paper, M is comprised of a learned linear input transformation where W and b are learned from data and X f is a flattened vector given by concatenating all points of a trajectory X. This model is assumed to be able to model each relevant motion pattern in isolation. Due to the lack of modeling capacity, it is expected that the prediction error increases with the increase of information present in the data. Finally, H pseudo (X φ ) is calculated by counting significant prediction error increases as depicted in algorithm 1.

Algorithm 1 Estimate Dataset Pseudo-Entropy
Require: This section starts with a setup common to all experiments, followed by a qualitative evaluation of the simple prediction model described in section IV and the LVQ model for dataset decomposition. Next, a coarse ranking of standard benchmarking datasets for long-term human trajectory prediction based on pseudo-entropy is given. The section closes with a discussion on the approach and methodology itself, other possible factors contributing to dataset complexity and interesting findings.
Evaluations are conducted on scenes taken from the following frequently used benchmarking datasets: BIWI Walking Pedestrians ( [33], abbrev.: biwi), Crowds by example (also known as the UCY dataset, [34], abbrev.: crowds) and the Stanford Drone Dataset ( [15], abbrev.: sdd). Besides being typically used for evaluating human trajectory prediction models in the literature, the original TrajNet challenge was built around these datasets. The scenes in the datasets are denoted as Dataset: Scene Recording, e.g. recording 01 of the zara scene in the crowds dataset is denoted as crowds:zara01. Note that for sdd, different recordings of the same scene do not necessarily capture the same campus area (but there might be some overlap). An overview of statistical details of the datasets is given in appendix A.
For trajectory prediction tasks targeted in this section, trajectories are split in half in order to obtain observation and VOLUME 4, 2016 target sequences. The prediction error is reported in terms of the average displacement error.

A. SETUP
First, the datasets are augmented to have a common sample frequency. The biwi and crowds scenes already have the same sample frequency of 2.5 samples per second, thus the sample frequency of all the sdd scenes is adjusted accordingly.
Next, as the prototype-based representation only works with trajectories of the same length, an appropriate sequence length has to be chosen for each dataset in the evaluation. The most commonly used sequence length in recent benchmarks is M = 20 (8 for observation and 12 for predicting) points per trajectory. Setting M = 20 for all datasets might, however, lead to smaller datasets having only few trajectories left for learning the LVQ, as the average trajectory length varies greatly between datasets. Because of this, the q-quantile of trajectory lengths per dataset is chosen, i.e. a common but dataset-dependent sequence length. After choosing M , the training datasets are assembled by collecting all possible (sub-)trajectories with length M from each respective dataset, in order to provide as much data as possible. Additionally, for achieving a more meaningful result, the average of multiple sequence lengths, i.e. time scales, is used for calculating the pseudo-entropies. Thus, q = 0.1 and q = 0.25 are used for evaluation, ensuring that a greater portion of the dataset remains while removing less interesting trajectories in terms of long-term trajectory prediction.
Then, trajectories not exceeding a dataset-dependent minimum speed 6 s min are filtered. The reason for this is twofold. First, statistical models are worse in modeling trajectories of slow-moving persons, as their behavior becomes less predictable [35]. Second, and as a consequence, these trajectories generally do not contain viable motion patterns to extract. For this evaluation, the minimum speed is calculated heuristically for a given training dataset X containing all possible (sub-)trajectories of length M : Here, m speed (i) denotes the average speed of the i'th trajectory X i ∈ X . Finally, for each dataset and sequence length, the training of the alignment and LVQ networks are run 10 times. The number of initial prototypes is set to K = 10 for all datasets. If not stated otherwise, the refinement parameters are set to phase1 = 0.04 and phase2 = 0.9.

B. SIMPLE PREDICTION MODEL CAPABILITIES
The pseudo-entropy estimation approach assumes that the learned linear transformation prediction model M is capable 6 The average distance between subsequent trajectory points. of modeling basic motion patterns in isolation. In order to verify this, M has been trained on samples of three clusters corresponding to constant, accelerated and curvilinear motion taken from the decomposed sdd:hyang04 dataset. Then, M was tasked to predict the remainder of each cluster prototype, given its first half (9 trajectory points in this case), showing its viability. The results are depicted in figure 5. The ground truth is depicted in red and the prediction in blue.

a.
b. c.

C. LVQ DATASET DECOMPOSITION: QUALITY, CONSISTENCY AND SENSITIVITY
In this section, the viability of the LVQ model given an aligned dataset is evaluated, as it builds the basis of the proposed approach for estimating the information content. This evaluation employs three exemplary datasets of varying assumed complexity, the biwi:eth, crowds:zara01 and the sdd:hyang04 dataset. The evaluation targets the quality and consistency of resulting dataset decompositions, as well as the approach sensitivity to its refinement parameters. Due to the similarities of datasets used throughout the evaluation, it is assumed that the findings of this section will carry over to the other datasets. Starting with consistency, the 10 available training iterations are examined. The first row of figure 6 depicts the number of components identified by the LVQ model before (blue) and after (orange) refinement. Although there are some small fluctuations, there are no strong deviations which cannot be compensated by averaging.
The influence of the heuristic refinement when varying its parameters phase1 and phase2 is depicted in the second and third row of figure 6. Looking at the second row, the number of remaining components decreases gradually as expected when increasing phase1 from an initial 0% up to 10% necessary support by the data basis. The third row in figure 6 indicates the little impact of phase2 and phase 2 in general. Decreasing the minimum number of overlapping trajectory points required for being filtered out from 100% to 75%, the final number of components only decreases when there are unnecessary components left, or phase2 is too restrictive. In fact, considering all datasets and training iterations, the second refinement phase only removed at least one component in 185 of 1823 cases.
Finally, analyzing the quality of the resulting decomposition, two things have to be considered: the motion patterns represented by each prototype and the significance of each biwi:eth crowds:zara01 sdd:hyang04 FIGURE 6. Number of resulting clusters after multiple training runs before (blue) and after (orange) refinement (1st row) and the influence of different values for the parameters phase1 and phase2 (2nd and 3rd row) of the LVQ refinement procedure for exemplary datasets.
cluster. As the first point is hard to verify quantitatively, a visual inspection is employed. Looking at, for example figure 2 or 7, the learned prototypes and identified motion patterns appear reasonable for bird's eye view datasets. This is also discussed briefly in section V-E. For evaluating the quality of the decomposition itself, an approach using a simple prediction model similar to the one described in algorithm 1 can be employed. While training models on increasingly complex combinations of clusters remains, the test set errors are now compared to the prediction errors on the remaining clusters. Then, ideally, a significant difference between these errors justifies the existence of the remaining clusters next to the ones combined in the training dataset. Table 1 depicts all mean test set and prediction errors (including standard deviations) for clusters generated for sdd:hyang04. Here, all pairwise differences in each row are significant, thus verifying the learned decomposition. Significance is determined by using a t-test for independent samples using a significance level α = 0.05. Variance homogeneity has been tested using Levene's test and considered when choosing a t-test variant (regular vs. Welch's). ment and LVQ models. The evaluation resulted in a coarse ranking of datasets, with datasets being assigned to one of four groups of similar pseudo-entropy. It can be seen that the biwi scenes are among the datasets containing the least information, while the most informative scenes are found in the sdd dataset. For demonstration purposes, the datasets and prototypes for biwi:eth and sdd:nexus01 are illustrated in figure 7 for similar sequence lengths (12 and 15). As opposed to biwi:eth, sdd:nexus01, being ranked has containing more information, consists of three times the amount of motion patterns, including constant velocity, curvilinear motion, acceleration and deceleration, as well as a mixed motion pattern. This mixed pattern consists of constant velocity, decelerating and accelerating parts, and might occur due to the rather high sequence length with respect to the covered time span. Coarse complexity ranking of standard trajectory benchmarking datasets based on their estimated information content (pseudo-entropy). Higher pseudo-entropy implies higher dataset complexity. The datasets are abbreviated in terms of the actual dataset (e.g. sdd), the name of the scene included in the dataset (e.g. hyang) and the recording number in case there exist multiple recordings for the respective scene.

E. DISCUSSION
Conclusively, selected aspects of the proposed approach and employed methodology are discussed. Then, potential factors for creating a more fine-grained ranking and some insights gained from the analyses are discussed.

1) Approach and Methodology
In the context of dataset complexity assessment, normalizing the velocities using the alignment model should be discussed. Given a prototypical motion pattern, variations in velocity can be generated by scaling it accordingly, thus it is assumed that the pattern itself is the main contributor to a higher dataset entropy. In case the original motion patterns are required, recall that the clustering pipeline corresponds to the encoding part of a well-established auto-encoding architecture, thus the original velocities can be recovered from the dataset representation when employing the full autoencoding architecture. Further, a dataset-dependent trajectory length M was chosen instead of a common value for all datasets. This decision is motivated by large differences in trajectory lengths occurring in existing datasets as well as unknown ground resolutions. While differences in length may only lead to some datasets becoming very small if M is too high, it is not clear if two trajectories of the same length from different datasets are comparable due to unknown ground resolutions. Even if the average offset between subsequent trajectory points is equal, both trajectories might represent different real-world speeds, thus covering a longer/shorter distance in reality, which directly impacts occurring motion patterns and the influence of agent-agent interaction.
The pseudo-entropy aims to reveal the average amount of information contained in a given dataset. Looking at the coarse ranking in section V-D, the results appear reasonable from an experience point of view. For verifying this coarse ranking in an experimental setup using state-of-theart trajectory prediction models, it would be necessary to put all datasets in a common reference frame and re-sample trajectories to achieve a matching ground resolution, i.e. the distance between subsequent trajectory points of objects moving at the same real-world speed must be equal, for all datasets. This can be an interesting experiment for future work on the topic of dataset complexity analysis.
Lastly, the presented approach only allows for a coarse complexity ranking of given datasets. For achieving a more fine-grained ranking, additional factors need to be considered. Possible factors are discussed next.

2) Potential Factors Affecting Complexity
Multiple factors contributing to dataset complexity could be derived from a learned dataset decomposition. First, the diversity between motion patterns, covered implicitly in the pseudo-entropy, could be considered explicitly. In case of distinguishable patterns, statistical models need to be capable of capturing multiple modes in the data, requiring a higher modeling capacity. Thus, a higher pattern diversity is expected to correspond to a higher dataset complexity. The second factor considers occurring variations of the same pattern. This factor looks promising, as a higher variation implies a higher uncertainty when modeling specific motion patterns, making it harder to capture by using statistical models. Lastly, the relevance distribution of identified motion pattern could be considered. This mainly focuses on biases in the data, and thus answers the question if there is a prevalent motion pattern or if the occurrence of all patterns is equally likely. Then, less biased datasets can be considered as more complex, as less biased data enables statistical models to capture different patterns in the first place.
Beyond that, agent-agent interaction, as well as the environmental cues can play an integral role in assessing trajectory dataset complexity. Looking at interaction, its influence on the shape of single trajectories can be significant, though this heavily depends on the chosen sample frequency as well as the ground resolution of a given dataset. More specifically, the influence of agent-agent interaction becomes less relevant, the sparser a trajectory is sampled, due to interactions mainly occurring on short time scales. The same applies to the ground resolution, where interactions become less visible when the spatial distance between subsequent trajectory points increases. As a final note, sensor noise must be considered, as there is a risk of interactions being indistinguishable from noise. For environmental cues, positional biases caused for example by junctions can heavily impact the occurrence of specific motion patterns, especially curvilinear motion. This leads to more diverse, and thus to potentially more complex datasets.

3) Interesting Findings
All datasets in this comparison are recorded from a birds-eye view. Inherent to this perspective is the expectation that there are common motion patterns in all datasets, independent of the time horizon. This fact has been, with a few exceptions, confirmed, in that almost all scenes contain slight variations of at least one basic motion pattern, including constant, accelerated, decelerated and curvilinear motion. Some datasets contain multiple variations of the same basic motion pattern or even mixed motion patterns, enabled by the, partially, high sequence length M . This can be seen in figure 7 (g. -l.). Another aspect related to the motion patterns found in the data is, that in all datasets, the constant velocity pattern is the dominant, i.e. most supported, motion pattern, covering a large fraction of the entire dataset (see figure 8 for exemplary fractions for low to high complexity scene datasets). This has multiple implications related to common evaluation methodology in current state-of-the-art publications. On the one hand, it is a perfect explanation for the difficulties in beating a simple linear extrapolation model in the task of human trajectory prediction. This phenomenon could for example be observed during the TrajNet challenge [14], where multiple of the first submissions failed to beat the linear model. On the other hand, this fact indicates, that an arbitrarily assembled benchmarking data basis poses the risk of rendering corner cases, i.e. motion patterns different from the constant velocity pattern, statistically irrelevant. This leads to statistical models that are incapable of modeling more complex motion and also struggle with beating a linear model.

4) Comparison with OpenTraj
Dataset complexity estimation in the context of human trajectory prediction still poses an unexplored topic in the literature. Because of that, there is no quantitative approach for measuring and comparing the performance of different approaches for dataset complexity estimation. As a result, this section resorts to a qualitative comparison of the pseudoentropy approach presented in this paper and the OpenTraj approach and aims to serve as a verification of results. Using the result provided in [21], a superficial comparison can be made by comparing the coarse dataset ranking based on pseudo-entropy with the clustering and entropy analyses in OpenTraj (figure 3 in [21]). It has to be noted, that in this paper every scene of each dataset has been treated independently (e.g. sdd:deathcircle -scene 1), while OpenTraj pooled VOLUME 4, 2016 the results for each individual dataset. Being common to both evaluation sections, the eth hotel, zara and sdd:deathcircle datasets can be used as a sample for the comparison. Looking at the pseudo-entropy based coarse ranking, these datasets provide a dataset of low, medium and high entropy, respectively. This relative ranking is consistent with the findings provided by OpenTraj, confirming the plausibility of both approaches.

VI. CONCLUDING REMARKS
In the context of statistical learning, dataset complexity is closely related to the entropy of a given dataset. Thus, an approach for estimating the amount of information contained in trajectory datasets was proposed. The approach relies on a velocity-agnostic dataset representation generated by an alignment followed by vector quantization. Using this approach, a coarse complexity ranking of commonly used benchmarking datasets has been generated. A following discussion addressed the results and methods used, as well as interesting findings based on the analyses, stressing the importance of a well-rounded data basis.

1) Implications for the State of Benchmarking
The approach, methods and results presented in this paper can be valuable in the context of dataset and prediction model analysis, as well as benchmarking in general. First of all, the spatial sequence alignment combined with the LVQ approach can be used for analyzing datasets on different timescales, e.g. for extracting underlying motion patterns. This analysis especially benefits the selection of observation and prediction sequence lengths in benchmarking, as well as the selection of an appropriate prediction model in terms of model capacity. The latter is motivated by the fact, that low-capacity models usually suffice for less complex data, which might in turn reduce cases of over-fitting and unnecessary computational effort. Further, the resulting dataset decomposition can be used to enhance qualitative analyses of prediction model capabilities in cases where the model might struggle with specific subsets of the data. Lastly, a hierarchy of tasks within a benchmark with increasing difficulty could be built on the dataset decomposition in combination with the presented coarse dataset ranking.

2) Future Research Direction
This paper aims to constitute a step towards thorough dataset complexity analysis. The following paragraphs try to give some open research directions in order to expand on the approach and findings of this paper.
Consider Model Uncertainty. Currently, model uncertainty is only considered implicitly by averaging multiple instances of the presented pipeline when estimating the dataset pseudo-entropy. However, the variance of the ensemble is disregarded in this proof of concept. Thus it could be interesting to examine if explicitly incorporating the models' uncertainty about its output could benefit the entropy estimation.
Birds-eye View. So far, all compared datasets are birdseye view datasets. While this view is the common case 7 for long-term human trajectory prediction datasets, trajectory complexity analysis is also relevant for other views (e.g. frontal view). Considering the structure of the presented approach, the entropy estimation should work as long as the spatial sequence alignment is applicable to the scenario of interest. In the current state, the alignment model expects complete tracklets as input and thus does not have to cope with missing observations, for example arising through occlusions when using a frontal view. Following this, the alignment model should be extended accordingly and be evaluated on a range of different datasets with varying views and object types.

APPENDIX A. DATASET DETAILS
In order to give more details on the datasets used throughout this paper, table A.1 lists the number of samples included in each dataset, the recording conditions (location and acquisition) as well the the average trajectory length (with standard deviation). In accordance with the evaluation section V, the annotation rate has been aligned for all datasets to a fixed rate of 2.5 annotations per second.