CoL-GAN: Plausible and Collision-Less Trajectory Prediction by Attention-Based GAN

Predicting plausible and collisionless trajectories is critical in various applications, such as robotic navigation and autonomous driving. This is a challenging task due to two major factors. First, it is difficult for deep neural networks to understand how pedestrians move to avoid collisions and how they react to each other. Second, given observed trajectories, there are multiple possible and plausible trajectories followed by pedestrians. Although an increasing number of previous works have focused on modeling social interactions and multimodality, the trajectories generated by these methods still lead to many collisions. In this work, we propose CoL-GAN, a new attention-based generative adversarial network using a convolutional neural network as a discriminator, which is able to generate trajectories with fewer collisions. Through experimental comparisons with prior works on publicly available datasets, we demonstrate that Col-GAN achieves state-of-the-art performance in terms of accuracy and collision avoidance.


I. INTRODUCTION
Pedestrian trajectory prediction is essential and critical in various applications, such as autonomous driving and robotic navigation. When autonomous agents move in a crowd, they should understand social behaviors to avoid potential collisions with other agents. Social interactions and multimodality are two major inherent challenges of pedestrian trajectory prediction. There are several kinds of human-human interactions, such as avoiding collisions, grouping, and keeping a suitable distance with neighbors. Multimodality means that pedestrians will generate diverse trajectories, even if similar historical trajectories are given.
Recently, data-driven methods [1]- [8] have been proven to be able to capture more diverse and complicated social interactions than traditional methods [9], [10] that model social interactions by predefining physical rules. However, compared with data-driven methods, these traditional methods can generate collision-free trajectories. Indeed, pedestrians The associate editor coordinating the review of this manuscript and approving it for publication was Choon Ki Ahn . walking in crowds will try to avoid collisions with others, so being collision-free is a fundamental characteristic of real trajectories. In some scenarios of particular interest, producing trajectories without collisions is necessary and critical. Maintaining a minimal distance, namely, a collision distance threshold, from others is necessary, even when pedestrians have to shorten the distance in an extremely crowded scenario. If autonomous driving systems predict trajectories with a large number of collisions, they will be likely to make wrong decisions. For example, if an autonomous vehicle forecasts that pedestrians will appear in the same location and move past them, which ignores the fact that they will occupy a larger area, accidents will be likely to happen.
Therefore, predicting collision-free trajectories is necessary. However, the existing data-driven trajectory prediction models [1], [4], [5], [8] suffer from a lack of quantitative evaluation metrics to evaluate the performance of collision avoidance and judge the plausibility of predicted trajectories. These models only use qualitative metrics, such as checking visualized trajectories in the prediction period, which is subjective, time-consuming, and not convincing enough. Here, we introduce the Average Collision Times (ACT), an intuitive quantitative metric to evaluate the plausibility of trajectories more convincingly.
As mentioned above, previous data-driven models predict trajectories with many collisions. Therefore, we propose an attention-based generative adversarial network (CoL-GAN), a novel data-driven model to predict pedestrian trajectories with fewer collisions. Experimental results show that Col-GAN can achieve higher accuracy and fewer collisions. The existing models utilize different schemes to model social behaviors. In contrast to prior works [2], [5], [7], [8] computing attention in complex ways, CoL-GAN adopts a novel social attention module with a simple structure to capture human-human interactions, where the attention scores are inferred based on relative positions and relative velocities. In contrast to works [1], [3], [5] considering the neighboring pedestrians in local areas, CoL-GAN focuses separately on all the pedestrians in a scenario. While Social GAN (SGAN) [4] uses its pooling mechanism only once to obtain social interaction features as the partial initial hidden state of the decoder, the attention module of CoL-GAN works at all time steps in the prediction period. In other words, our proposed attention module infers attention scores between the target pedestrian and all pedestrians (including the target pedestrian himself) based on the corresponding relative positions and relative velocities at each time step in the prediction period.
In addition to the complexity of social interactions between humans, another inherent property of trajectories is multimodality. Zhang et al. [11] used GMM (Gaussian mixture model) to estimate the probability distribution of a future position. However, a generative adversarial network can generate multimodality samples at the whole trajectory level. Similar to prior works [4], [6], [12], CoL-GAN also leverages a generative adversarial network architecture to produce multimodal trajectories. However, unlike previous LSTM-based discriminators, inspired by PatchGAN and PixelGAN [13], we introduce the Motion Discriminator, a CNN-based discriminator, which splits a whole trajectory into several parts to separately estimate the probability of being true.
Our contributions can be summarized as follows: (1) We introduce Col-GAN, a novel GAN model to predict human trajectories based on a new social attention mechanism. Col-GAN exploits a CNN-based network as the trajectory discriminator. (2) We introduce the ACT, a quantitative metric to evaluate the performance of collision avoidance, statistically and objectively. (3) Through experimental comparisons with state-of-the-art methods, in addition to the accuracy outperformance, the trajectories predicted by Col-GAN result in the best ACT.
To better present our work, the rest of this paper is arranged as follows. We describe related works in Section 2. Then, our method is introduced in detail in Section 3. We present our experimental results in Section 4. Finally, we conclude the paper in Section 5.
Alahi et al. [1] proposed modeling human-human interactions by LSTM with a social pooling module, which led to the trend of data-driven methods. Because pedestrians pay different attention to others based on their motions and movements, more methods try to model this phenomenon by various attention modules [4]- [6], [8], [18]. The key differences among these attention modules involve three aspects. First, they compute the weights of pedestrians based on different source information. Second, they combine the motion mode of target pedestrians with interactive information of others in different ways. Third, they process human-human interactions at different frequencies.

B. COLLISION AVOIDANCE
Being collision-free is the most intuitive phenomenon of plausible trajectories. Traditional trajectory prediction methods built their hand-crafted functions mainly inspired by modeling the collision avoidance phenomenon of pedestrians [9], [21], [22]. Such methods are strictly collision-free methods; at each time step, they predict trajectories with the purpose of avoiding collisions. In contrast, data-driven methods cannot predict trajectories without collisions. They implicitly learn to predict plausible trajectories from training data so that they can model more diverse interactions such as group forming and pedestrian following. S-LSTM [1] uses a pooling module to capture the interactions of pedestrians with VOLUME 8, 2020 the anticipation of helping predict trajectories with fewer collisions. Other attention-based methods [5]- [8], [18] also keep this in mind. Each method exhibits different performance in collision avoidance, and experimental results show that our method does especially well in avoiding collisions.

C. GENERATIVE ADVERSARIAL NETWORK
The movements of pedestrians are multimodal in nature. Because generative adversarial networks (GANs) are able to generate multimodal samples, they are suitable for solving these problems. Gupta et al. [4] emphasized the problem of multimodality of trajectories and introduced SGAN. Later, other works [6], [8], [12], [12] also tried to generate more diverse trajectories. Great advances have been made in GANs in many fields. A GAN is composed of a generator and a discriminator. Many works have focused on designing task-fit generators to make improvements [13], [23]- [25]. Additionally, other methods have tried to make progress by inventing effective discriminators [13], [26]. The discriminator used in PatchGAN and PixelGAN [13] has been proven to be more useful in generating more realistic pictures than a traditional picture discriminator. Inspired by these works, we use a CNN-based motion discriminator rather than a traditional LSTM-based trajectory discriminator.

III. METHOD
Pedestrians walking in a crowd perform diverse social interactions with others at every time step. They respond differently to others depending on the relative positions and relative velocities between them. They keep pace with nearby pedestrians who have the same destinations while carefully maintaining a proper distance with them so as not to cause a collision. However, different pedestrians prefer to maintain different proper distances. If there are pedestrians walking toward them, they may preferentially adjust their directions and speeds in advance to avoid collisions. However, they will choose different avoidance directions and speeds. In other words, pedestrians pay different attention to others in various scenarios, and there are a variety of ways for them to choose to avoid collisions with others. This motivates us to build a model that can integrate the aforementioned characteristics to predict multimodal future trajectories with fewer collisions. Therefore, we present a seq2seq-based GAN model to predict diverse trajectories, with an attention module to deal with interactions at each prediction time step, and this model can compute different attention scores of pedestrians based on their relative positions and relative velocities to the target pedestrian. We call the model collisionless GAN, CoL-GAN. The architecture of CoL-GAN is presented in Fig. 2.

A. PROBLEM DEFINITION
In this paper, we address the problem of predicting accurate pedestrian trajectories with fewer collisions in crowded scenarios. Given the historical trajectories of all pedestrians in the scenario, our task is to predict their future trajectories simultaneously. The pedestrians in the scenario are represented as p 1 , p 2 , . . . , p N . The position of a specific pedestrian Our goal is to predict the positions of pedestrians at any future time step t(t ∈ [T obs + 1, T obs + T pred ]), and for a specific pedestrian

B. OVERALL MODEL
CoL-GAN is a typical generative adversarial network composed of a generator and a discriminator. The generator generates future trajectories of pedestrians, and the discriminator estimates the probability that they are true. Our generator is based on a seq2seq architecture with a historical trajectory encoder and a future trajectory decoder.
The key difference between the LSTM-based encoder and decoder is that the decoder has a social attention module. The encoder is used to encode historical trajectories. It captures each pedestrian's historical motion patterns and encodes them into the hidden state h t en and the cell state c t en . At the time step t obs , the encoder provides h t obs en and c t obs en to the LSTM of our decoder as the initial hidden state and cell state, respectively. The decoder is the key component of our design; it consists of three components, an LSTM (LSTM de ), a social attention module, and a linear layer. Pedestrians walking in a crowd avoid collisions with others at every time step. They observe the movements and velocities of others and then plan their own routes. Finally, they walk following their planned paths with some urgent collision avoidance. Therefore, our decoder also uses the social attention module to process interactions at each prediction time step. For the target pedestrian p i , the attention module uses an MLP to infer all pedestrians' corresponding weights by using their relative positions and relative velocities to p i and then computes a weighted sum of the corresponding outputs of LSTM de . Finally, the weighted sum is used by the linear layer to predict future movements. In addition, noise and the predictions of the last time step will be concatenated as the input of LSTM de . For different trajectories, the noise will be different, but it is time-invariant.
After the generator predicts all trajectories, our CNN-based discriminator determines whether they are true or false to force the generator to predict more realistic trajectories. In the following subsections, we elaborate on each module in detail.

C. TRAJECTORY ENCODER
From historical trajectories of pedestrians, the encoder will capture their respective motion patterns. We do not directly use the coordinate X t i = x t i , y t i as the input of the encoder. Following the process of SGAN [4] and STGAT [8], we also use X t i = x t i , y t i as the input, which is equivalent to the velocity. Their definitions are as follows: Then X t i is embedded into a fixed-length vector e t (en,i) at every time-step. LSTM en will use e t (en,i) as the input and produce a new hidden state h t (en,i) as follows: where φ en (·) is an embedding function and W (en,em) are embedding weights. The LSTM weights of the encoder are denoted by W (en,lstm) .

D. ATTENTION BASED TRAJECTORY DECODER
At every time step t, our decoder takes the Ŷ t−1 i predicted at last time step as its input. Specifically, at the first prediction time step T obs + 1, the input is Y T obs i , and LSTM de takes [h T obs (en,i) , c T obs (en,i) ] as the initial hidden state and cell state, respectively. Similar to the encoder, Ŷ t i is also embedded into a fixed-length vector e t (de,i) at every time-step. Then, e t (de,i) and the noise vector z are concatenated together and are fed into the LSTM of the decoder. Therefore, this introduces the following recursion: where φ de (·) is the embedding function of the decoder and W (de,em) are its embedding weights. The LSTM weights of the decoder are denoted by W (de,lstm) . Here, a noise vector z is sampled from a Gaussian distribution, which is person-specific but time-invariant.

1) SOCIAL ATTENTION MODULE
At every time step, pedestrians consider the movements of others and make a plan to avoid collisions with others while walking toward their own destinations. The main factors that influence their decisions are the relative positions and relative velocities between them and others. Clearly, the naive LSTM is not able to capture interactions between pedestrians, but our social attention module does. Unlike S-LSTM [1] which limits the local area in which interactions are considered, our attention module considers all pedestrians in the scenario. Unlike SGAN [4] and STGAT [8], which capture social interaction features to initialize the hidden state of the decoder, our attention module works throughout the prediction period, handling social interaction at each time step. While CIDNN [19] only uses absolute positions to infer attention weights, we take into account both the positions and velocities. We denote the relative positions from pedestrian p i to target p j as RP ij and the relative velocities as RV ij . RP ij and RV ij are calculated as follows: Then, RP ij and RV ij are concatenated as R ij , which is the input of an MLP. For target pedestrian p j , every pedestrian in the scenario (including p j ) will be assigned an attention score α t ij calculated by the MLP with a softmax layer to normalize the probabilities at time step t. H t j of target p j will be calculated as the weighted sum of h t i (i ∈ [1, N ]) weighted by α t ij . The formulations are as follows: Then H t j will be used to predict Ŷ t+1 j as follows:

E. CNN-BASED MOTION DISCRIMINATOR
We are inspired by PatchGAN and PixelGAN [13], which do not classify whether a whole image is real or fake but classify whether each N × N patch in an image is real or fake. Therefore, we abandoned the typical discriminators classifying whether a whole trajectory is real or fake in SGAN [4] and SoPhie [6]. Our Motion Discriminator classifies whether each segment of a trajectory is real or fake. Because LSTMs may suffer from vanishing gradients, partly neglecting previous information, we use a fully convolutional network to work as the discriminator. As mentioned before, our generator ] which is the sequence of velocities of a pedestrian. We assume that the movements of pedestrians at any time step are equally important. Moreover, pedestrians can make correct decisions and move reasonably at every time step. Therefore, the probability of a trajectory being real is the average probability of all its segments. Our discriminator has three 1-D convolutional layers. Their kernel sizes are all set to 1. The convolution stride is fixed to 1, and the padding is set to 0. For the first two layers, LeakyReLU is used as our non-linear activation function. Between the second convolutional layer and the last convolutional layer, a batch normalization technique is used. Finally, a sigmoid function is used to compute the probability. After Motion Discriminator obtains the probability of each segment, we average them as the whole probability of the trajectory.

F. IMPLEMENTATION AND TRAINING DETAILS
We use an embedding dimension of 32, and the LSTM hidden state dimensions of both the encoder and decoder are 64. All LSTMs have 1 layer. The dimensions of the MLP are [16,32,1]. When training the model, we use both the original GAN loss and the variety loss proposed by SGAN [4]. However, for the variety loss, we use L1 instead of L2. For consideration of the training time, we do not set hyper-parameter k of the variety loss to be 20 like SGAN [4] and STGAT [8], but we set it to only 5. We iteratively train the generator and the discriminator with a batch size of 32 for 200 epochs using Adam with an initial generator learning rate of 0.001 and a discriminator learning rate of 0.00001. After 20 epochs, the generator learning rate is set to 0.0001. Our implementation is based on the PyTorch library. The model is trained on one Nvidia GeForce GTX 1080 Ti graphics card.

A. DATASETS AND METRICS
In this section, we evaluate our method on two public pedestrian-trajectory datasets: ETH [27] and UCY [28]. The ETH dataset consists of two scenarios named ETH and HOTEL. UCY includes two scenarios that are divided into 3 parts, named ZARA1, ZARA2 and UNIV. Among these five subdatasets, UNIV is the most crowded. We follow the same data preprocessing strategy as SGAN [4] and STGAT [8], and there are no other processes. All data are converted to a world coordinate system and then interpolated to obtain values every 0.4 s. We observe 8 time steps (T obs = 8) of trajectories to predict the following 12 time steps (T pred = 12). In addition, we follow the leave-one-out evaluation methodology in SGAN [4], training on 4 subdatasets and testing on the remaining subdataset.

1) QUANTITATIVE EVALUATION METRICS
Two metrics were widely used by previous works to evaluate the performance of trajectory prediction, the Average Displacement Error (ADE) and the Final Displacement Error (FDE). ADE: The average Euclidean distance between the ground truth and the predicted trajectories of all predicted time steps. FDE: Euclidean distance between the ground truth and the predicted position at the final time step (T obs + T pred ). The ADE and FDE can be used to quantitatively evaluate the accuracy of predicted trajectories, while there are no quantitative evaluation metrics to evaluate the reasonability and sociality of predicted trajectories. Since being collision-free is the most significant representation of human-human interactions [9], [10], we introduce the Average Collision Times (ACT) to quantitatively evaluate the reasonability of trajectories. It can be calculated as follows: where C t m represents the collision times in a scene at frame t and M is the number of scenes. At time step t, if the distance between pedestrian p i and pedestrian p j is less than a threshold D thr , there is a collision, and for p i and p j , the collision should be counted only once. In our experiments, D thr is set to 0.3 m and we also compare the performance of different methods for other different D thr values from 0.08 m to 0.35 m.
Baselines. We compare against the following baselines: • LSTM: A vanilla LSTM without an interaction mechanism.  [5], we also use the frame rate-corrected version of ETH, ETH-SR, which is also used by SRLSTM [5]. All methods observe the trajectories of 8 time steps to predict the next 12 time steps. We draw 20 samples for all stochastic methods. We obtained the results of SGAN and SGAN-P by evaluating their own trained models released on Github. • S-LSTM: The method proposed by Alahi et al. [1]. Each pedestrian is modeled with an LSTM, and the hidden state is pooled with neighbors at each time-step.
• SGAN: The first method using a GAN to deal with the multi-modality of trajectories [4].
• SoPhie: A GAN-based method that leverages both social and physical information [6].
• STGAT: A method that can model spatial-temporal interactions for pedestrian trajectory prediction by using graph attention networks [8].

B. QUANTITATIVE EVALUATION
We compare our method on three metrics ADE, FDE and ACT (D thr = 0.3 m) against different baselines in Table 1 and  Table 2. Fig. 5 and Fig. 6 show the collision avoidance performance of different methods for different D thr values from 0.08 m to 0.35 m. When computing the ADE and FDE, for stochastic models, we draw 20 samples following SGAN [4] and STGAT [8], and we used the same evaluation codes as them. For the ACT results, we compare two types, ACT-best and ACT-avg. The former means we select the best sample, and the latter means we average the ACTs of 20 samples.

1) ADE & FDE
For the ADE and FDE, Table 1 shows the superior performance of different models on different subdatasets. SoPhie outperforms others on ZARA1. STGAT obtains the lowest ADE and FDE on ETH and UNIV. Our model makes great progress on ETH-SR, HOTEL and ZARA2. On the whole, CoL-GAN obtains comparable results with other state-of-the-art methods and outperforms them on several specific subdatasets. As shown in Table 1, SGAN is better than SGAN-P, and it seems that its social pooling module does not help to improve the prediction accuracy. In contrast, STGAT and CoL-GAN show superiority in accuracy.
Our model achieves poor performance on ETH but achieves the best performance on ETH-SR. This is mainly because the original video of ETH is an accelerated version that is mentioned in SRLSTM [5], while ETH-SR is the frame rate-corrected ETH. Our Motion Discriminator seems to be sensitive to the unusual Ŷ t i .

2) ACT
As shown in Table 2, CoL-GAN exhibits excellent collision avoidance performance as evaluated by both ACT-best and ACT-avg. For ACT-best, CoL-GAN outperforms the other methods on all subdatasets. For ACT-avg, CoL-GAN outperforms the other methods on nearly all subdatasets except ZARA1. While ACT-best presents the upper-limit performance of stochastic models, ACT-avg shows the average performance. As shown in Table 2, for ACT-avg, SGAN-P is once again worse than SGAN on all subdatasets. Compared with SGAN, STGAT does not perform better in collision avoidance, while CoL-GAN performs better than SGAN.
To better compare the collision avoidance performance of different stochastic methods, we plot Fig. 5 and Fig. 6 to show the ACT-best and ACT-avg values for different distance thresholds. As Fig. 5 and Fig. 6 show, CoL-GAN stably outperforms the other methods.

3) ABLATION STUDY
We elaborate on an ablation study to confirm the effectiveness of our social attention module and Motion Discriminator. We present three variations of CoL-GAN: CoL-GAN without a social module (CoL-GAN-noAttn), CoL-GAN with the discriminator of SGAN (CoL-GAN-tradD), and CoL-GAN with a self-attention module [25] (CoL-GAN-selfAttn). CoL-GAN-selfAttn is a variation in which the attention module is replaced with the self-attention module presented in Self-Attention-GAN [25]. The self-attention module computes the attention score α by measuring the intrinsic similarity of LSTM hidden states of different pedestrians. We present SGAN-ourD, an SGAN trained with our Motion Discriminator. Table 3 demonstrates that CoL-GAN performs better than all the other methods in terms of ADE, FDE, ACT-best and ACT-avg on most of the subdatasets.

4) SGAN VS SGAN-ourD AND CoL-GAN-tradD VS CoL-GAN
As Table 3 shows, for the ADE and FDE, SGAN-ourD achieves a considerable improvement over SGAN on nearly all subdatasets and CoL-GAN outperforms CoL-GAN-tradD,  which indicates the superiority of the Motion Discriminator. Compared with other subdatasets, on HOTEL, our discriminator helps improve accuracy the most.

5) CoL-GAN VS CoL-GAN-noAttn AND CoL-GAN-selfAttn
Compared with CoL-GAN-selfAttn and CoL-GAN-noAttn, as shown in Table 3, CoL-GAN has better ADE, FDE and ACT-best values on most datasets. For ACT-avg, CoL-GAN outperforms the other two methods on every subdataset.

C. QUALITATIVE EVALUATION
The qualitative results are shown in Fig. 7. We choose several different scenarios. In Fig. 7(a), the prediction of CoL-GAN shows pedestrian #2 slows down his (her) speed to avoid collisions with pedestrian #1 while collisions happen in the results of SGAN and STGAT. As shown in Fig. 7(b), the trajectories predicted by CoL-GAN are closer to the ground truth, and they maintain the relative position relationship of pedestrians, which successfully avoids collisions. As shown in Fig. 7(c), CoL-GAN-noAttn predicts much more linear results than STGAT and CoL-GAN. In the results of STGAT, when pedestrian #1 and pedestrian #4 try to avoid each other, pedestrian #1 causes a collision with pedestrian #2. However, CoL-GAN predicts the most accurate trajectories without collisions. As shown in Fig. 7(d), for the results of STGAT and CoL-GAN-no-Attn, pedestrians cause collisions with other pedestrians standing there, while this does not occur in the results of our method. Fig. 8 shows visualization examples of the learned attention weights in the social attention module. As shown in Fig. 8(a)-(d), CoL-GAN assigns different attention weights to surrounding pedestrians. The target pedestrians have the most importance. The relative importance of others depends on the distance and the similarity of velocity. As shown in Fig. 8(b), with an increase in the distance from the target pedestrian at two different time steps, the attention weight of the same pedestrian decreases.

V. CONCLUSION
In this work, we propose an attention-based GAN with a CNN-based discriminator to predict pedestrian trajectories. Experimental results demonstrate that CoL-GAN is able to predict trajectories with higher accuracy and fewer collisions. Our attention module assigns different weights to the corresponding pedestrians including the target pedestrians to fuse human-human interactive information with the motion patterns of target pedestrians. Our Motion Discriminator classifies whether a trajectory is fake or real by classifying whether each segment of the trajectory is fake or real. To quantitatively evaluate the collision avoidance performance of data-driven methods, we introduce a new metric, the ACT. Experimental results for the ADE, FDE, and ACT demonstrate that our model outperforms other methods in terms of accuracy and collision avoidance. Our attention module shows state-ofthe-art effects on avoiding collisions; furthermore, our discriminator helps the generator learn to predict more accurate results.