STI-GAN: Multimodal Pedestrian Trajectory Prediction Using Spatiotemporal Interactions and a Generative Adversarial Network

Predicting the future trajectories of multiple pedestrians in certain scenes has become a key task for ensuring that autonomous vehicles, socially interactive robots and other autonomous mobile platforms can navigate safely. The social interactions between people and the multimodal nature of pedestrian movement make pedestrian trajectory prediction a challenging task. In this paper, the problem is solved using a generative adversarial network (GAN) and a graph attention network (GAT) based on the spatiotemporal interaction information about pedestrians. Our method, STI-GAN, is based on an end-to-end GAN model that simulates the pedestrian distribution to capture the uncertainty of the predicted paths and generate more reasonable future trajectories. The complex interactions between people are modeled by a GAT, and spatiotemporal interaction information is used to improve the performance of trajectory prediction. We verify the robustness and improvement of our framework by evaluating its results on various datasets and comparing them with the results of several existing baselines. Compared with the existing pedestrian trajectory prediction methods, our method reduces the average displacement error (ADE) and final displacement error (FDE) by 21.9% and 23.8% respectively.


I. INTRODUCTION
Because of its importance in video monitoring [1], planning and control of automatic driving [2], and robot navigation [3], pedestrian trajectory prediction has long been a popular focus of research in the field of computer vision. However, the prediction of pedestrian trajectories in a congested environment still presents many challenges, such as modeling the interactions between pedestrians and the surrounding environment, pedestrian trajectory uncertainty, and the capture of pedestrian intentions.
Due to the widespread application of machine learning and especially the rapid development of deep learning in recent years, researchers have mainly addressed the The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . above challenges through related methods based on recurrent neural networks (RNNs), which serve as the background for our research. Zhu et al. [4] proposed an efficient method of describing interpersonal interactions through a topological star structure by observing all pedestrian trajectories and extracting a comprehensive description; however, this method ignores the impact of the surrounding environment on people. Haddad et al. [5] used spatiotemporal graphs to capture both the temporal and spatial correlations of pedestrian predictions and considered physical cues in a scene and the interactions between pedestrians, thereby improving the performance of trajectory prediction. In addition, Liang et al. [6] and Liu et al. [7] considered pedestrianscene and pedestrian-object relationships simultaneously and incorporated pedestrian intentions to model future paths and predict human activities and locations. However, their work FIGURE 1. Different generative models generate different path errors when considering pedestrians walking towards each other. As shown in Fig.1(a), when a multimodal trajectory prediction model is used to predict a target pedestrian's trajectory, the pedestrian may go straight, turn left or turn right with a certain probability. This approach enables the prediction of a trajectory that is closer to the real trajectory (red trajectory) and has a smaller trajectory forecasting error. By contrast, as shown in Fig.1(b), when a unimodal trajectory prediction model is used for trajectory prediction, there may be a large error between the predicted and real trajectories of the target pedestrian. In the illustrated case, the average distance between the future trajectory of the target pedestrian (blue dashed line) and the real trajectory (red solid line) is relatively large.
ignored the multimodal nature of the prediction of future pedestrian trajectories. As shown in Figure 1, due to the uncertainty of the future trajectories of pedestrians, compared with a multimodal trajectory prediction model, a unimodal trajectory prediction model suffers from larger errors in predicting the future trajectory distribution.
In contrast, Gupta et al. [8] and Amirian et al. [9] used a generative adversarial network (GAN) structure to model all pedestrian trajectories in a scene. These models fully consider the multimodal properties of the global scene and the trajectories, but they do not address the issue of capturing pedestrian interaction information. By comparing the global pooling and attention pooling approaches used in these two models, recent research [10] has shown that using a graph attention network (GAT) to capture pedestrian interaction information can improve the predictive performance of pedestrian interaction models.
To overcome the limitations of previous work, we propose a spatiotemporal interaction graph attention GAN (STI-GAN) model to learn the multimodal properties of the trajectories to be predicted. First, we use a graph attention network to model the social interactions of pedestrians and assign a different attention weight to each neighbor to identify neighbors of higher importance. Unlike other pooling mechanisms, the GAT allows all pedestrians in the scene to interact. Second, we implement a graph attention model based on spatiotemporal characteristics in combination with a GAN structure to generate interpretable multimodal paths in the form of end-to-end sequences and use the GAN discriminator to compare the generated paths with the real trajectories to determine how realistically the generated trajectories are. We present an experimental error analysis conducted on two publicly available real-scene pedestrian trajectory prediction datasets, and the experimental results prove the effectiveness of our proposed model.

Contributions:
1) Based on spatiotemporal information, a graph attention mechanism is extended to a GAN model to generate more accurate and interpretable multimodal path distributions.
2) Our model incorporates temporal and spatial information about social interactions to predict the future path of each pedestrian. 3) We propose an improved feature extraction method to encode the social interactions between pedestrians.

II. RELATED WORK
Our work focuses on pedestrian trajectory prediction. In the past few decades, much research has focused on traditional methods of predicting the future trajectories of pedestrians by relying on handcrafted functions [11]- [14]. Recently, however, data-driven deep learning methods have enabled great progress in this context. In this section, we discuss the existing work on RNN-related sequence prediction, graph attention network, and GAN models.

A. RECURRENT NEURAL NETWORKS (RNNS) FOR SEQUENCE PREDICTION
Pedestrian trajectory prediction is a typical sequence problem in which historical trajectory information is used to predict future trajectories. RNNs, such as long short-term memory (LSTM) networks [15] and gated recurrent unit (GRU) networks [16], are often used to process such sequence problems. In recent years, as a variant of RNNs, LSTM networks in particular have been widely used in pedestrian trajectory prediction [8], [17]- [19]. Alahi et al. [17] first proposed a ''social pooling layer'', which allows nearest-neighbor pedestrians to share hidden states, to solve interactive problems. Xue et al. [18] used three different LSTM networks to capture pedestrian, social, and scene size information separately and innovatively introduced factors representing the influence of the scene layout on pedestrian behavior to improve the ability to predict pedestrian trajectories. Gupta et al. [8] first introduced a GAN for generating multiple possible future paths for pedestrians and used a global pooling layer to accelerate the calculations. Zhang et al. [19] proposed an LSTM-based data-driven state refinement module, which activates the current intentions of neighbors and jointly iteratively refines the current states of all pedestrians in a crowd through a message passing mechanism. VOLUME 9, 2021 FIGURE 2. Our proposed spatiotemporal-attention-based multimodal network architecture. The network structure is based on a GAN model and consists of three key components: a generator (G), a graph attention model (see Figure 5), and a discriminator (D). The generator uses an attention network and two LSTM modules to model the spatiotemporal correlations of interacting pedestrians and realize the fusion of spatiotemporal information.

B. GRAPH ATTENTION NETWORKS (GATS)
Recently, graph neural networks (GNNs) have been widely used in various fields, including computer vision [20], [21], recommendation systems [22], transportation networks [23], [24], and even materials chemistry [25]. The reason they are so widely used is that the graph structure can provide an explicit high-level representation of the environment. Networks incorporating a graph attention mechanism (graph attention networks, GATs) have also been developed based on GNNs [26]. In a GAT, different attention weights are assigned to diff-erent neighbors when aggregating feature information. Notab-ly, the problem of pedestrian trajectory prediction has both temporal and spatial characteristics because of the changes in pedestrian movement over time and the complex interactions among different pedestrians. Accordingly, Haddad et al. [5] proposed an attention model based on spatiotemporal graphs that can consider the influence of surrounding pedestrians on the target pedestrians in both time and space. In our work, we use a spatiotemporal graph model [5] and a GAT model to jointly model such complex interaction information. In each time step, we represent the interactions between pedestrians in the form of a graph, in which the pedestrians in a crowded scene correspond to the nodes of the graph and the interactions between pedestrians are described by the edges of the graph. We also assign different attention weights to different neighboring pedestrians.

C. GENERATIVE ADVERSARIAL NETWORKS (GANS)
The prediction of future pedestrian trajectories is a multimodal generation problem. Because of the capabilities of GANs in generating multimodal samples, a GAN model is suitable for solving this problem. GAN models are widely used in image translation [27], [28] and data enhancement [29]- [31] and have enabled remarkable breakthroughs in those areas. The structure of a GAN consists of a generator and a discriminator. Gupta et al. [8] introduced a GAN for solving the multimodal trajectory representation problem. However, the global pooling method adopted in this model uses a uniform weight for all surrounding pedestrians; thus, it can't distinguish the different effects exerted on a target pedestrian by pedestrians at different distances and traveling at different speeds. Sadeghian et al. [32] improved this model by adding an attention mechanism. This improved model can assign different soft attention distribution weights to the surrounding pedestrians and the static environment, helping the model learn the interaction information of different agents and extract the most important information from the neighbors. In addition, Amirian et al. [9] used InfoGAN [33] to perform unsupervised learning based on data with potential categories. For the pedestrian prediction problem, our work introduces a spatiotemporal interactive encoder based on GAT that is introduced into GAN to model complex interactive behaviors in both time and space, thereby further improving the performance of trajectory prediction.

A. PROBLEM DEFINITION
In a scene with changing background, pedestrian position information can be obtained by an accurate target detection algorithm and be used as model input to predict the future trajectory of pedestrians. In our article, the pedestrian position is given in the dataset, and we address the prediction of the future trajectories of all pedestrians based on given pedestrian trajectories in a crowded scene. Our goal is to predict the pedestrian trajectories in future time steps t = T obs+1 ,. . . ,T pred based on the observed trajectories X = X 1 , X 2 ,. . . , X N of the N pedestrians in the scene in previous time steps t = 1,. . . ,T obs. The real trajectory points of pedestrian i at time t are denoted by X t i = (x t i , y t i ), and similarly, predicted future trajectory points are denoted byŶ t i = (x t i ,ŷ t i ).

B. OVERALL MODEL
This paper proposes a new pedestrian trajectory prediction method that can accurately predict pedestrian trajectories by comprehensively considering each pedestrian's state, movement history, and interactions with surrounding pedestrians. The network structure is shown in Figure 2. The model includes two main network components: a generator and a discriminator. The generator includes three key parts: a 50848 VOLUME 9, 2021 spatiotemporal feature coding module, a GAT module, and a decoder module. First, the spatiotemporal feature coding module takes the historical trajectory of each pedestrian as input for feature coding and uses a combination of LSTM and GAT structures to learn the most important information about the spatiotemporal interactions between pedestrians for generating future trajectories. The learned features are then passed to the next module. The GAT module estimates the different levels of importance of the surrounding pedestrians with respect to the target pedestrian and learns the interactions between pedestrians. Subsequently, the decoder module takes the spatiotemporal interaction features along with noise as input and generates a series of reasonable future trajectories for each pedestrian. Finally, the LSTM-based discriminator compares the generated trajectories with the real trajectories and determines the probability that each generated trajectory is a real trajectory. The discriminator is mainly used to improve the predictive performance of the generator model, forcing the generator to generate more realistic samples.

C. TRAJECTORY FEATURE EXTRACTOR
The trajectory feature extraction module mainly uses an LSTM structure to extract feature representations of the observed pedestrian trajectories. We extract the nodal features of all pedestrians' past trajectories and embed the relative displacement of each pedestrian into a higher-dimensional fixed vector e t i through a multilayer perceptron (MLP): where W e represents the embedding weight. Then, we use LSTM to capture the time dependence between all states of the pedestrian, for which e t i is used as the input to the encoder LSTM unit at time t for pedestrian i. We denote this LSTM as V-LSTM: where v t i is the hidden state of the V-LSTM unit at time step t and W v is the weight of the V-LSTM unit, which is shared among all pedestrians in the scene.

D. GAT ENCODER
GNNs are an important supplement to traditional deep learning methods because they can handle irregularly structured objects well. In this work, we extend a spatiotemporal interaction encoder based on a graph attention mechanism to a GAN. This new model can simulate the social interactions between all pe-destrians in a scene from the two perspectives of spatial motion patterns and temporal correlations.
GAT and Pedestrian Construction: The ''pooling'' function and the ''attention mechanism'' mentioned in [17]- [19] cannot be used to effectively model irregularly structured objects. To model objects with irregular structures, we aggregate the information of the surrounding neighbors by adding graph attention and assigning different importance to different surrounding nodes. When calculating the spatial interaction between pedestrians in each time step, the adjacent nodes are considered mainly by introducing the GAT network, and the corresponding hidden information of each target pedestrian node is calculated and obtained. GAT introduces the self-attention mechanism to calculate the features of each neighbor node and then connects the features to obtain the influence of different neighbor nodes on the hidden state of each target pedestrian node. The GAT network is implemented by stacking multiple graph attention layers. Figure 3 shows a single graph attention layer. The input characteristic of the target node is where N and F represent the number of nodes and characteristic dimension, respectively, and the output characteristics of At present, our method uses GAT to model the spatial relationship between pedestrians in the same time step and uses another LSTM to capture the temporal correlation of pedestrians. Figure 4 shows the graphic structure of humans in two consecutive time steps, which mainly includes three key parts: nodes, space edges (black solid line) and time edges (blue dotted line). Among them, the nodes in the graph structure represent the pedestrians of each time step in the scene, the black solid line represents the spatial edge of the spatial social relationship between pedestrians, and the blue dotted line represents the temporal edge of the temporal correlation of the same pedestrian in the adjacent time steps.
Spatiotemporal Interactive Encoder Based on GAT: To model the pedestrian interaction network in the crowded scene, we introduce a spatiotemporal interaction coder based on graph attention, which can model the social interaction of all pedestrians in the scene. Figure 5 describes in detail the spatio-temporal interaction input characteristics of a single node based on graph attention. For pedestrian i, we use the pedestrian space encoding v t i and v t j (t = 1, . . . , T obs ) as the input to the softmax layer, and α t i,j as is used to scale the influence of the hidden state of each surrounding pedestrian j ∈ N \{i} on the target pedestrian. Finally, the influence of all the surrounding pedestrians is summed to form a graph VOLUME 9, 2021  The spatiotemporal node characteristics of pedestrians i based on the graph attention mechanism. The two red dashed lines indicate the spatial interaction between pedestrians j 1 and j 2 to target pedestrian i at time t . The green part in front of pedestrian j 1 and pedestrian j 2 indicate the temporal influence of the surrounding pedestrians on the continuity of target pedestrian i from time t to time t +1. The LSTM and GAT networks are used to capture the different spatial interactions of the surrounding pedestrians on the target pedestrian, and then another LSTM is used to capture the influence of the historical trajectory of other neighbors on the target pedestrian i , that is, the temporal correlation of the motion interaction. Finally, the output g t i of the encoder network model is obtained as shown in formula 5. and the attention layer output is g t j .W g and W u are parameters corresponding to the pedestrian. In these formulas, a denotes the shared attention mechanism, and ϕ is a linear embedding function: Reference [10] used the hidden state for the target pedestrian as the input to the GAT. In contrast, we not only use the current historical trajectory of the target pedestrian and the spatial interactions between pedestrians in the same time step but also incorporate the historical trajectories of the other pedestrians to jointly predict their future paths.
Once the spatial interaction influence g t j has been obtained for each pedestrian in the crowded scene, the temporal interaction influence s t i on the movement of each pedestrian and the representations of the movement histories of the other pedestrians are obtained by means of another LSTM module, and we denote this LSTM module as S-LSTM. Then, we incorporate the spatial interaction influence for trajectory prediction, where || denotes a series connection and W s denotes a parameterized shared linear transformation: E. LONG SHORT-TERM MEMORY (LSTM)-BASED GAN As stated in the introduction, pedestrian trajectory prediction can be characterized as a multimodal problem. Accordingly, an LSTM-based GAN can be used to generate multiple reasonable trajectories. We adopt this approach to capture the uncertainty of the possible future paths. In general, a GAN is composed of two models: a generative model and a discriminative model. The goal of the generative model is to deceive the discriminative model by generating samples that are as realistic as possible, while the goal of the discriminative model is to accurately distinguish the generated samples from the real samples. This ''two-model game'' ultimately enables the generative model to generate fake samples that mix the spurious with the genuine. In our model, a variety of reasonable trajectory samples are learned and predicted by a GAN.
Generator (G): The generator (G) obtains and encodes the spatiotemporal interaction information of trajectories through an LSTM-based spatiotemporal encoder and then uses an LSTM-based decoder for feature vector decoding and trajectory generation.
As shown in Figure 2, the encoder obtains the spatiotemporal interaction encoding vector m t i for the target pedestrian through formula 7. Following [8], the decoder takes a noise vector z sampled from a multivariate normal distribution in combination with the encoding vector representing the spatiotemporal history of a pedestrian as its input. Next, we use the LSTM method to generate the future trajectory of the pedestrian across multiple time stepŝ Y t i t = T obs+1 , . . . , T pred . We term this LSTM as G-LSTM. The corresponding LSTM model is referred to as G-LSTM. The pedestrian's future trajectory can be expressed as follows: where W d and W d1 are shared among all pedestrians in the scene and e T obs i is obtained from formula 2. Discriminator (D): As shown in Figure 2, we use a separate encoder to learn the rules of social interaction and identify unreasonable trajectories as false. In detail, any ground-truth or generated trajectory sample may be used as the input to the discriminator, and an MLP is applied to the last hidden state of the encoder to obtain a classification score. Thus, the path is divided into a real path and a false path. 50850 VOLUME 9, 2021 Losses: We use two different loss functions to train the network: L adv and L 2 . Between them, L adv represents an adversarial loss, whereas L 2 is a diversity loss function applied in the trajectory generation part of the model to encourage the network to generate k different samples. The total losses are as follows: where Y i represents the ground-truth trajectory of pedestrian i.Ŷ i denotes the future trajectoryŶ i of pedestrian i generated by our model, k is a hyperparameter, and D denotes the discriminator. Finally, we combine the losses to find the best discriminator D * and generator G * and choose a weight λ 1 as the final hyperparameter for combining these two loss functions:

F. IMPLEMENTATION DETAILS
In our model, an LSTM network structure is used as the RNN structure for both the generator and the discriminator. The numbers of hidden state dimensions of the generator's LSTM encoder and decoder are both 32 and that of the discriminator's LSTM encoder is 64. The input coordinates are encoded as 16-dimensional vectors and embedded into the LSTM part of the spatial encoder. During network training, only the mean square error is used for the first 250 cycles, and then, the last 250 cycles of adversarial training are conducted using both the cross-entropy loss and the mean square error. Through this training method, the generator can be encouraged to produce more reasonable results before the discriminator performs comparisons with the ground truth, thereby reducing the number of experimental iterations. In formula 8, we set λ 1 to 1. During training, the Adam optimizer is used to train the generator and discriminator. The batch size is set to 64, the number of iterations is 500, and the initial learning rate is 0.01.

IV. EXPERIMENTS
In this section, the two datasets used in our experiments and the two types of prediction errors reported to evaluate the results are introduced. Then, we compare the proposed method with four other models. Quantitative and qualitative results, including results obtained by analyzing the validity of our model and visualizing the differences between trajectories, are shown.

A. DATASETS
Experiments were conducted on two public pedestrian trajectory prediction datasets: ETH [34] and UCY [35]. These two public datasets include four scenarios and five subsets: the ETH dataset includes two scenarios, namely, ETH and HOTEL, and UCY is divided into three subsets, namely, ZARA1, ZARA2, and UCY. These datasets contain 1536 pedestrians, complex social scenes, and information about the interactions between pedestrians. To make full use of the datasets when training the model, the ''leave one out'' method was used; i.e., the model was trained on four subsets and tested on the remaining subset. For model training, we took the first 3.2 seconds of each trajectory as the observed trajectory and predicted the trajectory over the next 3.2 seconds or 4.8 seconds. Based on the experience of the authors of S-LSTM, the data over the next 8 and 12 time steps were predicted by observing the data from the first 8 times steps, with a frame rate of 0.4 seconds.

B. BASELINES AND METRICS
Baselines: To test the effectiveness of the proposed model, we compared its performance with the performance of four other advanced models: • Linear: The model is a linear regressor that estimates the linear parameters by minimizing the least square error.
• LSTM: The conventional LSTM model does not include a pooling mechanism, and all trajectories are considered independent of each other [15].
• S-LSTM: This model was proposed by Alahi et al. [17]. LSTM model is used to model each pedestrian. The hidden states for different pedestrians are shared between the LSTM models through a pooling mechanism. The pedestrian trajectories are predicted by modeling the interactions between different pedestrians.
• SGAN: This model is based on an LSTM-based codec framework that uses a GAN for training and captures the multimodal distribution of the future trajectories [8].
• SoPhie: This model was proposed by Sadeghian et al. [32]. An attention-mechanism-based GAN codec model is used to model social interactions, and a physical attention mechanism is used to achieve interpretable predictions.
• STI-GAN: This is the spatiotemporal multimodal GAN model proposed in this work. Following parameter settings similar to those in [8],  = (1,. . . ,n) in the scene. TABLE 1. For a given trajectory over 8 times steps, the quantitative results predicted by all benchmark models over the next 8 and 12 time steps on the public datasets ETH and UCY. STI-GAN is always superior to the baseline models due to the combination of pedestrian spatiotemporal information and the graph attention mechanism in the model.

2) Final displacement error (FDE):
The distance between the predicted final destination and the real final destination at the end of the predicted trajectory. Compared with the ADE, the FDE places more emphasis on the accuracy of destination prediction.

1) COMPARISON WITH EXISTING WORKS
In Table 1, our proposed model is compared with other existing models. We can see that the performance of the LSTM and S-LSTM is worse than that of SGAN and our model because GAN can effectively capture the multimodal path distribution. Besides, the proposed adversarial method based on pedestrian spatiotemporal information and a graph attention mechanism is significantly better than the previous adversarial methods [8] and [32], showing that the graph attention mechanism and the consideration of the spatiotemporal characteristics of pedestrian interactions in the model can improve its prediction performance. We also observe that SoPhie is different from other methods. It uses not only the historical paths of all agents in the scene but also scenes context information to predict the pedestrian paths. This method performs well on the ETH and ZARA2 datasets, further demonstrating the importance of considering the static scenario context for prediction. Notably, when the prediction time step is 12, in the Hotel scenario, linear performs best in both ADE and FDE. This is due to less pedestrian interaction and more linear trajectories in the Hotel scene. As shown in Table 1, the STI-GAN-20V-20 model has the smallest average error among all of the compared models. Compared with the SGAN model, its average ADEs over the next 8 and 12 time steps are reduced by 21.9% and 9.4%, respectively, and the corresponding FDEs are reduced by 23.8% and 22.9%.

2) ABLATION STUDY
Analyses were performed to evaluate the effects of the different components of the proposed model, including the diversity loss, the graph attention mechanism, and the spatiotemporal information module, as well as an evaluation of the spatial consumption. The quantitative results of different model variables are shown in the following three tables.
Evaluation of The Effect of The Diversity Loss: Due to the multimodal nature of the pedestrian movement, we generate multiple socially acceptable trajectories based on diversity loss [8]. Compared with STI-GAN-1V-1 and STI-GAN-1V-20, our final STI-GAN-20V-20 model can generate more reasonable predictions of future trajectories by means of diversity loss. The ADEs of the STI-GAN-20V-20 model for prediction over 8 and 12 future time steps are reduced by 18.8% and 19.0%, respectively, and the corresponding FDEs are reduced by 8.6% and 14.6%. The results show that the diversity loss can encourage the model to produce different predicted trajectory samples, which is helpful for improving the trajectory prediction performance of the model.
Evaluation of The Effect of The Spatiotemporal Interaction Module: To verify the effectiveness of considering spatiotemporal information, a network considering only the spatial interaction information of the crowd was also trained, that is, the SI-GAN model, which does not contain the time-dependent interaction module. As shown in Table 1, Compared with the model without the spatiotemporal interaction module SI-GAN, our full method STI-GAN-20V-20 has an ADE and FDE that are reduced by 6.3% and 4.8%, respectively, when predicting the trajectories over the next 8 times steps. This is because the spatiotemporal interaction module allows the model to consider not only the spatial interactions between pedestrians but also the influence of the continuous movement histories of the other pedestrians on the target pedestrian. The results prove that considering the spatiotemporal information of pedestrian interactions can help the model pred-ict more reasonable paths.
Evaluation of The Effect of The GAT Module: To evaluate the robustness of the graph attention mechanism, we compared two models: the SI-GAN model and the SGAN model. SI-GAN mainly uses a graph attention mechanism for modeling pedestrian interactions, while SGAN uses a pooling mechanism. From Table 1, we can see that the SI-GAN model performs slightly better than the SGAN model because the graph attention mechanism (GAT) allows the model to capture the most important pedestrian interaction information more accurately than the pooling mechanism does.
Evaluation of The Effect of The GAN Structure: To evaluate the effectiveness of the GAN discriminator, two models with different generation methods were compared: the S-LSTM model and the SI-GAN model. Between them, only the SI-GAN model relies on adversarial training to cause the output of the pedestrian trajectory prediction model, i.e., the generated distribution, to converge to the real distribution. Compared with those of the baseline S-LSTM model, the ADE and FDE of the SI-GAN model are reduced by 32.4% and 33.3%, respectively, when predicting the trajectories over the next 8 times steps, and they are reduced by 37.9% and 58.8%, respectively, when predicting the trajectories over the next 12 times steps. This is because our model uses a GAN structure to conduct adversarial training to predict reasonable future pedestrian trajectories.
Inference Speed and Spatial Consumption: The speed of pedestrian trajectory prediction is very important, for example, in practical applications such as self-driving cars and so on. The more pedestrians there are in the real scene, the more complex the graphic structure between pedestrians, and the more memory and computation required. On the public real datasets UCY and ETH, the maximum number of pedestrians per frame is 65, and the model can still accurately predict the future trajectory. Therefore, the number of pedestrians has little effect on the accuracy of trajectory prediction, but it will increase the amount of calculation.
We compared our two methods with the baseline model S-LSTM and SGAN. We refer to our complete model STI-GAN-20V-20 as STI-GAN for simplicity. As shown in Table 2, in terms of inference speed, the STI-GAN is slower than SGAN. This is because our GAT scheme is more time-consuming than SGAN's pooling module. Table 3 lists out the CUDA memory comparisons between our model and publicly available models which we could bench-mark against. The memory usage of SGAN is twice as high as that of S-LSTM during training, which indicates that adversarial training can significantly increase memory usage. We compare SI-GAN and STI-GAN indicate that considering the  continuity of time interaction does not affect the speed of inference of the model, but increases the memory occupation.

D. QUALITATIVE RESULTS
In this section, we qualitatively evaluate the output predictions of SGAN, Sophie, and our complete model under four different real scenarios on the ZARA dataset. By consider-in spatiotemporal interaction information and a graph attention mechanism in a GAN architecture, STI-GAN can better model the relationships between pedestrians, allowing it to more accurately predict the trajectories they will follow to avoid collisions. When pedestrians walk side by side or follow each other, our model can make correct pre-dictions results. In addition, when pedestrians are walking in opposite directions, our model can better model the relationships between them to deal with such situations.
Pedestrians Walking Side by Side: On the road, it is common for pedestrians to walk side by side to the same destination while maintaining a certain distance between them. As shown in Fig. 6(a), a pair of friends walking side by side in the same direction and at the same speed. SGAN and SoPhie pay too much attention to short-term social information in the pooling process; consequently, their performance is poor. Because of the spatiotemporal interaction mechanism used in STI-GAN, however, the trajectories predicted by the STI-GAN model are roughly consistent with the real trajectories.
Person Following: On a crowded road, when a target pedestrian is following the pedestrian in front of him or her, he or she will usually keep a certain distance from the pedestrian ahead and walk in the same direction and at the same speed as that pedestrian. He or she may also deflect in a certain direction and walk forward with the pedestrian ahead. In Fig. 6(b), the trajectories of a pair of pedestrians following another pair of pedestrians. In this situation, the target pedestrian needs to pay attention to the speed and direction of the pedestrian VOLUME 9, 2021 FIGURE 6. Trajectory prediction results of SGAN, SoPhie, and our proposed model in four different scenarios. Each column of images presents the trajectories predicted by the three prediction models in the same scene, i.e., the socially acceptable trajectory outputs of the different models. The blue solid lines in each image represent the historically observed trajectories, the red solid lines are the real future trajectories, and the green dotted lines are the predicted trajectories. In addition, we show some cases of prediction failure.
in front and on the left and right sides at the same time. SGAN uses the maximum pooling mechanism, which only focuses on the most important features that affect pedestrian trajectories, so it generates large error prediction trajectories. STI-GAN uses its spatiotemporal interaction mechanism to aggregate and capture global pedestrian information to consider the influence of spatial relations and historical trajectories of other pedestrians on the target pedestrian. Thus, the trajectories predicted by STI-GAN are closer to the real trajectories.
Group Avoidance: When people are facing each other in a crowded scene, pedestrians usually adjust their direction and speed in time to avoid the collision between two groups of pedestrians. As shown in Fig. 6 (c), the two groups of pedestrians are facing each other, and the direction should be adjusted in time to avoid the collision between pedestrians. In this crowded environment, the key to accurate modeling is to capture the information about the interactions of the surrounding pedestrians. Among them, the predicted trajectory of SGAN is quite different from the real trajectory on the ground. SoPhie does use an attention mechanism to extract the most important trajectory information from the surrounding pedestrians, but it is still insensitive to the unstructured features of the pedestrian interactions. By virtue of the graph attention mechanism of STI-GAN, it can capture the changes in other people's intentions more successfully and learn more reliable unstructured object feature representations; and avoid collision successfully.
Failure Scenario: Another common scenario is that pedestrians suddenly change direction during the process of moving forward. Fig. 6(d) shows a pair of friends who suddenly change their direction after passing a vehicle. In this case, the prediction results of neither the proposed model nor the baseline models are ideal. SGAN shows the worst performance, while STI-GAN can better model complex pedestrian interactions by means of the graph attention mechanism and therefore still predicts trajectories that are closer to the real trajectories than SGAN does.

V. CONCLUSION
To model pedestrian motion patterns and accurately predict future pedestrian trajectories, this paper proposes a multimodal end-to-end trajectory prediction model that combines spatiotemporal interaction information based on a graph attention mechanism with the multimodal characteristics of a GAN to predict trajectories that exhibit good rationality in terms of social interactions. Our spatiotemporal graph attention model can combine spatial and temporal information to rationally assign different weights to different pedestrians in order to better capture the complex interactions between pedestrians. In addition, our GAN can produce diverse samples that conform to social rules. Our proposed model was tested on two public video datasets. The experimental results show that compared with baseline methods, the new model combining a spatiotemporal attention mechanism with a GAN can better capture the complex interactions between pedestrians to predict pedestrian trajectories in various real scenes, thereby improving the performance of pedestrian trajectory prediction.
Our work focuses on the study of social interactions between pedestrians. In the future, we can jointly model the spatiotemporal social interactions between pedestrians and other pedestrians, pedestrians and vehicles as well as vehicles and vehicles, and further improve the accuracy of trajectory prediction through joint modeling. From 2009 to 2013, he held a postdoctoral position with the State Key Laboratory of Internal Combustion Engine Combustion, Tianjin University. Since 2013, he has been an Associate Professor with the College of Mechanical and Electrical Engineering, Hainan University. He has authored one book, more than ten articles, and more than ten major projects. His research interests include Key technology development of automatic driving, development of new energy vehicles, electronic control technology of engine, and software and hardware development of vehicle information terminal. He also has experience in leading the design and production of a power-train control unit and remote measurement calibration system. His research interests include data mining and artificial intelligence based diagnosis and optimization.