Multimodal Pedestrian Trajectory Prediction Based on Relative Interactive Spatial-Temporal Graph

Predicting and understanding pedestrian intentions is crucial for autonomous vehicles and mobile robots to navigate in a crowd. However, the movement of pedestrian is random. Pedestrian trajectory modeling needs to consider not only the past movement of pedestrians, the interaction between different pedestrians, the constraints of static obstacles in the scene, but also multi-modal of the human trajectory, which brings challenges to pedestrian trajectory prediction. Most of the existing trajectory prediction methods only consider the interaction between pedestrians in the scene, ignoring the static obstacles in the scene can also have impacts on the trajectory of pedestrian. In this paper, a scalable relative interactive spatial-temporal graph generation adversarial network architecture (RISTG-GAN) is proposed to generate a reasonable multi-modal prediction trajectory by considering the interaction effects of all agents in the scene. Our method extends recent work on trajectory prediction. First, LSTM nodes are flexibly used to model the spatial-temporal graph of human-environment interactions, and the spatial-temporal graph is converted into feed-forward differentiable feature coding, and the time attention module is proposed to capture the trajectory information in time domain and learn the time dependence in long time range. Then, we capture the relative importance of the interaction of all agents in the scene on the pedestrian trajectory through the improved relative scaled dot product attention and use the generative adversarial network architecture for training to generate reasonable pedestrian future trajectory distribution. Experiments on five commonly used real public datasets show that RISTG-GAN is better than previous work in terms of reasoning speed, accuracy and the rationality of trajectory prediction.


I. INTRODUCTION
With the development of society, autonomous navigation platforms like autonomous vehicles and social robots are growing, it is critical that autonomous vehicles and social robots must be able to predict the movements of pedestrians to The associate editor coordinating the review of this manuscript and approving it for publication was Jjun Cheng . prevent collisions with them [1], [2], [3], [4]. So, modeling the behaviors of pedestrians is an essential step for autonomous platforms application research, such as video autonomous monitoring platform detection suspicious trajectory [5], [6], [7], socially-aware robots for visual navigation [8], [9], and self-driving platforms safety decisions [10], [11], [12]. Pedestrian trajectory prediction is defined as the prediction of pedestrian movement trajectory for some time to come based VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. An example of pedestrian multi-modal trajectory in a crowded scene. In the scene, pedestrians will pay more attention to the person or object in front of them and pay less attention to the situation behind them. When making navigational decisions, pedestrians judge and analyze the importance of dynamic participants and static obstacles in the scene (such as trees, stationary vehicles, and streetlights) for future trajectory interactions. It should be noted that the scene changes dynamically, and so does the trajectory of pedestrians at each moment, with randomness. so, we should not only consider the spatio-temporal structure of human interaction with the current scene, visual attention, but also the random multi-modal nature of human walking.
on the past trajectory of pedestrian, accurate trajectory prediction can help autonomous driving and social robots navigate better.
Predicting the trajectory of pedestrian in a crowded scene is a challenging task. It is necessary to consider the spatial-temporal structure of human interaction with the current scene, visual attention [13] that human can quickly analyze the environment, and the random multi-modal [14] nature of human walking. In the process of walking, pedestrians can walk according to the intentions of surrounding neighbors and the positions of static obstacles to adjust their own trajectory to avoid collision, and with the passage of time, the scene of the pedestrians constantly moving, obstacles types and relative positions are constantly changing, pedestrians must also constantly adjust their own trajectory, so the interaction between human and dynamic environment has spatial-temporal structure [15]. In addition, humans are born with the ability to analyze and judge, people or objects that are nearby are more likely to attract the target pedestrian's attention than distant people or objects, or the target pedestrian pays more attention to the people in front of him than to the people behind him, in what's known as the ''attentional mechanism.'' In view of this situation, Laurent Itti et al. [13] proposed a visual attention model that can explain this situation. In addition, according to the actual situation in real life, the movement trajectory of pedestrians will be more random and uncertain due to the influence of other pedestrians or obstacles in the scene, as shown in Fig. 1. Compared with the deterministic trajectory prediction proposed before, the multi-modal trajectory prediction output is more reasonable.
However, most of the existing trajectory prediction methods do not consider the above three aspects simultaneously. Early work on pedestrian trajectory prediction mainly focused on model-driven methods. Helbing et al. [16] pioneered the social force model, which predictes pedestrian behavior according to attraction and repulsion. Morris et al. [17] proposed the Hidden Markov Model to predict pedestrian trajectory with spatial-temporal probabilities. However, these models have poor adaptability and are too sensitive to parameters, so they can not solve the problem of pedestrian trajectory prediction in crowded scenes. In recent years, datadriven method has become a popular research method for researchers. This method mainly regards pedestrian trajectory prediction as a time series generation task. Some recent works have used recursive neural networks (RNNS) to solve this problem. Alahi et al. [18] proposed the Social-LSTM model and innovatively used the social pooling layer module to divide the space where pedestrians are with rectangular grid units, so that capture the interactive information of adjacent pedestrians. Bisagno et al. [19] proposed the Group-LSTM model, which is an improved method of Social-LSTM. This model uses motion consistency to gather trajectories with similar movement trends and to group pedestrians. However, the above methods only consider the simple interaction between pedestrians and fail to capture the spatial-temporal interaction information between human and the current scene.
In view of the spatial-temporal interaction between human and the environment, [20], [21], [22] proposed a modeling method based on spatial-temporal graph (STG), through which the temporal and spatial connection between the target subjects can be clearly modeled. Mohamed et al. [23] proposed social spatial-temporal graph convolutional neural network (social-STGCNN), which models the interaction between pedestrians as spatial-temporal graph to replace the aggregation method, but they did not consider scene modeling. Sirin Haddad et al. [24] proposed a long and short term-memory (LSTM) network based on spatial-temporal graph, the interaction of all agents in the scenario was considered. Although the above methods model the spatial-temporal interaction between human and environment through spatialtemporal graph, it is the deterministic trajectory prediction output. Considering the randomness and uncertainty of pedestrian movement, the multi-modal trajectories prediction output is realistic and reasonable.
Since generative adversarial networks (GANs) [25] have achieved remarkable results in probability calculation and behavioral reasoning, researchers begin to turn their attention to GAN. Aglim Gupta et al. [26] proposed a pioneering social-GAN model and introduced GAN into the pedestrian trajectory prediction task. This model combines sequence prediction with generative adversarial network to generate diversified trajectories, and generated reasonable trajectory prediction through repeated adversarial training. However, the model do not consider global scenario information. Stuart Eiffert et al. [27] proposed a probabilistic crowd GAN (PCGAN) trajectory prediction method, which combines the recursive neural network and the mixed density network. This method not only consideres the interaction effects between people and people, but also the interaction effects between people and vehicles, but do not model the time information, and the training process is very complicated.
In order to solve the limitation of the above methods, we extend our previous work Zou et al. [28], introducing a scalable relative interactive spatial-temporal graph generation adversarial network architecture (RISTG-GAN), which comprehensively considers the interaction effects of all agents in the scene to generate reasonable multi-modal prediction trajectory. First, LSTM nodes are flexibly used to model the spatial-temporal graph of human-environment interactions, and the spatial-temporal graph is converted into feed-forward differentiable feature coding. Then, we introduce the time attention module to assign different weights to the past trajectory sequence of pedestrians, extract important information at different moments and weaken the speed deviation of different pedestrians. Finally, improved relative scaled dot product attention is used to capture the relative importance of various interactions in the scene on pedestrian trajectory and use recurrent sequence modeling and generative adversarial network for joint training to generate reasonable future trajectory prediction output. The main contributions of this paper are as follows: 1) This paper proposes a scalable RISTG-GAN architecture, and the number of nodes can change dynamically according to different scenes. The framework models all interactions in the scene and uses recurrent sequence modeling and generation adversarial network architecture to train together to generate multi-modal pedestrian trajectory prediction, which conforms to the characteristics of randomness and uncertainty of pedestrian walking in the real scene. 2) In the feature coding stage, a time attention module is introduced to assign different weights to the past trajectory sequence of pedestrians, extract important information at different moments, align the pedestrians in the space, and weaken the speed deviation of different pedestrians. 3) In the interaction stage, an improved scaled dot product attention is introduced to capture the relative importance of the impacts of all interactions on the pedestrian trajectory in the scene, which is more in line with the innate characteristics of human beings to screen information. The rest of this paper is arranged as follows. In Section II, we analyze recent work on pedestrian trajectory prediction. In Section III, we explain the principle of the RISTG-GAN pedestrian trajectory prediction model in detail. In Section IV, we do comparative experiments with other models on the open data sets and analyze the experimental results. In Section V, we summarize the work of this paper.

II. RELATED WORK
The focus of our work is to predict the trajectory of pedestrians. In the past decades, many researchers have carried out research on pedestrian trajectory prediction and put forward their own methods. Previous work has focused on modeling with hand-made feature functions [16], [29], [30], [31]. However, with the rapid development of deep learning, data-driven methods based on deep learning have recently made great progress in trajectory prediction. In this section, we focus on RNN-related sequence prediction, attention mechanism and GAN model related to our work.

A. RECURRENT NEURAL NETWORKS FOR SEQUENCE PREDICTION
Pedestrian trajectory prediction is defined as predicting the future movement trajectory of pedestrian according to the past trajectory, which is a typical sequence generation problem. In recent years, recurrent neural network (RNN) has achieved great success in the task of sequence prediction. As a variant of RNN, long and short-term memory network (LSTM) [32] can learn long-term dependencies. LSTM has designed three ''gate'' structures to control the cell state, namely forgetting gate, input gate and output gate. The function of the forgetting gate is to decide what information to discard from the cell state, thus solving the problem of large computational data and noise. Input gate is the selective memory stage, its function is to selectively ''remember'' the input, important information is recorded, otherwise less memory. The function of the output gate is to decide what information to output from the cell state. Thanks to the excellent application of LSTM in machine translation [33] and speech recognition [34], researchers begin to widely apply LSTM to the prediction of pedestrian trajectory. Alahi et al. [18] first proposed the Social-LSTM model, the space where pedestrians are was divided by rectangular grid units and used the social pooling layer to capture the interactions between pedestrians, allowing neighboring pedestrians to share the hidden state. Huynh Manh et al. [35] proposed scene-LSTM, which combines the scene information and the historical trajectory of pedestrian to predict the future trajectory of pedestrian in static crowded scenes. Xue et al. [36] proposed SS-LSTM, which uses three different LSTM networks to capture pedestrian, social and scene size information respectively to improve the ability to predict pedestrian trajectory.
Although the above methods improve the pedestrian trajectory prediction ability to different degrees, they do not consider different pedestrians or objects have different degrees of impact on the target pedestrian.

B. ATTENTION MECHANISMS ARE USED FOR TRAJECTORY PREDICTION
Humans are born with the ability of analysis and judgment. When walking in a crowded scene, pedestrians will pay more attention to the nearby people and obstacles in front of them compared with the pedestrians or obstacles behind them or in the distance. This is because humans use limited visual attention to quickly screen out useful information from the scene, so the attention mechanism is proposed. Thanks to the successful application of attention mechanism in natural language processing [37], some researchers have introduced attention mechanism into the field of pedestrian trajectory prediction, capturing the relative importance of neighbors and obstacles around pedestrians in the scene. Vemula et al. [38] VOLUME 10, 2022 proposed a social attention mechanism that can capture the relative importance of the current pedestrian navigation of other pedestrians in the scene. Fernando et al. [39] proposed a combination method of soft attention and hard attention. Soft attention was used to evaluate the significance of interaction in the scene area, and hard attention was used to assign different weights to pedestrians at different distances. Velickovic et al. [40] proposed a graph attention mechanism, in which stacked nodes can pay attention to the layer of their neighborhood characteristics and assign different weights to different nodes in the neighborhood. Sirin Haddad et al. [24] proposed the spatial-temporal attention mechanism, which is a variant of the multi-head method. It retains the global interaction information of all pedestrians in the scene and the local interaction information of static objects in the way of accumulation and average. Stuart Eiffert et al. [27] used the Graph Vehicle-Pedestrian Attention Network (GVAT) to focus on a much wider range of problems: pedestrians and vehicles. The network models social interactions and allows input of shared vehicle characteristics. These methods indicate that the introduction of attention mechanism can indeed improve the accuracy of pedestrian trajectory prediction. In our work, we capture pedestrian trajectory information in the time domain ang learn time dependence over long time ranges by introducing temporal attention. Recently, Transformer Networks have made great strides in Natural Language Processing [41], [42], we borrowed this method, introducing relative scaled dot product attention to capture the relative importance of various interactions in a global scene affecting pedestrian trajectories.

C. GENERATING ADVERSARIAL NETWORKS (GANS)
The above methods are the only deterministic trajectory prediction output. However, in real life, the trajectory of pedestrians shows more randomness and uncertainty due to the influence of other pedestrians or obstacles in the scene, and the multi-mode trajectory prediction output is more consistent with the real situation. Initially generative adversarial networks (GANs) [25] were used in probability calculation and behavioral reasoning. Agrim Gupta et al. [26] introduced GAN into the pedestrian trajectory prediction task for the first time and proposed a social-GAN model. The generator is composed of an LSTM based encoder-decoder with a social pool layer that simulates the relationship between each pedestrian. The discriminator distinguishes whether the generated trajectory is real (ground real) or false (generated) and generates reasonable trajectory prediction through repeated adversarial training. Amir Sadeghian et al. [43] extended this idea and improved the model by adding physical and social attention mechanism. The improved model can extract the most important trajectory information from the neighbors and assign different soft attention weights to the static environment. Vineet Kosaraju et al. [44] proposed the Social-BiGAT model and introduced a generative adversarial network based on graph to better simulate the social interaction of pedestrians in the scene through flexible graph structure to generate reasonable multi-modal trajectory prediction.

III. PROBLEM REPRESENTATION AND MODEL
In this section, we first define the pedestrian trajectory prediction problem. Next, introducing the RISTG-GAN framework, and then describe the working principle of the spatial-temporal graph feature coding based on relative interaction. Finally, this paper illustrates the process of using recurrent sequence modeling and generation adversarial network to train together to output reasonable trajectory prediction.

A. PROBLEM DEFINITION
In this paper, we aim at the prediction of pedestrian trajectory (x and y coordinates on a 2D map) in a fixed scene, and comprehensively consider the previous movement of pedestrians and the position of fixed obstacles in the scene (including stationary vehicles, lamp posts etc). At every moment, pedestrians regard the positions of other pedestrians and obstacles around them as a static ''map''. With the change of time, these static maps become a dynamic map with temporal sequence information. Therefore, the interaction between people and the environment has a spatial-temporal structure. The observable historical trajectory of pedestrian i is defined as: . . , t pred }. Similarly, the predicted future pedestrian trajectory is defined as:

B. OVERALL MODEL
This paper proposes a new pedestrian trajectory prediction method, RISTG-GAN, which considers the historical trajectory, state, interaction of surrounding pedestrians and fixed obstacles of each pedestrian in the scene comprehensively that can accurately predict the pedestrian trajectory. The overall architecture is shown in Fig. 2. The architecture can be divided into three modules, which are the feature encoder module, generator/decoder module and discriminator module. The feature encoder module includes the time attention module and the relative scaled dot product attention module. First, the interaction model of dynamic participants and fixed obstacles in the scene is established by using the spatialtemporal graph, and LSTM is used to extract nodes feature coding from the historical trajectory information of pedestrians. Next, the extracted feature coding is input into the time attention module, and different weights are assigned to it in each time step to get the time feature coding. Finally, we improve the scaled dot product attention proposed in reference [45] and propose the relative scaled dot product attention. The historical trajectory information of pedestrian, the location information of fixed obstacle and the time feature coding are input into the relative scaled dot product attention module to capture the relative interactive feature encoding of the impact of the global scene on the pedestrian trajectory.
In the generator/decoder module, we integrate random noise, time feature coding and relative interactive feature coding as the input of the generator/decoder module. Based on these features, the generator generates a distribution of diversity trajectories that conform to social rules. In the discriminator module, the discriminator is repeatedly trained to distinguish whether the generated trajectory distribution is real (ground true) or false (generated), and when the discriminator cannot clearly distinguish between the generated trajectory true and false, the output is reasonable.

C. SPATIAL-TEMPORAL GRAPH ARCHITECTURE
In this paper, we describe the dynamic spatial-temporal structure of the interaction between pedestrians and the environment in the scene by using spatial-temporal graph. We express the spatial-temporal graph as: G = (ν, ε S , ε T ), Where ν is the instance nodes set, ε T is a set of time edges, ε S is a set of spatial edges, and its abstract network architecture is shown in Fig. 3. In the paper, instance nodes include pedestrian node P and fixed obstacle node O, the nodes is variable. The spatial edge connects all instance nodes, while the time edge connects adjacent time steps to the same pedestrian node. It is worth noting that the obstacle nodes do not need to be connected at adjacent time steps, because the position of the obstacle does not change with time.
In this paper, we introduce the time attention module to capture the temporal edge information of pedestrian trajectory and use the improved relative scaled dot product attention to capture the spatial edge information. We will introduce the two aspects respectively below. Structure of spatial-temporal interaction information for pedestrians with adjacent time steps. The spatial relationship between pedestrians and obstacles is represented by a black solid arrow, ε T represents the spatial edge, the black dotted line represents the time edges that connects the same pedestrian node on adjacent time steps, ε S represents the time edge.

1) TIME ATTENTION MOUDLE
In the pedestrian trajectory prediction task, the position of the pedestrian changes dynamically with time, so it is necessary to capture the trajectory information in the time domain. By introducing the time attention module, we extract the trajectory information in the time domain and assign different weights to it. Taking pedestrian i as an example, we first use multi-layer perceptron (MLP) to embed coordinate position of pedestrian i to obtain fixed length vector e t pi and LSTM unit takes this embedded vector as input to obtain pedestrian node feature code h t i .
where φ(·) is a nonlinear embedding function, W p is the embedding weight, W p temporal is the weight of the temporal-edge LSTM cell.  We take the pedestrian node feature code h t i obtained above as the input of the time attention module. Fig. 4 shows the network architecture of our time attention module. Where N represents the number of people in the scene and node feature h i (i = 1, 2, . . . .N ) is input to the FC layer to obtain score S i .
where FC is the fully connected network, S i is the calculation of the score of h i , w a and b a are the network parameters, and tanh() is the activation function. Next, S i is taken as the input of S − BN layer, and the attention weight a i of h i is obtained.
where BN is the Batch Normalization function and S is the softmax() function. Finally, the time feature coding vectorĥ i with time information is obtained by multiplying the respective node feature coding h i and its corresponding attention weight a i and summation.
We capture the time edge information of pedestrian trajectory through the time attention module, which improves the accuracy and robustness of the model.

2) RELATIVE SCALED DOT PRODUCT ATTENTION MODULE
In the pedestrian trajectory prediction task, the trajectory of the target pedestrian is not only affected by the surrounding pedestrians, but also by the fixed obstacles in the scene, so we introduce the relative scaled dot product attention module to capture the spatial information of all instance nodes. It considers not only the relative position of the target pedestrian and its neighbors in current and historical moments, but also the relative position with the fixed obstacles, and assigns different weights. First, we calculate the relative distance O t ij between the pedestrian and the fixed obstacle node.
Obstacles do not exist (6) Next, the fixed length vector r t ij is obtained by embedding the relative distance from pedestrian i to adjacent pedestrian j and to the obstacle through multi-layer perceptron, and then the vector r t ij is used as the input of LSTM unit to obtain the relative feature code h t r , When the obstacle nodes exist, the relative feature code h t r contains the context information of the scene. When obstacle nodes do not exist, the relative feature coding is reduced to contain only social interaction information.
where, W r is the embedded weight, and W r spatial is the weight of the spatial edge-LSTM cell, which is shared among all instance nodes. Then, we use the scaled dot product attention mechanism proposed in literature [45] to assign influence weight to all instance nodes in the scene. Finally, the influence weight is multiplied by the time feature coding vectorĥ i to obtain the relative interaction feature coding I t i (the yellow grid square in Fig. 2).
where W 1 and W 2 are weights used for linear scaling and projection of hidden states onto the d e dimension vector, Dot(·) is the dot product, softmax() is the activation function, 1 √ d e scaling factor. So far, the process of encoding relative interactive features based on spatial-temporal graph has been completed.

D. GENERATOR
As mentioned in the introduction, pedestrian trajectories in crowded scene are stochastic and uncertain, so it is reasonable to use multimodal trajectory prediction output. Generative adversarial network is used for training. For the generator module (G), as shown in Fig. 2, we use the decoder based on LSTM unit for eigenvector decoding and trajectory generation. First, we introduce the standard normally distributed noise z (the gray square in Fig. 2). Next, we connect the time feature coding vectorĥ i , the relative interactive feature coding I t i and the noise vector z as the input of the decoder LSTM unit to obtain the mixed feature coding vector h t gi . Then, h t gi is converted to spatial coordinates through a multilayer perceptron.
where z is the noise vector satisfying the standard normal distribution, MLP(·) is the multi-layer perceptron, W g and W ge are the embedding weights.

E. DISCRIMINATOR
For the discriminator module (D), as shown in Fig. 2. Based on the observation of all the historical trajectories of pedestrians, the discriminator will evaluate the real future trajectories of pedestrians Y i and the predicted future trajectorieŝ Y i . We use MLP in the last hidden state of the encoder to get the classification score L disi .
L disi is the result of the classification (true/false), When L disi = 0, it means that the output trajectory is false; when L disi = 1, it means that the output trajectory is real, W e1 , W e2 , and W e3 are embedding weights, respectively.

F. LOSS
We defined the training goals of RISTG-GAN as follows: where λ is the weighting coefficient, and the adversarial loss L GAN (G, D) and L2 loss L L2 (G) are defined as follows:

G. IMPLEMENTATION DETAILS
In our proposed model, the encoder and decoder are constructed based on LSTM units. The hidden state sizes of the encoder and decoder are 16 and 32 respectively, and the input coordinates are embedded into 16-dimensional vectors. The ADAM optimizer [46] is used to train the generator and discriminator models, the initial learning rate is set to 0.001, and the number of training are set to 200 rounds.

IV. EXPERIMENTS
In this section, we introduce the two data sets used in the experiment and the measurement criteria, showing the experimental results of our method, comparing its performance with the most advanced method, and showing the quantitative analysis and qualitative results.

A. DATASETS AND METRICS
We evaluate the model performance on two common pedestrian trajectory datasets: ETH [47] and UCY [48]. The ETH dataset contains two subsets named Eth-univ and Eth-hotel, while the UCY dataset contains three subsets named UCY-zara1, UCY-zara2 and UCY-univ. These five real scenes contain the interactions between most people and the environment in the real world, such as turning at the intersection, following the crowd, avoiding the obstacles on the road, and intersecting with each other. Same work as in [26], [36], and [49], we use data within 8 seconds to evaluate the model and record a time step of 0.4 seconds. Among them, the first 3.2 seconds (8 time steps) are training data, and the last 4.8 seconds (12 time steps) are test data. In the evaluation of the model, we use two benchmark metrics: the mean displacement error (ADE) and the final displacement error (FDE). The mean displacement error (ADE) is defined as the average L2 distance between the ground reality and our prediction over all predicted time steps.
The final displacement error (FDE) is defined as the mean distance between the predicted final destination and the true final destination.
where,Ŷ t i and Y t i are the predicted position and real position of pedestrian i at time t respectively, and N is the number of pedestrians in the scene. The smaller ADE and FDE values are, the more accurate trajectory prediction is.
In this paper, in order to test the validity of the model, we choose five models for comparison, including LSTM [32], S-LSTM [18], SS-LSTM [36], S-GAN [26], and Sophie [43]. In addition, we have also performed ablation research on the proposed RISTG-GAN. In the RISTG-GAN framework, we model the complex interaction between people and the environment through the spatial-temporal graph, the instance nodes in the scene are divided into pedestrian nodes and obstacle nodes. The method that only considers pedestrian nodes is called RISTG-GAN-1, and the method that considers all agents is called RISTG-GAN-2. In Table 1, we describe the modeling direction of the seven models respectively.

B. QUANTITATIVE EVALUATION
In Table 2, our proposed model is compared with other five existing typical models on five publicly available datasets. We use data within 8 seconds to evaluate the model, taking the first 3.2 seconds of each trajectory as the training value and predicting the next 4.8 seconds of trajectory. Through comparison, it can be found that the LSTM model has the worst performance, because the model only considers the historical trajectory of pedestrian. The performance of S-LSTM model is better than that of the simple LSTM model, because the model proposes to use the social-pooling layer to capture the interaction information between local pedestrians. Compared with S-LSTM model, SS-LSTM model not only considers the interaction of all pedestrians in the scene, but also uses the context information of the scene to predict the pedestrian trajectory. The average values of ADE and FDE in the five data sets decrease by 18%and 17%, respectively, which further proves the importance of considering the context information of the scene for prediction.
Compared with the above LSTM-based prediction model, the prediction error of GAN-based prediction model is   smaller. S-GAN model is the first to introduce the generative adversarial network into the pedestrian trajectory prediction task, the model considers the multi-modality property of pedestrian trajectory in crowded scene and proposes a new pooling layer, so its performance is better than that of SS-LSTM. Based on the S-GAN model, Sophie model takes the scene information into account and improves the prediction performance of the model by introducing the physical attention mechanism and the social attention mechanism, especially on the UCY-zrar2 dataset. Compared with Sophie, the RISTG-GAN-1 models complex scene by using spatial-temporal graph and captures the relative importance of crowd interaction to pedestrian trajectory by using relative scaled dot product attention. The results show that trajectory prediction errors are further reduced. Based on the RISTG-GAN-1 model, the RISTG-GAN-2 model considers the positions of fixed obstacles in the scene, because the real scene contains not only moving pedestrians, but also stationary obstacles (such as lamp posts and stationary vehicles), so the prediction performance is further improved. By observing table 2, it is found that although the RISTG-GAN-2 model considers fixed obstacle nodes, there is no significant difference between the RISTG-GAN-1 model and the RISTG-GAN-1 model in the evaluation performance of the three data sets UCY-zara1/zara2/univ, which may be because the position of obstacles in the scene contained in these data sets has little influence on the walking of pedestrians. In the data sets ETH-univ and ETH-hotel, the RISTG-GAN-2 model performs better, because there are many obstacles in the scene in these two data sets, which pedestrians need to avoid. Compared with the SoPhie model, the mean values of ADE and FDE of the RISTG-GAN-2 model on five data sets are reduced by 24% and 21.7%, respectively.
For autonomous vehicles and social robots, it is crucial to accurately and quickly reason out the next trajectory of pedestrian to avoid collisions in crowded scene. The faster the reasoning speed is, the further guarantee of pedestrian safety can be obtained. Therefore, we also compare the reasoning speed of the models. In Table 3, we record the speed of reasoning for each model. Because the LSTM model only considers the historical trajectory of pedestrian, the amount of reasoning tasks is small, so the reasoning speed is the fastest, but the accuracy is too low. Both S-LSTM model and SS-LSTM model are improved base on LSTM model. The space where pedestrians are located is divided by grid cells and the interaction is calculated. The calculation efficiency is low and the reasoning speed is the slowest. S-GAN model introduces the generative adversarial network (GAN), which considers the multimodal property of pedestrian trajectory, and not only has high accuracy, but also the reasoning speed is fast, because adversarial training can significantly improve the memory utilization rate. Based on this, our model also combined with GAN to model the scene through flexible spatial-temporal graph. Compared with the SS-LSTM model, the reasoning speed of RISTG-GAN-1 and RISTG-GAN-2 is increased by 18.83 times and 16.83 times respectively.

C. QUALITATIVE EVALUATION
On the basis of quantitative evaluation, we qualitatively evaluate the output prediction of S-GAN, Sophie and RISTG-GAN-2 models under four different real scenes on the ETH and UCY datasets, and the visualization results are shown in Fig. 5. Fig. 5(a) is a crossing scene, judging from the results of qualitative evaluation, the predicted trajectories of the three models can all successfully cross the oncoming pedestrians, because the three models all model the interaction of pedestrians in the scene. However, S-GAN and Sophie only pay attention to short-term social information in the pooling process, so the prediction results are greatly different from the real trajectory. Our model, RISTG-GAN-2, uses a spatial-temporal graph to capture long-term social information, so the predicted results are closer to the real future trajectory. Fig. 5(b) is a pedestrian interaction and avoid obstacles scene, from the evaluation results show that three kinds of models to predict the trajectory can avoid pedestrians, but S-GAN model predicts the trajectory of failed to avoid the obstacle (seat)in the scene, because the S-GAN model only consider the interaction information between the pedestrians, does not consider the scene information. The predicted trajectory of Sophie model can avoid obstacles, because the model considers the scene information, but does not consider the relative importance of the impact of obstacles on the pedestrian trajectory, so the predicted trajectory is quite different from the real trajectory. The RISTG-GAN-2 model fully considers the above-mentioned problems, so it successfully avoids obstacles, and the predicted trajectory is closer to the real future trajectory. Fig. 5(c) is a following scene. Since the S-GAN model adopts the maximum pool mechanism and only pays attention to the most important information affecting the pedestrian trajectory, the error between the predicted trajectory and the real trajectory is the largest. The social concern component proposed by Sophie model can aggregate the information of different participants, but it is still insensitive to the unstructured characteristics of pedestrian interaction, so it also has large errors. The RISTG-GAN-2 model captures the spatial-temporal interaction information between human and environment by using spatial-temporal graph, and allocates different influence weights according to the interaction information. Therefore, the predicted trajectory of RISTG-GAN-2 is closer to the real trajectory. Fig. 5(d) is a scene in which the prediction fails. Two pedestrians walking in a straight line change their walking direction temporarily due to a sudden vehicle passing nearby. For this situation, the predicted results of the three models are not ideal.
In Fig. 6, we visualize how much attention pedestrians pay to their surroundings. The experimental results show that people pay more attention to pedestrians and fixed obstacles in front of them than to pedestrians and fixed obstacles behind them and in the distance, which is in accordance with social common sense. Context changes behind the pedestrian or in a distant scene may affect the pedestrian's future navigation decisions.

V. CONCLUSION
In this paper, we propose an RISTG-GAN model for pedestrian trajectory prediction. The model uses spatial-temporal graph to model various interactions between human and the environment, at the same time, the time attention module is used to capture the time information of pedestrian trajectory and assign different weights. The relative importance of various interactions to pedestrian trajectory in the scene is captured by using the relative interaction scaled dot product attention module. In addition, considering the randomness of pedestrian movement in complex scene, we introduce generative adversarial network to generate the distribution of diverse trajectories in accordance with social rules. Experimental results show that our model performs better than the latest benchmark methods on multiple available datasets. Our proposed method better captures the interactions of all agents in complex scenes and improves the ability of pedestrian trajectory prediction. However, the complexity of our approach is slightly higher than that of the baseline approaches because all agents interactions are considered, but this does not affect the superiority of our approach. In the future, we will continue to optimize the model, further reduce the complexity of the model while improving the accuracy, so as to improve the navigation accuracy and real-time performance of autonomous vehicles and social robots.