Generating Reliable and Efficient Predictions of Human Motion: A Promising Encounter between Physics and Neural Networks

Generating accurate and efficient predictions for the motion of the humans present in the scene is key to the development of effective motion planning algorithms for robots moving in promiscuous areas, where wrong planning decisions could generate safety hazard or simply make the presence of the robot"socially"unacceptable. Our approach to predict human motion is based on a neural network of a peculiar kind. Contrary to conventional deep neural networks, our network embeds in its structure the popular Social Force Model, a dynamic equation describing the motion in physical terms. This choice allows us to concentrate the learning phase in the aspects, which are really unknown (i.e., the model's parameters) and to keep the structure of the network simple and manageable. As a result, we are able to obtain a good prediction accuracy with a small synthetically generated training set, and the accuracy remains acceptable even when the network is applied in scenarios quite different from those for which it was trained. Finally, the choices of the network are"explainable", as they can be interpreted in physical terms. Comparative and experimental results prove the effectiveness of the proposed approach.

Neural Networks (NN) hold the promise to solve this type of problems in a simpler way. In principle, a deep neural network (DNN) trained with a sufficient number of samples could learn the human motion patterns by discovering the underlying dynamic model on its own. However, the number of layers and of neurons required to manage the complexity of human behaviours can be very large and is anyway hard to predict. Equally difficult is to understand the number of samples that are needed to train a network of this complexity. Finally, the use of a DNN lacks a property of remarkable importance for many applications: the so-called "explainability". When an autonomous system takes a decision it is important to understand why that specific choice has been made, in order to solve bugs or attribute legal responsibilities [6]. The total absence of a prior model in a DNN makes explainability hard or even impossible to achieve.
In this paper, we seek to bridge the gap between model based and learning based approaches in order to retain the advantages of both. Our goal is to predict the motion of a human for several seconds ahead using a short segment of past observations. To this end, we use a NN, but the network's structure is chosen so that its connections reflect the dynamics of the SFM, i.e. we embed our prior knowledge into the NN in the form of a model assuming that the latter acceptably represents the dynamics of human motion. This way, the learning phase is concentrated on the aspects for which we actually lack any real knowledge: the parameters and the forces acting in the SFM. The advantage of this approach are manifold: 1. wiring a model inside the NN reduces the number of neurons by a significant amount (we estimate one or two orders of magnitude), 2. as shown in our experiments, a relatively small number of synthetically generated samples is sufficient to generate accurate predictions, even for scenarios that are quite different from the ones considered in the training set, 3. because our NN retains the model inside, its decisions can be explained in physical terms, which simplifies the interpretation of the results of the NN and the explanation of its possible mistakes.
The paper is organised as follows. In Section II, we report about the related work in the area and we summarise some background knowledge on the SFM, which will prove useful in the development of the paper.
In Section III, we report the key contribution of the paper: how to embed the SFM into the structure of a NN.
In Section IV, we report a full set of experiments proving the validity of the approach. Finally, in Section V we state our conclusions and announce future work directions.

A. Related Work
Physics-based methods for human motion prediction are based on the explicit dynamical model of Newtons laws of motion. Their implementation is quite easy and they usually work faster than alternative approaches. A major limitation is that they work well in describing local dynamics, but fail to follow the medium/long-term intentions of human beings. Estimating on-the-fly the hypothetical target of walking pedestrian from the past motion is still an open issue [7]. For example, in [8], a virtual goal is chosen as the position that a person would reach if s/he moved with constant velocity, while in [9] a set of trajectory sub-goals are estimated from the recorded data in a structured environment. Furthermore, [10] proposed a modified formulation of the SFM to calibrate the parameters with observable features from empirical data.
These limitations persist even if NNs are employed. As in this paper, other works have explored the feasibility of combining the SFM with machine learning techniques. For instance, a gradient descent based method was proposed by [11] to learn the parameters of the interaction force of the SFM; [12] combined the SFM with three different direction-decision predictors, namely a Linear Regression, a Neural Network and a Decision Tree model, and also investigated the environment features that affect the direction choice. Conversely, [13] used an evolutionary learning algorithm to optimise the SFM parameters to video recorded data of a crowd.
However, neural networks usually suffer from overfitting, or they depend strictly on the type of information with which they are trained. For example, [14] showed that state-of-the-art neural models can be outperformed by a simple constant velocity model in the case of linear trajectories.
The idea of pre-wire the network structure with the physical law for human motion prediction is rather new. A similar approach has been followed for vehicle dynamics modelling in [15], where the flexibility of the data-driven approaches is combined with a NN embedded with the vehicle physics model.
where v =ṗ. Furthermore, the attractive force f o is defines as where the characteristic time τ > 0 parameter determines the rate of change of the velocity vector, while e d is the unit vector pointing towards the goal. The force exerted by the static obstacle w on the i-th pedestrian is i.e. the sum of a repulsive component, a compression force and a sliding friction force. We denote with d w = ||p − p w || the distance between the pedestrian centre of mass and the coordinates of the obstacle closest point, so that n w = (p − p w )/d w and t w = [−n w (2), n w (1)] T are the distance unit vector and its tangential direction, respectively. The function g(x) = max{0, x} models the fact that both the compression and the sliding friction forces exist only if the pedestrian touches the obstacle (d w > r). A, B, k 1 and k 2 are the model parameters. Notice that in this paper we are neglecting the interaction forces f p j with the j-th pedestrian in (1), which will be the objective of future works.

III. NEURAL NETWORK MODEL
In the proposed structured network, the neurons are organised in order to process the input signals according to (1). Two separate branches, according to the considered scenario, are designed to estimate the SFM forces.
In the case of no obstacles (first scenario), the agent moves freely towards its goal, so it is subject only to (2).
In the second scenario, the pedestrian is affected by the repulsive force (3) of the surrounding static objects (if any). Consequently, each network branch models effects of different nature, i.e. attractive or repulsive forces, which, given the linear nature of the SFM, are summed up at the end to give the resulting force.

A. Open environment
While freely moving towards the desired goal, the pedestrian is only affected by the force f o in (2). Hence, the first neural network Net1 has to predict the two force components f o x , f o y . The network inputs are n samples of the past p coordinates of the pedestrian (up to the current time t). In order to avoid spatial biases, the coordinates are normalised with respect to the first sample of the window, so that First, two hidden layers with no biases and with only one fully connected output neuron learn the instantaneous velocity v x , v y on the X f and Y f axis, respectively. Since these two layers are followed by a tanh(·) activation function, another single neuron with no bias is used in each layer to rescale the estimates. Moreover, the most recent relative motion measurement ∆p 1 (t) = p t − p t−1 is used to estimate the components of the normalised goal-directed unit vector e d . Finally, the velocities magnitudes derived as are used to estimate the desired speed v d . All the estimates pass trough a Lambda layer where they are combined and weighted according to the m and τ parameters in (2). The Net1 output is then the estimate of (2) is translated in the form of a structured NN as where

B. Structured environment
For the environment with obstacles, the second network Net2 has two parallel branches that predict the f o x , f o y and f W x , f W y components of the force, respectively. The presence of obstacles makes the prediction of the agent's target indeterminate, i.e. a motion observation cannot be exploited to understand if the travelled path depends on the obstacle or on the attraction of the goal. For this reason, we directly provide e d as input to the branch that estimates the attractive force f o . Our strategy for choosing the goal position (and, hence, e d ) is described later in Section III-C. The force due to static obstacles is described by (3). In order to simplify the learning complexity, we neglected the compression and the sliding friction forces, both in the SFM simulations and in the neural network. The second branch of the network then comprises a Lambda layer (followed by two single neuron layers with no bias) that takes as inputs the distance d w and the components of the unit vector n w at the current time t. The inputs are combined in an exponential form as in (3), where the only two learnable weights reflect the A, B parameters of the SFM. The formulation of the total force f in the structured where again w A , w B ∈ R 1 , and w f s ∈ R 1×2 are learning weights.

C. Multi-goal prediction
Since the Net2 network cannot estimate the agent's goal on its own, we implement a multi-goal approach to estimate the most likely navigation direction. The proposed strategy is based on the combination of pseudo-Kalman filtering and likelihood analysis. More precisely, we make the assumption that the agent never stops starting from one of the four possible waypoint areas (see Fig. 1-a) and reaching another one. We set the parameters in (3) to A = 1000, B = 0.08, according to [2]. The 70% of each synthetic dataset was used as the training sets, while the remaining samples were used for validation. Notice that, in order to avoid possible correlations between training and validation, the samples randomisation is done after dividing the two sets. The window of the motion observations was empirically set to n = 10 samples, which provides a good trade off between learning speed and network prediction accuracy. This result is consistent with the fact that the networks mostly depend on the most recent data, and that a longer motion observation does not significantly improve the prediction accuracy [14]. Both networks are learned without over-fitting on their respective datasets, with training mean squared errors of 8.98 · 10 −5 N and 1.20 · 10 −3 N for Net1 and Net2, respectively.
For space limits, we only report in Fig. 2 and Fig. 3 Fig. 2-b), Fig. 3 shows the results in a completely novel environment, where the agent wants to reach the exit on the right corridor. Despite a slight underestimation of the forces in the occurrence of the horizontal collisions with the walls (see Fig. 3-b), the prediction remains consistent also in this novel scenario. This demonstrate that Net2 has not been negatively affected by the environmental biases during the training. We then compared the open loop predictions made by the NN with respect to the trajectories generated with the SFM. After inferring the forces with the first window of ground truth motion, we recursively use the predictions using (1). The force inputs used by the network depend solely on the position predicted in the previous time window of n = 10 steps, thus generating the results in Fig. 4. It can be noticed that in both scenario the trajectory is pretty well replicated for the different scenario reported.

A. Net1 experimental validation
To experimentally validate the performance of the Net1 network and make a comparison with other methods in the literature, we used two widely known human motion datasets: the ETH [17] and UCY [18] dataset.
The former contains the scene Hotel and ETH, while the latter contains three scenes, namely UCY, Zara1 and static obstacles is mostly negligible.
According to other related works [19], [7], [14], we compute the errors using: cross-validation training used for example in [19], [7], we used the synthetically learned Net1 to validate the prediction accuracy over real-world data. In other works, the observation window was usually set to 8 timesteps (that is, 3.2 s according to the data acquisition frame rate of the datasets), while the predictions spanned the successive 4.8 s. In our model, we observe only 1 s of the real-world trajectories, according to the length of the motion observation window of n = 10 samples only, and similarly predict for 4.8 s for comparability. In Table I we show the prediction errors for all the datasets and the comparison with the Constant Velocity (CV), the Constant Accelerated (CA) models and Feed Forward (FF) neural network implemented by [14]. Moreover, we report the comparison with the LSTM network by [19].

B. Multi-goal prediction on synthetic trajectories
Let us consider again the intersecting corridors scenario. Following the method described in Section III-C, we identify four waypoints areas, which correspond indeed to the areas chosen for the network training. The pedestrian starts from one of the areas (from the uppermost one in the experiment in Fig. 1-a), and reaches one of the three below. Therefore, we generate three different hypotheses for the trajectory predictions with the Net2 network (one per area, respectively) choosing as waypoint the area centroids. As shown in Fig 1-b, after about 2 seconds, the classifier is able to find the correct goal, since our model moves the agent accordingly with the SFM. In the following seconds the confidence towards the simulated trajectory increases.

C. Net2 validation
The final evaluation of Net2 with the multi-goal strategy is carried out through actual experiments in our department at the University of Trento. In particular, we record the data in a portion of an hallway with multiple exits. Data were collected using a LIDAR with a view of 360 • and maximum measuring distance up to 6 m running at 20 frames per second. The laser scanner was placed about 80 cm from the ground at the center of the scene, in order to optimally see the two sides of the hallway. The measurements points provided by the sensor were used to both extract the walls information (that is, the static points between subsequent frames) and the pedestrian positions. Our acquisition algorithm was used to extract points belonging to the person waist, and clustered them into a single planar position. In the recorded set, depicted in Fig. 5-a, the pedestrian could go to three different targets, i.e. one directly to the left, one to the right and one right at the end of the hallway (see Fig. 5-b). In Fig. 6-a we report the classification result for trajectory of the blue waypoint. After observing 1 second of the real trajectory and knowing the three possible waypoints, we foresee the trajectories in open loop with the network and the multi-goal strategy. The waypoint on the left exit is discarded as canditate goal after about 4 seconds, while the confidence of the two remaining waypoints remains almost the same, until the correct goal is found after about 6seconds, before the pedestrian oversteps the next exit (see Fig. 6-b). It is worthwhile to note that just one second of observed trajectory is needed and that useful predictions can be derived by using as a motion strategy a simple dynamic model such as the SFM. Even though the SFM is not suitable for motion planning applications (mainly due to its inability to manage nonlinearities), the synergy with a NN leads to accurate forecast the local motion avoiding the choice of parameters.

V. CONCLUSION
In this paper, we have shown a novel technique for predicting human motion. Our idea is based on the combination of a neural network with a famous physics inspired dynamic model, the SFM. In the combination, each of the two approaches emphasises its own strengths and compensates for the weakness of the other.
Specifically, the SFM brings a structure to the NN, reducing its complexity and the number of samples needed for the training. Furthermore, the NN predictions become explainable and physically interpretable. On the other hand, the NN expresses its full power in terms of flexibility, and of its ability to learn the complex parameter set of the SFM, which would be very difficult to estimate in real-time by conventional means for the strong non linearities of the model. Our simulations and experiments reveal the full potential of the marriage between the two worlds of physics inspired models and neural networks. Many important points remain open and will attract our efforts in the near future. First, we aim to establish a full comparison between the performance of our structure NN and a standard DNN over a number of realistic use cases. Second, we plan to develop NN embedding different models which are potentially more realistic than the SFM, first and foremost the HSFM [2] of the PHSFM [20]. Third, we plan to develop motion planning algorithms designed to make the best use of learning in predicting the human motion.