Deep Learning for 3D Human Motion Prediction: State-of-the-Art and Future Trends

Due to the success of deep learning in wide range of computer vision and computer graphics tasks, there is an increasing number of developed methods leveraging deep neural networks to solve human motion prediction. Recent motion prediction methods focus on solving many issues to predict accurate and natural human motion in temporal domain. In this study, we present a comprehensive survey of deep-learning-based human motion prediction methods. First, we define the human motion prediction problem and the scope of this study. We then provide related background knowledge and a comprehensive list of motion prediction methods based on our proposed classification. Next, we provide a complete survey of the characteristics widely used in the literature and explain the evaluation processes. Finally, we presented a quantitative comparison of recent studies and address the remaining unsolved issues while exploring possible research directions for future research.


I. INTRODUCTION
The ability of machines to anticipate future human motion is beneficial because it opens up many possible applications. Human character animation in the virtual world and computer games can be further improved using a flawless human motion prediction model. More importantly, developing a prediction model that is comparable to human behavior necessitates learning how humans behave while performing certain actions, which is crucial for many computer vision tasks. This knowledge is helpful in applications such as sports, surveillance, autonomous driving cars, robotics, and smart user interfaces.
Humans can predict a set of actions that are about to transpire based on this learned experience. However, prediction is a challenging task for machines and predicting future human actions is obviously more complicated than in other applications. Even with deep learning models, for every new context or clue that might hint at a future action necessitates a proper modification of the existing model to produce plausible and substantial results.
During the last decade, attempts have been made to ad-vance machine learning capability to predict the future. Early researchers attempted to use a recurrent neural network (RNN) to extrapolate inputs to the future [1]. It was later proven that a simple RNN is inadequate, necessitating additional modules such as long short-term memory (LSTM) [2] and gated recurrent units (GRU) [3] to improve the results. Although many diverse studies exist in the human motion prediction domain, we limit the scope to motion prediction methods from a sequence of data using deep learning. Moreover, we further limit the scope of the survey to methods that generate human motion prediction on top of a human body representation model, such as a 3D human skeleton, parametric models, and 3D mesh.
Exhaustive surveys on a few related topics of computer vision exist, but none of them specifically discuss human motion prediction. Oprea et al. [4] provided a thorough review of deep-learning methods for video prediction. Video prediction and human motion prediction have a lot in common, but video prediction deals with images instead of human poses in each time step. Rudenko et al. [5] conducted a survey on the 2D human motion trajectory prediction problem.
This study provides an exhaustive survey of recent studies on deep learning-based human motion prediction. Based on the survey, researchers engaged in this area can easily and properly design their own advanced algorithms.
The remainder of this study is organized as follows. Section II defines the problem of human motion prediction and addresses the challenges in this field. Section III introduces the common types of deep neural network backbones used in human motion prediction. Section IV, we explain different representations of the human body and provide a list of human motion datasets that have been widely used in previous studies. Section V presents a summary of human motion prediction methods and explains their evolution over the years. In Section VI, we describe the evaluation results of the human motion prediction methods with an intensive discussion and present the limitations of the current evaluation. Finally, Section VII and Section VIII provide future research trends and concluding remarks, respectively.

II. HUMAN MOTION PREDICTION A. PROBLEM DEFINITION
We define the problem of predicting human motion, given an input of the motion sequence, as follows: Let X 0:m ∈ R jxk be the input/conditioning motion sequence X 0:m = [x 0 , ..., x m−1 , x m ] over time step m, with j joints, with each containing k dimensions. Optional input modification by concatenating a one-hot vector is possible, thereby explicitly forming the model of the action class of the input motion sequence. A prediction task involves a networkG θ : X 0:m − → Y m+1 parameterized by θ, where it receives X0:m to predict X 0:m to predictŶ m+1:n whereŶ m+1:n ∈ R jxk ;Ŷ m+1:n = [ŷ m+1 , ...,ŷ n−1 ,ŷ n ] andŷ m+1 is the next pose after x m , unless mentioned otherwise. G θ comprises an encoder network E θ : X 0:m − → X 0:m and a decoder network D θ : X 0:m − →Ŷ m+1:n to transform the 3D body joint x m into latent features x m , and vice versa. Furthermore, G θ employs recurrent network R θ such as GRU [3] and LSTM [2] to deal with the temporal aspect of human motion prediction [6]- [13]. During training, a loss function L(Y m+1:n ,Ŷ m+1:n ) evaluates the distance betweenŶ m+1:n and Y m+1:n where Y m+1:n is the ground truth 3D body joint.
In other words, the problem of human motion prediction is to predict a plausible continuation of the input motion sequence. The training process involves feeding the network with numerous motion sequences of a consistent length (e.g., four-second motion sequences) and optional action class information (e.g., one-hot vector). A typical preprocessing step is to divide the motion sequence S containing multiple poses s, where the first half acts as an input X 0:m := {s t } m=1s t=0s and the latter half become the target Y m+1:n := {s t } n=4s t=1s . The training process for a human motion prediction method is either a supervised setting (i.e., using an action label as the input) or a self-supervised setting (i.e., without an action label as the input).

B. CHALLENGES
Recent studies have successfully leveraged deep neural networks for human motion prediction; however, they are still far from being put to real-world use. Furthermore, accurately modeling human body dynamics is challenging because there are many intricacies within each human motion. The actual human motion has many variables to determine the incoming action, which comes from both inside (e.g., intention and tendencies) and outside (for example, navigating obstacles and evading dangers). Humans face no challenge in this prediction task and perform it effortlessly most of the time. However, unlike humans, machines are not innately aware of the physical constraints of humans and have difficulties understanding the context of human motion. We discuss some challenges in designing a deep neural network for human motion prediction.

1) Multimodal Nature of Human Motion
To predict future motion, a conditioning frame or sequence is necessary. Theoretically, it is possible to perform predictions with a few frames or one frame as a condition. However, the neural network model would have difficulty understanding the context, causing contextually unrelated human motion prediction under this setting. Hence, most methods use a conditioning frame (varies between methods, but at least ten frames). This becomes an even bigger problem in the evaluation process as it becomes impossible to precisely determine the accuracy of a prediction. Because of the multimodal nature of humans, there is a plethora of possible subsequent poses to which a human can conform for a given human pose. However, the most effective method for evaluating human motion is simply to calculate the distance between the predicted motion and the ground truth data, which is the continuation of the conditioning sequence.

2) Unnatural Artifacts
Within a neural network, the human body is regarded as a group of joints, in which each joint has a set of values that determine its location. The neural network predicts the future value of each joint at each time step belonging to a human pose that continues the last input pose. However, realistic human motion prediction is more than just predicting a contextually valid sequence of poses. Some defects exist in computer-generated human motion, such as foot skating and invalid body pose, resulting in artifacts that degrade the prediction quality. These artifacts are particularly challenging for neural networks because they are not aware of the physical constraints of humans and regard the human body as a set of values. Understanding details from a spatiotemporal representation of data requires specialized networks such as recurrent and generative models. Over the years, these methods have become increasingly complex. Here, we provide a brief explanation of each network type that has been utilized for human motion prediction. A list of human motion prediction methods. negative log likelihood (N LL), mean angle error (M AnE), mean squared error (M SE), mean per joint position error (M P JP E), mean average error (M AE), KL divergence (KL), exponential map (EM ), joint location (3D − L), joint displacement (JD), lie algebra (LA), 4D-quaternion (4D − Q), = marker location (M L), ✓ represents a network consisting of more than one component and ✓ represents a network that uses one component only.

III. NETWORK BACKBONES
Understanding details from a spatiotemporal representation of data requires specialized networks such as recurrent and generative models. Over the years, these methods have become increasingly complex. Here, we provide a brief explanation of each network type that has been utilized for human motion prediction.

A. RECURRENT NETWORK
A recurrent network is a class of networks with connections between nodes, representing temporal relations within the data. Most recurrent network designs apply this principle by reusing a specific value from the previous time-step calculation (e.g., Figure 1(a) and (b)). An example is the vanilla RNN hidden state [42]. LSTM [2] was then proposed to alleviate the inability of the RNN to accurately remember the past and the vanishing/exploding gradient issue with RNNs. LSTM introduces three distinct gate types that control the hidden-state value and computation flow. A similar method, called GRU [3], was proposed, which uses two gate types.
Despite the lack of one gate type, a further experiment showed that the performance of GRU is comparable to that of LSTM [43]. Machine translation [44], speech recognition [45], and video captioning [46] have been accomplished using a recurrent-based network. The ability of a recurrent network to exploit past information is beneficial for these tasks because they deal with sequential data. Hence, it is natural that human motion prediction networks often use recurrent-based networks. Table 1 shows that most motion prediction methods use a subset of the recurrent network. For human motion prediction, recurrent networks are often used to process a feature in the current time step for the future or the next time step [1], [6]- [13], [15], [18], [20], [23], [37], [39], [41].

B. CONVOLUTION NETWORK
Convolution-based networks were originally designed to handle image-like data with large spatial dimensions [47]. A VOLUME 4, 2016 convolution-based network consists of a convolutional layer that works by thoroughly applying a filter throughout the spatial dimension in a feed-forward manner. This characteristic makes the convolutional layer and its byproduct, the convolution-based network, a suitable building block of most neural networks designed for image-related tasks. However, it has limitations when dealing with temporal data. An extension of the convolution network is available for dealing with video-based data [48]. However, a different approach is required to model the motion sequence, which has a higher temporal dependency and less spatial information than the videos. For example, Bütepage et al. [16] experimented with different encoder types to encode a human pose and found that the best encoder considers different time scales. Furthermore, Li et al. [21] replicated the ability of the recurrent network to model the temporal context by using a convolution network with different input scales. Mao et al. [34] employed an attention mechanism to aggregate motion history into future motion prediction (i.e., Figure 1(c)).

C. GENERATIVE NETWORK
A generative network generates new plausible data from the distribution of training data. Examples of the networks built for this purpose include variational autoencoders (VAE) [49] and generative adversarial networks (GAN) [50]. Proposed as an extension of autoencoders, the VAE is capable of both reconstruction and generation. The training process forces the VAE to have a latent space in the form of a probabilistic distribution that encapsulates the training-data distribution. The GAN design consists of a generator and discriminator networks. The generator generates accurate yet fake data, whereas the discriminator spots the fake from actual data. Consequently, GAN networks can generate impressive results in a broad range of tasks, while also being difficult to train.
Within the field of human motion prediction, GANs and VAEs are used differently (see subsection sub:Stochastic Prediction). VAEs are often used in conjunction with a recurrent network for diverse predictions [11], [12], [41]. This network combination excels by keeping track of temporal information, while being able to generate diverse motions by leveraging the VAE latent space. Conversely, GAN is more versatile because the discriminator can be used to measure the prediction accuracy [7], [8], [28], [33] for both deterministic and stochastic networks or to measure the rate of convergence [8] (Figure 1(d)).

A. HUMAN BODY REPRESENTATIONS
The human motion prediction literature includes some terminologies regarding the 3D body, which are shared between related topics. For example, a human body/pose describes the entirety of a joint conforming to a single human body or pose in each time step. Each pose consists of several connected joints, forming a model that resembles a human skeleton. The word joint describes the location or angle of a specific joint within the human skeleton. There is no definite number of joints in a pose because it varies depending on the motion sequence dataset used.
There are a few variations in the joint representation used in many human-motion prediction networks. Typically, a joint representation within a pose is either angle-or location- based. The angle-based/joint angle refers to an axis angle representation of the joints, or more specifically, the exponential map (Figure 2(a)). Meanwhile, the location-based joint representation defines the joint location in 3D Cartesian coordinates ( Figure 2(d)). Information regarding the joint representation used in the literature is presented in Table 1. From Table 1, we can infer that the angle-based representation is the most used, followed by the locationbased representation.

B. DATASET
The self-supervised nature of the deep human motion prediction method enables them to use an action sequence dataset that contains some form of 3D body annotation. Most mo-cap datasets satisfy the requirements necessary for training a deep human motion prediction network; therefore, they are widely used. This section describes widely used datasets in the literature, along with their details and standard preprocessing configurations according to each dataset. Human3.6M [14]: is an indoor dataset comprising RGB videos and the 2D and 3D body annotations of seven actors performing a range of actions. It was captured using 15 sensors (four digital video cameras, one time-of-flight sensor, and ten motion cameras), which were combined using hardware and software synchronization. This setup allows them to provide multiple data such as pose data, time-of-flight data, scanner data, and RGB video (sampled at 50 Hz).
At the time of writing this paper, the Human3.6M dataset was the most widely used benchmark dataset for motion prediction. Most researchers divided the actions within the dataset into eight easy and seven difficult actions. The standard experimental setup for this dataset is to downsample the motion sequence two times (to 25 Hz), train on six subjects, and test on five subjects.
CMU Mocap [17]: is a motion capture database that is captured using a system consisting of 12 infrared MX-40 cameras. It contains 2235 recordings from 144 different subjects performing various complex movements. With their recording setup, they could capture 120 Hz data on some actions. Each mo-cap actor wears a black suit with markers on the body (with 41 mo-cap markers). The CMU dataset can be used as an action classification dataset for multiple tasks [16]. For human-motion prediction tasks, some researchers included tasks performed by only one actor [11], [13], [21], [25], [33], [35]- [37].
NTU RGB+D [22]: is an action-recognition dataset captured using three Microsoft Kinect V2 cameras. The dataset contains multiple annotations such as RGB videos, depth map sequences, 3D skeletal data, and infrared videos. It contains 60 action classes (40 daily actions, nine healthrelated actions, and 11 mutual actions) performed by 40 distinct subjects.
Holden [19] et al.: combined multiple datasets [17], [51], [52] for their training process and added data from their internal captures. The combined dataset underwent multiple postprocessing steps [19]. The resulting dataset contains approximately twice as much data as the CMU [17] sampled at 120 Hz. Methods using this dataset mostly downsample the sequences four times (to 30 Hz) to unify the sampling rate across all sequences.
Penn Action [24]: is an action recognition dataset comprising 2326 video sequences of people performing 15 different actions. It also provides annotations in the form of a coarse viewpoint, human body joints, and 2D bounding boxes. The study annotated 13 joints for each human subject in the form of 2D locations and visibility.
HumanEva-I [38]: is a dataset built for human motion and pose estimation in videos. The 3D body pose annotations in this dataset are acquired using a mo-cap system. The dataset also contains videos of four subjects performing six action types, captured from multiple angles (three colored videos and four grayscale videos) and sampled at 60 Hz.
Action3D [40]: is another dataset that was captured using a single Kinect camera. This dataset provides annotations attainable with a Kinect camera, such as RGB videos (sampled at 30 Hz), depth maps, and 3D skeleton joint locations. The recording consisted of ten subjects who performed actions in ten action categories.
AMASS [30]: is a collection of standardized mo-cap databases. The study proposed a new method, MoSh++ [53], for converting different mo-cap data into realistic 3D human meshes. These realistic 3D human meshes are SMPL+H [54] and SMPL+X [55] parameters. SMPL+H is a parametric 3D body model that consists of a hand pose and an articulated VOLUME 4, 2016 body pose. SMPL-X is an extension of this work, which enables not only the hand and body but also the facial expression to be modified. The current version of this dataset provides both SMPL+H and SMPL-X annotation for all of the different mo-cap data. As a dataset that unifies other mo-cap databases, the combined dataset contains 2992.34 minutes (a little less than 5 h) of motion sequences associated with 484 subjects, with more than 11,000 motions (not unique) from all databases.
3DPW [26]: is an in-the-wild dataset with accurate 3D poses for evaluation. Additionally, this is the first dataset containing video footage captured from a mobile phone camera. Furthermore, this dataset addresses the lack of a largerecording-volume outdoor dataset. Finally, this dataset provides multiple annotations, such as 2D pose, 3D pose, camera pose (for every frame), 3D body scan Researchers have tackled motion prediction problems for many years. Early motion prediction methods were straightforward. Most of them proposed a simple architecture with common distancebased loss. However, further experiments revealed that a simple recurrent or convolutional network is insufficient for motion prediction. Furthermore, existing issues such as the multimodal nature of the future, error accumulation, and evaluation difficulty raise the need to address the problem in multiple approaches. Therefore, many motion prediction methods have been proposed. s, and 3D people models (18 variations with different clothing).

V. PREDICTION METHODS
Researchers have tackled motion prediction problems for many years. Early motion prediction methods were straightforward. Most of them proposed a simple architecture with common distance-based loss. However, further experiments revealed that a simple recurrent or convolutional network is insufficient for motion prediction. Furthermore, existing issues such as the multimodal nature of the future, error accumulation, and evaluation difficulty raise the need to address the problem in multiple approaches. Therefore, many motion prediction methods have been proposed.
We propose a taxonomy for human motion prediction methods. First, we observed that researchers often share terminologies to describe certain types of work, but no one has suggested concrete terminology beforehand. Therefore, we used similar, if not identical, words to describe motion prediction methods. Second, we split motion prediction methods into three categories: deterministic, structure-aware, and stochastic predictions. Deterministic predictions include studies that treat human joints impartially. Structure-aware prediction includes methods that use structural information of the human joint. Methods belonging to this group are designed with the joint and its neighbor in mind, usually represented in the form of an adjacency matrix. This design renders structurally aware networks more robust, enabling them to perform without a recurrent-based network module. Hence, we further split the structure-aware section into two, consisting of networks that do not use a recurrent-based network and networks that do. The stochastic prediction category includes methods that generate multiple samples from a single-motion sequence input. The final category contains a few methods that have unique characteristics, such as using 2D images as input [31] or dealing with incomplete input sequences [13], [28]. The following passage summarizes human motion prediction methods based on their classifications.

A. DETERMINISTIC PREDICTION
Fragkiadaki et al. [1] proposed the encoder recurrent decoder (ERD) network as the first deep learning-based human motion prediction method. ERD can be defined as G θ = D θ (R θ (E θ (·))) where E θ is a multiconvolutionallayer encoder inspired by [56], R θ is an LSTM layers, and D θ is a multiple dense layer. The design paradigm for ERD involves evaluatingŶ t = G θ (X t−1 ) for every frame t and using the ground truth pose X t instead of Y t during the next time step t + 1 in the training process. In addition to ERD, they proposed a three layered LSTM (LSTM-3LR) with dense layer for E θ and D θ .
The quantitative results indicate that LSTM-3LR is better than ERD for human motion prediction. However, when predicting times longer than 560 ms, LSTM-3LR suffers from a mean pose convergence problem (ERD does not have this issue) [1]. Methods plagued with this issue will predict a simple pose where the spatial variations of human motion disappear. This contradicts how humans behave because normal human motion contains unique perturbations, which we consider natural motion. Additionally, early human motion prediction works [1], [15] had motion discontinuity issues between the last conditioning pose and the first predicted pose.
However, Ghosh et al. [18] argued that modeling spatiotemporal information is possible without manually designed and action-specific graphs [15]. They proposed a DAE-LSTM method to implicitly model the structural and temporal information of human motion. They employed a dropout auto-encoder (DAE), which was inspired by denoising auto-encoders, to remove the error accumulated from the recurrent network. Regarding motion prediction, they used the LSTM-3LR, as in previous studies [1], [15]. The quantitative results showed that their dropout autoencoder module successfully reduced errors for long-term prediction.
Martinez et al. [6] proposed architecture and design changes to address the issues raised in previous studies [1], [15]. The first design change addresses the frame discontinuity issue using a residual architecture that models the displacement of the joints instead of the actual value. Second, in previous studies [1], [15] ground-truth data Y t was used as the input for the network during the training process. This design leads to deteriorating prediction quality after a few seconds of prediction [20]. It is also known that using ground-truth information during training in reinforcement learning [57] and recurrent networks [58] causes issues, as the network is denied the opportunity to learn from its mistakes, mainly from feeding the ground-truth data as input during training. Martinez et al. [6] proposed a paradigm change by usingŶ t during training, mimicking test time behavior, as it prevents the deteriorating quality issues. Their architecture is defined as G θ = D θ (R θ (·)), where R θ is a GRU network with 1024 units [3] and it uses a dense layer for D θ . Departing from previous network designs, their design excluded E θ , as they discovered that the network produced better results without it.
Martinez et al. [6] also discovered a simple zero-velocity baseline. The zero-velocity baseline works by constantly replicating the last input frame X m ; it outperformed prior works by a significant margin [1], [15], establishing a strong baseline for future methods. Gui et al. [7] proposed an adversarial geometry-aware encoder (AGED) network that outperformed the zero-velocity baseline. AGED uses two distinct discriminators comprising GRU cells [59]: the continuity and fidelity discriminators, which are based on the adversarial training principle [50]. The discriminators evaluated long and short sequences, respectively. Furthermore, a novel geodesic loss provides the network with a more meaningful loss by evaluating the geodesic distance between two rotations in the Riemannian manifold structure [60], in contrast to the Euclidean distance.
However, previous works faced difficulties in long-term predictions. This issue is exacerbated by the lack of a useful evaluation function, as the existing MSE-based evaluation function is only useful for evaluating short-term prediction, leaving long-term prediction results to be evaluated qualitatively. Motivated by this issue, Gopalakrishnan et al. [9] proposed a new architecture, the verso-time label noise RNN (VTLN-RNN), to tackle both short-and long-term prediction problems simultaneously alongside a more competent longterm evaluation metric, which is the normalized power spectrum similarity (NPSS). The VTLN-RNN network performs a two-level computation. The top level receives the conditioning sequence and action label and runs backward in time to generate "guide vectors." The lower level uses the "guide vectors" for motion prediction at each time step. Their novel loss, the NPSS, compares motion sequences based on their power spectrum obtained using a discrete Fourier transform. This leads to a more accurate error value as tolerable deviations (e.g., misaligned joint angles, adding extra frame, or removing frame during a motion compared to ground truth) will not be punished as severely as when using MSE.
Previous studies were limited by their reliance on action class information during the prediction process [1], [7], [15], [16], [20].. Methods with this characteristic are unrealistic because action labels are not available in real-world scenarios. Inspired by the ability of the hierarchical multiscale recurrent neural network (HM-RNN) [61] to model high-and low-level abstractions, Chiu et al. [23] proposed a similar architecture called triangular-prism RNN (TP-RNN), which does not require an action label. It contains multilevel LSTM, which consists of an increasing number of LSTM sequences as the level grows, where each sequence takes turns to pro- cess the information at lower levels. This design allows the network to encode different hierarchies in human dynamics at different time scales, thereby learning a robust motion context that spans over different time scales, allowing the network to predict without action-class information.

B. STRUCTURE-AWARE PREDICTION
It is possible to leverage spatial information of the human body for human motion prediction. Methods in this category are encouraged to learn the implicit relationship between each joint in the human body during motion (Figure 3), usually in the form of a parameterized graph.

1) Feed-Forward Based
Despite its success when dealing with periodic data, early deterministic prediction methods using LSTM networks [2] did not perform as well on aperiodic data. These characteristics, along with the need to train a separate model for each human motion, limit their generalization to novel data. Therefore, Bütepage et al. [16] proposed an autoencoder consisting of three different encoder types. One of the encoder types, the hierarchy encoder (connecting each human joint in a tree-like setup), outperformed the other encoder types in most aspects, hinting at the fact that the structural prior is beneficial for motion prediction. Quantitative results show that their network performed much better on complex actions compared to previous recurrent-based networks without incorporating a recurrent unit into their architecture.
Supporting similar arguments, Li et al. [21] proposed a convolution seq2seq network to model both temporal and spatial information from past motions. The network is divided into two parts to obtain long-and short-term hidden variables, containing long-and short-term encoders respectively. Following that, the long-and short-term hidden variables are concatenated to produce future motion. In contrast to LSTM [2] networks, their concept can preserve long-term information indefinitely because it is unaffected throughout the prediction process. Their method also allows for indefinite future predictions with the sliding window attribute of the short-term encoder.
Both Bütepage et al.'s [16] and li et al.'s [21] proposed network-applied convolutions across time on input poses. This design has a drawback in that the temporal dependencies of the network strongly depend on the convolutional filters of the network. Therefore, Mao et al. [25] proposed a VOLUME 4, 2016 feed-forward deep network architecture using a GCN [62], [63] and automatically learned graph connectivity. Their proposed method adopts a trajectory representation based on the discrete cosine transform (DCT), inspired by [64], instead of using a pose representation like that in previous studies [16] [21]. Using DCT, they obtained a compact representation of the human motion. Their experimental results also supported another of their claim that predicting in 3D space yields better results compared to predicting in angle space. They supported their claim by further claiming that the angle representation has ambiguities because two completely different sets of angles can share the same 3D pose.
Similar to Mao et al., Lebailly et al. [36] proposed a novel human motion prediction network using GCN. The temporal inception module (TIM), which was inspired by the inception module [65], can encode the input pose sequence at different temporal scales simultaneously. The TIM module contains multiple kernels of different sizes to obtain information from different scales. Based on their experiment, a trade-off was observed because a TIM with a longer subsequence will perform better at long-term prediction and vice versa.
Cui et al. [33] suggested that a single learnable unrestricted graph is equivalent to implicitly modeling pose information. Therefore, they proposed a dynamic relationship GCN (drGCN) that uses two distinct learnable adjacency graphs. One is a connective graph that learns the weights between neighboring joints. The other is a global graph that learns the implicit connections between all the joints in the human body.
Aksan et al. [29] proposed the structured prediction layer (SPL) to decompose the human joint prediction layer into a hierarchy, following the kinematic chain of the human skeleton. SPL creates many small networks that enable the model to learn dedicated representations per joint. As a result, the gradient on the hidden layers is affected only by the gradient from each joint hierarchy connected to them instead of the entire dense layer. Furthermore, they analyzed stateof-the-art methods and found that the benchmark using the H3.6M dataset became saturated. Methods that performed well on H3.6M did not scale well on a larger and more diverse dataset. Accordingly, they evaluated their method with AMASS on top of H3.6M and found that their method was more impactful on a large dataset.

2) Recurrent Based
Jain et al. [15] proposed the structural-RNN (S-RNN) that combines an LSTM network [2] and spatiotemporal graphs. Spatiotemporal graphs have been used to model high-level spatiotemporal structures [66]- [72]. The combined network (S-RNN) is capable of both modeling long sequences and capturing the spatiotemporal structure. Jain et al.also provides a converted version of the H3.6M dataset into an exponential map, which they used in their prediction.
Liu et al. [10] argued that LSTM and GRU face difficulties in long-term motion prediction because they often fail to capture the motion context and the long-term predicted pose degrades into a motionless state. They added that prior algorithms do not respect the physical constraints that govern human motion. Therefore, they proposed hierarchical motion recurrent (HMR), a hierarchical autoencoder structure with recurrent networks. The encoder is designed to encode the input sequence hierarchically using an update process similar to that used by a recurrent network cell state to update its value with the neighboring cell state. This encoding process runs n times, enriching the information within a cell state because it repeatedly shares information with other cell states. The decoder uses the final encoded motion context to recursively output the future motion context. During the training process, the network is guided by a loss function that penalizes the network according to the bone length (a longer bone means more loss). HMR outperforms previous works in both long-and short-term predictions, demonstrating its capability in modeling motion context. Inspired by [19], Wang et al. [39] developed STRNN using the same principle. The network encodes the entire input one by one before splitting it into two branches (input decoding and motion prediction). To capture the spatial features better, they employed a spatial encoder that encoded the DoFs of each body part.

C. STOCHASTIC PREDICTION
One solution for predicting the multimodal nature of human motion is to predict multiple plausible motions from a single input. Barsoum et al.proposed a WGAN-GP [73]-based network called HP-GAN [8], as one of the first probabilistic prediction methods. They claimed that using MSE in a nondeterministic setup causes the model to average between two possible futures. Aware of the difficulty in training GANs, they used a loss based on skeleton physics alongside GAN loss to stabilize and improve the training process. A critic network was used to quantify the prediction quality, although it was not sufficient to determine whether the training had converged.
Maintaining a balance between accuracy and diversity is important for stochastic human motion predictions. Aliakbarian et al. [11] addressed this problem by proposing mix-andmatch perturbations for imposing diversity on a stochastic network. They claimed that rather than combining a noise vector with the conditioning variables in a deterministic manner, a better result can be achieved by randomly selecting and perturbing a subset of these variables.
Although much care is being taken when designing a network to model multimodal data, the problem of producing diverse sampling from networks is underexplored. Yuan et al.proposed DLow [12], a novel sampling method that produces diverse sampling results while using a pretrained network to address this issue. They proposed a mapping function that subdivides the latent space into chunks using a set of variables. These variables were obtained from a network trained using the likelihood and diversity losses. Their methods produced better quantitative results than previous methods.
Zhang et al. [41] used DLow [12] as part of their network to explore the informative low-frequency bands they created using DCT [74], [75]. A key novelty of their method is predicting the mo-cap marker instead of the human joints. This change is based on the argument that human motion prediction is inherently similar to point-cloud prediction. The use of a mo-cap marker allows them to use the fitting algorithms (Mosh [76] and Mosh++ [30]) to acquire regularized 3D keypoints that can be used to guide the network or eliminate errors.

1) Using RGB Image Sequence as Input
Building upon their previous study, Zhang et al.proposed the predicting human dynamics (PHD) architecture to predict future 3D motion from past 2D inputs. The PHD architecture learns a latent representation of 3D human dynamics, named movie strips, using the temporal context of image frames. Following that, they trained an autoregressive network that takes past movie strips to predict future movie strips. To produce a 3D output, they used a 3D regressor to read the 3D mesh from movie strips. Compared to previous networks, the PHD architecture is more robust as it can train with multiple datasets in an action and dataset agnostic manner.

2) Dealing with Incomplete Input Sequence
Ruiz et al. [28] attempted a paradigm change by considering human motion prediction as an inpainting problem. They proposed their network, named spatio-temporal motion inpainting GAN (STMI-GAN), consisting of three separate discriminator modules to evaluate three different parts (bone length, feature space, and the human motion itself). To simulate an inpainting problem, they applied a mask to the input motion to occlude the human joints. To guide the network, they used limb distance, bone length, reconstruction, and regularized adversarial losses to enforce an accurate prediction. The L2 metric was only used to evaluate the reconstruction error. Furthermore, they proposed two other metrics to evaluate the prediction results: PSEnt and PSKL. PSEnt compares the entropy of the generated and input sequences, whereas PSKL measures the distribution distance between the generated data and ground truths.
It is known that gathering mo-cap data often contains incomplete observations owing to occlusion caused by joints or other objects. These types of data do not work well with most methods because they are not built with these imperfections in mind. Cui et al. [13] proposed an multitask graph convolutional network (MT-GCN), a novel multitask graph convolutional network capable of repairing incomplete observations while also predicting future human motion. This is the first study that addresses both problems simultaneously and produces competitive quantitative results that outperform state-of-the-art motion prediction methods.

VI. HUMAN MOTION PREDICTION EVALUATION
Evaluating deep human motion prediction methods remain as an ongoing challenge. In this section, we delve into the process of evaluating a human motion prediction method to other methods and review the evaluation method widely used across the literature. We also describe the current issue when evaluating a human Evaluating human motion prediction methods is not a straightforward process because of the lack of a definite method for ascertaining the quality of the produced motion. The quality of the predicted motion itself comprises many aspects such as plausibility, accuracy, and diversity, for stochastic methods. Currently, no evaluation method can precisely measure the extent to which each method excels in certain respects. Hence, performance metrics remain a tool to assess the performance of deep human motion prediction with respect to other studies. motion prediction method.

A. PERFORMANCE METRICS
Evaluating human motion prediction methods is not a straightforward process because of the lack of a definite method for ascertaining the quality of the produced motion. The quality of the predicted motion itself comprises many aspects such as plausibility, accuracy, and diversity, for stochastic methods. Currently, no evaluation method can precisely measure the extent to which each method excels in certain respects. Hence, performance metrics remain a tool to assess the performance of deep human motion prediction with respect to other studies.
The deepest motion prediction method is compared with a distance-based metric, such as the Euclidean distance (L 2 ). The common setting for this evaluation is to calculate the distance between the predicted and ground truth 3D joint locations using the mean per joint position error (MPJPE) [14] or the Euler angle with the mean angle error (MAnE). Because most methods use the exponential map representation within their network, conversion to the Euler angle is necessary for evaluation.
The methods report their quantitative evaluation results in the form of tables. A typical comparison procedure with the most frequently used dataset for comparison, Hu-man3.6M [14], is to report the MAnE result for each or some motion category over some timestamps. Table 3 and Table 2 show the evaluation scores compiled using multiple methods. Each column shows the MAnE result for a specific timestamp (shown in the top column in milliseconds). Columns adjacent to each other belong to the same action category (denoted by the topmost column). Tables 2 and Tables 3 show the MAnE results from many methods; however, they are organized according to the comparison made by the respective authors of each method. Key differences between Tables 2 and 3 are the three baseline evaluation results, that is, the ERD, LSTM-3LR [1], and S-RNN [15]. Table 3 reports the results obtained from their study for these three methods. Table 3 is obtained from the study by Martinez et al. [6] because they reevaluated these three methods and produced a different score. Afterwards, VOLUME 4, 2016 [14]. ⋆ results reported by Jain et al. [15], ⋄ results reported by Chiu et al. [23], † results reported by Aliakbarian et al. [11]. Best results are highlighted with bold.  [29] 0 future work will compare their method to different baseline scores, which creates this diverging comparison. However, some methods compare both, which leads them to exist in both Tables 2 and Table 3.

Method
However, simply evaluating the distance between the angles or joint locations is unreliable and often lacking. For example, angle-distance evaluation is inconsistent because of the ambiguity of angle-based representation, where two quite different angles can be attributed to the same pose. This leads to a discrepancy in the error value between the angle-and location-based methods [25]. Despite not sharing this ambiguity, calculating the joint location distance faces another issue in which the model is forced to predict the ground truth, ignoring the multimodal nature of human motion.
A method with diverse predictions faces a significant evaluation challenge. A diverse prediction method is likely to yield a lower L 2 score than a deterministic method. However, this does not mean that the produced motion is inaccurate or implausible [28]. Another study used a classifier network, trained as a discriminator to differentiate the ground truth data and generated data, to ascertain the predicted motion quality [11]. Finally, a couple of modified L 2 metrics are used for evaluation [12], [41], such as the average pairwise distance (APD) to measure diversity [11], average displacement error (ADE), final displacement error (FDE) to measure accuracy [78]- [80], and the multimodal versions of ADE and FDE, i.e., MMADE and MMFDE [81].

B. LACK OF INFORMATIVE METRIC
As mentioned above, the evaluation of human motion prediction methods is challenging. Most researchers have used the L2 norm (Euclidean distance) to evaluate their prediction results. However, high L2 norm scores do not always translate into accurate predictions or good models. Hence, a new evaluation metric is required for long-term predictions where deviations from the ground truth are inevitable.
Some researchers have attempted to create new metrics for evaluating long-term predictions. Ghosh et al. [18] proposed the use of an action classifier network to evaluate their predictions. It can highlight large deviations, as the motion will not be classified correctly when it occurs. A disadvantage of this is that it is unable to evaluate finer details of motion.

VII. FUTURE TRENDS
Based on the survey study on the state-of-the-art human motion prediction, we present a few possible future trends.

A. EXTENDED MOTION DATASET
A human motion prediction method gains the ability to predict by learning from a lot of motion sequences. Currently, the most used dataset consists of common daily human motions such as walking, eating, and discussion as in Hu-man3.6M dataset [14]. However, a few action types cannot represent diverse human motion in real life. For example, human motion prediction to predict a particular sport or dancing action requires a large scale motion sequence dataset with such action types, which is infeasible in many cases.
In this context, it would be a promising approach to automatically generate large scale datasets with diverse action types using existing computer vision techniques and public domain videos such as YouTube, Dailymotion, and Vimeo. In addition, developing a data augmentation technique for a human motion dataset might be a useful technique to employ.

B. SEMANTIC EVALUATION METRICS
We observe that most methods evaluate using distancebased metrics. Controlled deterministic prediction methods are evaluated using MAnE or MPJPE, while diverse prediction methods are evaluated using a set of modified distance metrics. However, simply calculating the distance between predicted and ground-truth joints is not an ideal way to measure the suitability of the predicted human motion. Note that distance-based metrics ignore the semantic nature of human motion which becomes more important unless we predict very short sequences. Therefore, further research should be conducted to account for more informative aspects or semantic correctness of the predicted motion. Table 2 that the discussion and smoking actions are more difficult to predict than others. This shows that existing methods are still struggling to perform well on these actions because they are more aperiodic compared to walking action. In addition, most methods are developed to outperform for all action types and there is no method that aims to specifically handle various aperiodic motions.

It is observed in
In the future research, general and complex motions should be the target to be predicted. To this end, more rigorous analysis of human motion is necessary and the designed predictor is desired to be more customized to particular motion domains (e.g., soccer, dancing, etc).

D. FULL 3D HUMAN MESH PREDICTION
Existing methods commonly utilize the skeleton-type joints for representing the human body. However, as mentioned by Zhang et al. [41], the skeleton-type joint structure leaves out many important details because it is invisible to the naked eye. For example, the bone length may vary and the predicted joints may not conform to a valid human body. This can be alleviated by predicting the whole 3D human mesh while enforcing the physical limitation of the dynamic human body.

E. CONDITIONING ON A DYNAMIC VARIABLE
Throughout this study, we survey human motion prediction methods that generate determined input motion sequences. We believe that adding a dynamic input variable to control human motion prediction would be beneficial for many applications. For example, in video game applications, it would be challenging but promising to add a dynamic directional input or a dynamic action type variable to simulate a variety of real life scenarios by changing the variable dynamically.

F. APPLICATIONS OF HUMAN MOTION PREDICTION
We observe that most human motion prediction methods are designed to predict the continuation from common daily actions. Conversely, accurately predicting human motion can be beneficial for many applications, including sports broadcasting, video game, surveillance, and safety.
Accurately predicting a future motion of a sports player could reveal a potential outcome of a sport match. Variations of similar prediction in sports games might produce a huge commercial impact on the sport broadcasting industry. Human motion prediction methods can also be extended for surveillance and security purposes by detecting and predicting abnormal actions with potential hazard. In an autonomous driving car system, a camera in the vehicle predicts the future motion of multiple pedestrians and the possibility of a car accident can be reduced dramatically.

VIII. CONCLUSION
In this report, we summarize the recent advances in human motion prediction. We observed that this topic has garnered increasing attention every year and has seen a significant improvement from earlier works. However, human motion prediction remains a challenging problem, and most studies have struggled to produce accurate results in long-term prediction despite excelling in short-term prediction. Dataset availability is also a limitation, as new motion sequence datasets cannot be constructed with ease, and motion prediction methods can only learn the distribution of action available in the dataset. These issues, coupled with the error accumulation and the tendency to converge to the mean pose in long-term prediction, means that we are far from applying motion prediction methods in real-world scenarios. However, we observed an increase in large motion sequence datasets, which could facilitate future research on human motion prediction. Furthermore, the deep learning field has experienced fast and steady growth; therefore, it is reasonable to expect many developments in the future.