ADHERENT: Learning Human-like Trajectory Generators for Whole-body Control of Humanoid Robots

Human-like trajectory generation and footstep planning has been a longstanding open problem in humanoid robotics. Meanwhile, research in computer graphics kept developing machine-learning methods for character animation based on training human-like models directly on motion capture data. Such methods proved effective in virtual environments, mainly focusing on trajectory visualization. This paper presents ADHERENT, a system architecture integrating machine-learning methods used in computer graphics with whole-body control methods employed in robotics to generate and stabilize human-like trajectories for humanoid robots. Leveraging human motion capture locomotion data, ADHERENT yields a general footstep planner, including forward, sideways, and backward walking trajectories that blend smoothly from one to another. At the joint configuration level, AD-HERENT computes data-driven whole-body postural references coherent with the generated footsteps, thus increasing the human likeness of the resulting robot motion. Extensive validations of the proposed architecture are presented with both simulations and real experiments on the iCub humanoid robot. Supplementary video: https://sites.google.com/view/adherent-trajectory-learning.


Introduction
The general problem of generating trajectories for humanoid robots still remains a challenge for the robotics community. The complexity of the problem increases considerably when targeting real-time trajectory generation for different environmental conditions and robot locomotion modes. For instance, whole-body trajectory generation methods for robot walking soon become numerically intractable due to the high dimensionality of the problem, especially when the overall generated motion is required to fulfill a certain degree of human likeness. We propose a system architecture for efficiently addressing whole-body human-like trajectory generation for humanoid robots. The architecture leverages recent research in computer graphics (CG) targeting character animation via learning-based methods [1][2][3][4][5] (see Appendix, Sec. A.2). We focus on Mixture of Experts (MoE) methods such as Phase-Functioned Neural Networks (PFNN) [1] and Mode-Adaptive Neural Networks (MANN) [2]. We integrate the latter with state-of-the-art hierarchical whole-body humanoid robot control methods, which proved effective on a diverse set of real-world humanoids [6][7][8][9][10][11][12][13] (see Appendix, Sec. A.1). Learning-based methods notably improve generality and human likeness of trajectory generation, while whole-body hierarchical controllers provide the reliability and robustness required in the real world.

ADHERENT
The proposed ADHERENT architecture consists of four main components: Dataset Collection, Retargeting, Trajectory Generation and Trajectory Control -see Fig. 1. In the following, we present the methods implementing each component. In light of ADHERENT's modularity, specific methods can be easily replaced by more efficient and effective ones in future instances of the architecture.

Dataset Collection
The Dataset Collection component is in charge of acquiring human locomotion data. For this, we use our human wearable data processing framework [14,15] that fuses data from a sensorized suit by XSens technologies [16], carrying 17 wireless inertial sensors scattering the entire body. Data span a wide range of walking motions (forward, backward, lateral, diagonal) performed on a flat terrain with continuously-changing steering direction. Stops and restarts are included in the sequences, characterized by steps of variable length. Please refer to the Appendix (Sec. C.1) for further details.

Retargeting
The Retargeting component adjusts the human trajectories so as when the modified trajectories are applied to the robot, its motion turns out to be similar to the human one. We retarget the collected motion capture (MoCap) data by enhancing the Whole-Body Geometric Retargeting (WBGR) technique (see Appendix, Sec. B.2) with a kinematically-feasible base motion retargeting that renders the robot base motion compatible with the retargeted joint trajectories (See Appendix, Sec. C.2).

Trajectory Generation
We interactively generate trajectories for the robot by exploiting the MANN architecture detailed in the Appendix (Sec. B.3). The training dataset preparation and network structure are inspired by [2].
Features extraction The input and output vectors for MANN are extracted by preprocessing the retargeted MoCap dataset. The input x i at time step t i includes the robot state at t i−1 and the ground base trajectory data at t i , i.e., past and future base trajectory data projected on the ground, subsampled to obtain 12 data points equally spaced on a 2 s window centered at t i . The output y i at time step t i includes the robot state at t i , the ground angular base transformation from t i−1 to t i , and the future ground base trajectory data at t i+1 (consisting of 6 data points equally spaced on a 1 s window starting from t i ). Further details on the input and output vectors are included in the Appendix (Sec. C.3).
Network structure The MANN architecture used in this work is composed of a Motion Prediction Network and a Gating Network with 3 hidden layers of 512 and 32 units each, respectively. ELU activation [17] is used. The Gating Network receives the full input x i . We use K = 4 experts.
User input processing At inference time, the user provides via joypad two continuous signals to interactively generate trajectories: i) the motion direction: the direction in which the user wants the robot to move; and ii) the facing direction: the direction towards which the user wants the robot to align the mean of its base and torso horizontal pointing directions. At fixed facing direction, varying the motion direction allows to switch between frontal, sideways, and backward walking. At fixed motion direction, varying the facing direction allows steering. Moreover, releasing the analog for the motion direction leads the robot to a stop. A step-by-step description of the user inputs processing, which is critical for the predictive performances of MANN, is provided in the Appendix (Sec. C.4).
Network output postprocessing The network output y i is used to update the robot configuration. In particular, the joint position s i included in y i becomes the new joint configuration and the ground angular base velocityḃ a i is exploited to update the base orientation. The base position is instead updated by applying the same kinematic feasibility procedure used at the retargeting stage (Sec. 2.2).
Footstep and postural extractor The desired feet positions and orientations composing the footstep plan are retrieved from the generated trajectory by the Footstep Extractor. In particular, a new foot position is added to the plan once the support foot changes. Concerning orientations, in the case of flat terrain the plan only requires the predicted yaw angle of the support foot, while roll and pitch angles are set to zero. The joint positions s i included in the network prediction y i constitute a whole-body human-like postural. However, having been trained on a MoCap dataset collected at 60 fps, the network generates references only compatible with such frequency. Still, the whole-body Quadratic Programming (QP) control layer may require posturals at a different frequency. The Postural Extractor interpolates the network's predictions to obtain posturals at the required frequency.

Trajectory Control
We execute on the robot the walking trajectories by leveraging the three-layer control architecture described in the Appendix (Sec. B.4). Given the footstep plan provided by the Footstep Extractor, a reference Divergent Component of Motion (DCM) [18] trajectory is obtained by the planner included in the Trajectory Optimization layer. In particular, we adopt the implementation from [19], providing a feasible DCM trajectory even for variable Center of Mass (CoM) height. Given a desired DCM trajectory, at each control cycle the Simplified Model Control layer computes the desired CoM velocitẏ x * by concatenating Eq.(1) and Eq.(2) from the Appendix (Sec. B.4). Then, the desired CoM position x * is retrieved by Euler integration. A QP problem of the form defined in Eq.(3) from the Appendix (Sec. B.4) is formulated from the postural returned by the Postural Extractor. The desired feet poses and CoM trajectory {ẋ * , x * } are set as hard constraints. An additional soft constraint aims to zero the chest roll and pitch angles. Finally, the desired joint velocityṡ * included in the solution of the QP problem is integrated, and the resulting desired joint position s * is sent to the robot.

Results
Here, we report the results obtained after training our architecture on the processed MoCap dataset. MANN is trained in a classical regression setting, minimizing the mean squared error between the ground truth and the network prediction. The Appendix also includes training details (Sec. D.1) and additional analyses on the robustness (Sec. D.2) and blending coefficients activation (Sec. D.3).
Trajectory generation The Trajectory Generation component interactively generates walking trajectories. Each prediction requires around 3 ms on a 9-th generation Intel Core i7 CPU @ 2.60 GHz. By varying motion and facing directions, the user can move the robot forward, backward, and sideways in a human-like fashion. Changes in the input signals promptly translate into smooth transitions between different walking patterns. By releasing the motion direction stick, the user can stop the robot and then restart the motion at will. Fig. 2 (top) shows a complex trajectory, including several walking patterns and smooth transitions between them. The footstep positions extracted from the entire trajectory are visualized in red and blue for the right and left foot, respectively. A larger variety of trajectory generations is reported in the supplementary video. , right-oriented forward (4-6), right-side (7-11), and backward (12)(13)(14)(15) walking, as well as smooth transitions between them, and a final stop (16). Below each frame, the user inputs interactively shaping the trajectory are plotted from the local viewpoint of the simulated robot (red: Desired motion direction; blue: Desired facing direction). Bottom: The very same trajectory on iCub.
Trajectory control The generated trajectories are executed on the real-world 32-Degree of Freedom (DoF) iCub humanoid robot [20], which is 104 cm tall and weighs approximately 33 Kg. The control architecture composed by the Simplified Model Control and Whole Body QP Control layers runs at 100 Hz on a 4-th generation Intel Core i7 @ 1.7 GHz. Fig. 2 (bottom) illustrates the successful execution of the complex trajectory whose generation is shown in the upper part of the same figure. The footstep sequence performed by the robot is added to the visualization for the sake of clarity. The execution of this trajectory, along with others, is also presented in the supplementary video.
Human likeness We evaluate the human likeness of the trajectories executed by ADHERENT on the real-world iCub robot. We compare them with walking trajectories adopting a fixed postural for the upper body, as is often the case in classical humanoid robot locomotion. A side-by-side comparison in the case of a forward walking is shown in the supplementary video. As it can be observed, the overall motion with ADHERENT postural shows an improved human likeness. Additional considerations on the degree of human-likeness achieved with ADHERENT are included in the Appendix (Sec. D.4).

Discussion
ADHERENT is able to generate reference trajectories in real time (3 ms per prediction step), thanks to efficient neural-network-based feedforward prediction. Therefore, it achieves comparable efficiency with respect to simplified-model trajectory generators, while being able to produce more general high-dimensional whole-body trajectories that state-of-the-art methods can only generate offline due to excessive computational complexity. We also demonstrate that ADHERENT-generated trajectories can be successfully executed on an advanced humanoid robot whose physical properties significantly differ from those of the human body. Such successfully-executed trajectories robustly cover a broad range of motion types and steering capabilities. Improving human likeness of the robot motion is an additional feature enabled by ADHERENT. We show that learning-driven human-like motions generated by the network can be successfully transferred to the real robot.

A.1 Humanoid Robot Locomotion
State-of-the-art architectures for humanoid locomotion simplify the whole-body trajectory generation problem by hierarchically decomposing it into several layers [1]. Layers' functionalities can be categorized [2] in: i) Trajectory optimization, providing a high-level footstep plan given user input; ii) Simplified model control, computing feasible CoM trajectories given the footsteps; and iii) Wholebody QP control, producing dynamically-feasible joint trajectories. Instead of directly optimizing over large configuration spaces, the first two layers tend to use simplified models to compute solutions. For instance, the unicycle planner [3] employs a unicycle model to produce footstep plans at the trajectory optimization layer, constraining the plan to simple directed walking on a plane [4], [5].
In recent years, this class of hierarchical architectures has been successfully applied to produce robust walking on a diverse range of complex humanoids [1, 6-8, 3, 2, 9, 10], also allowing for the integration of reactive strategies [11,12]. However, simplified models do not fully represent the complex humanoid mechanical structure in order to reduce computational cost and allow for on-line operation. As a result, they restrict the attainable solutions set and the resulting behaviors with respect to those achieved by humans. In particular, they cannot efficiently compute walking patterns with unconstrained footstep placement. Moreover, whole-body human likeness is hard to explicitly encode and optimize for with respect to other attributes such as feasibility, stability, and robustness, and is therefore usually neglected in such schemes.
Data-driven models of human trajectories have recently been explored to enable human-like behavior in robotics [13,14]. Applications include anticipatory trajectory generation for human-robot collaboration [15]. Still, such methods are focused on overall path planning (i.e., CoM trajectory), and do not target human likeness at the joint or footstep level.

A.2 Character Animation in Computer Graphics
It is important to note that the problem of human-like trajectory generation is not limited to robotics research. Indeed, it is a prominent topic in CG research too, especially due to applications to realistic character animation, and has witnessed several recent breakthroughs based on the introduction of machine-learning methods. The core of the problem can be framed as the kinematic prediction of the whole-body joints configuration in the next time step, given the current configuration and the high-level target trajectory to be followed (i.e., obtained from human input).
Many works approach this problem by modeling it as a nonlinear autoregressive model with exogenous inputs. They employ powerful learning-based predictive models able to capture the motion's complexity in high dimensions. In PFNN [16], the predictive model is a phase-weighted mixture of neural networks trained on human motion capture data. At prediction time, the network weights are blended according to a cyclic phase function encoding the periodicity of the walking motion. This resulted in a significant breakthrough for character control, enabling remarkably natural motion and smooth transitions. However, training data need to be annotated with phase function values, which can be costly or unfeasible for complex and non-periodic motions. In MANN [17], the latter problem is solved by substituting the fixed phase function with a gating network, which learns end-to-end how to effectively blend the network weights. Note that both PFNN and MANN are limited to trajectory generation for kinematic rendering only. In fact, their target applications are in settings in which natural visual appearance, rather than dynamic control of a real-world system, is the primary requirement (i.e., videogames).
In [18], Motion Matching is employed alongside reinforcement learning (RL) to retrieve motions from a MoCap dataset and provide them as references to train a policy in simulation. Improvements in terms of memory usage were proposed in [19]. DeepMimic [20] employs MoCap data to guide policy training via an imitation reward component. However, although having demonstrated remarkable capabilities in simulation and on real quadrupeds [21], RL approaches are severely limited by substantial inefficiencies and are yet to be successfully applied to real-world humanoids.

B.1 Notation
Please refer to the following notation for the quantities introduced in the remaining of the Appendix: • I and B denote the inertial frame and the base frame of the robot. In the specific case of iCub [22], B is positioned at the level of the waist, in between the two legs, with the X axis pointing backward and the Z axis upwards. • Given two frames A and C, A R C ∈ SO(3) represents the rotation matrix between the frames, i.e., given two vectors A p, C p ∈ R 3 respectively expressed in A and C, the rotation matrix A R C is such that A p = A R C C p. • Superscripts · H and · R indicate quantities referring to the human and the robot, respectively. • The m × m identity and zero matrices are denoted by I m and 0 m , respectively. • When referring to network inputs x, outputs y, weightsα, and blending coefficients θ or to their elements, subscript · i indicates quantities of the i-th time step t i . • The vec(·) operator vectorizes matrices by rows.
• Given a, b ∈ R 3 , we define a ∧ = A ∈ R 3×3 as the skew-symmetric matrix such that is the generalized velocity of the complete floating-base system, where I ω B is the angular velocity of the base frame w.r.t. the inertial frame, whose coordinates are expressed in the inertial frame, i.e., IṘ

B.2 Whole-Body Geometric Retargeting
Among the various approaches to human motion retargeting (see, e.g., [23][24][25][26]), WBGR is a recent method that is easily adaptable to different robot models and human subjects [27]. Assuming a degree of topological similarity between the human's and the robot's mechanical structures, WBGR makes use of m correspondences between the frames associated with m human and robot links at a reference configuration. Then, given the human link orientations I R i H , i ∈ 1, ..., m to be retargeted onto the robot, WBGR allows to retrieve the robot joint angles by solving the inverse kinematics problem with the robot orientations I R i R = I R i H H R i R as targets: each H R i R is a proper constant rotation matrix accounting for possible human-robot frame misalignment.

B.3 Mode-Adaptive Neural Networks
MANN is a recently-proposed neural network architecture for responsive character motion generation specifically designed for multi-modal and unlabeled data [17]. In particular, assume that x i encodes the previous configuration of the controlled character as well as the desired future motion specified by the user. Then, MANN predicts a new configuration y i for the character that achieves the user-specified motion. The next user input is combined with y i , forming the next autoregressive network input x i+1 . This enables MANN to iteratively generate trajectories following the MoCap data distribution while being responsive to the user. The main characteristic of this architecture, which builds upon the Mixture of Experts paradigm [28], is that of being composed of two subnetworks: • The Motion Prediction Network: given x i , it predicts y i ; • The Gating Network: given x i or a subsampled inputx i , it predicts the blending coefficients vector θ i = [θ i1 , ..., θ iK ] ⊤ used to dynamically compute the weights vectorα i of the Motion Prediction Network from the K expert network weights vectors {α 1 , ..., α K }.
In an end-to-end training procedure from unstructured MoCap data, both the weights µ of the Gating Network and the K expert weights {α 1 , ..., α K } are learned. At runtime, the weightsα i of the Motion Prediction Network at time step i are dynamically computed by linearly combining the K experts {α 1 , ..., α K } with the blending coefficients θ i predicted by the Gating Network, that is,

B.4 A Three-layer Control Architecture for Humanoid Robot Locomotion
A state-of-the-art control architecture for humanoid robot locomotion relevant for this work is composed of three nested layers that exploit both simplified and complete robot models [2]. Given the footsteps, in the outer trajectory optimization loop, an exponential interpolation technique is used to plan a desired DCM trajectory, smoothed via a third-order polynomial during the double support phases. Then, the central simplified model control loop is in charge of stabilizing the DCM dynamics by using the Zero Moment Point (ZMP) position r zmp ∈ R 2 as control input. The tracking of the desired DCM position and velocity ξ ref ,ξ ref ∈ R 2 is guaranteed by the instantaneous control law given by: where K ξ p > I 2 , K ξ i > 0 2 ,ω = g/z 0 , g is the gravitational constant and z 0 denotes the constant CoM height assumed for the Linear Inverted Pendulum (LIP) model [29]. The desired ZMP position r zmp ref is then stabilized along with the reference ground CoM position and velocity x ref ,ẋ ref ∈ R 2 by means of the control law given by: where K com >ωI 2 and 0 2 < K zmp <ωI 2 . Finally, the inner whole-body QP control loop computes the robot velocity ν as the solution to a stack of tasks formulation with hard and soft constraints, cast as a QP problem of the form: where H and g are evaluated from a low-priority postural task (soft constraint), while A c and b c are retrieved from the chosen high-priority tasks (hard constraints) and the final inequalities encode constraints on the maximum joint velocity. The joint velocity from the solution of the above problem, obtained via standard QP solvers, is directly integrated to get joint positions for position control.

C ADHERENT C.1 Dataset Collection Details
As detailed in Table 1, each kind of motion in the collected dataset is performed for several minutes in a row. Then, plenty of transitions between different motions are collected in a long mixed sequence. Our final dataset comprises around 1 h of unlabeled MoCap data at 60 fps. We then double it by mirroring, i.e., for each data point the base orientation is mirrored with respect to the world X-Z plane, while the left and right link orientations for the limbs are switched and mirrored with respect to the human model's mid-sagittal plane, resulting in a total of 441570 data points.

C.2 Kinematically-feasible Base Motion Retargeting
WBGR in [27] does not address the base motion retargeting, i.e., the retargeted base position and orientation may not be compatible with the robot kinematics, thus possibly leading to a robot moving forward faster than what its walking pace entails. In other words, a swaying effect arises when dynamic motions are retargeted to robot models structurally different from the human subject [27].
While in CG generating models which fit the collected data is a viable workaround [17] [18], base motion retargeting for actual robots requires special attention.
First, assume that: i) the robot makes at least one known contact with the environment; and ii) each foot is modeled as a rectangular patch. Then, we propose the following approach for kinematicallyfeasible base motion retargeting: 1. The contact point I p c is identified as the lowest among the 8 vertices of the feet's rectangular approximations; 2. The retargeted base orientation I R B is directly retrieved from the MoCap data; 3. The retargeted base position I p b is computed by forward kinematics from I p c , constrained to remain fixed between two consecutive retargeting steps, via I p b = I p c + I R C C p b , where C is the frame attached to the contact point (i.e., the frame placed in the lowest vertex and oriented as the support foot) and C p b is the base position, expressed in C, computed by forward kinematics in the updated joint configuration returned by the latest WBGR iteration.
As a result, we obtain retargeted motions that resemble human ones at the links level and are kinematically feasible at the base level, as shown in the supplementary video.

C.3 Features Extraction Details
In this work, the input vector x i for MANN is defined as follows: where are ground facing directions (i.e., mean of base and chest pointing directions, projected on the ground), ∈ R is the length of the future ground trajectory and s i−1 ,ṡ i−1 ∈ R 32 are joint positions and velocities at t i−1 .
The output vector y i is instead defined as follows: where P i+1 = {p 1 i+1 , ..., p 6 i+1 } ∈ R 2×6 are future ground base positions, D i+1 = {d 1 i+1 , ..., d 6 i+1 } ∈ R 2×6 are future ground facing directions, V i+1 = {v 1 i+1 , ..., v 6 i+1 } ∈ R 2×6 are future ground base velocities, s i ,ṡ i ∈ R 32 are joint positions and velocities at t i , whilė b a i = β i /∆t i ∈ R, with ∆t i = t i − t i−1 , and β i denoting the angle between the ground facing directions at t i and t i−1 (i.e., d 6 i and d 6 i−1 , respectively). All the ground base trajectory data in x i and y i are expressed in the bidimensional local reference frame defined by the ground base position at t i (i.e., p 6 i ) and the ground facing direction at t i (i.e., d 6 i ) along with its orthogonal vector. By stacking all the input and output vectors resulting from the processing above, we obtain the input and output matrices X ∈ R 441570×137 and Y ∈ R 441570×101 which, normalized to have zero mean and unit variance, constitute our training set.

C.4 User Input Processing Details
The desired motion and facing directions specified by the user are visualized in Fig. 3 (left), from the local viewpoint of a robot proceeding forward while steering left. Details on how these inputs are actually transferred to the network follow, since the user inputs represent an external factor, never seen by the network during training but directly shaping its input x i at runtime.
The user-specified motion and facing directions are smoothly interpolated to generate a desired future ground base trajectory {P * i+1 , D * i+1 , V * i+1 } whose components are defined as in Eq. (5). In particular, the user-specified motion direction is used to define the last point of a quadratic Bézier curve starting from p 6 i and constrained to end on the asymmetric shape shown in black in Fig. 3. We obtain P * i+1 Figure 3: The desired motion and facing directions from the joypad (left) define the desired future ground base trajectory (center). On the right, the user-specified future trajectory (grey) is blended with the future trajectory from the previous network prediction (magenta), leading to the desired future ground base trajectory (green) actually included in the next input to the network. by subsampling 6 data points from this Bézier curve. As a result of the asymmetric constraint, P * i+1 is longer for forward rather than sideways or backward walking. The user-specified facing direction is instead mapped into a series of facing directions D * i+1 progressively driving the current value to the desired one. V * i+1 is obtained by differentiating P * i+1 . Fig. 3 (center) provides a visualization, from the local robot's viewpoint, of P * i+1 (red dots) and D * i+1 (blue vectors) generated from the user-specified inputs (left).

As a last step, {P
} retrieved from the previous output y i in order to obtain the ground base trajectory data for the next input x i+1 . For the blending, an example of which is shown in Fig. 3 (right), we follow the method proposed in [16]. For instance, the desired future ground base positions P * i+1 are blended with the future ground positions outputed by the network P i+1 via: where t goes from 0 to 1 as ground base positions go towards the future (i.e. towards the limit of the considered 1-second future window) and τ p = 1.5 controls the responsiveness of the character to the user-specified inputs. The very same procedure is employed to blend the future facing directions and base velocities, with τ p in Eq. (6) respectively replaced by τ d = 1.3 and τ v = 1.3.
Note that user-specified inputs may result in a desired motion which is absent or rare in the training dataset. In such case (e.g., when the user requests too abrupt steering) the network may generate unexpected motions. We solve this issue by limiting the local variation of facing direction that the user can require to 45 • and 20 • for forward and backward/sideways walking, respectively.

C.5 Network Output Postprocessing Details
When the user tries to stop the robot, a small in-place rotation persists. Indeed, given an x i corresponding to a desired stop, the network predicts a y i whoseḃ a i component is slightly different from zero. We solve this issue by imposingḃ a i = 0 once a stop by the robot is detected. Here, we are referring to stops at the network level, which can occur several time instants after the user releases the joypad. We detect such stops by searching for almost-identical consecutive network outputs. In our case, a stop is detected if ∥y i − y i−1 ∥ < τ stop , with τ stop = 0.05.

D.2 Robustness Analysis
We evaluate the robustness of ADHERENT to a challenging range of step sizes and walking speeds compatible with the locomotion capabilities of iCub. We carry out our analysis both in simulation and on the real robot. We apply a slow-down factor to the generated trajectories in the range {1, 2, 3, 4}.
A footstep scaling factor in the range {0.2, 0.4, 0.6, 0.8, 1} is instead applied to the footstep positions. The analysis is repeated for forward, backward, and left-side walking. Concerning simulations, Fig. 4 illustrates the results for all parameter combinations, with the successful trials and failures represented by the green and the red areas, respectively. For each considered motion, most combinations are successful. No matter the slow-down factor, the maximum admissible footstep scaling for successful backward and left-side walking is 0.8 and 0.6, respectively. As regards real experiments, the solid green line in Fig. 4 connects the most challenging parameter combinations resulting in successful outcomes. We observe that higher speed can be traded off with larger step sizes.

D.3 Blending Coefficients Activation
We analyze how the experts specialize in different motions by plotting the profiles of the corresponding blending coefficients θ in Fig. 5. Note that θ activations show distinctive periodic patterns characterizing each motion type. For instance, in both the straight and steered forward walking phases, only θ 1 (green) and θ 2 (yellow) are active, and specialize in left and right swing motions, respectively. Moreover, θ 3 (red) and θ 4 (blue) become active during right-and left-side walking, respectively. The real-time evolution of expert activations for different motions is shown in the supplementary video.

D.4 Human-likeness Analysis Details
In order to evaluate the human likeness of the trajectories executed using ADHERENT, we compare them with trajectories exploiting a fixed postural for the upper body. For both cases, Fig. 6 shows reference and measured joint trajectories for a representative set of upper-body joints during forward walking. When using a fixed postural, the measured joint positions oscillate in proximity of the constant reference. With the ADHERENT postural, the measured joint positions closely track the references produced by the network, despite the lower-level action of the Whole Body QP Controller. This demonstrates that reference motions produced by generators trained on human data can actually be realized on the real robot. Note that the learning-driven motion of the representative joints discussed above considerably contributes to the improvement in terms of human likeness that can be seen in the supplementary video.

E Future Work
In this work, we presented an implementation of the ADHERENT architecture employing instantaneous controllers for trajectory control. Possible future work includes investigating, comparing, and integrating ADHERENT with more advanced control architectures (i.e., MPC-or RL-based).
From a machine-learning perspective, ADHERENT must be retrained from scratch whenever new motion skills need to be added. This could be tackled by integrating recent continual/lifelong learning methods in the architecture.
Finally, an extension of our work for the navigation of uneven ground could be pursued by including perceptual terrain features in the network input. Enhancing the ADHERENT architecture with perception could indeed represent another significant step towards the development of general and human-like trajectory generation.