On the Emergence of Whole-body Strategies from Humanoid Robot Push-recovery Learning

Balancing and push-recovery are essential capabilities enabling humanoid robots to solve complex locomotion tasks. In this context, classical control systems tend to be based on simplified physical models and hard-coded strategies. Although successful in specific scenarios, this approach requires demanding tuning of parameters and switching logic between specifically-designed controllers for handling more general perturbations. We apply model-free Deep Reinforcement Learning for training a general and robust humanoid push-recovery policy in a simulation environment. Our method targets high-dimensional whole-body humanoid control and is validated on the iCub humanoid. Reward components incorporating expert knowledge on humanoid control enable fast learning of several robust behaviors by the same policy, spanning the entire body. We validate our method with extensive quantitative analyses in simulation, including out-of-sample tasks which demonstrate policy robustness and generalization, both key requirements towards real-world robot deployment.

On the Emergence of Whole-body Strategies from Humanoid Robot Push-recovery Learning Diego Ferigo * ,1,3 , Raffaello Camoriano * ,2 , Paolo Maria Viceconte 1,4 , Daniele Calandriello 2 , Silvio Traversaro 1 , Lorenzo Rosasco 2,5, 6 and Daniele Pucci 1 Abstract-Balancing and push-recovery are essential capabilities enabling humanoid robots to solve complex locomotion tasks. In this context, classical control systems tend to be based on simplified physical models and hard-coded strategies. Although successful in specific scenarios, this approach requires demanding tuning of parameters and switching logic between specificallydesigned controllers for handling more general perturbations. We apply model-free Deep Reinforcement Learning for training a general and robust humanoid push-recovery policy in a simulation environment. Our method targets high-dimensional whole-body humanoid control and is validated on the iCub humanoid. Reward components incorporating expert knowledge on humanoid control enable fast learning of several robust behaviors by the same policy, spanning the entire body. We validate our method with extensive quantitative analyses in simulation, including out-of-sample tasks which demonstrate policy robustness and generalization, both key requirements towards real-world robot deployment.
Index Terms-Robotics, Humanoids, Reinforcement Learning, Whole-body Control

I. INTRODUCTION
Bipeds are those creatures that make use of two legs for moving while maintaining static or dynamic equilibrium. Balancing is a key prerequisite for any kind of locomotion bipeds may achieve. Human evolution determined highly robust bipedal locomotion, providing enhanced environmental adaptability and fitness with respect to other species. Humanoid robots are actuated mechanisms sharing many structural similarities with the human body. In a world largely crafted by and for humans, they also need to balance for effective operation. The challenges posed by bipedal dynamics are manifold. Bipeds, compared to other morphologies, are inherently unstable. Control actions need to account for a narrow support surface and a sparse mass distribution. Nonetheless, bipedal balancing and locomotion successfully established themselves in nature. Therefore, it is reasonable to expect comparable proprioceptive signals to be sufficient for the emergence of similar motor capabilities.
A great variety of methods aiming to solve similar sequential decision-making problems has recently been proposed. Deep Reinforcement Learning (DRL) is among the most promising [1]. Complex locomotion behaviors can be synthesized by policies trained on sequential interactions with the environment [2]. However, this approach poses fundamental challenges when applied to robotics [3]. In particular, collecting the amount of example trajectories required by most state-of-the-art modelfree DRL algorithms is unfeasible for current robots [4]. A common solution consists in resorting to synthetic data based on rigid-body dynamics, addressing the mismatch introduced by the sim-to-real gap in a subsequent stage [5], [6]. Nonetheless, learned behaviors often display unnatural characteristics, such as asymmetric gaits, abrupt motions of the body and limbs, or even unrealistic motions exploiting imperfections and glitches in the physical simulator of choice. These issues significantly limit generalization and transferability to real-world robots.
State-of-the-art methods for bipedal robot control [7] are rooted in control theory and optimal control. Control architectures are often organized as hierarchies composed of trajectory optimization [8], simplified model control, and wholebody quadratic programming [9], [10]. While such approaches have achieved considerable results both on simulated and real humanoid robots, they: i) Rely on an accurate description of the robot dynamics; ii) Require hand-crafted features for online execution [11]; iii) Present challenges when simultaneously facing different tasks.
As concerns push recovery, switching between different strategies (e.g., ankle, hip, stepping, and momentum) is not trivial. Compared to previous results [12], this work offers the following main contributions: • Demonstration of the emergence of robust momentumbased whole-body push-recovery strategies in addition to ankle, hip, and stepping ones; • Design of reward components to guide learning towards steady-state balancing, with transient push-recovery strategies; • Definition of a state space -inspired by floating-base dynamics -encoding sufficient information for solving the task with no prior knowledge about the desired trajectories.

II. RELATED WORK
A. Control-theoretic approaches Humanoid locomotion control has traditionally been tackled by resorting to simplified models. In particular, the 3D Linear Inverted Pendulum (LIP) model is among the most widely employed ones [13]. Its simplified dynamics proved effective and efficient for trajectory generation in walking, balancing, and push-recovery methods. In the presence of limited perturbations, in-place recovery strategies regulating the Center of Pressure (CoP) [14] or the centroidal angular momentum [15] can be sufficient for recovery. These include ankle, hip, and foottilting strategies [16], [17]. An alternative method, modulating the Center of Mass (CoM) height was recently proposed [18]. Stronger perturbations require the support surface to be enlarged or shifted to ensure that the CoP is kept enclosed in it [16]. A natural way to achieve this is by means of stepping strategies. To this end, push-recovery stepping controllers based on Zero-Moment Point (ZMP) [19] trajectory generation have been proposed [20], along with Model Predictive Control (MPC) methods controlling the ZMP while rejecting strong external disturbances [21]. Alternatively, footstep planning strategies based on the Capture Point (CP) [22], [23] have been employed for position-controlled [24], [25] and torque-controlled [9] humanoids. Control-theoretic methods significantly improved the state-of-the-art push-recovery performances of humanoids. Still, they present several limitations: 1) Controllers usually encode a single behavior. Being robust to a wide range of perturbations requires complex controller switching; 2) Robot-and task-specific tuning of the controllers and switching system is a costly trial-and-error procedure; 3) Simplified models and hard-coded strategies often constrain the attainable behaviors; 4) MPC-based methods are computationally expensive, hindering real-time deployment.

B. Deep Reinforcement Learning approaches
In recent years, DRL has been successfully applied to synthesize computationally efficient controllers for complex robotic tasks in a data-driven way, both in simulation and in the real world. Quadrupeds have drawn considerable attention in DRL locomotion research, also due to their relatively lower dimensionality and greater stability with respect to bipeds. Policies trained in simulation have been transferred to real robots via accurate system identification and domain randomization [26], [27], while the data-efficient Soft Actor-Critic algorithm has been shown to learn robust gait policies from few realquadruped trials [28]. Remarkably, DRL can also train walking policies for non-humanoid bipedal robots [29], including realworld deployment without dynamics randomization [30].
Other works focus on learning locomotion policies for humanoids. This setting is more challenging, due to the complex and redundant body structure. The potential of DRL in this domain was first demonstrated on walking tasks in simulation [31]. Other methods improve the human-likeness of the behaviors by introducing motion imitation [32], [33]. Still, these methods are more targeted towards benchmarking modelfree DRL for continuous control and realistic animation of simplified characters rather than applicability to real humanoid robots.
More recent work has been devoted to training push-recovery [12] and walking [34] controllers for accurate humanoid robot models using principles from robot control and transferable observation and reward designs. The latter approaches, although demonstrating diverse effective behaviors emerging from a single policy, control only the lower body joints. DRL-based methods for whole-body humanoid control remain an open problem and have the potential for learning high-dimensional locomotion policies, further improving humanoid capabilities to recover from external perturbations.

A. Notation
• W and B denote the world (inertial) frame and the base frame of the robot; R and L denote the frames of the right and left feet.
• Given two frames A and B, A[B] denotes a new frame with the origin of A and the orientation of B.
• G := G[W ] denotes the frame with origin on the robot's CoM and orientation of the world frame.
• n denotes the robot's Degrees of Freedom (DoFs).
• A p B ∈ R 3 denotes the coordinates of point B in frame A. Superscripts, e.g. A p xy B , extract specific coordinates. • Given two frames A and B and a point C, the matrix • Given A R B , the triplet A (ψ, ρ, φ) B denotes the Euler angles of the z-x-y sequence of intrinsic rotations.
• Given w, u ∈ R 3 , we define w ∧ = W ∈ R 3×3 as the skew-symmetric matrix such that w ∧ u = w × u, and W ∨ = w its inverse. • Given A p B and three frames A, B and C, the velocity of the point B w.r.t. the origin of frame A, expressed in • s,ṡ ∈ R n denote the joint positions and velocities.
where the base is represented as body-fixed velocity [15].
denotes the 6D force acting on frame F expressed in frame A. In the above definitions, the world frame W is implicitly assumed when A is omitted.

B. Reinforcement Learning (RL)
We formulate balancing and push recovery as a discrete-time RL problem modelled as an infinite Markov Decision Process (MDP) with a discounted expected return [35], [36]. In this setting, an agent interacts with an environment following a control policy. At each time step t, the agent collects data from the environment in the form of a state x t . The control policy π(a t |x t ) selects an action a t whose application results in a new state x t+1 and a scalar reward r t = r(x t , a t , x t+1 ) encoding the immediate value of the experienced transition towards solving the target task. The interaction generates several trajectories τ = {(x 0 , a 0 , r 0 ), (x 1 , a 1 , r 1 ), ...}. The agent's goal is to learn a policy π maximizing its expected return

C. Policy Gradient (PG) methods
A popular class of algorithms addressing expected return maximization for continuous-control tasks is provided by model-free PG methods [37]. Given a parameterized policy π θ (a t |x t ), PG methods perform direct gradient-based optimization of θ over the scalar performance measure : whereÊ t denotes the empirical mean over a finite batch of trajectories. The advantage functionÂ t = R t −V (x t ) evaluates the advantage of taking action a t at state x t , defined as the difference between the actual return R t = T −k k=0 γ k r t+k collected from x t in the sampled trajectory and the current estimate of the value function V (x t ). Using on-policy samples only, at each iteration of the optimization the gradient of the expected return is estimated by differentiating L P G (θ) and used to update θ. Among the available PG algorithms, we employ Proximal Policy Optimization (PPO) [31], which tackles the instability characterizing the training process in presence of large policy updates by maximizing the objective L CLIP (θ) = E t min π θ (at,xt) π θ old (at,xt)Â t , clip π θ (at,xt) π θ old (at,xt) , 1 − , 1 + Â t where θ old are the pre-update policy parameters and the hyperparameter used to clip the policy update. Maximizing L CLIP (θ) maintains new policies close to old ones while optimizing the objective.

IV. ENVIRONMENT
The environment is structured as a continuous control task with early termination conditions. Its dynamics runs in the Ignition Gazebo simulator embedded into the gymignition framework [38], compatible with OpenAI Gym [39]. The enabled physics engine is DART [40]. We selected iDynTree [14] for calculating rigid-body dynamics quantities,  3] using an accurate model of the robot's kinematics and dynamics represented in the following form [15]: is the mass matrix, h(q, ν) the Coriolis and gravity term, B a selector matrix, τ the joint torques, n c the number of contacts, J k and f k respectively the Jacobian and the 6D force of the k-th contact.
The environment receives actions and provides observations and rewards at 25 Hz. The physics and the low-level PIDs run at 1000 Hz. During training, some properties of the environment are randomized (see Sec. IV-D).

A. Action
The separation between agent and environment is defined by the action selection. In our nested structure, the policy generates an action a ∈ R 23 composed of the reference velocities for a large subset of the robot joints (controlled joints), which are then integrated and fed to the corresponding PID position controllers. The controlled joints belong to the legs, torso, and arms. Hands, wrists, and neck, which arguably play a minor role in balancing, are locked in their natural positions. The policy computes target joint velocities bounded in [−180, 180] deg/s at 25 Hz. Commanding joint velocities rather than joint positions prevents target joint positions from being too distant from each other in consecutive steps. Especially at training onset, this would lead to jumpy references that cannot be tracked by the PID controllers, affecting the discovery of the relation between x t and x t+1 . The integration process, instead, enables to use a policy that generates discontinuous actions while maintaining continuous PID inputs with no need for additional filters.

B. State
The state of the MDP contains information about the robot's kinematics and dynamics, since no perception is involved. It is defined as the tuple x := q, ν, f L , f R ∈ X . The observation, computed from the state x, is defined as the tuple The  Table I.
Although the agent is trained in simulation, we design it for real-time execution on actual robots. We carefully select state components that can be either measured or estimated onboard [14]. To promote policy transfer, we avoid measurements from noisy sensors and values that cannot be estimated with sufficient accuracy. In fact, any significant mismatch between simulated and real data would hinder transfer, increasing the reliance on policy robustness. We select minimal state components encoding the environment dynamics without affecting learning performance.

C. Reward
The reward is a weighted sum of terms that can be categorized as regularizers, steady-state, and transient. Regularizers are terms often used in optimal control for the minimization of control action and joint torques. Steady-state components help to obtain the balancing behavior in the absence of external perturbations, and are active only in Double Support (DS). Finally, the transient components favor the emergence of pushrecovery whole-body strategies.
The total reward is composed of a weighted sum of scalar components i ω i r i , where r i is the reward term and w i its weight. In order to provide a similar scale for each of them, and therefore improving the interpretability of the total reward, we process the real and vector components with a Radial Basis Function (RBF) kernel [41] with a dimension given by a cutoff parameter calculated from the desired sensitivity. Appendix A provides a more detailed description of the kernel. Table II includes the weights of each reward component and the kernel parameters, if active.
Regularizers: Joint torques r τ . Torques applied by the PID controllers are penalized. The environment runs at 25 Hz and the low-level controllers at 1000 Hz. Therefore, for each of the 23 joints, 40 torques are actuated between two consecutive environment steps. We collect all these torques in a single vector τ step ∈ R 23·40 and average its elements. Joint velocities rṡ. Our control scheme ensures that joint position references are continuous. However, PPO explores the action space of joint velocities following the active distributions. To promote smoother trajectories, we penalize the norm of the latest action. It can be seen as the minimization of the control effort.
Steady-state: Postural r s . Whole-body humanoid control schemes apply different weights to various control objectives. The postural is notably one of the most used [42], although it is usually assigned a low priority. A postural reward term helps to reach a target posture during balancing instead of relying on local minima found by the learning process. This component penalizes the mismatch between the sampled joint configuration and the reference configuration shown in Figure 4a. CoM projection r G . Statically balanced robots, in order to maintain stability, keep the CoM within the Support Polygon (SP), defined as the Convex Hull (CH) of their contact points with the ground. With the same aim, we introduce a Boolean component rewarding the agent if its CoM ground projection is within the SP induced by the feet. For additional safety, we shrink the SP by a 2.5 cm margin all along its perimeter. Horizontal CoM velocity r xy v . We define a target horizontal velocity for the CoM as a vector pointing from the CoM projection to the center of the SPp xy hull . In order to promote faster motions if the CoM is relatively close to the ground, the magnitude of the target is amplified by a factor w 0 = g/p z G derived from the LIP model [13], where g is the standard gravity. This component encourages the motion of the CoM projection towards the center of the SP.
Transient: Feet in contact r c . The feet are encouraged to stay on the ground. In order to promote steps and increase movement freedom, we add a Boolean term marking whether any foot is in contact with the ground. Links in contact r l . If any link excluding feet is in contact with the ground, the episode terminates with a negative reward of −10 for the terminal state. Whole-body momentum r h . Our policy also controls joints belonging to the torso and the arms. The momentum generated by the upper body can, therefore, be exploited for balancing and push recovery. This term minimizes the sum of the norms of the linear and angular components of the robot's total centroidal momentum G h [15]. In early experiments, the policy was converging towards feet tipping behaviors, i.e. the feet were not in full contact with the ground. Since the terrain is flat by assumption, we discourage tipping by promoting a feet orientation with the soles parallel to the ground. If W R f oot = [r (x) , r (y) , r (z) ] is the rotation between the foot frame and the world, this term promotes the alignment of its third column with the world frame.

D. Other specifications
Initial State Distribution: The initial state distribution ρ(x 0 ) : X → O defines the value of the observation in which the agent begins each episode. Sampling the initial state from a distribution with small variance, particularly regarding joint positions and velocities, positively affects exploration without degrading the learning performance. At the beginning of each episode, for each joint j we sample its position s j,0 from N (µ = s 0 , σ = 10 deg), where s 0 represents the fixed initial reference, and its velocityṡ j,0 from N (µ = 0, σ = 90 deg/s). As a result, the robot may or may not start with the feet in contact with the ground, which encourages the agent to learn how to land and deal with impacts.
In order to promote exploration beyond the initial state distribution and favor the emergence of pushrecovery strategies, we apply external perturbations in the form of a 3D force to the base frame of the robot. The applied force vector has a fixed magnitude of 200 N and is applied for 200 ms. Considering the weight of the iCub, approximately 33 kg, the normalized impulse sums up to 1.21 Ns/Kg. We sample the direction of the applied force from a uniform spherical distribution. The frequency of the application is defined as average applications per second, again sampling from a uniform distribution. We apply a force on average every 5 simulated seconds.
Early Termination: The balancing and push-recovery objectives for a continuous-control task are characterized by an infinite-horizon discounted MDP. During training, however, episodes should stop as soon as the state reaches a subspace from which either it is not possible to recover or it is uninteresting to explore, following an early-termination criterion. The state space interesting for our work is where the robot is -almost -standing on its feet, therefore we terminate the episodes as soon as it falls to the ground. We detect the falling condition when any link but the feet touches the ground plane.
Domain Randomization: During the training process, at the beginning of each new episode, the environment performs a domain randomization step. The masses of the robot's links are sampled from a normal distribution N (µ = m 0 , σ = 0.2m 0 ), where m 0 is the nominal mass of the link defined in the model description. To avoid making assumptions on the material properties of the feet and the ground, we randomize the Coulomb friction µ c of the feet by sampling it from U(0.5, 3). Finally, since the simulation does not include the real dynamics of the actuators, to increase robustness we apply a delay to the position references that are fed to the PID controllers, sampled from U(0, 20) ms.

V. AGENT
The agent receives the observation o from the environment and returns the action a defining the reference velocities of the controlled joints. The parameters of the agent are reported in Table III and further explained below. Learning Algorithm: We select PPO as candidate learning algorithm, in the variant with both the classic gradient clipping and the minimization of the KL divergence.
Policy and Value Function: The stochastic policy π(a|o) selects which action to take given a state. The value function V (o t ), instead, estimates the average return when starting from the state o t and then following the policy for the next steps. We represent both the policy and the value function with two different neural networks composed of two fully connected layers, with 512 and 128 units each, followed by a linear output layer. The hidden units use a ReLU activation function. The networks do not share any layer.
Distributed Setup: The chosen PPO algorithm scales gracefully to a setup where the batch samples are collected from multiple workers in parallel. Our training setup is formed by 32 workers with an independent copy of the environment, and a trainer. After collecting a batch of 10000 on-policy transitions, we train the neural networks with stochastic gradient descent. The optimizer uses minibatches containing 512 samples and performs 32 epochs per batch. The learning rate is λ = 0.0001. Each trial is stopped once it reaches 20 M agent steps, roughly equivalent to 7 days of experience on a real robot. Worker nodes run only on CPU resources, while the trainer has access to the GPU for accelerating the optimization process. We use the RLlib [43] framework, OpenAI Gym, and distributed training. Fig. 2 reports the learning curves of the average reward and episode duration over 11 independent agent training runs. Average reward across trials exhibits consistent growth and low variance (Fig. 2, left). We have also observed increasing values for all individual reward elements during training. Episode duration improves as well across trials and displays low variance (see Fig. 2, right), approaching maximum episode length more frequently as training progresses.

B. Emerging behaviors
Controlling the upper body enables rich recovery behaviors that involve the control of the total momentum of the kinematic structure. We succeed in triggering such behaviors applying external forces during policy training. To make force profiles more realistic, instead of applying constant forces for a fixed interval as during training, we throw high-speed objects towards the balanced robot. Figure 4b shows two characteristic sequences. A larger variety of push-recovery strategies are displayed in the supplementary video: https://dic-iit. github.io/emergence-push-recovery-icub/.

C. Deterministic planar forces
We evaluate the push-recovery performance from horizontal forces. Forces are applied for 0.2 s after 3 s from the simulation start, when the robot is stably standing still and front-facing. Success is defined if the robot is still standing after 7 s. In Fig. 3a, success rates for forces pointing in 12 directions are reported. Magnitudes increase from 50 N to 700 N at 25 N intervals. 5 repetitions are performed for each magnitude and direction, randomizing the initial joints configuration by adding zero-mean Gaussian noise (σ = 2 deg). Magnitudes within the training range (0-200 N) are counteracted successfully. Remarkably, the policy is also robust to out-of-sample forces in all directions in (200-300 N), up to 400 N in some directions. Moreover, it successfully recovers from pushes in the training range (0-200 N) even with an out-of-sample test friction coefficient µ c = 0.2 (Fig. 3b).

D. Random spherical forces on the base links
We evaluate policy robustness in challenging scenarios involving sequences of random forces with different combinations of magnitude and duration. Forces are applied to the base in a random direction more frequently than during training, on average every 3 s. For each combination, 50 reproducible episodes with different seed initialization and no domain randomization are executed. Episodes terminate if the robot falls or after 60 s, averaging 20 applications in a full episode. Our evaluation metric is the number of consecutive forces endured by the robot. Fig. 5 reports aggregate results for each combination of magnitude and duration. No matter their magnitude, forces lasting 0.1 s are properly balanced. As expected, performances decrease with growing magnitude and duration. Nevertheless, the agent is able to withstand repeated applications of out-of-sample forces. For instance, on average it withstands 9 consecutive 300 N 0.2 s applications.

E. Random spherical forces on the chest and elbow links
We also evaluate robustness of the learned policy to previously unseen forces applied to other links. Fig. 5 shows the results obtained on the chest and elbow links. As expected, forces applied on links which are far from the CoM turn out to be more challenging. Nevertheless, the policy is able to withstand a good number of them and generalize with good performances. For instance, it is on average able to recover from 10 consecutive 200 N 0.2 s forces on the elbow link, as opposed to an average of 17 for the base link. The average number of consecutive counterbalanced forces with the same magnitude and duration decreases to 5 for the chest link. Notice that the randomness of the interval between two subsequent forces applications leads sometimes to very challenging scenarios in which multiple forces are applied in a very short time span.

VII. DISCUSSION
Learning efficiency: The overall experience for a single policy training lasts approximately 7 simulated days. As for other continuous control tasks, model-free PG methods lack sample efficiency. There is plenty of room for robot learning research to bridge this efficiency gap. Indeed, floating-base robots such as iCub can be modeled quite accurately with rigid body dynamics. Most robots used in research are provided with a dynamic model accurate enough to be exploited as a powerful prior. The community has recently proposed interesting modelbased algorithms [44] with the potential to improve efficiency and leverage decades of robotics research.
Low-level control: Low-level position control is widely adopted in other similar works. PID controllers have the advantage of being independent of each other and requiring single-joint signals. However, besides being difficult to tune, they trade off tracking accuracy with compliance. A stiff robot, in the presence of high perturbations, is less robust because even if the planner is whole-body, low-level control is not. Wholebody and intrinsically more compliant low-level controllers could be beneficial, although they often operate on the entire underactuated floating-base system. Properly handling the base references from the policy point of view is yet an uncharted domain. Natural behaviour and sim-to-real: The emerged pushrecovery strategies are not as natural as human ones. The policy tends to promote small jumps to full steps, probably due to two factors: the stiffness given by the low-level PIDs, and the difficulty of accurate contact modeling. As concerns low-level control, actuator dynamics plays a vital role. Our simulations introduce variable delay but do not saturate joint torques. Their minimization in the reward does not prevent occasional high torque spikes synthesized by the PIDs. The integration of more realistic actuator models will be explored in future work. Regarding contacts, modeling differences between physics engines notably make policies hardly transferable to different engines or the real world. The simulator we adopt, Ignition Gazebo, will soon provide a transparent physics engine switch, enabling randomization of the entire engine beyond the common physics parameters.

VIII. CONCLUSIONS
We present a DRL-based control architecture capable of learning whole-body balancing and push recovery for simulated humanoids. We promote exploration by applying random forces to the kinematic structure, leading to the emergence of a variety of push-recovery behaviors. Compared to previous works, our policy controls most of the robot's joints, and we show that this contributes to extending the space of recovery motions to whole-body strategies. We have shown the results of our architecture controlling 23 DoF of the iCub robot, and showing that our policy can withstand repeated applications of strong external pushes.
Our approach shows different types of limitations. The PID controllers, while providing a simple low-level control, introduce a stiffness that can prevent natural motion and introduce a joint dynamics that differs from the real platform. The learning efficiency of model-free algorithms is pretty low and requires days of simulations for a complex behaviour to emerge. Finally, relying only on state space exploration for finding the expected behaviours requires a carefully designed reward function, that might require a significant effort. These limitations could be mitigated by introducing prior knowledge in the training scheme, like the usage of model-based whole-body controllers for the low-level and more accurate actuator modeling in simulation, and model-based reinforcement learning. We plan to explore some of these directions in future work with the aim to bring our policies to the real robot.

APPENDIX RBF REWARD KERNEL
Radial basis function (RBF) kernels are widely employed functions in machine learning, defined as whereγ is the kernel bandwidth hyperparameter. The RBF kernel measures similarities between input vectors. This can be useful for defining scaled reward components. In particular, if x is the current measurement and x * is the target, the kernel provides a normalized estimate of their similarity.γ can be used to tune the bandwidth of the kernel, i.e. its sensitivity.
In particular, we useγ to select the threshold from which the kernel tails begin to grow. Introducing the pair (x c , ), with x c , ∈ R + and | | 1, we can parameterizeγ = −ln( )/x 2 c . This formulation results in the following properties: 1) K(x * , x * ) = 1, i.e. when the measurement reaches the target, the kernel outputs 1; 2) Given a measurement x m such that ||x m − x * || = x c , the kernel outputs K(x m , x * ) = . In practice, can be kept constant for each reward component. The sensitivity of individual components are tuned by adjusting x c . We refer to x c as cutoff value of the kernel, since each norm of the distance in the input space bigger than x c yields output values smaller than . This formulation eases the composition of the total reward r t when reward components are calculated from measurements of different dimensionalities and scales. In fact, once the sensitivities have been properly tuned for each component, they can simply be weighted differently as r t = is the i-th measurement sampled at time t, and w i ∈ R the weight corresponding to the i-th reward component.