Reaching Through Latent Space: From Joint Statistics to Path Planning in Manipulation

We present a novel approach to path planning for robotic manipulators, in which paths are produced via iterative optimisation in the latent space of a generative model of robot poses. Constraints are incorporated through the use of constraint satisfaction classifiers operating on the same space. Optimisation leverages gradients through our learned models that provide a simple way to combine goal reaching objectives with constraint satisfaction, even in the presence of otherwise non-differentiable constraints. Our models are trained in a task-agnostic manner on randomly sampled robot poses. In baseline comparisons against a number of widely used planners, we achieve commensurate performance in terms of task success, planning time and path length, performing successful path planning with obstacle avoidance on a real 7-DoF robot arm.

Abstract-We present a novel approach to path planning for robotic manipulators, in which paths are produced via iterative optimisation in the latent space of a generative model of robot poses. Constraints are incorporated through the use of constraint satisfaction classifiers operating on the same space. Optimisation leverages gradients through our learned models that provide a simple way to combine goal reaching objectives with constraint satisfaction, even in the presence of otherwise non-differentiable constraints. Our models are trained in a task-agnostic manner on randomly sampled robot poses. In baseline comparisons against a number of widely used planners, we achieve commensurate performance in terms of task success, planning time and path length, performing successful path planning with obstacle avoidance on a real 7-DoF robot arm.
Index Terms-Constrained Motion Planning, Representation Learning, Deep Learning in Grasping and Manipulation, Optimization and Optimal Control

I. INTRODUCTION
P ATH planning is a cornerstone of robotics. For a robotic manipulator, this generally consists of producing a sequence of joint states the robot needs to follow in order to move from a start to a goal configuration. This requires that the poses along the sequence are kinematically feasible while at the same time avoiding unwanted contact either by the manipulator with itself or with potential objects in the robot's workspace. Due to its importance, path planning is a richly explored area in robotics (e.g. [1]- [6]). However, traditional approaches are often marred by a number of issues. As the state-space dimensionality increases and constraints become more constrictive, the decreasing efficiency of traditional planning methods makes reactive behaviour computationally challenging. While existing sampling and optimisation-based approaches to the planning problem can find solutions, they scale super-linearly with a robot's degrees of freedom, and those that have optimality guarantees on resulting paths are guaranteed to achieve this only asymptotically, after infinite time [3]. Increasing system and task complexity also requires consideration of multiple objectives (e.g. performing a certain task while adhering to pose constraints). Yet, enforcing constraints on the planned motion can be difficult. Traditional optimisation-based planners can struggle to incorporate constraints that cannot be expressed directly in joint space. Sampling-based planners, on the other hand, struggle to find solutions in scenarios where constraints render only a small volume of configuration space feasible or where narrow passages exist [7].
The advent of deep learning has shown that learningbased approaches can offer some relief in overcoming robotic planning and control challenges. While a considerable body of work examines the direct learning of control policies, attempts have been made to apply deep learning to robotic path planning (e.g. [8]). Learnt heuristics and neural network collision detectors have been used as drop-in replacements to stages of traditional methods (e.g. [9]- [11]). A number of Once trained, gradients through the VAE decoder and collision predictor enable optimisation in the latent space to bring the decoded end-effector position closer to the target position. Performing this optimisation iteratively with a learning rate produces a series of latent values {z i } T i=1 that describe a jointspace path to the target that satisfies the collision constraint.
works explore the use of structured latent spaces to effect planning and control (e.g. [12]- [15]). However, existing works typically require training for a particular task on carefully curated data. In contrast, applications of variational autoencoders (VAEs) in the space of affordance-learning [16] and quadruped locomotion [17] have highlighted the potential of viewing planning as run-time optimisation in pre-trained statistical models of state-space to achieve feasible spatial paths under environmental constraints.
Inspired by [16] and [17], in this work we explore an alternative, entirely data-driven approach to both joint-space planning and constraint satisfaction in a robot manipulation setting. In particular, our approach leverages iterative, gradient-based optimisation to produce a sequence of joint configurations by traversing the latent space of a VAE. Training data for this model is trivially obtained as it need not be in any way taskoriented but can come from random motor-babbling on a real platform, or simply sampling valid states in simulation. In addition, performance predictors operating on the latent space and potentially other observational data (for example, the positions of obstacles) are trained in a supervised fashion to output probabilities of certain performance-related metrics, such as whether the manipulator is in collision with an obstacle. These networks are frozen after training and are subsequently used in this gradient-based optimisation approach to planning through activation maximisation [18], which is the process of using backpropagation through network weights to find a permutation to the network inputs that would act to bring about a desired change in the network's outputs.
Taking this view of path planning overcomes many of the obstacles that make robotic path planning a non-trivial task: (a) as our plans consist of states drawn from a deep generative model fit to a large dataset of feasible robot poses, and are thus approximately drawn from this data distribution, there is a very high likelihood that every state in the planned path is valid in terms of self-collisions and kinematic feasibility; (b) by modelling joint states and end-effector positions jointly, we avoid the need to explicitly calculate inverse or forward kinematics at any stage during planning, even when the goal configuration is given in R 3 Cartesian space; (c) by leveraging activation maximisation (AM) via gradients through performance predictors, we can enforce arbitrarily complex, potentially non-differentiable constraints that would be hard to express in direct optimisation-based planners, and might be intractably restrictive for sampling-based planners; (d) by taking a pre-trained, data-driven approach to collision avoidance, we do not need any geometric analysis or accurate 3D models at planning time, nor indeed do we need to perform any kind of explicit collision checking, which is generally the main computational bottleneck in sampling-based planners [19].
In addition to the advantages in path planning that this method offers and above and beyond related works, we introduce an additional loss on the run-time AM optimisation process which encourages the planning process to remain in areas of high likelihood according to our prior belief under the generative model. In our experiments we find that this contribution is critical in enabling successful planning that stays in feasible state-space.

II. RELATED WORK
Successful path planning for a robotic manipulator generally consists of producing a kinematic sequence of joint states through which the robot can actuate in order to move from a start to a goal configuration, while moving only through viable configurations. While goal positions may be specified in the same joint space as the plan, in a manipulator context it is more common for the goal position to be specified in R 3 Cartesian end-effector space, or R 6 if the SO(3) rotation group is included as well. Viable configurations are the intersection of feasible states for the robot -i.e. those that are within joint limits and do not result in self-collision -and collisionfree states with respect to obstacles in the environment. The intersection of these defines the configuration space for the robot in a given environment.
As analytically describing the valid configuration space is generally intractable, sampling-based methods for planning provide the ability to quickly find connected paths through valid space, by checking for the validity of individual sampled nodes. Variants of the Probabilistic Roadmap (PRM) and Rapidly-exploring Random Tree (RRT) sampling-based algorithms are widely used [1], [2], and provably asymptotically optimal variants exists in PRM*, RRT* [3]. These methods suffer from a trade-off between runtime and optimality: while often relatively quick to find a feasible collision-free path, they tend to employ a second, slower, stage of path optimisation to shorten the path through the application of heuristics. In the presence of restrictive constraints, both sampling-and optimisation-based planners can be very slow to find an initial feasible path [7].
Optimisation-based planners start from an initial path or trajectory guess and then refine it until certain costs, such as path length, are minimised, and differentiable constraints satisfied. Techniques such as CHOMP [4] bridge the gap between planning and optimal control by enabling planning over path and dynamics. TrajOpt [20] differs from CHOMP in the numerical optimisation method used and the method of collision checking. The Gaussian Process Motion Planner [21] leverages Gaussian process models to represent trajectories and updates them through interpolation. Stochastic Trajectory Optimization for Motion Planning (STOMP) [5] is notable in this context as it is able to produce plans while optimising for non-differentiable constraints , which our work enables with gradients through trained performance predictors.
Planning with Deep Neural Networks. A recent line of work has explored using deep networks to augment some or all components of conventional planning methods. Qureshi et al. [10], [11] train a pair of neural networks to embed point-cloud environment representations and perform single timestep planning. Iterative application of the planning network produces a path plan. Ichter and Pavone [9] learn an embedding space of observations, and use RRT in this space with a learnt collision checker to produce path plans, but need data to learn a forward dynamics model in order to roll out the plan.
Another family of learning-based approaches to planning learn embedding spaces from high dimensional data, and learn forward dynamics models that operate on this learnt latent space. Universal Planning Networks [12] learn deterministic representations of high-dimensional data such that update steps by gradient descent correspond to the unrolling of a learned forward model. The Embed-to-Control works [13], [22] employ variational inference in deep generative models in which latent-space dynamics is locally linear, a property that enables locally optimal control in these spaces. DVBFs [23] improve on these models by relaxing the assumption that the observation space is Markovian. PlaNet [24] uses a latent-space dynamics model for planning in model-based RL. However, planning in all these models tends to consist of rolling out trajectories in time, finding a promising trajectory, and executing the given actions. As such, these techniques tend to become intractable for longer time horizons, and cannot be thought of as path planning frameworks.
A different approach is that of encoding movement primitives under the learning from demonstrations framework. Conditional Neural Movement Primitives [25] extracts prior knowledge from demonstrations and infers distributions over trajectories conditioned on the current observation. Our approach differs in that we do not encode trajectories directly, but rather learn a probabilistic model of robot states and generate trajectories as we optimise in latent space.
Learning Inverse Kinematics. In this work, by learning a joint embedding of joint angles q and end-effector positions e, we are able to optimise for achieving an end-effector target e target , while planning state sequences in joint space. Note that we do not care about the orientation in which the goal is reached, therefore end-effector orientation is omitted in the formulation of e. Learning the statistical model of kinematics means we do not need to solve inverse kinematics (IK) at any point. Prior work has sought to learn solutions to IK that can cope with its ill-posed one-to-many nature for redundant manipulators [26], [27], and to overcome the problems with analytic and numerical approaches [28]- [30]. Ren et al. [27] train a generative adversarial network to generate joint angles from end-effector positions, with the discriminator acting on the concatenation of both the input position and generated joints. This method implicitly maximises p(q|e), but does not address the multimodality of the true p(q|e) IK solutions. Bocsi et al. [26] employed structured output learning to learn a generative model for the joint distribution of joint angles and end-effector positions. By modelling the joint instead of conditional distributions, i.e. p(q, e) rather than p(q|e), their model can capture the multimodal nature of IK, as one set of IK solutions (e 1 , q 1 ) can be learnt without compromising the learning of another set (e 1 , q 2 ). These works are relevant in the way in which they use a learnt statistical model to capture the relationship between q and e. However, we differ from prior work in that, although we follow the generative approach, we do not try to find the best q that maximises p(q, e) as is done in [26], but instead plan in the latent space that decodes to valid joint configurations, producing smooth trajectories.
While our work is partly inspired by [16] and [17] in its approach, it significantly extends this prior work both in terms of method and application domain. In particular, in exploring this approach in a manipulation context we rely solely on training poses to structure the latent space. This is in contrast to [17], where, in a quadruped locomotion context, structure is induced via especially designed stance labels. Like [16], who first proposed the use of AM for constrained optimisation in a structured latent space in the context of affordance learning, we consider environmental constraints. However, our agent operates in a significantly more complex configuration space to achieve real-world reaching and obstacle avoidance. In addition, we introduce an additional loss term that encourages the model to traverse regions of high likelihood under the learned prior over the latent variables (i.e. to stay close to the training distribution) during planning. We demonstrate that this novel loss term increases efficacy by a large margin, effectively encouraging kinematic feasibility of the plans produced.

III. PATH PLANNING AS OPTIMISATION IN LATENT-SPACE
Our approach to path planning first learns a latent representation of the robot state by observing random (feasible) arm configurations. We then learn high-level performance predictors acting on this latent space as well as environment information to guide optimisation in latent space.

A. Problem Formulation
Suppose we have a dataset of joint angles and end-effector We use a VAE to learn a generative latent-variable model of x. When sampled, we expect the generative model to produce data that conform to the forward kinematics (FK) relationship. While we do not leverage the FK information at runtime, we use it during training to evaluate the sample consistency of the generative model, i.e. how well the samples of joint angles and corresponding Cartesian endeffector positions match the actual system. We opt to encode (q i , e i ) jointly as the information is readily available from routine robot operation and it avoids the ambiguity usually associated with mapping from the manipulator's Cartesian workspace to a valid joint configuration, thereby simplifying the inference task. To solve path planning in this approach, we use AM to iteratively backpropagate position error relative to a reaching goal into the latent space [18]. In addition, we exploit the probabilistic nature of our model by encouraging solutions to traverse regions of latent space of high likelihood under prior belief via a prior loss. Via the decoder, each location in latent space can be decoded into a robot configuration such that trajectories in latent space, when decoded, result in sequences of robot poses.
We posit, first, that this approach will produce valid paths from an initial end-effector position to the given target position. Our second hypothesis is that the accuracy of reaching operation will be correlated with the sample consistency of the model. That is, if the model demonstrates a closer coupling of joint angles and Cartesian end-effector position, as defined by the analytic FK relationship, then it will produce more accurate reaching solutions via AM. We will demonstrate how the approach can be extended to deal with reaching tasks while avoiding obstacles. One strength of this approach is the conceptual ease with which additional constraints can be added.

B. Learning a Latent Representation of Robot State
Our aim is to learn a generative model of x. This can be accomplished with a variational autoencoder (VAE) [31], [32], which defines an encoder q φ (z | x) and decoder p θ (x | z), where z is a learned latent representation. To train the VAE, we would like to maximise the evidence, A common alternative therefore is to maximise the evidence lower bound (ELBO), where L ELBO ≤ log p(x): (1) To trade off between reconstruction accuracy and the KL term, a β hyperparameter is often added to the ELBO formulation [33]. Rather than setting this hyperparameter manually [33], we adopt an alternative, dynamic GECO approach [34]. The GECO objective formulates the ELBO loss as a constrained optimisation problem, using a Lagrange multiplier λ, such that wherex is the reconstruction of x through the VAE. The reader is referred to [34] for implementation details on how λ is updated. The Lagrangian optimises the KL divergence subject to E z [C (x,x)] ≤ 0, for a given constraint function C. The constraint typically models an upper bound on a predefined reconstruction error (e.g. an L 2 loss): Although the GECO formulation still contains a hyperparameter, τ ≥ 0, this represents an interpretable quantity: an upper bound on the reconstruction error. In practice, this is easier to work with than tuning the β hyperparameter in the latent space, which is difficult to interpret. VAEs in our experiments are trained by optimising the GECO objective with the L 2 reconstruction loss.

C. Activation Maximisation for Path Planning Under a Prior Loss
Given a target position e target , the aim is to produce a sequence of joint configurations (q 0 , . . . , q T ) that drive the robot's end-effector from its initial position e 0 to an end position e T within a distance tolerance e T , e target 2 < γ. This can be achieved in the probabilistic model through the iterative use of AM [18].
Let the initial x 0 be encoded such that the corresponding latent configuration z 0 is drawn from the posterior. Decoding z 0 then gives rise tox 0 = {q 0 ,ê 0 }. More generally,x = {q,ê}. Let ê 0 , e target 2 denote the Euclidean distance betweenê 0 and e target , then we can compute an L 2 loss that we backpropagate through the decoder p θ (e, q | z). However, rather than update the network weights, we use AM to update the latent vector. In particular, given the AM objective, latent representations are updated iteratively in the following way, where α AM is the learning rate and ∇L AM is the gradient of the AM loss with respect to the input z: Target Loss (4) This produces a progression of latent representations (z 1 , . . . , z T ), which continues for a set number of T steps. Through the decoder, these latent representations can be mapped to joint configurations (q 1 , . . . , q T ). If the kinematics relationships represented by the decoder network are valid, and a sufficient number of steps T are taken, then we expect the final joint angle configuration q T to correspond to a new end-effector position e T such that e T , e target 2 < γ. Starting with the initial position, the sequence of decoded end-effector positions represents a spatial path (e 0 , . . . , e T ). Without modification, AM may often drive the values z into parts of the latent space that have not been seen during training. Decoding these latent representations can lead to poor (q, e) pairs that are inconsistent with the desired kinematics. To encourage the optimisation to traverse regions in which the model is well defined (i.e. to stay as close to the training distribution as possible) we introduce an additional loss term to the AM objective consisting of the likelihood of the current latent representation under its prior p(z), such that Target Loss +λ prior (− log p(z)) Prior Loss (5) This encourages the reconstructed joint configurations to remain valid. Again, λ prior is tuned automatically during training using a GECO formulation, by selecting an upper bound on the prior loss.

D. Obstacle Avoidance via Performance Predictors
A key requirement for path planning is obstacle avoidance. In our framework this is effected by a binary classifier predicting whether the current arm configuration, as represented in latent space, is in collision with an obstacle. By backpropagating gradients forcing the collision response of this classifier to zero we effectively drive the robot away from obstacles. The classifier is trained using a binary cross-entropy (BCE) loss while the VAE weights remain frozen.
When performing AM in the case of obstacle avoidance, we add an obstacle loss term from BCE to the AM loss in Eq. 5.
Target Loss +λ prior (− log p(z)) Prior loss where λ prior and λ obs are tuned jointly using GECO with mutliple constraints. Avoidance of multiple obstacles can be achieved by repeatedly deploying the same classifier and adding the resulting gradients into the optimisation. The ease with which multiple constraints can be expressed and enforced is an explicit strength of this approach.

E. Model Selection Through Sample Consistency
While the downstream performance we seek from our models is better path planning, this is not continuously measurable during training. For VAE model selection and hyperparameter tuning, we consider three metrics as predictors of path planning success: (a) the data reconstruction loss ê − e 2 + q − q 2 , (b) ELBO (Eq. 1) and (c) kinematic sample consistency, which we define as This sample consistency error δ is the Euclidean distance between the reconstructed end-effector positionê and the true forward kinematics (FK) solution for the reconstructed joint anglesq. We find that high sample consistency is a better predictor of a model's downstream planning performance than the more traditional ELBO loss alone (Fig. 2 right).
IV. IMPLEMENTATION DETAILS This section provides details on model architecture, model training and planning, using a 7-Dof Emika Franka Panda arm.

A. Architecture Details
The VAE architecture comprises of an encoder and a decoder. The encoder takes as input x = {q, e} and outputs the mean µ and variance σ of the posterior distribution q φ (z | x). The latent encoding z is then obtained using the reparameterisation trick [31]. A multivariate isotropic Gaussian prior is imposed on the latent space. The decoder takes as input the latent sample z and outputs the reconstructionx = {q,ê}. The obstacle collision classifier takes {z, o = {x, y, h, r}} (xy coordinates, height, radius of the cylinder) as input and has a single output logit, which when passed through a sigmoid function gives the predicted probability of collision. The encoder, decoder and obstacle collision classifier each contains four fully connected hidden layers of 2048 units, but differ in input and output layers, as shown in Table I

B. Training Data Generation
In this evaluation, we consider cylindrical objects 1 , which are easily represented in state space as tuples {q, e, o, c}, where q = (θ 1 , ..., θ 7 ) represents the robot joint configurations; e = (e 1 , e 2 , e 3 ) the end-effector coordinates; o = (x, y, h, r) the obstacle coordinates, height and radius; and c ∈ {0, 1} the binary collision label. Joint configurations q are sampled uniformly within the joint limits. We take the modified Denavit-Hartenberg parameters in the Panda arm documentation to characterise the forward kinematics relationship, e = FK(q). To generate the position of the obstacles, for each obstacle, we sample a distance to origin L, an angle θ obs in [0, 2π) uniformly and set x = L cos(θ obs ), y = L sin(θ obs ). MoveIt's planning scene interface is used to check whether the arm is in self-collision or in collision with the table; joint configurations that are in such collisions are discarded. We also use MoveIt's planning scene interface to label collision with obstacles in the training data. The dataset contains an equal number of samples in and not in collision with the obstacles. In total, the dataset contains 100k data points, of which 80k are used for training and 20k for validation.

C. Obstacle Scenario Generation
In our experiments, scenarios are generated by sampling a given number of obstacles and two sets of joint angles -one for the initial robot configuration and another for the target position. The joint angle samples for the target are only used to compute the target position through the FK model of the Panda arm that we characterised and are not known to the planner. The obstacles and the target are generated while ensuring that there is at least a feasible set of joint angles reaching the target without collision with the obstacles. The first obstacle is sampled between the initial end-effector position and the target position. Subsequent obstacles are either sampled randomly or sampled between the initial end-effector position and the target position, with a probability of 50% each. The model is evaluated on scenarios in which it would collide with the obstacles if the obstacle loss term was not added to the total loss in the AM objective function.

D. Training Details
The input values to the VAE and the collision classifier are standardised, and the output values de-standardised, according to the mean and the standard deviation of the training data. The model is trained using a batch size of 256 for 16,000 epochs using the Adam optimiser [35]. To select hyperparameters, a grid search is run on the following values: number of hidden layers, units per layer, latent dimension, GECO reconstruction target τ (Eq. 3), VAE learning rate, and GECO learning rate.

E. Planning Details
Planning is achieved by applying activation maximisation in the latent space, as outlined in Algorithm 1.

V. RESULTS
We evaluate our approach in the context of a set of robot reaching tasks, described below, using a simulated Panda arm. We further demonstrate that the approach can be deployed on a physical Panda arm.

A. Path Planning for Target Reaching
Before extending to path planning with additional constraints, we explore the ability of iterative AM as described in Section III-C to produce a path plan for goal reaching in free space. We sample 1,000 start and goal configurations for the robot, with an initial joint position q 1 and a goal e target in R 3 . Results of our method are shown in Fig. 2 (left), where we quantify planning success rates at different distance thresholds. We find that the addition of the prior loss (Eq. 5) to the AM objective is instrumental in improving success rates, while when we optimise AM for reducing distance to goal with no Algorithm 1: Activation Maximisation for Obstacle Avoidance initialise path buffer X; initialise λ prior and λ obs ; infer latent representation of initial configuration z 0 ∼ q φ (z | x = x 0 ); for each time step t = 0,1,...,T do decode latent encoding to state spacê x t ∼ p θ (x | z = z t ); save states to path X t =x t ; if d(e t , e target ) < γ then break; end compute loss terms ê, e target 2 , − log p(z), and additional constraint, we observe more frequent infeasible state reconstructionsq in the path plans (Fig. 3 top). With the prior loss, over 90% of the planning scenes are solved to within a 5mm threshold of the goal.  [36]. Adding prior loss in AM objective function improves reaching success rate. Right: Minimum distance between the end-effector and the target vs sample consistency error. Lower sample consistency error leads to better target reaching.

B. Obstacle Avoidance
To demonstrate the effect of adding obstacle loss to accomplish obstacle avoidance, we show qualitative results both in simulation and in the real world in Fig. 3 (bottom) and Fig. 4.
To evaluate the efficacy of our approach in finding feasible plans in the presence of obstacles we generate 1,000 scenarios for each of one to five cylindrical obstacles. We compare our approach of latent-space path planning (LSPP) to eight planners in widespread use: Potential Field [37], [38], RRT-Connect [39], LBKPIECE [40], RRT* [3], LazyPRM* [41], FMT* [42], BIT* [43] and CHOMP [4]. The artificial potential field baseline is a classical local collision avoidance method using Jacobian pseudo-inverse to reduce the error in endeffector position while avoiding obstacles through the use of virtual repulsive forces. It is adapted from [38], but instead of using the depth-space concept to estimate the distances between the robot and the obstacles, it has direct access to In the case of no prior loss, the encoding of the robot initial configuration lies in the trusted region, but drifts to its boundary as we perform gradient descent, which decodes to a meaningless output. In the case of no obstacle loss, the robot collides with an obstacle. The link in collision is shown in red.
the obstacle configurations to generate the repulsive vectors.
A grid search is conducted on the hyperparameters (Eq. 7 in [38]) to optimise for the overall success rate. For all other baselines we use the default parameters from their MoveIt OMPL and CHOMP library implementations [44], [45]. For RRT, LazyPRM* and BIT*, we keep the default planning time of 5 seconds. CHOMP uses a linear initialisation from start to goal position and optimises locally, such that it provides a fair comparison for local planners. Quantitative results are shown in Table II. For LSPP, a grid search is conducted on the GECO target (Eq. 6), GECO smoothing factor and GECO learning rate for the obstacle loss term to optimise for the overall success rate. Across all methods, a run is considered a success if the robot reaches the target within a distance threshold of 1cm and without colliding with obstacles. In terms of planning success rate, LSPP performs commensurate to the baselines in the case of one and two obstacles, but suffers a performance drop when more obstacles are present. This performance drop is expected and can also be observed in CHOMP, another optimisation-based planner. Our scenario generation process does not ensure there exists a feasible solution to a particular scenario. The success rates are therefore only indicative of relative performance. However, RRTConnect, RRT*, FMT*and BIT* serve as useful calibration as they are probabilistically complete, ensuring a solution will be found if one exists, given sufficient runtime. There are a number of factors which influence LSPP performance. There exists an inherent tension due to the AM objective between reaching a goal and avoiding obstacles. This is, in effect, regulated by the GECO parameters. As LSPP is inherently a gradient-based optimisation method it is subject to local minima. Empirically, this happens more often as the number of obstacles increases, but could potentially be handled by adding a stochastic recovery strategy or a post processor. In addition, the optimisation can be misguided either by a failure in the obstacle classifier or due to low sample consistency.
Overall LSPP's average planning time is commensurate Fig. 4: Real-world experiments: a rollout of a trajectory using LSPP. Our latent-space approach operates in state space and therefore trivially transfers to the real world.
with that of RRTConnect whereas it significantly outperforms Potential Field, LBKPIECE, CHOMP and FMT*. We note also that LSPP exhibits consistently lower variances in planning time than the baselines. In LSPP, each additional obstacle requires an extra forward and backward pass of the collision predictor, and thus planning time increases linearly with obstacles. However, in these experiments this remains a negligible effect on the overall LSPP time.
The path length is normalised by dividing the actual length of the planned path by the Euclidean distance between the initial end-effector position and the target position to ensure a fairer comparison among different scenarios. It should be noted that the cost functions in OMPL minimise joint space path length, and a shortest path in joint space does not necessarily translate into a shortest path in Cartesian space. RRT* is an asymptotically optimal algorithm, thus it is not surprising that it finds near optimal paths. Nevertheless, LSPP outperforms most of the other baselines.
The artificial potential field baseline is widely used due to its simplicity and serves as a useful comparison for local collision avoidance methods. It is in spirit most similar to LSPP, subject to local minima, and neither of them has theoretical guarantees. However, it is not directly comparable as it assumes access to the FK relationship to compute the Jacobian while ours only relies on it for data collection and model selection, which could be avoided if we have a separate sensor for corresponding end-effector positions and if we choose a different model selection criterion. In terms of performance, it only achieves around 72% success rate even without any obstacles as it struggles at joint limits. Contrary to global planners, each action can be executed after each update is computed. Thus, it may appear to be surprisingly slow, while in reality it achieves real time performance.
Overall, it is encouraging to see that LSPP, an intuitive and data-driven formulation, is approaching the performance of established path planning algorithms.

C. Dynamic Feasibility
To show that the plans are dynamically feasible, we present an analysis in Fig. 5. The generated motion plans, when executed with a constant control frequency of 50Hz, demand a relatively small angular velocity, angular acceleration and angular jerk. These are all well below the maximum joint limits for the Panda arm, shown as red segments in the figure. This demonstrates that we can generate feasible state space motion plans by decoding from the latent trajectory we obtain from gradient-based optimisation. Additionally, we can potentially further improve the smoothness by adjusting the learning rate of the Adam optimiser during the optimisation.

VI. CONCLUSION
We present a novel approach to path planning for robot manipulation that learns a structured latent representation of the robot's state space and uses constrained optimisation to produce joint space paths to reach end-effector goals. Our approach differs significantly from related work in that it performs path planning based on a generative model of robot state, which is trained in a largely task-agnostic manner. In addition to the goal and obstacle losses, we introduce a novel constraint which maximises the likelihood of the latent variable being explored under its learned prior, thereby encouraging the model to stay near the training distribution of robot configurations. In doing so, we bypass the traditional computational challenges encountered by established planning methods while achieving commensurate performance in terms of reaching success, planning time and path length. Despite the lack of theoretical guarantees, it is a practical mechanism for path planning. Future directions include algorithmic improvement to handle local minima, generalisation to scenarios with more complex obstacles and dynamic objects, and tasks that involve interaction.  2) Choice of Hyperparameters: We discuss the effects of some of the hyperparameters on model training and performance.

ACKNOWLEDGMENT
Number of hidden layers and units per layer. The number of fully connected hidden layers and the number of units per layer in the neural network affect its capacity, which is important for the VAE in modelling the kinematics relationship and for the obstacle classifier in predicting collision. However, having a network that is too large (e.g. eight hidden layers) is found to lead to instabilities in training and to having diminishing returns in terms of performance. Thus, a grid search is conducted on the size of the neural network. For memory efficiency, we choose to perform the grid search on the number of hidden units in powers of two starting from two hidden layers and 64 units per layer.
Latent dimension. The number of latent dimensions determines the capacity of the latent space to capture the correlation between the joint angles and the end-effector position. As the expressive power of the decoder is finite, having a small number of latent dimensions is found to create an information bottleneck that prevents the VAE from generating accurate reconstructions. The information preference problem [46] may also be created if we employ a large number of latent dimensions, which means that many of the latent dimensions may not capture any useful information, which in turn encourages the decoder to ignore the latent encoding. This is also not desirable given our motion planning pipeline is based on latent traversal. Thus, we perform a grid search for the number of latent dimensions, and find that a dimension of seven (i.e. the same as the number of DoFs of the robot) achieves the best performance in terms of our metrics.
GECO reconstruction target. The GECO reconstruction target τ imposed as a constraint via a Lagrange multiplier mechanism in GECO [34] is found to be important for the performance of the VAE model. If the goals are strict (i.e. perfect reconstruction), as the size of the neural network is limited and thus limiting its inference and generation capacity, the MSE term in ELBO overwhelms the KL regulariser due to the Lagrange multiplier, leading to overfitting and a mismatch between the posterior and the prior. On the other hand, if the goals are loose, the reconstruction accuracy becomes poor, leading to worse sample consistency. Thus, we use a grid search on the parameter τ .
Learning rate (GECO). The learning rate for the GECO Lagrange multiplier determines the responsiveness of the λ parameter to the violation of the GECO reconstruction target. The higher the learning rate, the more responsive the λ parameter becomes in adjusting the relative weights of the two terms in ELBO in training the VAE. However, a high GECO learning rate leads to instabilities in training. The optimal value is then found through a grid search.

B. Planning Details 1) Modification of GECO:
The following algorithm is applied to update the individual λ parameters using a modification of the GECO algorithm [34]. The algorithm is applied to different pairs of loss terms (− log p(z), ê, e target 2 ) and ( i (− log(1 − p ϑ (z, o i ))), ê, e target 2 ) to compute λ prior and λ obs at each update step.

C. Choice of Baselines
MoveIt [44] is the most widely used framework for robot manipulation. The Open Motion Planning Library (OMPL) [45] is a collection of sampling-based motion planning algorithms and is the default planner in MoveIt. In our experiments, seven MoveIt path planning algorithms are chosen for comparison. In the following, we provide a summary (mostly condensed from the OMPL documentation) and the rationale behind our choice.
RRTConnect [39] is one of the default planners in MoveIt. It grows two RRTs [1], one from the start and one from the goal, and attempts to connect them. It is an improved version of RRT and is probabilistic complete, ensuring a solution will be found if one exists, given sufficient runtime. It is commonly used and best known for its fast convergence, even in highdimensional spaces.
LBKPIECE [40] is the other default planner in MoveIt. KPIECE, a sampling-based path planning algorithm designed specifically for planning in high-dimensional spaces, uses a discretisation to guide the exploration of the continuous space. It offers computational advantages by employing projections from the searched space to lower-dimensional Euclidean spaces for estimating exploration coverage. LBKPIECE is a bi-directional variant of KPIECE with lazy collision checking and one level of discretisation. It is also commonly used and known for its planning efficiency.
RRT* [3] is an asymptotically optimal incremental sampling-based path planning algorithm. It is an optimal variant of RRT and converges to an optimal solution in terms of path length after infinite time. This baseline is insightful in comparing path length.
LazyPRM* [41] is another asymptotically optimal sampling-based planner. The Probabilistic Roadmap Method (PRM) constructs a roadmap and checks whether a path exists in the roadmap between a start and goal state. PRM* gradually increases the number of connection attempts as the roadmap grows in a way that provides convergence to the optimal path. LazyPRM* is a variant of PRM* with lazy state validity checking. This is another useful baseline for path length.
FMT* [42] stands for Fast Marching Tree. It is another asymptotically optimal sampling-based planner. The algorithm is specifically aimed at solving complex motion planning problems in high-dimensional configuration spaces, by performing a lazy dynamic programming recursion on a set of probabilistically-drawn samples to grow a tree of paths.
BIT* [43] stands for Batch Informed Trees. It is an anytime asymptotically optimal sampling-based planner that uses heuristics to prioritise expansion towards the goal and high-quality paths. It has been shown to outperform existing sampling-based planning algorithms, e.g. RRT* and FMT*, in terms of computational cost to find equivalent results.
CHOMP [4] stands for covariant Hamiltonian optimisation for motion planning. It is a gradient-based trajectory optimisation procedure, and uses two objective functions: an obstacle term that captures obstacle avoidance and a smoothness term that captures the dynamics of the trajectory. It is able to avoid obstacles in most cases, but it can fail if it gets stuck in a local minimum due to a bad initial guess for the trajectory. OMPL can be used to generate collision-free seed trajectories for CHOMP to mitigate this issue. Thus, in our experiments, we use OMPL with the default RRTConnect planner for an initial guess and use CHOMP as a post-processor. CHOMP is efficient, produces smooth paths and is the most commonly used optimisation-based approach.

D. Analysis on Latent Space Representation
The latent space of our VAE encodes pairs of joint position q and end-effector position e. How accurate is the q -to- e mapping in the latent space? Figure 6 (left) presents a histogram of sample consistency errors from 10,000 prior samples. The peak is centred at around 2.5mm and over 95% of the samples fall below 1cm. What is the completeness of the latent space, i.e. does the latent space cover all the valid joint space or does it miss parts of it? To answer this question, 10,000 samples are drawn from the prior and decoded to {q,ê}. In fig. 6 (right), we plot the sample consistency errors at different end-effector positionsê. We observe that all the valid joint space is well covered and there is no obvious correlation between the sample consistency error and the endeffector position.