Disruption-Resistant Deformable Object Manipulation on Basis of Online Shape Estimation and Prediction-Driven Trajectory Correction

We consider the problem of deformable object manipulation with variable goal states and mid-manipulation disruptions. We propose an approach that integrates online shape estimation, prediction of shape transitions, and mid-manipulation trajectory correction. All functionalities are implemented using two neural network architectures. We apply this approach to the problem of cloth folding, and perform evaluation experiments in simulation and on robot hardware. We demonstrate that the system can achieve good approximation of given goal states, even when the manipulation process is disrupted by cloth slipping or external interference.


I. INTRODUCTION
R OBOTIC manipulation of cloth objects is complicated by the fact that these objects take on many different shapes. However, humans manipulate cloth with great dexterity. This raises a number of fundamental questions for the pursuit of robotic manipulation of deformable objects. How do humans mentally image the object's deformations? How do we decide our manipulation strategy? How do we conceptualize the manipulations we perform, and how do we mentally grasp the state of the object? These mechanisms remain ill-understood and therefor hard to replicate.
Our overarching goal here is to establish a framework for functionally approximating human cloth manipulation. The focus of the present work is on continuous shape monitoring and self-correction. We place particular emphasis on the following criteria: (1) An up-to-date representation of the object's shape must be maintained throughout the manipulation process. This requires estimation of a topological representation of the object on basis of discrete sensor data, which must be robust Manuscript  against ambiguity resulting from occlusion. (2) An up-to-date prediction of the manipulation's outcome must be maintained, and manipulation trajectories must be corrected on the spot when the necessity arises. In short, we must perform shape estimation, shape change prediction, and motion correction in a continuous online cycle. Existing work has realized these functionalities in isolation [1], [2]. If we loosen the demand for online operation, work combining multiple of these elements can also be found [3]. However, to the best of our knowledge, realising the estimationprediction-correction cycle in online fashion has remained a challenge. The current work proposes methods for realization of the constituent functions and their integration in an online cycle. Below we list the core features. a) Shape estimation: We represent cloth shapes as mesh structures, and estimate the configuration of the mesh from point clouds obtained from a RGBD sensor. By assigning uncertainties to the individual vertices of the mesh, we make it possible to express ambiguity in our estimations. Processing time is <150ms on average. b) Shape prediction: We predict cloth shape evolution as the robot performs a given manipulation. Processing time is <11ms for 100 frames on average. c) Trajectory correction: When the predicted outcome for the present manipulation diverges from the goal, we revise the remainder of the manipulation trajectory in order to realign the expected outcome with the goal. We believe this combination of functionalities allows us to capture more of the flexibility seen in human deformable object manipulation than has previously been achieved.
The paper is structured as follows. The next section discusses related work. Section III explains the global structure of our approach. Section IV describes shape estimation, Section V shape prediction, and Section VI manipulation generation. Section VII reports and discusses the results of our simulation and real-world experiments. Section VIII concludes the paper.

A. Cloth Shape Representations
Automated cloth manipulation is an active field of research. Some approaches explicitly estimate cloth shapes. A common approach is to model the cloth as a polygon model. Miller et al. [4] match polygon models to comparatively complex topologies such as long-sleeved shirts. Stria et al. [5] similarly estimate clothing shapes, and perform folding. Twardon et al. [6] demonstrated tracking of garment openings (e.g., sleeve ends) by means of ACBM. These methods can be applied online, but are insufficiently expressive to capture fine detail and complex shape configurations.
For finer shape representation, mesh models are an option. Kita et al. [7] and Li et al. [8] demonstrated accurate shape estimation for suspended garments using an active recognition strategy wherein cloth objects are lifted and manipulated for the purpose of shape estimation. Active strategies are effective for acquiring initial models of cloth objects, but unsuitable for estimating shape continuously during goal-directed manipulations. Willimon et al. [9] and Han et al. [1] also propose estimation routines for deformable objects, but the amount of deformation considered is less than many common cloth manipulation scenarios, including ours, require.

B. Motion Generation
Various strategies have been proposed for generating manipulation motions, with significant variation in the scope of the problem setting. Van den Berg et al. [10] target folding with given manipulation procedures. They solve the problem of determining whether the given procedure is possible with a given number of grippers, and translate it into gripper trajectories under a set of idealized assumptions about cloth dynamics. Execution is open-loop. Maitin-Shepard et al. [11] and Doumanoglou et al. [3] propose folding pipelines from unordered states to folded states with intermittent recognition, operating in fixed, flowchart-style manipulation procedures. Li et al. [2] used simulation to generate folding trajectories, obtaining high quality trajectories and online performance. Sun et al. [12] perform flattening using a geometric approach, generating high quality 2.5D representations of wrinkled cloth. Seita et al. [13] and Wu et al. [14] generate manipulations for smoothing a square cloth using reinforcement learning.
Limitations of the above approaches are lack of on-line trajectory correction during manipulation, and the assumption of fixed goal states. Petrík and Kyrki [15] realize fine feedback control with robustness to material variation for the constrained case of folding a strip of cloth in two, using a reinforcement learning approach with a low-dimensional state representation. Hu et al. [16] tightly interlink recognition and motion generation using machine learning techniques operating on raw sensor data. Yang et al. [17] achieved online cloth folding by generating motions directly from sensor images. However, the dynamics model and manipulations learned do not necessarily transfer well to different goals.
Work accommodating variable goals remains scarce. We have proposed a system for generating multi-step manipulation plans with variable goals [18], [19] using forward models of the cloth dynamics. Subsequently, Kawaharazuka et al. [20], Hoque et al. [21] and Yan et al. [22] proposed forward model-based approaches capable of accommodating variable goal states to varying extents.
Here, we focus on execution of individual manipulations with various goal states, combining intra-manipulation mesh estimation with forward model-based shape prediction of the course of the manipulation in order to realize on-line trajectory correction. Eventually we aim for integration with the mesh-based version of our multi-step planning system [23].

A. Function Overview and Integration
We consider the task of manipulating a cloth object into a given goal configuration. Fig. 1 shows an overview of our approach. First, we estimate the initial shape of the object from sensor data. Then we initialize the manipulation for transforming the object from its current shape to the target shape (manipulation planning is treated in [23]). During manipulation, we continuously perform shape estimation (a) and shape prediction (b) on basis of the sensor data acquired over the course of the manipulation. By comparing the predictions with the goal shape, we monitor the progress of the manipulation process. When the difference between prediction and goal shape exceeds the admissibility threshold, function (d) revises the manipulation motion on-the-spot, and manipulation resumes.
The advantage of this mechanism is its ability to respond to unexpected situations during manipulation. For example, if the cloth slips over the surface of the work surface, or is pulled by an external force, the system can adjust the manipulation on-thespot through estimation of the new cloth shape and prediction of the shape evolution for revised trajectories.

B. Approach
Among the functions in Fig. 1, (a) shape estimation and (b) shape prediction in particular require online performance, which is challenging. We approach this challenge as follows. We train a neural network to generate initial probabilistic shape estimates in milliseconds. We combine this estimate with prior knowledge of the object topology through a short energy minimization process ("refinement") to find a shape that is both realistic and consistent with the estimate. For shape prediction we use a second network, which learns the relation between hand motions and cloth shape change at fine temporal granularity. This lets us predict the evolution of the cloth shape, again with processing times on the order of milliseconds.
The next sections explain our methods for shape estimation (Section 4), and shape transition prediction and trajectory correction (Section V). Both assume a dataset of manipulation examples, containing manipulation trajectories and the corresponding cloth shape evolution at fine temporal granularity. In the present paper we generate this dataset in simulation. Dataset generation is detailed in Section VII.  Fig. 2 shows an overview of the shape estimation pipeline. We explain the constituent parts and processes below.

A. Voxel-To-Mesh Net
Our shape estimation approach employs a neural network (NN) that generates a probabilistic mesh estimate on basis of a voxel representation of the current cloth shape. We refer to this NN as the Voxel-to-Mesh net (VtM for short) below. We then apply an energy minimization procedure to derive a deterministic mesh representation from the probabilistic estimate and prior knowledge about the cloth object. We refer to this process as "refinement" below. The choice for NN-based shape estimation is motivated by three factors. (1) Speed: initial shape estimates are generated in less than 4ms on average. (2) Simplicity and generalizability: whereas geometric methods often require task-specific categorization of shape elements (e.g., wrinkle categories [12]), our NN-based estimation strategy is low on assumptions and in principle applicable to a broad variety of tasks and cloth topologies. (3) Occlusion handling: occlusion handling is a complex problem for geometric methods. Notable geometric approaches are limited to occlusion-free shapes (e.g., the 2.5D descriptions in [12]) or manipulate items into lowocclusion configurations for shape estimation (e.g., [24]), which can be inefficient. The NN approach can learn the rules governing positional uncertainty implicitly from data, quantifying uncertainty at fine granularity with minimal computational cost.
1) Architecture: The VtM net is a Multi-Layer Perceptron (MLP) architecture (we have experimented with 3D convolutional architectures as well, but achieved better results with the fully connected architecture). Input is a 32 × 32 × 32 binary voxelization of a cloth shape. Output is a 32 × 32 × 6 probabilistic mesh representation. Input and output volumes are flattened for network I/O. In between are four hidden layers of 4096 neurons each. Hidden layers use the hyperbolic tangent (tanh) activation function. Each 1 × 1 × 6 subvolume of the output defines a 3D multivariate normal distribution with a diagonal covariance matrix. We denote the means as μ x , μ y , μ z and the non-zero elements of the covariance matrix as σ x , σ y , σ z . The activation function on the output layer differs for μ and σ values, as the latter should take positive values only. For μ-outputs we use the linear activation function and for σ-values we use a out = ELU (a in ) + 1.05, where ELU is the Exponential Linear Unit activation function. This function ensures that σ-values are positive, and larger than 0.05 (this helps to stabilize training as σ-values growing too small can lead to incidental extreme loss values).
2) Input/Output Processing: We define the net's "workspace" as a volume of space running from (−1, −1, 0) to (1, 1, 1/3). For all network I/O, cloth shapes are scaled and translated to fit this space. Shapes are translated so that their centres are at (0, 0) in the XY plane, by projecting them onto the XY plane, finding the centre of the projection, and subtracting the centre coordinates from the coordinates of the points comprising the shape. Shapes are scaled such that a fully spread out, axis-aligned cloth runs from (−0.7, −0.7) to (0.7, 0.7). Quantities in the remainder of this section apply in this normalized format.
For input, we convert the input shape representation to a voxel representation, with the voxel volume spanning the workspace defined above. Before voxelization, we non-linearly boost zcoordinates as follows: The non-linearity in this transformation has the effect of emphasising depth differences close to the work surface (z = 0) and deemphasising depth differences further above the work surface. Regions of the cloth that are being lifted up generally present simple shapes, as they hang down under the effect of gravity. These parts can be interpreted well enough at crude z-axis resolution. Regions resting on the work surface or on underlying layers of cloth present more detail due to wrinkling and layering. After applying this transformation, we convert point data to voxel representation by setting all voxels containing at least one point to 1 and all other voxels to 0.
3) Occlusion: Measurement of the real cloth is subject to two types of occlusion: self-occlusion and occlusion by the hands and arms of the robot. We handle self-occlusion by artificially applying occlusion consistent with our hardware setup to simulation data during training. This occlusion applies to the voxel input, while target output (ground truth mesh) remains unoccluded. While we centre states for estimation, real-world occlusion occurs before this centring. To ensure that the net can handle the range of occlusions that occur in the real-world manipulation setup, we apply random offsets to the relative camera position used for calculating artificial occlusion during training.
4) VtM Net Training: Each possible shape has eight equivalent mesh representations (i.e., there are eight equivalent assignments of Cartesian coordinates to the mesh' geodesic coordinates). Fig. 3 illustrates mesh equivalence with some examples. Consequently, there are eight correct answers for each input. We account for this by defining the training loss for the VtM net as follows: Here s i is the i th mesh representation in an arbitrary ordering of the set of equivalent mesh representations of the ground truth, andŝ is the probabilistic mesh estimate output by the net. NLL is shorthand for Negative Log-Likelihood and MIN selects the smallest value from a set of values.
Training employs various types of data augmentation. The simulation data we use for training is in mesh format, so during training batch generation we convert meshes to noisy point clouds by taking the set of vertices as points, duplicating each point to increase the point count, and adding Gaussian noise γ ∼ N (0, 0.01) to each point independently. Augmentation with noise improves robustness and promotes generalization to real-world data. We apply random mirroring on the X axis and random rotation around the Z-axis. To improve robustness to slight variations in the relative position and angle of the camera in real-world experiments, we add tilt and Z-shift augmentations. The tilt augmentation tilts the state over the X and Y axis by angles (in degrees) drawn from U(−5, 5), and Z-shift translates the state on the Z-axis by a distance draw from U(0, 0.02), where U(a, b) denotes the continuous uniform distribution over the interval [a, b]. Tilt and Z-shift are applied to input but not to ground truth, so the net learns to remove them. We train the net using the SignSGD update rule [25].
The training and augmentation logic facilitates transfer to real cloth. Augmentation and noise on the input improve the net's robustness to noisy real input, while it is trained to output clean shapes from the distribution that generated the training set. Hence the net "interprets" real states into similar states from the training distribution, to some extent. This bias reduces the need to harden processes further down the pipeline.

B. Mesh Refinement
The VtM net produces probabilistic mesh estimates p . We convert this estimate into a deterministic estimates d that is both plausible w. r. t.s p and consistent with prior knowledge about the cloth, using a refinement procedure that incorporates the cloth's topology in the form of a spring model.
Refinement employs the following losses: Negative loglikelihood ofs d w.r.t.s p (loss nll ), spring energy (loss spring ), and an upward bias (loss up ). For computing loss spring , we define a set of springs between the vertices, following a spring pattern common to cloth simulation (see e.g., [26]). Let k be the distance between orthogonally neighbouring vertices. A vertex at indices (u, v) connects to neighbour vertices (u, v ± k) and (u ± k, v) ("stretch" springs), (u ± k, v ± k) ("shear" springs), and (u, v ± 2k) and (u ± 2k, v) ("bend" springs), insofar these vertices exist. Spring energy loss is then calculated as follows.
Here l i is the current length of the spring, r i is the resting length of the spring (i.e., its length in the cloth's fully spreadout default state), and N springs is the total number of springs. Consideration of self-collision would be desirable, but because of its high computational cost we omit it.
The third loss biases refinement against downward adjustment of vertex positions, because this can push vertices into the work surface. The upward bias ensures that wrinkles produced by optimization form in upward direction. Upward bias loss is computed as the mean over max(0,s p µz −s d z ), wheres p µz and s d z are the μ z and z components ofs p ands d and subtraction and max operate element-wise. Spring loss is multiplied by 5000 to be on the same order of magnitude as loss nll . Upward bias loss is multiplied by 1000. The refinement process starts by initialising the vertex positions. Recall that the purpose of the VtM net in our system is to continually track the cloth shape as the cloth is being manipulated. Consequently, consecutive inputs usually correspond to consecutive moments in time (frames), and represent similar shapes. We exploit this fact by initializing refinement with the refinement result obtained for the preceding frame, if a preceding frame exists. This tends to reduce refinement time cost, and can help to disambiguate the current frame in some cases. When no preceding frame exists, or its difference with the μ component of the presents p exceeds a given threshold, we initialize with the μ component itself. We update vertex positions using gradient descent on the sign of the gradients of the compound loss, with an update rate of 0.001, running until the loss stabilizes or 300 iterations have passed.

A. Mesh Representation
For prediction, we represent each vertex as a tuple of five values: its coordinates in 3D space, and two values g r , g l specifying the vertex's grasp state, taking value 1 when the vertex is grasped by the right and left hand, respectively, and 0 otherwise. Shape s i denotes the list of vertices at frame i. The encoder compresses a given mesh representation s i into a low-dimensional latent encoding h i . The decoder does the opposite, recovering full state representationŝ i from latent encoding h i . Encoding into latent representation serves two purposes: reduction of computational cost, and acquisition of a representation format that facilitates prediction. By training the modules end-to-end, we obtain a latent encoding format optimized for prediction, while also allowing recovery of the full shape representation. The encoder consists of 3 convolutional layers with channels depths of 64, 128, 256, kernel sizes 3 × 3, 3 × 3, 5 × 5, and strides 2, 3, 1, followed by a dense layer with an output dimensionality of 256. All layers use the tanh activation function. Input is presented as a 32 × 32 × 5 volume, with each 1 × 1 × 5 subvolume representing one vertex. The decoder largely mirrors this architecture, using transposed convolution instead of convolution layers, and using linear activation on its output neurons. Its output is a 32 × 32 × 3 volume specifying the predicted coordinates for each vertex.

B. DNN-Based Shape Transition Prediction
The LSTM module predicts the shape evolution for a given trajectory. The net consists of 10 layers containing 256 LSTM units each, and has six input neurons. We initialize the internal state (i.e., activation values) of all layers with latent encoding h i from the encoder. We then iterate through Δm i:n , feeding its elements in order on the input neurons. The sequence of internal states of the last layer observed over the course of this process gives the latent encoding sequence h i+1:n describing the shape evolution in latent form. By passing this sequence through the decoder, we obtain the shape sequenceŝ i+1:n .

C. Shape Transition Learning
The network is trained on our dataset of manipulation examples. For each example presented during training, we select an integer i, 0 ≤ i < n, where n is the length of the example in frames. We let the net process s i and Δm i:n−1 to obtain prediction sequenceŝ i+1:n , and compute the training loss (MSE) over s i+1:n andŝ i+1:n . Using random starting points ensures that the shape transition can be predicted from any point in the manipulation. We apply rotational augmentation. We train the net using the Adam update rule [28] until loss converges.

VI. TRAJECTORY OPTIMIZATION
We perform trajectory optimization by minimising the difference between the predicted manipulation outcome and the goal shape. We denote the sequence of predicted shapes aŝ s i+1 , . . . ,ŝ n and define the following cost function: len (m i:n ) = 1 hands Where MSE denotes the mean squared error, len(m) calculates the physical length of trajectory m, and w is a weight parameter. Shapes are assignments of coordinate values to the vertices of the cloth mesh, so the MSE over two shapes can be calculated by computing the squared error over corresponding coordinate values and averaging them. We penalize trajectory length to avoid generating unnecessarily long trajectories. We can now formalize trajectory optimization as the following minimization problem: We optimize trajectories by obtaining gradients w. r. t. L(s * ,ŝ n , m i:n ) for inputs Δm i:n−1 , through back-propagation [29], and adjusting the inputs using the Adam update rule [28]. The procedure for trajectory optimization is as follows: 1) If i = 0, Initialize the trajectory points m i+1:n .
3) Calculate the loss over predicted outcomeŝ n and goal shape s * , and obtain gradients by back-propagation. 4) Update m i+1:n along the gradients. Steps 2 through 4 are repeated until the loss falls below a given threshold or a set number of loops has passed. Trajectory length n is derived from the trajectory used to initialize the optimization process by adding 40, and remains fixed during optimization. The additional frames are to allow the cloth shape to stabilize after being released by the grippers. During stabilisation, manipulation input is blank (all-zero), but prediction of shape development continues. Fig. 5. Example calculation of trajectories (blue lines) and release points (r 1 , r 2 ) from grasp points (g C 1 , g C 2 ) and displacement vector d. The gray shape in the background is the cloth silhouette (2D projection).
Given goal shape s * , current shape s i and trajectory points m i:n , optimization can be performed at any time in the manipulation process. We trigger the optimization process when the divergence between the predicted outcomeŝ n and the goal state exceeds a given admissibility threshold.

A. Manipulation Format
We consider single-and dual-handed manipulations on a square cloth. For data generation and trajectory initialization, we represent manipulations as real-valued vectors of length six, defining clean arc trajectories. The format is as follows. The first four values define two grasp points, g G 1 and g G 2 , by their geodesic coordinates (u, v) on the cloth, with the cloth surface running from (−1, −1) to (1, 1). The second grasp point can take a null value, indicating a single-handed manipulation. The last two values define a displacement vector d given in 2D Cartesian coordinates (x, y). Given a cloth state (mesh), this representation determines grasp point trajectories as follows. We map geodesic grasp points g G i to Cartesian grasp points g C i using the cloth mesh. For two-handed grasps, we then compute point p: Let m be a line through p perpendicular to d. The x and y coordinates for Cartesian release points r i are found by mirroring g C i over line m on the XY-plane. Fig. 5 shows an example. For single-handed grasps, the x and y coordinate of the single release point r 1 is given by g C 1 + d. The z coordinate for r i is given by g C i .z + min(0.2, k/2), where k is the distance between g C i and r i in the XY-plane. We find circle c centred at height g C i .z, perpendicular to the XY plane, and passing through g C i and r i . The shortest segment of c connecting g C i and r i defines the trajectory for point i.

B. Dataset Generation
We generate a dataset of 3691 manipulation sequences consisting of 3 manipulations each, using the ARCSim cloth simulator [30], [31]. Each sequence starts with the cloth laid out flat. Single-and dual-handed manipulations are generated randomly in a proportion of 1:2. Grasp points are selected randomly from the convex corners of the projection of the cloth shape onto the XY plane. Candidate points are found by means of corner detection [32]. Displacement vectors are generated randomly with a maximum length of 2.0 (measured in the normalized workspace of Section 4). The cloth mesh is stored at each frame of the simulation. The total number of frames in the dataset is

C. Environment for Simulation Experiments
For evaluating trajectory correction in simulation, we set the simulation up to accept frame-by-frame input in the form of 3D movement vectors for the grasped points. We implement three evaluation scenarios.
1) Undisturbed manipulation. In this scenario, the cloth behaves exactly as in the dataset. This should allow accurate prediction throughout, so manipulation should be successful with minimal correction. 2) Manipulation with cloth slipping. This scenario reduces the friction between cloth and work surface. This causes the cloth to slide somewhat over the course of the manipulation, necessitating correction. 3) Manipulation with external interference. This scenario mimics a situation where an external force moves the work surface during manipulation (or, equivalently, pulls the cloth over the work surface). Movement is in the direction of the displacement vector and runs from the 10 th frame to halfway the manipulation.

D. Mesh Estimation Results
We evaluate shape estimation on the shape data of 100 manipulation sequences (300 manipulations, 61660 frames) from the test set and 100 sequences (300 manipulations, 62226 frames) from the training set. For consecutive frames, we initialize refinement with the refinement result of the preceding frame, unless average vertex distance with the μ component of the estimate exceeds 0.1 times the cloth length.
Estimation accuracy is reported in Table I, time cost in  Table II, and example results are shown in Fig. 6. Errors are measured as Euclidean distance between estimated and actual vertex positions, with the length of the side of the cloth as unit. Errors vary with the complexity of the cloth shape. Since we focus here on manipulations starting from the spread-out state, we report errors for the first manipulation in each sequence (marked as first step in Table I) in addition to the errors over  Table I). The distance between the μ component of the estimation and the target averages to about 1/20 th of the cloth length for the full set, and slightly less than 1/60 th for first manipulations. Refinement does not notably improve accuracy. This is expected: the net is trained to minimize error exclusively, whereas refinement also considers shape realism. For example, estimates often omit wrinkles, indicating them as regions of increased uncertainty instead. This reduces the surface of the mesh described by the μ component. Refinement will restore cloth surface by producing wrinkles in areas of increased uncertainty, moving vertices away from their μ values. Unless the wrinkles happen to align closely with the actual wrinkles, this increases the error.
All training data is generated with identical material properties. To assess how these properties affect accuracy, we generate three single-step test sets with different material properties. From the material definitions included with ARCSim, we selected the t-shirt, sweater, and swimsuit materials (for details about the material definitions we refer to [33]). Cloth topology and other settings were unchanged. Estimation accuracy is shown in Table I. We observe some deterioration, but errors remain within 1/40 of the cloth length. Domain randomization could likely further improve robustness to material variation.

E. Shape Prediction Results
We evaluate shape prediction for the full test set and 300 manipulations examples from the training set. For each example we perform prediction over horizons of 50, 100, and 200 frames, which are representative horizons for real manipulation scenarios. We select the starting state randomly from the range [0, n example − n horizon ] where n example is the full number of frames in the example and n horizon indicates how far into the future we are predicting (i.e., from starting frame i, we predict frame i + n horizon ). When n example < n horizon , we predict the full example starting at frame 0. Errors measure average distance between corresponding vertex pairs in ground truth and prediction, with the cloth length as unit.

F. Trajectory Correction Results
We performed folding in simulation to evaluate trajectory correction. We set the initial shape s 0 and goal shape s * , and initialize the trajectory points m 0:n to evenly divide a round arc trajectory defined in the format described in 7A. To isolate trajectory correction we omit shape estimation in this experiment, using meshes from the simulation directly. We evaluate the three scenarios described in Section 7C. Each scenario is run for 50 manipulations from the test set (first-in-sequence manipulations). For comparison, we also run each manipulation without trajectory correction. Table IV shows accuracy averages. We observe that in each scenario, trajectory correction reduces the error between outcome and goal shape. Time cost for correction ranges from 1 to 20 seconds, and depends on the frame length of the remaining trajectory. Fig. 7 shows example results. In Fig. 7a we see that correction has little influence on the result in scenario 1: close approximation of the goal is achieved with and without correction. The predicted error w. r. t. the goal shape remains near-constant over the course of the manipulation. This indicates that under conditions consistent with the training data, we can accurately predict the course of the manipulation. Scenarios 2 and 3 diverge from the training conditions, resulting in large errors if no trajectory correction is performed, as seen in the results for scenario 3 shown in Fig. 7b. The error graph shows that the widening error with the goal shape is evident from the network's predictions during manipulation (dotted line). When trajectory correction is enabled, the predicted error is effectively suppressed (solid line). This results in a much better approximation of the goal state.

G. Hardware Experiments
We performed cloth folding on robot hardware, using HIRO (Kawada Robotics), a dual-armed robot with six degrees of freedom per arm. The cloth is observed through an RGBD sensor (Azure Kinect) placed opposite to the robot. We use square cloths of 24cm by 24cm. Fig. 8 shows our setup.  RGBD input is processed as follows. We isolate the cloth area using Grabcut [34] and contour detection, as shown in Fig. 8b. Then we obtain a 3D point cloud of the cloth by retrieving the depth values from the region of the depth image corresponding to the cloth area. We convert the point cloud to a voxel representation, and estimate the mesh using the VtM net and the refinement procedure, resulting in a shape estimate as shown in Fig. 8d. At the start of a manipulation, the experimenter instructs the robot to grasp the points indicated by the planned manipulation. Shape estimation and prediction are executed at intervals of 10 trajectory frames at a time, and we set trajectory correction to trigger when the MSE over the predicted outcome and the goal exceeds 0.003.
We perform two experiments. The first evaluates performance on six goal shapes for three cloths: a typical hand towel, a square of stretchy t-shirt fabric, and a square of thin but fairly stiff woven fabric. We evaluate each combination of goal and fabric once, for a total of 18 cases. Results and scores are shown in Fig. 9. Scores represent the IoU (intersection over union) taken over the XY projection of the goal shape and a mask image of the result shape taken from top-down view. We obtain an average IoU of 0.904 overall. The t-shirt fabric is very supple and buckles easily. In preliminary experimentation, this sometimes led to shape collapse that is hard to recover from.
The second experiment evaluates manipulation with external disruptions. We manually disrupt the manipulation by pulling on the tablecloth covering the work desk halfway into the manipulation, thereby displacing the cloth by a few centimeters. We use the hand towel for these trials. Results and IoU scores are shown in Fig. 10 for experiments with and without trajectory correction. We observe that trajectory correction substantially improves outcomes, producing satisfactory approximations of the goal shapes.
Time cost for shape estimation and trajectory search did not differ significantly from the simulation experiments. Footage of hardware experiments can be found in the video accompanying this paper.

VIII. CONCLUSION & FUTURE WORK
We presented an approach for deformable object manipulation on basis of shape estimation, shape prediction, and trajectory correction. We quantitatively evaluated the individual parts in simulation, and tested the integrated system on a selection of test cases in simulation and on robot hardware. We find that the system can successfully produce a variety of goal shapes, even when disruptions occur during the manipulation.
While our experiments are presently limited to square cloth, the system should in principle be applicable to alternative topologies with limited modification. However, the assumption that the topology is known can be limiting. Handling items for which we do not have sufficient prior topological knowledge will require integration with garment recognition and shape parametrization routines, as seen in e.g., [10].
Future directions include more extensive evaluation of the integrated system. We also pursue integration with our previously proposed multi-step manipulation planning system [23]. In this integration the planning system would provide initial trajectories and intermediate goal shapes for execution by the system proposed here. Lastly, we aim to extend the system to more topologically complex objects, such as clothes, and explore domain randomization strategies to further improve generalization over materials.