Robot Learning-Based Pipeline for Autonomous Reshaping of a Deformable Linear Object in Cluttered Backgrounds

In this work, the robotic manipulation of a highly Deformable Linear Object (DLO) is addressed by means of a sequence of pick-and-drop primitives driven by visual data. A decision making process learns the optimal grasping location exploiting deep Q-learning and finds the best releasing point from a path representation of the DLO shape. The system effectively combines a state-of-the-art algorithm for semantic segmentation specifically designed for DLOs with deep reinforcement learning. Experimental results show that our system is capable to manipulate a DLO into a variety of different shapes in few steps. The intermediate steps of deformation that lead the object from its initial configuration to the target one are also provided and analyzed.


I. INTRODUCTION
Deformable and non-rigid objects are extensively manipulated in our everyday life. Paper, cloths, wires, food, are only few examples. Thus, deformable object manipulation is an essential skill for robot to enter the human living and working environments. For instance, robots could become more involved in forestry operations [1] or healthcare activities for the elderly and disabled [2]. Also many industrial applications require robots able to manipulate non-rigid objects. Food industry, for example, could boost the production [3], farming industries could use robots to manipulate plants to lessen physical burden on workers [4] and manufacturing industry can minimize labor cost [5]- [7]. Despite the numerous applications and the effort made by the robotics community [8], effective and reliable methods for deformable object manipulation remain exceptionally difficult to construct.
Earlier works on deformable object manipulation have sought open-loop strategies, which are ineffective since the material can shift in unpredictable ways [9]. Successive works attempted to develop various model-based strategies for controlling the object shape through robot The associate editor coordinating the review of this manuscript and approving it for publication was Kin Kee Chow. manipulation [10], [11]. This is a common and effective approach with rigid objects, but it results weak with nonrigid objects. Indeed, there is no obvious mapping from an observation of the object to a compact representation in which planning can be performed.
Deep Reinforcement Learning (DRL) is becoming more and more popular in robotic manipulation [12]- [17]. We are actually witnessing a run for the best DRL algorithm (in terms of flexibility and efficiency), that would enable the robot to perform any kind of manipulation, without engineering but only through its personal interacting experience with the environment [18]. However even the state-of-the-art solutions based on DRL algorithms produce results [15]- [17] quite far from those achievable with classical engineering methods.
The challenge in these works is the development of an algorithm which could learn the joint torque trajectories for a generic task directly from the input raw images by means of a rewarding system. This process demands to the agent to intrinsically learn operations like inverse kinematics, trajectory planning, visual feature extraction, object detection and semantic segmentation. All problems extensively studied and efficiently solved in literature.
Anyway, in order to discard the requisite of a model, one of the major challenge when interacting with deformable FIGURE 1. Pick-and-drop trajectories performed by the robot during every iteration of the proposed algorithm. It starts acquiring an image of the table with the hand camera (a). The decision process selects a grasping point based on this image, and computes the corresponding releasing point. In the second step the right arm moves toward the grasping point and stops 0.05m over the table (b). Then it grasps the DLO in the decided point (c) and returns to the approaching point (d). In the last two steps approaches the releasing point (e) and opens the gripper (f). Finally it returns in the initial configuration ready to start over, by taking a new image (a).

FIGURE 2.
Image segmentation algorithm for DLOs. The first step consists in segmenting the input image into adjacent sub-regions (superpixels) and creating an adjacency graph. From the extremity of the DLO, an arbitrary number of walks are started, by moving into adjacent superpixels. Each walk moves forward along the adjacency graph by choosing the best next superpixels until it reaches the other extremity (this walk is masked as 'closed'). As a set of random walks are started, Ariadne keeps only the most likely one, among those marked as 'close'. objects, reinforcement learning seems a very reasonable and very attractive approach [14], [19]. In fact, the optimization skills and the flexibility of DRL are essential to overcome the complex behaviour of deformable objects. However, to the state-of-the-art of DRL, a worth solution would be lighting the learning load by integrating the DRL algorithms with other non-learning-based tools and engineering consideration, in order to make the most of their capabilities.
In this work, we build a smart integration between efficient engineered solutions and DRL algorithms. In particular we propose a wise use of DRL algorithms in the few tasks in which the process needs to predict the optimal interaction with the DLO. While we prefer to employ a stable inverse kinematics (IK) solver and a trajectory planner to perform the robot motion. Moreover, we lighten the information extraction from visual data with a state-of-the-art vision technique specifically designed for DLOs [20]. The presented work is motivated by the lack of effective application solutions for DLO manipulation in tasks like untangle, spread and routing a wire in assembly processes [7], [21]. Thus, our study wants to move a step forward into these challenging tasks, proposing a solution able to control the shape of a DLO in a clutter environment using vision feedback.
In line with our work, also other authors adopted similar approaches. Boularias et al. [22] explores the use of DRL combined with well-known techniques for image segmentation, for manipulating unknown objects. They propose a pipeline that first segments images into separated objects, predicts pushing and grasping actions, extracts hand-tuned features for each action, then executes the action with highest expected reward. In [23] and [24], to make training tractable on a real robot, they simplified the action space to a set of endeffector-driven motion primitives. They formulate the task as a pixel-wise labeling problem: where each image pixel -and image orientation -corresponds to a specific robot motion primitive executed on the 3D location of that pixel in the scene. Similarly to these works, we turn action prediction into a classification problem by discretizing the action space and we define specific robot motion primitive (grasping and releasing).
The main contributions of our work are: (1) a novel robot learning-based system for autonomous deformation of a rope from/to a general shape using visual feedback capable to work with any cluttered background; (2) a study on DLOs deformation through a re-positioning sequence, in particular we investigated different strategies to decide the grasp/release locations and their relations.
The remainder of this paper is structured as follows: section II, reports an overview of previous works in this field; section III presents the experimental setup; section IV provides relevant background on reinforcement learning and Deep Q-Network; section V describes the proposed method in details; finally, in section VI, we examine the experiments and make some piratical considerations.

II. RELATED WORKS
The problem of DLOs manipulation has been studied before, with particular attention to tying knots. For instance, Yamakawa et al. [25] proposed a trajectory planning approach where a knot can be tied with a single robot arm at high speed. Mayer et al. [26] examined the use of recurrent neural networks to learn the knot tying trajectories. Learning from Demonstration (LfD) was proposed by Lee et al. [27] to learn a function that maps a pairs of correspondence points, while minimizing a bending cost.
The insertion of a DLO in a hole is another widely investigated task, due to all the useful applications that it would have in assembly operations [6], [7]. Inaba and Inoue [28] developed an hand-eye system to insert a rope into a hole using stereo vision for computing the relative position between rope tip and hole. In [29] they presented a method to insert string through tight workspace openings online using an approximate Jacobian to estimate the motion of the string. In [30] the insertion of a DLO into a hole is performed by analyzing the feedback coming from a tactile sensor by means of a recurrent neural network which estimate the force acting on the wire itself.
Few works attempt to address the shape control of a DLO using a robot. Rambow et al. [31] used a two-arm robot to mount a deformable tube in a desired configuration based on a single teleoperated demonstration. Nair et al. [23] developed a learning-based system where a robot takes as input a sequence of images showing small deformations of a rope from an initial to the goal configuration, performed by a human demonstrator, and outputs a sequence of actions that would lead the rope to the target shape, imitating the demonstrator deformations sequence. In [23] a Baxter robot has been configured to collect interaction data with the rope for 500 hours, used later to learn an inverse dynamics model which is finally employed to imitate the human demonstration. Similarly, also Sundaresan et al. [32] proposed an approach using imitation learning to arrange the configuration of a rope. They also show that the proposed solution can be used for a knotting task from human demonstration and assuming to start always from the same configuration containing a single loop. To brake symmetry and enable consistent correspondence mapping with target shape in [32] and [33] added, respectively, a ball and a blue tape. Moreover, in [33] they also tied one end of the rope to a clamp attached to the table. In this work, instead, we use a perfectly symmetric rope, with both the extremity free and identical. Another recent work on the same topic is [34], where they estimate a state-space representation of the rope and learn a dynamics model with an LSTM network and solve the rope manipulation with MPC. The weakest point of this solution is the assumption of having a strong color contrast between the rope and the table for a correct state estimation.
Differently from the over mentioned works, we addresses the problem of autonomous deformation of a rope from/to a general shape by training a reinforcement learning agent from scratch on a real robot, without: (1) the necessity of demonstrate the intermediate deformation steps in test time; (2) adding easily distinguishable object to brake the rope symmetry; (3) fixing any extremity to the table; (4) making any restrictions on the background color. In the sequence of Figure 4 we used a white background to make images clearer and to facilitate readers in the vision of the rope. However, as explained in subsection V-B, the system is designed to work on heterogeneous and confusing backgrounds.

III. EXPERIMENTAL SETUP
For the experiments described in the paper, we employ a Rethink's Baxter robot, which has a wrist-mounted gripper with two degrees of freedom (one rotational and one for closing/opening the two fingers). An RGB camera integrated with the robot hand provides visual data, with a resolution of 960 × 600 px.
The setup is illustrated in Figure 1. Also in this case, a white background is used to make images clearer and to facilitate readers in the vision of the rope. However, it is worth to remark that, as explained in subsection V-B, the system is designed to work on heterogeneous and confusing backgrounds, see e.g Figure 2.
A perfectly symmetric DLO (i.e. a rope), lies free on a table, at a known height z * , in front of the robot. We define a fixed camera pose over the table to acquire the input RGB image. The interaction of the robot with the rope is limited to two simple motion primitives consisting of grasping the rope at location (u 1 , v 1 ) and releasing it at location (u 2 , v 2 ), where u 1 , v 1 , u 2 , v 2 are pixel coordinates in the input RGB image. Since both the table height and hand-camera pose are known with respect to the robot base frame, we can estimate the grasping (x 1 , y 1 , z * ) and releasing (x 2 , y 2 , z * ) coordinates in the base frame.
As shown in Figure 1, during the grasping the robot first approaches the point (x 1 , y 1 , z * ) from the top, with and offset of z = 0.05 m along the vertical z-axis and the gripper open. It moves down with a linear trajectory in the Cartesian space along z to z * , then it close the gripper's fingers before rising back to z * + z . The motion sequence for dropping the rope is the same, with the intuitive difference that it starts with the gripper close, and opens it after the descent to z * . In both the motion primitives, the motion planning is automatically executed with the native Baxter's IK solver.

IV. PRELIMINARIES ON DRL
We formulate the grasping task as a Markov decision process defined by (S, A, p, r). Where state space S and action space A, that represent respectively all possible combination of current and target shape and all possible grasping point in the scene, are assumed to be discrete. In subsection V-B and subsection V-D we illustrate the discretization strategy and we define the environment's state, while in subsection V-E we define the agent's actions. The unknown state transition probability p(s t+1 |s t , a t ) represents the probability density of the next state s t+1 given the current state s t and current action a t . For each state s t at time t of the environment (i.e. the DLO), the agent (i.e. the robot) chooses and executes an action a t according to the policy π(a t |s t ), which implies the transition of the environment to a new state s t+1 and the formulation of a reward r t as defined in subsection V-D. Under this formulation, the goal is to find an optimal policy π * that maximizes the expected sum of future rewards +∞ t=i E (s t ,a t )∼p π [r t ], where we use ρ π to denote the state or state-action marginals of the trajectory distribution induced by a policy π(a t |s t ).
In this work, we investigate the use of deep Q-learning, that is a Q-learning where a deep neural network is used to approx- , which measures the expected reward of taking action a t in state s t at time t. The network that approximates Q-value function is called Deep Q-Network (DQN) [35] and the training data are processed by using stochastic gradient updates. In Q-learning, a greedy policy π(a t |s t ) is trained to choose optimal actions by maximizing the action-value function Q π (s t , a t ). Formally our learning objective is to iteratively minimize the temporal difference error δ t of Q π (s t , a t ) to a fixed target value y t , where γ ∈ R + is called the discount rate.

V. METHOD A. OVERVIEW
In this section, we describe our method to reshape a DLO using a single arm robot. The proposed method relies on a DQN-based decision process that leverages on an effective visual representation of the DLO shape. Current and target shapes are modeled using both a Key Points Path and a Spatial Grid Matrix, detailed in subsection V-B. The interaction with the DLO, and its reshaping process, take place through a sequence of grasping and releasing operations. The decision process, detailed in subsection V-C, learns to predict the best grasping point from the input image while the corresponding releasing point is computed by projection. A sample sequence of steps that leads the DLO to the target shape is shown in Figure 4. Since the proposed method relies on a reinforcement learning algorithm, in subsection V-D and subsection V-E, . The input raw image is processed by ariadne (a). Since it needs to be initialized with the DLO extremities, YOLO object detector is employed for the purpose. Ariadne produces a binary mask and a list of image points that describes a walk along the DLO. From the binary mask we create the spatial grid (b) and define the matrix M g t , while from the points path (c) we obtain the list of points P t .
we formally define states, actions and rewards, while in subsection V-F some considerations about the training and how we speed it up when starting from scratch are made.

B. SHAPE REPRESENTATION
In order to effectively exploit its decision-making skills, the DRL agent has been integrated into a framework that lightens the learning load, as will be detailed in subsection V-C. This process is based on two representations of the DLO, both shown in Figure 3, processed from the visual input. The first representation, consists of a sorted sequence of key points belonging to the DLO. This representation allows us to effectively identify the releasing point on the target shape as a projection of the grasping point (taken from the current shape). In this way the agent needs to learn only the grasping point. In the second representation, a dimensionality reduction of the visual data is performed by mapping the segmentation mask into a spatial grid matrix. This matrix will later compose the state of the environment that the agent uses to predict the best action to perform.
Both the representations relays on an algorithm called Ariadne [20], able to perform simultaneously instance segmentation and b-spline modeling of DLOs. The basic idea of Ariadne is to detect the DLOs as suitable walks over the Region Adjacency Graph built on a super-pixel oversegmentation of the source image. In Figure 2 is visible an example of segmentation on a cluttered background.

1) KEY POINTS PATH
Ariadne segments the image into adjacent sub-regions (superpixels) then finds a walk that connects the two extremities of the DLO. This walk is essentially a sorted list of superpixels, that can be represented by their centroids, hence it can be converted into a sorted list of image points P = [p 1 , . . . , p n ]. Each walk need to be initialized with seed superpixels located at the DLOs' extremities. Purposely, we deployed YOLO v2 [36], an object detection tool based on convolutional neural networks. We fine-tuned the YOLO v2 model, pretrained on ImageNet, on a dataset that we created with the black rope used in the experiments. To create this dataset we developed an automated labeling tool based on video sequences that we allows us to easily gather massive amounts of training images in the field with minimal human intervention [37]. The tool is based on the idea that restricted camera movements (i.e. lift and rotate) leads to a controlled rigid transformation A between the two consecutive images I i , I i+1 such that I i+1 = AI i . The same rigid transformation A can be applied to each bounding box (BB)b i present in the image I i so as to obtain a new set of BB such thatb i+1 = Ab i . This procedure can be repeated for each consecutive pair of images in the video sequence, it is therefore clear how the sole human intervention is to create the BB labels in the first frame I 0 .

2) SPATIAL GRID MODEL
A uniform space partitioning is performed on a binary image mask I mask t ⊆ [0, 1] h×w obtained as segmentation of the DLO from the input RGB image I t ⊆ [0, 255] 3×h×w . This partitioning consists of a set with size n rows × n cols of rectangular regions of pixels { i,j ∈ R ψ h ×ψ w } i∈n rows ,j∈n cols (image windows) with constant size ψ h × ψ w = h n rows × w n cols . Each region is mapped into a scalar value g [u, v], g Th , that is the average of all the region-pixels binarized through the function [0,1] (x, x Th ), which gets 1 only when x ≥ x Th and 0 otherwise. From these values we define the spatial grid matrix at time t as M g t = [g i,j t ] i∈n rows ,j∈n cols ∈ [0, 1] n rows ×n cols , where every cell (i, j) and every region i,j have a bijective correspondence. To simplify position calculations, each region is represented by its center point.

C. DECISION PROCESS
The goal is to reshape a DLO by means of a sequence of grasp and release operations. To achieve this we employ the decision-making process schematically outlined in Figure 5. This process aims to determine the optimal grasping and releasing points, respectively p grasp ∈ R 2 and p release ∈ R 2 , in order to maximize the visual overlap between the current and the target shapes, using as input data the image of the current scene.
A straightforward approach that we initially explored is to train an agent for learning jointly the two optimal locations p grasp and p release from the observation of the current scene s t . However, the releasing location is strongly dependent on the grasping point, but the over mentioned approach does not take into account this conditional nature of the two operations.
To address this problem, we could combine two agents in a cascade, where the first predicts the grasping point and the second the releasing point. In other words, instead of learning jointly the two locations with an unique policy π([p grasp , p release ]|s t ), we define two policies: one that learns the grasping point from the current state π grasp (p grasp |s t ); while the other one learns the releasing point from both the state and the predicted grasping location π release (p release |s t , p grasp ). Nevertheless, training this policy is inefficient. In fact, the two operations would require two dedicated rewards, but we can only generate one reward after the releasing which is proportional to the visual overlap between the current and the target shapes. Clearly, in this setup, the decision process does not have the possibility to understand if an high (or low) reward is due to π grasp or π release .
Ultimately, to overcome also this action reward assignment issue, we propose to only learn the grasping point, while the releasing point is derived from the key points path representation of current and target shapes presented in subsection V-B. In fact, given the target shape path P * = [p * 1 , . . . , p * m ] and the current shape path P t = [p t 1 , . . . , p t n ], we can easily project a point form one path to the other. In particular we can project the grasping point p t k , taken from P t , into a releasing point p * s belonging to P * , where s = k m n + 1 2 . Having established that we can find the placing location with this projection strategy, one might wonder if we can choose also the picking point simply form the representations, without learning it. The most trivial solution would be grasping every time a random point from those that are not overlapped with the target shape. This is clearly very ineffective, since it neglects the DLO property of interconnection among the key points. Moreover, if we grasp only the free points, i.e. those that are not-overlapped, we cannot ensure to really reshape the rope, since the algorithm would simply aim to clean all the free points moving them to the target location. Hence, a trivial solution such as winding the rope in a small region that completely overlaps just a portion the target would conclude erroneously the task if there no free point is left. On the other hand, if we grasp also those already overlapped, we risk to make many pointless re-positioning actions. Another trivial approach would be following the order in the path, but also in this case we are not taking into account the interlinked nature of the object. In fact every time we place a point we might erroneously move those that we placed earlier.
As already stated, in the proposed solution we develop a decision process based on a DQN agent that learns the optimal grasping cell (action) in a grid that combine the spatial information of both the target and the current shapes (state). As shown in Figure 5 the agent is wrapped into a structure that defines the agent's state by extracting the useful features from the input image and derives the grasping and . In each step t , we obtain the action a t as the coordinates to the highest value of φ Q (s t ) (red star). On the input images we draw the grasping (red circle) and the releasing (green circle) points correspondent to the predicted action a t . For each transaction we also compute the reward r (a t , s t , s t +1 ), as a function of the overlap score (s t ) (see Equation 3).

FIGURE 5.
Scheme representing the proposed method. We highlight in green the robot side, which includes the image acquired by the hand camera and the deformation (grasp and releasing operations) executed on the DLO. The decision making process is highlighted in yellow and the agent in red. The scheme shows also the agent's memory update, with dashed lines and grey boxes. In particular, the bottom part of the scheme reports the new state and new overlap that are obtained from the same scheme in the successive time step, from the new image acquired after the deformation.
releasing point from the agent's action. The task starts by providing a goal that can be either a key points path and a spatial grid or a raw image of the rope in a target shape. In each iteration the system acquires a new RGB image of the scene. Then, the visual segmentation algorithm, creates the binary mask and the key points path for the current shape. The mask is reduced to the correspondent spatial grid matrix, which is combined with the target's one, as defined in subsection V-D, to obtain the state. The agent predicts the best action for the current state, i.e. it provides the optimal grasping cell of the spatial grid, as detailed in subsection V-E. This action needs to be mapped into a grasping point with respect to robot frame {B}, so first we find the point in the input image as center of the region of pixels corresponding to the grasping cell, and then, with the knowledge of the camera pose, we transform it with respect to {B}. The releasing point, as explained previously in this section, is obtained from the key points path and the grasping point. While the angle is simply estimated with a line fit algorithm from the image window contained in a the corresponding cell. Note that this estimation is affected by an ambiguity of π between current and target shapes. This would imply an undesired twist when releasing the rope. To have a consistent angle between the two shapes we can use the sorting information of the key points in the path. By consistently defining the two extremities on target and current shapes, the ambiguity is automatically solved. Obviously, arises a new problem on defining the extremities, since the DLO is perfectly symmetric. Let A and B be the end points of the current shape and A * and B * those of the target one. Thus, we define A * as the end point of the target closer to A, which is instead arbitrarily assigned, and B * the other one.
Once the robot has performed the deformation as explained in section III, a new iteration starts. In the successive iteration, the reward, that is function of the overlap score, and the new state are computed and sent to the agent, which records the transaction state, action, new state and reward for the learning. The task ends when the overlap score reaches a given threshold.

D. ENVIRONMENT
We model each state s t as a linear combination of the spatial grid matrix of the scene at time t, M g t , and the VOLUME 9, 2021 one of the target shape, M g * , In this way the state is a matrix s t ∈ [0, 3] n rows ×n cols where each element s i,j t corresponds to the cell (i, j) of the spatial grid built on the scene. Note that it can be rewritten as where the overlapped regions are set of image pixels belonging to both the target and the current shape.

E. AGENT
This work uses an implementation of deep Q-learning, where the DQN φ Q (s t ) that approximate the Q-function Q π (s t , a t ) is a convolutional neural network (CNN) schematically represented in Figure 6. Since both state and action space are quite simple by construction, simple network architectures can be used as well. The default architecture consists of five convolutional layers interleaved with nonlinear activation functions (ReLU) [33] and spatial batch normalization [38]. As already said, the input and the output of the DQN have the same size, that is the size of the spatial grid, n rows × n cols .

1) ACTIONS
The agent predicts a vector action a t = [i j] , where i ∈ N n rows and j ∈ N n cols are the coordinates of a target region in the spatial grid where to perform the grasping. These coordinates are easily inferred from the DQN's output φ Q (·) ∈ R n rows ×n cols . In fact, the matrix φ Q has the same size of the spatial grid matrix M g t , thus we have a one-to-one correspondence between the elements. This implies that we can take φ i,j Q (s t ), the value in coordinate i, j of φ Q (·), as the approximated Q-value Q π (s t , a t ) of the action a t = [i j] , or in other words, φ i,j Q (s t ) can be considered as the expected future reward of grasping the DLO in the region (i, j). Hence, the action that maximizes the Q-function is the couple of indices corresponding to the region with the highest Q-value across the spatial grid matrix: argmax a Q π (s t , a ) = argmax (i,j) φ i,j Q (s t ).

2) REWARD SHAPING
In our decision process we use a shaped reward. In fact, shaped reward functions compared to sparse reward functions, require more design effort as they incorporate knowledge of the problem into the reward structure, but in general they require less time to train, or at least they should speed-up the training in a complex setup.
The reward scheme we designed is very simple. First of all let us consider the state as written in Equation 2. We can easily assert that only the regions belonging to the current DLO shape are worth considering for grasping, which means that we can assign a reward r(a t , s t , s t+1 ) = 0 to all the actions a t = [i j] that leads the robot to the regions corresponding to the value s We define an overlap score (s t ) = n s t =3 n s t =0 at time t as the number of overlapped regions n s t =3 over the number of all regions that are either part of the current or the target shape n s t =0 . Hence, assuming that (s t+1 ) − (s t ) > 0, the reward that we assign to a valid action is directly proportional to the increment in the overlap score where k ∈ R is a gain that we set to k = 10. Moreover, to penalize the actions that cause an overlap loss, (s t+1 ) − (s t ) ≤ 0, we assign a constant reward r(a t , s t , s t+1 ) = k 2 , greater than zero (since the action is still valid) but always smaller than Equation 3.

F. TRAINING AND TEST
We train the DQN using Adam optimization with fixed learning rates of 10 −4 . Our models are implemented in PyTorch and trained with an NVIDIA GeForce GTX 1080 Ti on an Intel Core i7-7700K CPU clocked at 4.20GHz. We train with prioritized experience replay [39] using stochastic rank-based prioritization, approximated with a power-law distribution. Our exploration strategy is -greedy, with initialized at 0.7 then annealed over training to 0.1. Our future discount γ is constant at 0.5. The experience replay uses batches of size 132.
At the beginning of the training the DQN has random values and the agents can only take random actions in order to explore the environment. To speed this process up, human expertise can be used as agent's prior knowledge or heuristic. Hence, in the first phase of the training a human demonstrator provides a sequence of pick points on the rope toward the target shape, while the agent only collects data (i.e. state, action, reward and new state). Ideally, once the process is over, the agent has learnt a raw but satisfactory policy. Thus, in the second phase of the training, the agent can acts FIGURE 7. First set of 5 experiments that shows the DLO deformation steps performed by the robot using the proposed method. The images are binarized for visual clearance. The final shape corresponds to an overlap score greater than 90% ( (s t ) >= 0.9). The state cells are: black if s autonomously on the system and collects more self-generated data. Differently to other works like [23] or [32], the human demonstrations are used only to initialize the agent's experience and no longer needed in test time.
The demonstration phase is useful for gathering a large amount of meaningful data, possibly that cover a wide set of different scenarios. Hence, the demonstrator should to prevent the system to fall in some irrecoverable state (highly VOLUME 9, 2021 FIGURE 8. Second set of 5 experiments that shows the DLO deformation steps performed by the robot using the proposed method. The images are binarized for visual clearance. The final shape corresponds to an overlap score greater than 90% ( (s t ) >= 0.9). The state cells are: black if s can be done in the second phase of autonomous exploration, when the agent has already some raw experience on the task. Following this principle we gradually increase the overlap score threshold Th up to 0.8 every 50 transitions with step = 0.1. We observed that, the agent first learns to find the nonempty regions taking into account that all the regions are linked because part of the same DLO and some of them are already correctly aligned with the target. In order to avoid over-fitting the agent on a particular shape, we collected 30 target shapes and change among them every n = 15 transitions or every time the overlap score reaches the given threshold.

VI. EVALUATION
In this section we evaluate the proposed method on our experimental setup. The spatial grid considered for the DLO shape representation has size n cols × n rows = 16 × 10. We collected 200 transactions by demonstration and other 300 during the autonomous exploration phase. We evaluate the performance by counting the number of steps required to reach an overlap score greater than 90% ( (s t ) > 0.9). By running the experiment on 30 different scenarios, we estimate a success rate of 76.7% (23/30 tests) in achieving the goal with less than 12 steps and 86.7% (26/30 tests) with less than 18 steps. In 4/30 tests we assumed a failure due to an undesired tangling. In Figure 7 and Figure 8 the 10 experiments are reported, showing the intermediate deformation steps performed by the robot and the agent's state. In this figure, the images have been binarized to improve readability. It is worth noticing that the system learns to stretch the DLO in only 2 steps by simply adjusting the two extremities.
The experimental data reported in Figure 4 show an example of correct learning, where the agent predicts as optimal grasping locations those that are not aligned with the reference shape. In particular, this behaviour is clearly visible in the first two steps and in the last one. Note also that the estimate Q-values are zero in the cells that are empty or occupied by the target shape only (not suitable for grasping). Moreover, while the cells not aligned with the reference are frequently preferred to those already aligned, these are not excluded, as happens in the 4th step of Figure 4.

VII. CONCLUSION
In this work we studied the robotic manipulation of a deformable linear object lying on a table, i.e. a rope, using visual data. The proposed method relays on a decision making process that learns the optimal grasping location from the input visual data, by means of a DQN agent, and finds the best releasing point from a path representation of the rope shape. Also other solutions are examined and discarded for inefficiency or inadequacy. Differently from other studies in that field, the proposed technique only needs very limited human intervention during the initial training phase, while the system is able to learn autonomously how to deal with generic scenarios thereafter.
Experimental results of reshaping tests are provided, showing the intermediate steps of deformation that lead the rope from its initial configuration to the target and we examined the output of the DQN in each step of a sample experiment. This results show that our system is capable to manipulate ropes into a variety of different shapes in few steps.
Since our technique only assumes a Q-learning algorithm with CNNs, we believe it can be easily improved by applying state-of-the art algorithms, e.g. HER [40] or including some awareness of the sequential deformation by integrating recurrent neural networks.