Grasping Objects Mixed With Towels

In logistics warehouse sorting, rubbish classification, and household services, scenarios exist in which rigid and soft objects are randomly piled together. In such situations, two major challenges arise in robotic picking tasks: the first is to distinguish rigid objects from soft objects, and the second is to grasp one object of each type at a time. In this study, we propose a novel robotic picking methodology for the grasping of objects mixed with towels. The proposed approach is based on a novel object detection method that can identify a rigid object placed in different directions using a rotational bounding box. Rigid objects can be separated from the mixed scene without object segmentation. Moreover, the grasping pose of a rigid object can be generated directly along its principal axis, without using a CAD model or specific pose detection method. The gripper opening width is determined according to the object size. Therefore, our method can detect whether other objects, particularly soft ones, exist around a rigid object. If no suitable grasping pose is available for the rigid objects, the grasping pose on a wrinkle of the towel is selected. The experiments demonstrate that our method can accomplish the picking task in scenes with mixed rigid and soft objects, thereby indicating its significance in robotic object detection and sorting.


I. INTRODUCTION
In terms of logistics warehouse sorting, rubbish classification, and household services, humans constantly need to deal with object sorting in mixed and disordered scenes. People, including children, can easily pick up a rigid object without pulling up its surrounding objects. Similarly, they can successfully grasp a soft object without catching the objects covered thereby, and can place the objects in the correct locations according to their categories. In contrast, it is rather difficult for a robot to accomplish such tasks. Robots demonstrate a high probability of picking both rigid and soft objects together, as illustrated in Fig. 1. In this study, we focus on object grasping when rigid objects are mixed with towels.
The critical problem to be solved is to distinguish the rigid objects from the towels. Because a towel is highly deformable, it is adaptive to the shape of the object that is placed therein. Thus, in a scene with mixed objects, it is difficult to separate the object from the towels by segmentation, particularly when the colors of the towel and object The associate editor coordinating the review of this manuscript and approving it for publication was Okyay Kaynak . are similar. Therefore, an object detection method is used to locate the region of the rigid object, which also achieves the aim of separating the rigid object from the towels. In recent years, numerous object detection methods have been proposed, including Faster R-CNN [1], YOLO [2]- [4], and SSD [5]. The bounding box of the network output can be used to locate an object. However, object orientation is not considered in bounding boxes, which leads to detection results with multiple objects included in one bounding box.
Moreover, RGB images are used in these methods. If the color of the object is similar to the background, the detection results are unsatisfactory. In this study, inspired by the rotational bounding box introduced in [6], we extend YOLOV3 [4] and use RGB-D images as the input. In this manner, the network will output the rotational bounding boxes of the detected objects, so as to solve the problem that multiple objects are contained in one bounding box, which is convenient for the subsequent object grasping.
It has been established that, when planning a robotic grasping for a soft object, it is necessary to employ a pose detection method that differs from that for a rigid object, and vice versa. Although various impressive techniques have been achieved in the field of grasping, they have mainly focused on scenes with only one type of target [7]- [10]. When rigid and soft objects are mixed together, new challenges inevitably arise in grasping the objects. In this study, we select towels, which are in common use in daily life, as an example of soft objects for the problem of grasping mixed rigid and soft objects. One challenge is that the objects may be partly or totally covered by the towels. Another challenge is the selection of an appropriate grasping pose such that there is only one type of object within the gripper.
Upon comprehensive consideration of the above challenges, a novel grasping method based on object detection with a rotational bounding box is introduced in our work. Objects placed with different orientations can be detected by extending the YOLOv3 [4] network. Rigid objects can be separated from complex backgrounds without object segmentation. Different grasping methods are automatically selected according to the various object types. When grasping rigid objects, according to the detected information, a series of grasping poses can be generated directly along the principal axis without using a CAD model or a specific pose detection method. When grasping towels, we use the improved method of our previous work [11], which selects a grasping pose on the wrinkles of towels. This strategy can realize that only one type of object exists within the gripper. The experimental results demonstrate that the objects can be grasped successfully in scenes with mixed objects and placed in the designated box. The main contributions of this study are as follows: • Based on the object detection results, we introduce an algorithm to accomplish the efficient grasping of objects mixed with towels.
• We propose the use of RIoU instead of the traditional IoU to measure the accuracy of the prediction bounding box, which can display the angle change between two rotational bounding boxes in an improved manner.
• We consider GRIoU as the loss of the rotational bounding box regression, which can improve the object detection performance.

II. RELATED WORK
Robots may grasp multiple objects simultaneously in logistics warehouse sorting, rubbish classification, and household services. However, existing works have mainly focused on the scenario of grasping an object from a pile of rigid or soft objects. A detailed review of the previous studies on object detection and grasping is presented in the following sections.

A. OBJECT DETECTION
In recent years, numerous methods for object detection have been proposed. However, the bounding boxes used in these studies are horizontally placed rectangles, consisting of four coordinate parameters, such as those in YOLOv3 [4] and SSD [5]. Moreover, the bounding boxes obtained by the above are generally substantially larger than the object. In particular, the detection results are significantly degraded when the object is not placed horizontally or vertically. These shortcomings of the horizontally placed rectangles make them unsuitable for the robotic object sorting strategy, namely, detection first followed by grasping. In the field of remote sensing image detection, to overcome this limitation, Liu et al. [6] proposed a new detection technique that can output an additional angle parameter in comparison with the conventional method. Thus, the object orientation angle can be determined. On the basis of the Faster R-CNN, Xu et al. [12] proposed that, by learning the offset of four points on the non-rotation rectangle, a quadrilateral can be formed to detect multi-oriented objects.

B. GRASPING POSE DETERMINATION FOR RIGID OBJECTS
Gualtieri et al. [13] and ten Pas et al. [14] presented an impressive method for determining the candidate grasping pose by means of point clouds. With reference to the study of [14] and the classic work of PointNet [15], Liang et al. [16] introduced a new deep learning method known as Point-NetGPD. The advantage of this method is that the spatial geometry of the point cloud in the graspable region can be understood better. Based on depth images, Mahler et al. [17] proposed an outstanding network known as GQ-CNN for grasping pose determination. More importantly, a relatively effective grasping pose for most daily objects can also be obtained using their method. Moreover, substantial research has been conducted on grasping pose determination, specifically for the Cornell grasping dataset [18]. Furthermore, multiple grasping of rectangular boxes can be generated on the image by [10], [19], and [20].

C. GRASPING POSE DETERMINATION FOR SOFT OBJECTS
At present, the works on grasping soft objects, such as cloths and towels, focus on manipulation, including folding, hanging, and classification. In these studies, the premise is successful object grasping. Because cloths and towels may be randomly deformed, the grasping pose determination technique is quite different from that for rigid objects. An appropriate candidate grasping pose can be determined according to the aim of picking up clothes and its present poses. As demonstrated in [21], any point can be selected as VOLUME 8, 2020 the grasping position. However, this approach often results in air grasping or grasping the target object together with other objects. In the work of [9], the center of the segmented region was selected as the candidate grasping point.
In [22] and [11], a point that was at the highest position and on a wrinkle of the clothes was selected as the grasping position, respectively. In the procedure of folding, corner points are often considered as the grasping positions [9]. For further information on grasping clothes, please refer to the survey in [23].
In summary, when grasping rigid objects, existing works have mainly focused on the generation of grasping poses on the objects. Moreover, in manipulation on soft objects, the grasping pose is generated in different positions of the soft objects according to the different tasks. The two grasping methods are very different. Therefore, in scenes with mixed objects, it is very important to distinguish the two object types and to select different grasping poses according to the various object types. Furthermore, the special properties in the grasping of mixed rigid objects and towels must also be considered. For example, a grasping pose may be suitable for a rigid object; however, when part of a towel is near the object or under it, when the gripper is closed, the towel will also be grasped. Therefore, it is almost impossible to complete the task by using the above methods alone or by means of a simple combination of A, B, and C. In this study, based on the results of our new object detection method, we introduce a novel methodology to attempt to overcome the above challenges.

A. PROBLEM STATEMENT
In logistics warehouse sorting, rubbish classification, and household services, robots not only need to grasp an object of each type successfully, but also need to recognize the category of the objects. There are two main solutions to this problem: first grasping and then detecting the object, or conversely, first detecting and then grasping the object. The common problems are to distinguish a rigid object from a soft one and to achieve the grasping of mixed rigid and soft objects. In the first strategy, occlusion occurs when recognizing the object grasped by the hand. The second strategy also exhibits certain shortcomings. The traditional detection method always leads to the detection of more than one object in a bounding box, particularly, when items are mixed together. In this study, we propose a novel object detection method that can identify objects placed in different orientations with rotational boxes. The method can effectively solve the problem that multiple objects are contained in one bounding box. On this basis, not only can the object category be recognized, but a rigid object can also be separated from the other objects. Moreover, we propose a robotic picking methodology that can select different grasping methods according to various object types.

B. RIoU AND LOSS OF ROTATIONAL BOUNDING BOX REGRESSION
The extended YOLOv3 [4] is adopted to detect objects placed in different orientations. In comparison with the current YOLOv3 [4], the extended version can output an additional angle to represent the orientation of an object. Furthermore, the previous anchor box is no longer applicable in the extended YOLOv3 [4]. Therefore, the rotational anchor box is introduced into the extended network, as illustrated in Fig  Apart from the additional angle information, the predicted parameters are exactly as in YOLOv3 [4]. Moreover, the network input is replaced with RGB-D, and RIoU is used to measure the overlap of two rotational bounding boxes.
After training, the parameters (including (t x , t y , t w , t h , θ)) can be predicted. As described in YOLOv3 [4], the cell offsetting from the left upper corner of the image is (c x , c y ), and the size and angle of the bounding box prior can be expressed as (p w , p h ) and p θ , respectively. Thus, the predicted results can be expressed as where σ (x) represents a sigmoid function and (b x , b y , b w , b h , θ) represent the center, size, and angle of the predicted box, respectively. The value IoU is used to measure the accuracy of the detection results. A higher value of IoU indicates a more accurate prediction result. However, for the rotational bounding box, the current IoU cannot effectively represent the angle change of the prediction box, as indicated in Fig. 4. Liu et al. [6] proposed the angle-related IoU (ArIoU ) to define the IoU relating to the angle. It is represented as follows: where θ A and θ B represent the rotational angles of bounding boxes A and B, respectively. In this case, Â and A have the same parameters, except for the angle, and the angle of Â is equal to θ B . However, the above definition exhibits its own defects. When the change in the two rotational angles is small, the value of ArIoU does not change significantly, as illustrated in Fig. 4. Therefore, in this study, we present another definition of IoU for the rotational bounding box: RIoU . It is more sensitive to a slight change in the angle and can demonstrate the change trend of the angle difference between the two intersecting rectangles. The value is given by: where θ A and θ B represent the rotational angles of bounding boxes A and B, respectively. Furthermore, A and B are obtained from A and B through R1 and R2 transformations, as shown in Fig. 3. It can be observed from Fig. 4 that RIoU is more sensitive to small changes in the angle. If the predicted box and truth overlap completely, the value of RIoU will be 1. If the angle between the predicted box and truth is π/2, or no overlap exists between them, the value of RIoU will be 0. When two rotational rectangles intersect, solving the intersection area is not as simple as determining the horizontal rectangle. To address the above problem, we transform two rotational rectangles (A and B) of the initial position into two horizontal rectangles (A and B ) by means of two rotational transformations, as illustrated in Fig. 3. Following the R1 transformation, we obtain the horizontal rectangle A1 and rotational rectangle B1. The rotational angle of B1 is the angle difference between A and B. The intersection area of the two new rectangles is the same as the original intersection area. Moreover, the relative positions of the two new rectangles are the same as before. Thereafter, the horizontal rectangle B can be obtained by the R2 transformation. In this case, B and B1 have the same parameters, except that the rotational angle of B is 0, whereas A and A1 are exactly the same. As A and B are now horizontal rectangles, their intersection areas can easily be computed.
The GIoU [24] focuses on both the overlapping and nonoverlapping areas of the two rectangles. It can provide a better reflection of the coincidence of the two rectangles. However, the GIoU [24] does not perform effectively in representing the change in the angle between the two rotational rectangles, as indicated in Fig. 4, although this concept is very useful for our work. Based on GIoU , we propose GRIoU , which can represent not only the change in the angle between two rotational rectangles, but also the coincidence between them, as illustrated in Fig. 4. Therefore, we use L GRIoU as the loss in the training process, as is presented in Algorithm 1.

C. GRASPING POSE DETERMINATION 1) GRASPING POSE OF TOWELS
A towel is deformable and can be represented in thousands of states in space. Therefore, it is very difficult to separate towels with segmentation. In this study, we still select the grasping pose on wrinkles as in previous work [11]. Moreover, the point cloud used still does not contain color information. The advantage is that this method is applicable for towels of any color. In our work, the most convex point is not used as the grasping point, as in the previous study.
In contrast, the center of the wrinkle is often regarded as the grasping point, which can be obtained by PCA (Principal Component Analysis) of the candidate wrinkle point cloud. This can prevent the grasping points from being located in an unfeasible grasping area of the wrinkle, while effectively reducing air grasping. Moreover, the grasping direction of the two-fingered gripper is along the principal axis of the wrinkle, and the principal axis can be computed by PCA. The grasping direction generated in this manner is generally perpendicular to the wrinkle. Consequently, grasping failure caused by an incorrect estimated grasping direction can effectively be avoided.

2) GRASPING POSE OF RIGID OBJECTS
In a scene with mixed rigid objects and towels, owing to the shape adaptability of the towels to the rigid objects, the situation always exists in which the rigid objects may be surrounded and covered by towels. This requires the grasping pose to be generated by taking into account the above situation as far as possible. Therefore, it is necessary to generate the grasping poses along the principal axis of the object. By means of collision detection, whether there are towels or other objects near the object can be determined. Firstly, according to the detection results, the object can be selected as the candidate for grasping based on its score. Secondly, we map the 2D rotational bounding box to the 3D rotational bounding box, the benefit of which is that the rigid object can easily be separated from the mixed scene. Furthermore, this aids in generating the grasping pose along the principal axis of the object. Finally, through the PCA of the point cloud extracted from the 3D rotational bounding box, a series of grasping poses along the principal axis can be determined. The opening width of the two-fingered gripper reaches up to the dimension of the object. According to the PCA and size of the 3D rotational bounding box, we can obtain the center point P c (p c x , p c y , p c z ) of the extracted point cloud, the principal direction (n x , n y , n z ), and the size (w r , h r ) of the object. Using this information, we can determine the grasping pose along the principal axis of the object. In this study, we sample three grasping points along the principal axis: P 1 (p c x , p c y , p c z ), P 2 (p c x − n x * 0.3 * h r , p c y − n y * 0.3 * h r , p c z − n z * 0.3 * h r ), and P 3 (p c x + n x * 0.3 * h r , p c y + n y * 0.3 * h r , p c z + n z * 0.3 * h r ). Subsequently, we can identify which grasp pose may be suitable by collision detection. If the collision detection demonstrates that other objects exist around the object, that with the next-highest detection score is regarded as the subsequent candidate object.

3) COLLISION DETECTION OF GRASPING POSE
A towel is adaptive to the two-fingered gripper and it is therefore unnecessary to conduct collision detection thereon, which is only required for the rigid objects in our work. Let V (L) ⊂ R 3 and V (R) ⊂ R 3 represent the volumes of the left and right fingers of a parallel-jaw gripper, respectively. Moreover, V (L) and V (R) represent the 3D rotational bounding boxes. Let N ⊂ R represent the number of points in the 3D rotational bounding box. If (N (V (L)) ≤ C T and N (V (R)) ≤ C T ), where C T is a constant value of 60, a collision-free grasping pose is indicated. Taking into account the noise from the camera, a constant C T is used in this work to increase the tolerance of the collision detection.

A. DETECTION RESULTS
Because the color of the objects and the background may be similar, if only RGB is used as the input data of the network, it is difficult to distinguish between these. Therefore, we added an extra channel of depth to the network. Several detection results of the different IoU and loss values are presented in Table 1 and Fig. 5. Compared to ArIoU , RIoU could improve the detection performance with the same MSE loss as that used in YOLOV3 [4]. Moreover, the loss L GRIoU exhibited superior improvement. From Fig. 5, it is obvious that our method could predict the angles of the objects more accurately.

B. RIGID OBJECT GRASPING
The grasp pose of the rigid object and opening width (1.5 * w r of the object) of the two-fingered gripper could be determined based on the object detection information. The grasping pose occurred along the principal axis of the object, as illustrated in Figs. 6(c) and (e). Following collision detection, if one of the fingers was shown in red, this indicated that there were other objects near to it. Otherwise, the fingers were shown in blue, indicating that this was a proper grasping pose, as illustrated in Figs. 6(e) and (f).
If the grasping poses of the candidate grasping object were all collisional, the object with the second-highest score became the candidate object, as indicated in Fig. 6(e). Following collision detection, the grasping pose was feasible for the third candidate object; thus, this grasping pose was selected.
Moreover, for small-sized objects, collisions could occur in all of the grasp poses along the principal axis direction, as illustrated in Fig. 8(c). The grasping direction could also be rotated by 90 • and grasping could be attempted again, as indicated in Fig. 8(c). Furthermore, it had to be ensured that the width (1.2 * h r of the object) of the two-fingered gripper was no greater than the maximum opening limit. When the objects were close together, it was difficult to grasp them successfully without the other actions, as illustrated in Fig. 7(a). In this case, we adopted the strategy of pushing and grasping to address the problem, as shown in Fig. 7. The pushed object was selected based on its score and the pushing direction was along the principal axis of the pushed object. The pushing end pose P e was set to P c . The pushing initial pose was expressed as P i (P i x = p c x + n x * 0.5 * h r , P i y = p c y + n y * 0.5 * h r , P i z = pc z ).

C. TOWEL GRASPING
Following the above grasping pose determination, if an appropriate grasping pose was still not detected, a towel would initially be grasped (see Fig. 8(c)). An overview of grasping a towel is presented in Fig. 8. Moreover, the opening distance of the two-fingered gripper was set to be slightly larger than the width of the wrinkle (in our experiments, the opening distance of the gripper was g w = 30 mm). Moreover, it should be noted that the opening width of the two-fingered gripper jaw was very small and its end joints were passive joints. Therefore, even if there were objects under the towel, they would not be grasped.
In certain very special and rare cases, when the geometry of the covered objects is extremely clear, we can also detect these common geometries, such as rectangles and cylinders, as illustrated in Fig. 9. If the detected objects contained some regular shapes, as shown in Fig. 9(a), this indicated that certain objects were covered by towels. It was only necessary to exclude the region marked as unfeasible for grasping. Let V (sob) ⊂ R 3 represent the rotational bounding boxes for the specific objects (such as a cylinder). Let V (rob) ⊂ R 3 represent the rotational bounding boxes for the rigid objects (such as toothpaste and coke). Let V (rob) = 1.5 * 1.5 * V(rob); that is, multiplying (w r , h r ) of V (rob) by 1.5. Let C (sob) ⊂ R 3 represent the point cloud in V (sob). Let C (rob) ⊂ R 3 represent the point cloud in V (rob). Because several deviations could exist between the predicted and actual angles,  Let C (rob) ⊂ R 3 represent the point clouds in V (rob), which replaced C (rob) as the point clouds of the rigid objects. Therefore, the practicable grasping region in the point clouds could be expressed as C (g) = C ∩ C(sob) ∩ C (rob). An overview of grasping a towel is presented in Fig. 9.

V. CONCLUSIONS
In this study, based on the results of object detection, we introduced a picking strategy that can automatically select a feasible algorithm for grasping pose determination in a mixed situation. Furthermore, it can efficiently solve the grasping of objects mixed with towels. Considering the defects of the traditional IoU and ArIoU , we proposed RIoU to measure the accuracy of the detection results. Our method considers L GRIoU as the loss, which can improve the object detection performance. The proposed method can detect objects with different orientations. Using this new detection technique, rigid objects can easily be separated from a mixed scene without segmentation. Furthermore, the grasping pose of the rigid object can be generated directly along the principal axis thereof without using a CAD model or other pose detection methods. The effectiveness of the proposed method was demonstrated in scenes with mixed objects. However, there are some limitations in the current work. One limitation of this work is that the current method does not consider the effect of grasping force. For some soft objects, like bread, if the grasping force is not appropriate, the object will be damaged. The other is that the grasping direction is perpendicular to the table and aligned with a principal axis of the object. However, in practice, there are indeed some situations where it may be more reasonable to grasp objects in other directions. So, in future work, we will study more complex scenes, including other kinds of soft objects, such as bread, bananas, etc. In this case, the grasping force optimization approach will be needed to achieve the balanced grasping of rigid and soft objects. In addition, we will also consider how to realize grasping in more feasible directions to meet more complex scenes. He has published more than 250 articles in refereed journals and refereed conference proceedings and was listed in the Highly Cited Authors (Engineering) by Thomson Reuters, in 2013. His research interests include visual servoing, medical robotics, multi-fingered grasping, mobile robots, and machine intelligence. He has received numerous research awards from international journals and international conferences in robotics and automation and government agencies. He is the Editor-in-Chief of Robotics and Biomimetics and served as an Associate Editor for the IEEE TRANSACTIONS ON ROBOTICS AND AUTOMATION and the General Chair for the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.