Fruit Localization and Environment Perception for Strawberry Harvesting Robots

This work presents a machine vision system for the localization of strawberries and environment perception in a strawberry-harvesting robot for use in table-top strawberry production. A deep convolutional neural network for segmentation is utilized to detect the strawberries. Segmented strawberries are localized through coordinate transformation, density base point clustering and the proposed location approximation method. To avoid collisions between the gripper and fixed obstacles, the safe manipulation region is limited to the space in front of the table and underneath the strap. Therefore, a safe region classification algorithm, based on Hough Transform algorithm, is proposed to segment the strap masks into a belt region in order to identify the pickable strawberries located underneath the strap. Similarly, a safe region classification algorithm is proposed for the table, to calculate its points in 3D and fit the points onto a 3D plane based on the 3D point cloud, so that pickable strawberries in front of the table can be identified. Experimental tests showed that the algorithm could accurately classify ripe and unripe strawberries and could identify whether the strawberries are within the safe region for harvesting. Furthermore, harvester robot’s optimized localization method could accurately locate the strawberry targets with a picking accuracy rate of 74.1% in modified situations.


I. INTRODUCTION
Machine vision is an essential element in agricultural robots. Before the development of deep learning techniques, traditional image processing methods were used, such as methods based on color thresholding, however these were not able to adapt to changing agricultural environments [1]- [3].
Deep Convolutional Neural Networks (CNN) have greatly improved the performance of image processing, particularly since the emergence of AlexNet, proposed by Krizhevsky et al. [4] and the numerous other detection CNN subsequently developed, some of which have been utilized for the detection of crops and fruits. Examples of such networks include You Only Look Once (YOLO), proposed by Redmon et al. [5], Single Shot Detector (SSD), proposed by Liu et al. [6] and the Region-based Convolutional Neural Network (Faster R-CNN), proposed by Girshick et al. [7]. Sa et al. [8] utilized Faster R-CNN in the detection of sweet peppers, mangoes, strawberries and other fruit while Bargoti et al. [9] adopted the same network to detect apples and mangoes, further improving its detection performance through data augmentation.
Besides object detection, segmentation CNNs have also been adopted for other applications in agriculture. Popular semantic segmentation networks include Fully Convolutional Network (FCN) [10], SegNet [11], DeepLab [12] and Unet [10]. Popular instance segmentation networks include Sharp Mask [13] and Mask R-CNN [14]. Bargoti et al. [15] utilized a semantic segmentation network to detect apples and estimate the yield. In addition, Yu et al. [16] utilized Mask R-CNN [14] for strawberry detection and similarly, Gonzalez et al. [17] used the same network for blueberry detection. While detection and segmentation networks have been widely used for the detection and counting of fruit, their applications in fruit harvesting have been rarely reported. Most of these methods focused on image analysis, thus were not applied to a specific agricultural machine system.
In order to achieve the efficient and reliable picking of the objects, they need to be localized after detection. Different methods based on different cameras have been used for the localization of fruits and other agricultural crops. These include the use of stereo cameras, depth cameras or single camera with extra assumptions.
Mehta et al. [18] localized citrus fruits using a fixed monocular camera. Xiong et al. [1] used a single RGB (Red, Green, Blue) camera for weed localization, based on the assumption that the distance between the camera and the weed plane was fixed.
Single camera techniques are simple but limited in their depth determination and, therefore, much work has been done on the development of multiple camera systems. Font et al. [19] presented a stereo camera system for apple and pear localization. Mehta et al. [20] investigated the fruit localization problems using multiple cameras based on the assumption that the target had been matched successfully. Similarly, Ji et al. [21] used stereo matching for the localization of apple branches.
Many agricultural robots use an RGB-D (RGB-Depth) camera for detection and localization because of its simplicity. Wang et al. [22] used an RGB-D camera for the detection and fruit size estimation of mangoes. Vitzrabin et al. [23] proposed a detection method for sweet peppers using an RGB-D camera, and Xiong et al. [3] developed a strawberry harvester using an RGB-D camera for the detection and localization of the fruits. In this paper, we used an RGB-D camera for object detection and localization.
Environment perception or ambient awareness is crucial for agricultural robots, to ensure safe interaction between the robot and humans, the surrounding environment and other objects. Reina et al. [24] integrated Light Detection And Ranging (LiDAR) and imaging for the environment awareness of outdoor vehicles. Similarly, the same researchers [25] developed a multi-sensor system that integrates stereo-vision, LiDAR, radar and thermography, for the ambient awareness of agricultural vehicles in crop fields. They also [26] used RGB-D images to sense obstacles in outdoor environments in the navigation of rough terrain mobile robots. Indeed, the environment perception system is most commonly used for vehicle navigation, the conditions of which are markedly different to those for a strawberry picking robot on a strawberry farm. In order to ensure safe picking operations, it is necessary for the robot to detect the environment directly surrounding the target strawberries.
In the development of various strawberry harvesters, some have adopted machine vision systems based on color thresholding methods [2], [3], [27], utilizing the color differences to distinguish between ripe strawberries and other strawberries and plants. Some machine vision systems have been designed to detect the strawberry peduncle as they work with a scissorlike cutter to cut the peduncle [28]- [30]. These systems apply color thresholding to first detect the strawberry and then detect the peduncle of the strawberry by identifying a certain region above the strawberry. However, as mentioned above, this color-based image processing is not able to adapt to changing environments [3].
Traditional feature learning methods have most typically been used for learning the different shapes of strawberries [31] and deep learning techniques for object detection and segmentation have shown results in the detection of strawberries [8], [16], [32]. However, these work have focused on image processing and, as previously mentioned, when integrated with a real strawberry harvester, the accurate localization of the strawberries and maintenance of the safe picking operations are essential and are, therefore, the main focus of this paper.
Specially, we aim to solve the localization and collision problems frequently encountered during table-top picking for the strawberry harvester. The following highlights are presented in this paper: • We utilize the deep learning network for instance segmentation to detect the target strawberries. Based on the detection results, we propose a localization method based on points clustering and location approximation algorithms. • We raise the potential collision problems for manipulators in table-top strawberry farming. We solve this problem by proposing environment perception algorithms that can identity a safe manipulation region and the strawberries within this region. We propose the safe region classification method for the strap in a 2D image and the table in 3D point cloud to identify the pickable strawberries that are located underneath the straps as well as the pickable strawberries in front of the table. • The methods for localization and environment perception were implemented and evaluated on our strawberry harvesting robot in the farm conditions, thus providing a reference for machine vision systems for localization and environment perception for similar harvesting robots.

II. OVERALL SYSTEM DESIGN
Our strawberry picking robot conducts static picking, in which it stops and processes the input image before issuing a command to the robot control system. Therefore, when the robot is static, the RGB and depth image acquired from the camera module is utilized for the computation of localization and environment perception in the machine vision system. The overall architecture of the proposed machine vision system is shown in Fig. 1. Instance segmentation network Mask R-CNN was utilized to detect our targets, including strawberries, strap and table. Thereafter, the detected strawberries undergo safe operation checking in 2D imaging, coordinate transformation, a 3D location approximation algorithm and safe operation checking in 3D space, to obtain the final 3D strawberries' locations within the safe manipulation region, thus achieving safe and efficient picking.
The proposed environment perception algorithms include defining the safe manipulation region in 2D image according to the locations of the strawberries and strap, and defining the safe manipulation region in 3D according to the locations of the strawberries and table.   In Fig. 1, the procedures related to strawberry localization are highlighted in red, while those related to environment perception are highlighted in blue. These two objectives coordinate with each other to finalize the positions of strawberries within the safe region, therefore the procedures relating to both objectives are highlighted in green. The detailed localization and perception algorithms will be described in the following sections.

A. FRUITS DETECTION AND SEGMENTATION
Mask R-CNN [14] was used for the detection and segmentation of fruits, tables and straps. Mask R-CNN is a deep neural network that can generate both the bounding box and the masks for each instance, as can be seen in Fig. 2. ResNet101 was used as the base convolutional neural network for feature extraction.
As described above, there are several networks available for object detection that are fast, accurate and well suited for fruit counting and yield estimation [5]- [7]. However, our goal is to estimate the fruit location in 3D space as accurately as possible. In this case, segmentation can provide more detailed information and is thus more appropriate for localization, since the segmented masks only contain the pixels of the targets whereas bounding boxes additionally include pixels of other objects. To sum up, the instance segmentation method was used because it can generate pixellevel segmentation for each object.
Four target groups were classified, namely ripe strawberries, raw strawberries, straps and tables. The ripe strawberries are, of course, the harvester's target, while the tables and straps present potential collision problems with the gripper while in manipulation and are, therefore, also objects that should be detected. Detailed discussion about strap and table detection will be presented in the next section.
Three examples of the detection and segmentation results are provided in Fig. 3. Fig. 3 (a) shows the input images and

B. COORDINATE TRANSFORMATION FOR SEGMENTED STRAWBERRIES
Through image processing, several masks were created for the strawberries, in which one mask represented a detected target. The masks were de-projected into 3D points, representing the 3D positions of the targets in the camera frame C. The workflow of the coordinate transformation is shown in Fig. 4. The masks were extracted from the detected results and the depth image was aligned to the RGB coordinate system. The depth value was then obtained by matching the aligned depth image with the corresponding mask results. The coordinates were transformed from the image frame I to the RGB camera optical frame C using the intrinsic parameters of the RGB-D camera.
Examples of the coordinate transformation process and its results can be seen in Fig. 5. The first and second columns are the colorized detected masks and the corresponding depth images, respectively. The third column is the visualization of transformed points marked by 3D bounding boxes in the point cloud. The detected masks contain the unripe strawberries but only the positions of the ripe strawberries were selected and sent to the harvester. Therefore, the third column shows the 3D bounding boxes of the ripe strawberries.

1) Points clustering
In this harvesting system, once the 3D positions of the targets are obtained, the machine vision system needs to send the positions of all strawberries to the manipulation system. However, it was found that the raw points transformed from VOLUME 4, 2016  the masks were not sufficiently accurate. Therefore, postprocessing procedures were implemented on the raw points to obtain a point-set that could better represent the target's real position.
The inaccuracy of the transformed points was caused by several factors. For example, the target points could be projected to the background scene due to inaccurate sensing from the depth camera, such as the example shown in Fig. 6 (a). Another factor was noise from the adjacent objects and, in addition, there may have been inaccurate segmentation of the masks from the Mask R-CNN.
Therefore, a clustering algorithm was utilized to screen out irrelevant or noisy points. Density-Based Spatial Clustering (DBSC) of applications with a noise algorithm [33] is a method that in which group points can be closely packed together. By setting a threshold distance to measure core samples and a parameter of a minimum number of points that can be a cluster, the less dense points and noises could be removed. Fig. 6 shows three examples of points before and after clustering, enclosed in the bounding boxes. The noises marked in the figure, can be filtered through this clustering method. Fig. 6 (a) shows an example of a strawberry edge sticking to the background, while 6 (b) and (c) show the examples of noises caused by adjacent objects.

2) Target position optimization
The 3D bounding boxes of target strawberries in the RGB camera optical frame were sent to the manipulator. The raw points obtained after clustering and the bounding box that encloses the region of the points is shown in Fig.7 (a), in which it is evident that the bounding box can only represent a portion of a strawberry. The surface of the target that faces towards the camera is sensed better than other surfaces as the RGB-D camera uses a projection method to obtain 3D points. In the table-top scenario, if the camera angle is that of the front view, the lengths in the x and z dimensions of a strawberry are almost the same. Therefore, in order to localize the targets more accurately, we used the dimensions detected in the x axis (representing the surface towards the camera) to represent those in the z axis. Fig.7 (b) shows the strawberry points and the refined bounding box.

D. WORLD COORDINATE TRANSFORMATION
The camera module enabled the location of the 3D coordinates of the fruit in the camera optical frame C, so it was necessary to convert the locations from the camera frame C into the arm frame W. The relationship between the different frames is shown in Fig. 8, in which S represents the strawberry, C the camera frame, W the arm frame and B the chess   board frame. Let W S be the location of the strawberry S with respect to the arm frame W, and C S be defined as the location of strawberry S location in the camera frame. The coordinate transformation of strawberries from camera frame to arm frame can be expressed as follows: where W C R and W C t are the rotation matrix and translation vector from the camera frame C to the arm frame W. The B C R, B C t shown in Fig. 8 can be obtained through camera calibration while W B R, W B t are known parameters. Based on these two sets of parameters, W C R and W C t can be obtained.

IV. ENVIRONMENT PERCEPTION A. PROBLEM DEFINITION
It is necessary for the strawberry harvester to sense its environment in order to make predictions and plan for the manipulation. Therefore, the scene must be segmented and objects that could cause potential damage must be localized.
During the experiments, the manipulator collided with the table or strap when the strawberries were either too close to the table or above the strap. Therefore, we used the segmentation network to detect the strap and table and make estimations about whether or not a target strawberry was located within the safe manipulation region. The regions marked by white dash lines in Fig. 9 represent the safe safety region for the manipulation. Fig. 9 (a) is a front view of the scene, in which the safe region is below the strap, while Fig. 9 (b) shows a side view showing the safe region below the strap and a safety distance from the table. Strawberries should, therefore, be picked in the safe region.

B. SAFETY SOLUTIONS FOR THE STRAPS
An important output obtained by the Mask R-CNN model was the strap masks. The strap above the strawberry table is used to support the strawberries plant during growth, making fruit easier to harvest and also preventing the stems from breaking. Most ripe strawberries hang underneath the straps, however some can be found above the straps, which may be dangerous for the gripper during harvesting. In this section, we introduce two methods by which strawberry positions can be identified in relation to the strap.

1) Method 1: Original Masks
In order to classify the strawberries that are on or above the straps, the top positions (y i top ) and the horizontal centroids (x i c ) of the strawberries bounding boxes are first calculated, as shown in Fig. 10. Thereafter, for each strap mask region of non-zero pixels, x i c is applied to obtain all the vertical coordinates y i from the masks. Next, y i top is compared to the minimum value of y i , which is used to represent the strap position, and assigned as dangerous if the strawberries are above the strap and safe if the strawberries are below the strap.
We observed, however, that this method was not always sufficiently precise, as there were some situations in which corrupted segmented straps were obtained, such as case 3 shown in Fig. 10. In this case, the calculation method was not applicable to the strawberries that did not have strap masks below and, therefore, case 3 may be considered a failure using this method.

2) Method 2: Rectified Masks
To solve the above mentioned problems arising in method 1, first, the Canny Edge Detection algorithm proposed by Canny et al. [34] was applied to ascertain all of the edge points of a segmented strap. Thereafter, we sequentially applied the Probabilistic Hough Transform algorithm proposed by Kiryati et al. [35], which uses a random subset from the edge detector to obtain multiple lines in the image, including their starting and ending coordinates. All these coordinates were then used to calculate the line equation (y = m · x + b) that best interpolates all the points by using least squares. The bounding box that enclosed all the strap masks, marked by the dash line in Fig. 10, was determined by the width of the strap and the fitted line. As shown in Fig. 10, to verify whether strawberries are above or below the straps and assign a warning sign (dangerous or safe) to each fruit, x i c is applied to the line equation to obtain the y and compare it to the y i top + threshold. This threshold is a value obtained through the original segmented mask to determine the safe manipulation region between the line and the position of the top of the fruit. As shown in Fig. 10, all cases were defined correctly using this method.
Comparative visual results for the two methods described above, the safety solution containing the original strap segmentation and the rectified strap segmentation, are shown in Fig. 11. The images Fig. 11 (a) presents the original images, while the images in Fig. 11 (b) show the results of the first method and the images in Fig. 11 (c) show the results of the second method. The green and yellow bounding boxes indicate, the safe (S) and the dangerous (D) warning signs, respectively. It is evident from these images that the visual results obtained through the first method could not correctly classify as dangerous the strawberries above the corrupted regions of the strap masks. However, with the second method, all the fruits were classified successfully.

C. SAFETY SOLUTION FOR THE TABLE
The picking robot needs to know the specific 3D location of the table in order to identify the proximity of a strawberry. The same clustering method was used for the table 3D points. The detected table masks and corresponding 3D points for table can be seen in Fig. 5.
In order to represent a table's complete position, we fitted a 3D plane to the detected 3D points of the table. A plane in 3D space can be determined by defining a point p 0 = (x 0 , y 0 , z 0 ) on the plane and a normal vector n = (a, b, c) that is perpendicular to the surface. The surface p = (x p , y p , z p ) can be represented by n·(pp 0 ) = 0.
We used the centroid of the points as p 0 . Then we created a moment of inertia tensor and used singular value decomposition to obtain the normal vector n of the plane.
The distance between the detected strawberry center p s and the table surface plane p could then be calculated. A line l = (x l , y l , z l ) passing through point p s and perpendicular to the table plane can be represented by l = k*n + p. The intersection point p i between the line and the plane satisfies both equations as follows: Thus the value of k and the exact position of p i were obtained. The distance between p i and p s was calculated and used to ascertain whether or not a strawberry is within the dangerous distance to the table of strawberry trays.
The results of the detection and segmentation results of table are presented in Fig.12 (a). The detected coordinates in the image can be obtained from the masks and transformed to the camera optical frame with the aligned depth image. The fitted plane is marked in green in Fig.12 (b) and Fig.12 (c). Fig.12 (c) also shows the point cloud and the detected  strawberries, as well as the distance between the target and the table.

D. STRAWBERRIES IN THE SAFE MANIPULATION REGION
The coordinates of detected strawberries were compared with the positions of the strap and table, to ascertain whether a strawberry was within the safe region. The algorithm for the position checking sequence can be seen in Algorithm 1.
The entire process can be concluded within the following three main steps. First, the positions of the strawberry and strap are compared within the 2D image, disregarding any strawberries above the strap. Second, the positions of the strawberry and the table are compared in the 3D space in the RGB camera's optical frame. The remaining strawberries and the table are also compared in 3D space, with those strawberries close to the table screened out by the pre-defined safety distance. In the third and final step, only the strawberries below the strap and outside the safety distance to the table are selected.
Algorithm 1 ascertain whether strawberries are within the safe region Result: coordinates of strawberries in safe manipulation region pre-processing: 2D line fitting for the strap and 3D plane fitting for the table. The metrics used to evaluate the detection results include precision, recall, F1 score and Average Precision(AP), as defined in Eq. 3, below. A total of 120 images were used to evaluate the detection method and the number of True Positive (TP) and False Positive (FP) were recorded. Three confidence values, ranging from 0.7-0.9, were set to compute the precision, recall, F1 score and AP. The results are shown in Table 1, in which it can be seen that ripe strawberries had a higher rate of detection accuracy. It was evident that from the annotation process that the ripe strawberries are easy to define while unripe strawberries are more difficult as they undergo a long growth stage from young, small strawberries to partially ripe strawberries. This could be confusing to the detection network.

B. EXPERIMENTS OF SAFETY SOLUTION FOR THE STRAPS
The performance of the two safety solution methods for the straps were evaluated, using test images containing a total of 418 strawberries. It is relevant to mention the strawberries were most commonly situated below the strap, so the warning sign classification was highly unbalanced. Confusion metrics for both methods are presented in Table 2, in which it is evident that the results for the method involving the original masks show high classification errors for the dangerous warning sign class. Some of the Dangerous classes were classified as Safe mainly due to the corrupted regions of the strap masks. However, after rectifying the masks, this error was mitigated and the overall accuracy results were improved from 83.7% to 96.9%. In both methods, the inaccurate classifications (Safe classified as Dangerous) were due to poor segmentation as well as inaccurate line equations.

C. EXPERIMENTS OF SAFETY SOLUTIONS FOR THE TABLE
The safety solutions for the table were evaluated using the RGB images, aligned depth images and point cloud. The RGB and depth images were used for obtaining detection and localization results while the ground truth was obtained by manually measuring the distance between the target and the table in the point cloud. The safety distance was set to 10 cm based on reasonable practical experience. Twenty sets of the collected data with 112 strawberries were tested and the classification results are shown in the confusion matrix in Table 3. Similar to straps results, significantly fewer strawberries were found in the dangerous region than in the safe region. The overall accuracy was 97.3%. The accuracy of the plane fitting was based on accurate detection and localization of the table. Therefore, the evaluations were primarily based on the assumption that the table had been correctly detected. Should the points not sufficiently accurate, the resulting fitted plane may not be well aligned to the real table. Because the aim of the algorithm is to accurately identify the strawberries within the safe manipulation region, the confusion matrix was used that would reflect related failures.

D. EVALUATION OF LOCALIZATION ON THE HARVESTING ROBOT
We tested the strawberry detection and localization method on our strawberry harvester (developed by Noronn AS). This harvester comprises a vehicle platform, a camera, a robotic arm and a gripper for picking strawberries [3], [36], as shown in Fig.13. A GPU (GTX 1060, NVIDIA, USA) was used for running the machine vision and manipulation control systems. The average processing time for one image frame, including running the detection network, coordinate transformation and other computations was 0.82s, as can be seen in Table 4. The time is an average of 119 image frames with a resolution of 640 x 480. The average times and their standard deviations for processing the detection, coordinate transformation (including strawberries and table points) and other computations are listed separately in Table 4.
The successful picking rates of the localization method based on raw points (method 1) and the bounding box optimization (method 2) were compared using the same scenarios, in which the cutting action was disabled so that the gripper swallowed the strawberry, moved down and went VOLUME 4, 2016  to the next strawberry. Each successful swallowing was considered as a successful picking.
The tests were conducted in modified situations, including those in which the strawberries were isolated and those in which ripe and raw strawberries were hanging adjacent to each other. In this test, the Rumba variety of strawberry was used, and the number of successfully detected and successfully swallowed strawberries of 12 trials are recorded in Table  5. The test of different growing situations can also be found in [36], in which the various harvesting failure cases were introduced. The picking rate in this paper is lower than that in [36], because in this test the variety of strawberry is more challenging for picking and the tests were conducted with one attempt of picking. The picking rates for the two localization methods were obtained by dividing the swallowed strawberries by the number of detected strawberries. Method 1 in Table 5 indicates localization based on raw points, while method 2 indicates the optimized localization method. It can be seen that the optimized localization method achieved a success rate of 74.1% in the modified environment, while the localization based on raw points achieve a successful picking rate of 51.8%.

VI. CONCLUSIONS
This work proposed a localization method and environment perception algorithms for strawberry harvesting robots. The localization method was based on the segmented masks of a deep convolutional neural network and depth images from an RGB-D camera. To increase localization accuracy, density based point clustering was used to segment and remove noise points in the 3D point cloud. The table and strap were detected and located using the same network, and their locations were compared with the positions of strawberries in order to identify whether the strawberries were within the safe manipulation region. The position comparison between the target strawberries and the strap was based on the line fitting using the Hough Transform algorithm, while the position comparison between strawberries and the table was based on a 3D plane fitting. The test results showed that the optimized localization method can accurately localize targets, with an accurate picking rate of 74.1% in modified situations. The overall accuracy rates for the strap and table safety identifications were 96.9% and 97.3%, respectively.
This work investigated the challenges of localization based on deep learning segmentation networks. It also raised the problem of environment perception in harvesting and provided methods for detecting the danger objects for the harvester and classifying the safe manipulation region.
In future work, the localization algorithm could be further optimized and adopted to suit more complex situations, such as occluded and unusual hanging positions of the strawberries.