Visual Perception and Modelling in Unstructured Orchard for Apple Harvesting Robots

Vision perception and modelling are the essential tasks of robotic harvesting in the unstructured orchard. This paper develops a framework of visual perception and modelling for robotic harvesting of fruits in the orchard environments. The developed framework includes visual perception, scenarios mapping, and fruit modelling. The Visual perception module utilises a deep-learning model to perform multi-purpose visual perception task within the working scenarios; The scenarios mapping module applies OctoMap to represent the multiple classes of objects or elements within the environment; The fruit modelling module estimates the geometry property of objects and estimates the proper access pose of each fruit. The developed framework is implemented and evaluated in the apple orchards. The experiment results show that visual perception and modelling algorithm can accurately detect and localise the fruits, and modelling working scenarios in real orchard environments. The $F_{1}$ score and mean intersection of union of visual perception module on fruit detection and segmentation are 0.833 and 0.852, respectively. The accuracy of the fruit modelling in terms of centre localisation and pose estimation are 0.955 and 0.923, respectively. Overall, an accurate visual perception and modelling algorithm are presented in this paper.


Introduction
Robotic harvesting technique shows a promising aspect in the future development of the agriculture industry. Machine vision is one of the essential element in the harvesting robots. The vision system of a harvesting robot is required to precept environment information, such as the location of target fruits and the surrounding environment, to guide the manipulator to perform the harvesting task. There are many factors which can heavily affect the performance of the vision system, including illumination changing, objects occlusion, and noisy background. Compared to perform robotic harvesting in the structured orchard which has been optimise-designed for automation purpose, robotic harvesting in unstructured orchard environments is even challenging. Since the complex arrangement of branches, leaves, and other objects can easily obstacle the detachment path of the manipulator, leading to the failure of operation or even damage the manipulator. Therefore, a robust and efficient perception and modelling algorithm is the crucial element to develop universal harvesting robots.
In this work, a machine vision system for apple harvesting robots is presented. The developed vision system includes a multi-purpose Deep Network Detection And Segmentation Network (DASNet) for visual perception in 2D images and an environment modelling algorithm for information processing on 3D point clouds. The following highlights are presented in this paper: • Development of a multi-purpose deep network DASNet, which combines the detection and instance/semantic segmentation in a one-stage detection network architecture, to perform the real-time environment perception in unstructured orchard environments.
• Development of an universal framework to perform visual perception and modelling in the unstructured orchard environments to guide robotic harvesting of fruits.
• The developed perception and modelling vision algorithm were implemented and evaluated on the real unstructured orchard environment, providing a guideline for the designing of the robotic system in similar working environments.
The rest of the paper is organised as follow. Section 2 reviews the related works. Sections 3 and 4 introduce the methodology and experiment of the work, respectively. In section 5, the conclusions are presented.

Related Works A: Visual Perception
Robotic vision in agriculture applications can apply different visual sensors, including RGB/stereo camera, RGB-D camera, Light Detection And Ranging (LiDAR), thermal imaging sensor, and spectral camera [1]. This work focus on reviewing the vision processing method on 2D RGB images. Vision perception of RGB images has been extensively studied. Traditional image processing algorithm applies feature descriptors to describe objects of interest within images, and then the machine-learning based algorithms are applied to perform the classification or detection accordingly. Many feature descriptors such as Colour Coherence Vector (CCV), Histogram of Gradient (HoG) and Scale Invariance Feature Transform (SIFT) were applied in the previous works. Correspondingly, traditional machine-learning based algorithms such as K-mean, Support Vector Machine (SVM), and Neural Network (NN) were applied to perform the classification on such descriptors. Nguyen et al. [2] applied colour information and 3D geometry descriptor to perform detection of apple fruit with RGB-D camera. Lin et al. [3] applied HSV colour feature and 3D geometry features on point clouds to describe the appearance of multiple classes of fruits, then an SVM classifier was trained to perform the detection based on such features. Wang and Xu [4] applied multiple image feature descriptors and Latent Dirichlet Allocation (LDA) model to perform unsupervised segmentation of plants and fruits in greenhouse environments. Similar works of traditional machine vision algorithm in agriculture applications can also be referred to the reviews [5,6].
More recently, deep-learning based algorithm shows advance and robust performance in the many tasks of machine vision. Deep-learning based algorithm applies deep Convolution Neural Network (CNN) architecture to perform automatic selection and learning of proper features from images. Many classic deep-learning architectures for different purposes were proposed. Region Convolution Neural Network (RCNN) [7] and YOLO [8] are state of the art in object detection. The former method utilises two-stage detection strategies, applying a Region Proposal Network (RPN) to search Region of Interest (ROI) and a classification network to classify the objects and optimise the boundary box within ROI. YOLO applies fully convolution network architecture, combining the ROI searching and classification into a single step, reduce the computation complexity of the objects detection compared to the RCNN. Other works of the Deep CNN such as Fully Convolution Network (FCN) and Mask-RCNN for semantic segmentation and instance segmentation can be referred to the works [9] and [10], respectively. In the machine vision in agriculture applications, the works of [11] and [12] applied faster-RCNN to perform detection of multiple classes of fruits in farm conditions, including apples, mango, and pepper. The works of [13] and [14] utilised the Mask-RCNN model to perform detection and instance segmentation in the application of robotic harvesting of strawberry fruits. The works of [15] and [16] applied YOLO on real-time in-field detection of apple and mango fruits for yield estimation and monitoring, respectively. The works of [17] and [18] applied FCN model to detect the guava fruits and cotton, respectively. In our previous works [19,20,21], a YOLO and SPRNet [22] architecture based multi-purpose deep CNN network model DASNet was developed to perform real-time detection, instance segmentation of fruits, and semantic segmentation of branch/trunk in orchard environments. More similar works of deep-learning based algorithms in machine vision in agriculture applications can also be found in the recent survey [23].

B: 3D Visual Mapping
3D mapping of the working scenarios is an essential element in many robotic applications. There are several implementations of 3D mapping methods which have been applied in the previous works. Tabak et al. [24] discretised the mapping area of the 3D environment with equal size voxels. Voxelisation of 3D space requires large memory consumption when large outdoor scenarios and fine resolution are presented. With the development of the rang sensor, point cloud becomes another popular approach of mapping. Cole and Newman [25] used 3D point clouds to present the 3D space in outdoor 3D SLAM system. However, the computation and memory consumption of this representation becomes enormous with increasing measurement resolution. For example, the depth camera with a resolution of 640 × 480 can output up to 300 thousand points. Other methods like elevation map [26,27] or surface representations [28,29] can only meet requirements when certain assumptions are made. Hornung et al. [30] presented an octree-based 3D mapping approach OctoMap to represent the 3D space into a memory efficiency volumetric occupancy map. Also, such a method provides other advantages as adjustable resolution and high update frequency [31]. Therefore, OctoMap provides an efficient and robust approach to modelling the unstructured scenarios [32].
There are several previous studies which have explored the environment modelling in the orchard scenarios. Adhikari and Manoj [33] applied a Time-of-Flight (ToF) depth camera to perform visual sensing, and a 3D skeletonization algorithm was used to modelling the 3D profile of the branch/trunk for mechanical pruning. Wang and Zhang [34] applied two RGB-D cameras to reconstruct dormant tree in the form of 3D pint cloud under laboratory environment. Amatya et al. [35] applied a Bayesian-based classifier to perform classification of branch pixels within orchard environments, and then the branch was represented in the form of the straight line in the 2D images. Li et al. reconstructed the tree from the front view and back view, and the branch was modelled as a cylinder through the random sample consensus (RANSAC) algorithm. Lin et al. [17] applied FCN to perform segmentation on branch and modelling it as the 3D line in the scenario to perform pose estimation of guava fruits. From our reviews, previous works of orchard modelling have several shortages in terms of robustness in real orchard environment, accuracy modelling, and inefficient computation. Therefore, a universal framework which includes OctoMap based modelling algorithm and corresponds fruit modelling algorithm is developed in this paper to perform visual perception and modelling in the unstructured orchard scenarios. The developed apple harvesting robot contains a robot arm & gripping subsystem and a visual perception system. RGB-D camera (for example, Kinetic-v2 or RealSense D-435) are used to sensing the colour and depth information from the working scenarios. The applied RGB-D camera can be either installed on the gripper or at the back frame of the robot. In each running iteration of robotic control computation, visual perception and modelling algorithm is designed for sensing and process the visual information to guide the next action of the manipulator or gripper.

System Architecture
The workflow of the visual perception and modelling algorithm is shown in Figure 1. Firstly, DASNet [20,21] performs segmentation and detection on input RGB images to extract objects of interest from the working scenarios. Then, by combining the depth map which is collected by RGB-D camera, the processed information is used to modelling the working scenarios of the orchards. The branch/trunk or other elements (other possible obstacles) are represented in the form of the OctoMap. The fruits are represented in the form of the sphere, of which a 3D Sphere Hough Transform (3D-SHT) algorithm is used to estimate the geometry property of each fruit. Based on the point distribution, the accessible pose is estimated for each of fruits. Furthermore, to ensure the robustness of the pose estimation, a 3DVHF+ [36] based pose verification algorithm is developed to check the confidence level of the estimated pose. Lastly, the acceptable picking list of fruits and 3D scenario mapping are sent to the centre control, which can decide and generate the next motion and action of manipulator or gripper. The DASNet is utilised to perform the visual perception task in unstructured orchard environments. DASNet improves the architecture of one-stage network by combining the detection, instance segmentation, and semantic segmentation into a single network model. The detection branch of network utilises a three-level Feature Pyramid Network (FPN) to fuse the information of feature maps from deeper level to the shallower level of the network. In each level of FPN in DASNet model, an Atrous Spatial Pyramid Pooling (ASPP) is applied to encode the multi-scale features of objects into output feature maps. There are two fixed-size anchor boxes which are signed at output branch of each level of FPN (total six anchor boxes in 3-level FPN). Each output branch predicts the class, boundary box, and object mask for each of the objects. The overall design of the detection branch of DASNet is shown in Figure 2 and more details of DASNet model can refer to the reference [21]. DASNet utilises Resnet-50 as backbone since previous of our works suggest that Resnet-50 achieves the balance between performance and computation efficiency.

Multi-purpose Network
The semantic segmentation branch is grafted at detection branch of DAS-Net, which receives the feature maps from C3, C4, and C5 level. To keep consistency of the feature maps from different level of the network, the C5 and C4 level of feature maps are 4× upsampled and 2× upsampled to match the size of the feature maps from the C3 level. The fusion between different level is achieved by concatenation operation. The concatenated feature tensor is then 8× upsampled to form the segmentation results. The detection branch of DASNet outputs the detection and instance segmentation results of fruits while semantic segmentation branch outputs the semantic segmentation results of the branch/trunk. The combination of the detection branch and semantic segmentation branch forms the final output of the DASNet.

Post-processing
Due to the noisy background in unstructured orchard environments, accurate segmentation of branch/trunk is still a challenging task. Therefore, DASNet is expected to segment the major branch/trunk of trees while ignoring the other small branch. A region-connection analysis by using OpenCV API function is applied after DASNet to remove the small region the segmentation results of the branch. Except for the branch, other elements within the orchard environment such can also affect the operation of the manipulator. Thus, the objects which are identified as not belonging to the branch/trunk are also be included in the following environment modelling. From the environment perception module, the semantic labelling of fruits, branch, and other elements are returned as the form of the image binary mask. Combining the information of depth map from the RGB-D camera, the point clouds are assigned for each class of objects correspondingly. However, the 3D mapping of the environment based on point clouds is not efficient [32]. Point clouds always include many noisy information and details which cannot be used in the path planning and obstacle avoiding. Meanwhile, the large number of points within the RGB-D camera output could also lead to large computation consumption.

Modelling of Scenarios
OctoMap mapping the 3D environment based on octrees, dividing the space with the number of small-sized voxels. Original OctoMap adopts probabilistic 3D mapping framework to minimise the error in the range measurement due to the moving of objects or robots. In the condition of apple harvesting robots, the current scenario collected by RGB-D camera is considered as in static condition in terms of guiding the following action of the manipulator. Therefore, the binary representation of occupied voxels within octrees is utilised. By setting the minimum size of the voxels, the resolution of the environment mapping can be adjusted based on the situation. In the process of modelling the trunk/branch, the point clouds of each class of objects will be firstly de-noised to minimise the possible measurement error during the sensing. Then, the OctoMap for each class of objects is constructed accordingly based on the given resolution. The workflow of fruit and scenario modelling algorithm is shown in Figure 3  A 3D-SHT algorithm is developed in this work to estimate the optimal centre position of the fruits from the obtained point clouds of each fruit mask. We assume the geometry of the apple fruit is sphere, and the equation of sphere can be expressed as: c x , c y , c z , and r are the centre and radius of the sphere, respectively. 3D-SHT extend the Circle Hough Transform (CHT) into 3D case, applying vote framework to estimate the most possibly parameters of a sphere which can indicate the actual position and size of detected fruits within scenarios. We firstly spatially uniformed sampling the point clouds of each target to reduce the computational complexity and noise. Then, based on the distribution of point clouds, a searching range of each parameters are calculated accordingly. The searching range of each parameters will be discretized based on given resolution. For each of point within the point clouds, we calculate the corresponding value of radius r est for each possible pair of c p x , c q y , and c n z within the searching range.
If the calculated value of radius belong to the interval of searching range r accept of acceptable radius, we add one vote on the corresponding parameter pair of c p x , c q y , c n z , and r k . Finally, the parameter pair with the highest number of vote generates the estimated sphere of detected fruits.

b. Pose Estimation
Based on estimated optimal centre of fruits and distribution of point clouds of each mask, a pose estimation algorithm is utilised to calculate the Eulerangle of each fruits within the scenario. We assume the point clouds (number of points is n) which is identified belong to a fruits is the visualised or unblocked partition of the fruits from the current view-angle of the RGB-D camera. According to the pose of each points within the point clouds based on the optimal centre of detected fruits, the optimal access pose of the manipulator to this fruits can be estimated. The parametric equation of a sphere is expressed as follow: In Eq 4, R equals to x 2 + y 2 + z 2 , and θ and ϕ is in the range of [0,2π]. We calculate angle θ i and ϕ i of point p i which belongs to the point clouds, to estimate the unblock pose of fruits under the current view-angle of the RGB-D camera. For chosen point p i (x i , y i , z i ) and centre of estimated sphere C(c x , c y , c z ) of a fruit mask, we have: Then, the angle θ i can be calculated through the function shown as follow: Similarly, the angle ϕ i can be calculated through the function: The pose of a fruits can be modelling as the ZYX-Euler angle rotation of the coordinate around the centre of the sphere. θ and ϕ are the rotation angle along the Z-axis and Y-axis, respectively. The rotation matrix therefore can be expressed as: While the θ and ϕ are expressed as follow: To reduce the false prediction of fruit pose due to measurement error of point clouds, we limit the range of estimated angle of θ and ϕ in the range of [− 1 3 π, 1 3 π].

c. Pose Verification
Due to the complex arrangement of unstructured orchard environment, visual perception and modelling algorithm may affected by some unexpected factors. Therefore, a pose verification algorithm based on 3DVFH+ is utilised to verify the correctness of estimated pose and secure the manipulator during the operation. This method is to calculate the orientation of barrier in the given neighbourhood of a chosen fruit, to ensure that there is no barrier presented in the orientation of the estimated pose. We firstly initialise a 2-dimensional histogram H to record the orientation angle θ and ϕ of the barrier (branch or other elements) to the centre of fruits with a given resolution. Then we search the obstacles within neighbourhood range of r (in mm) of the target fruit. That is, for a barrier B i within the range, the orientation angle θ i and ϕ i to the centre C of fruits are calculated by using Eqs 6 and 7. Then a penalty value P V i is added to the corresponding location at H [θ i barrier , ϕ i barrier ], which can be expressed as follow: The α and K are the class term and distance term of barrier penalty which are expressed as follow: β is a constant to adjust the penalty value related to the distance penalty term. We set β and neighbourhood range r equals to 50 and 200, respectively. Then, based on the estimated pose θ pose and ϕ pose of the chosen fruit, the sum penalty value at H[θ pose , ϕ pose ] is used to calculate the confidence rate L, which is expressed as: Eq 14 map the confidence rate of estimated pose of a target into the range of [0,1]. Therefore, the confidence rate of estimated pose of each fruits can be represented in form of the probability value. Given a threshold τ to filter the under-estimated target, which is: can pick : T rue, L >= τ can pick : F alse, L < τ We set τ as 0.6, and the fruits with higher confidence rate will be assigned as priority in the picking sequence. Furthermore, we should notice that Eqs 11 to 15 only consider the geometry constraint in the environment, other constraints such as robotic working space can be further added into this framework, which can be expressed as: 4 Experiment and Discussion

Implementation Details
The experiment was conducted in the apple orchard, which is located at Qingdao (China), from 10th to 25th of November in the year 2019. There

Evaluation Methods
The evaluation of environment perception module in terms of object detection, instance segmentation, and semantic segmentation are accomplished by using F 1 score, Mean Intersection of Union (MIoU), respectively. The F 1 score is used to measure the detection performance of the network by evaluating the value of P recision and Accuracy, which are expressed as follow.
P recision = T rueP ositive(T P ) T rueP ositive(T P ) + F alseP ositive(F P ) (17) The MIoU evaluates the quality of segmentation by measuring the rate between intersection and union of two subsets, which are network prediction of segmentation and ground-truth in this case. The expression of MIoU is shown as follow [37]: The p ij and p ji stand the false positive and false negative of the network prediction of segmentation, respectively.

Comparison with Other Methods
The performance of DASNet model in terms of detection, instance segmentation, and semantic segmentation are compared with YOLO (and YOLOtiny), faster RCNN, mask RCNN, and FCN-8s, which are shown in table as follow.  The M IoU F and M IoU B in Table 1 stand the accuracy of instance segmentation on fruits and semantic segmentation on branch, respectively. From the experimental results, DASNet improves the detection performance of one-stage network compared to the YOLO from 0.811 to the 0.833, and achieves comparable performance compared to the two-stage detection network faster RCNN and mask RCNN, which are 0.833, 0.834 and 0.838, respectively. In terms of the instance segmentation of detected objects, DASNet also shows a similar accuracy compared to the mask RCNN, which are 0.857 and 0.871, respectively. The potential reason for the higher accuracy of instance segmentation on mask RCNN is that the two-stage detection network can apply RPN and ROI re-alignment to segment the corresponding area of the objects accurately. On another hand, although DASNet applies ASPP to encode multi-scale features, multi-scale features may also introduce noise and lead to a lower accuracy of segmentation. From the following experimental results, DASNet is capable of providing accurate segmentation of fruits. In terms of the semantic segmentation, DASNet achieves similar performance compared to the FCN-8s model, which are 0.802 and 0.767, respectively. However, due to the complex conditions in the implementation of robotic harvesting in the unstructured orchard, to accurately segment the branch/trunk from the noisy background is still a challenging task. Therefore, environment modelling algorithm considers both inputs from the branch/trunk and elements of other objects within the working scenarios. The working scenarios which are processed by the visual perception algorithm are shown in Figure 6. In addition, the comparison of the computation efficiency of different network models is given in Table 2.

Evaluation on Fruits Modelling
The evaluation of fruits modelling algorithm is achieved by measure its accuracy and robustness along the distance within the working range of the harvesting robot. The accuracy evaluation of modelling algorithm is conducted by manual marking the correct-estimated objects and wrong-estimated objects in each scenario. The robustness evaluation measures the fluctuating range (Standard Deviation (SD)) of estimated geometries property and object poses in multiple running (repeat 5-10 times in each scenario) of the same scenario. Both tests are evaluated along the distance from the depth camera to the objects.

Localisation and Pose Estimation
The operating range of the vision system is from 0.3m to 0.7m along the X-axis of the robot coordinate (the minimum steady operation distance of Intel RealSense D435 is 0.2m). To evaluate the performance of the developed system, we extend the maximum test range of visual system up to 1.2m (in the range of ¿0.9m). Table 3 shows the experimental result on 3D-SHT algorithm. Table 5 shows the Average Size of Boundary Box (ASoBBx), Average Number of Pixels (ANoP), and Average Number of Computation Candidate (ANoCC) of each object within images along the distance.
Rather than utilising all point candidates of an object to compute the geometry properties which is computation inefficient and time-consuming, a voxel downsampling algorithm is utilised. This step generates the computation candidates of each object, which lead to more efficient computation. Therefore, ANoCC records the average number of points which are used to calculate for each of the objects in a different range of operating distance.
From the experiment results shown in Table 3, it can be seen that the accuracy of the 3D-SHT algorithm shows a decrease with increasing of the distance from the depth camera to objects. Also, the fluctuating range of estimated parameters of objects becomes larger with the increase of distance. The reason that leads to the results is due to the changing of object scale in different distance within images. As shown in Table 5, with the increase of distance from depth camera to the objects, the ASoBBX, ANoP, and ANoCC show a dramatic decrease. The limited number of point candidates lack the capability of describing the objects with enough information and details, which may lead to false prediction. From the experiment results (Table 3) within the distance of 0.7m, centre estimation algorithm can robustly perform the task with acceptable fluctuating of estimation (around 5.5mm). The accuracy of centre estimation algorithm in the range of 0.3-0.5m and 0.5-0.7m are 0.955 and 0.925, respectively. The experimental results of the fruit pose estimation algorithm of the objects are shown in Table 4. Similar to the case of centre estimation of fruits pose estimation algorithm can work robustly under the range of 0.7m. The accuracy and SD of θ and ϕ between 0.3-0.5m and 0.5-0.7m are 0.923, 5.2 • , 4.9 • and 0.885, 8.6 • and 7.6 • , respectively. With the increase of distance from the depth camera (decrease of ANoCC), the accuracy and robustness of pose estimation show a significant decrease. Therefore, a 3DVFH+ based pose verification algorithm is utilised to check the correctness of the estimated pose, which can ensure the safety of manipulator.

Modelling of partially Blocked Objects
Modelling of fruits which are partially blocked by other objects is an important task in the robotic harvesting. With the advance of deep-learning based instance segmentation method, DASNet can robustly perform instance seg-

Evaluation on Overall System
a. Point Clouds Processing In the unstructured orchard environment, the depth camera can be affected by the complex arrangement of elements within the scenarios, such as inaccurate depth sensing of points due to the mismatch between the RGB image and the depth image. In addition, the depth-sensing of the objects can be affected by the adjacent objects. Such defect could severely affect the working of the centre localisation and pose estimation algorithm (as shown in Figure 9). Therefore, the point cloud of objects will firstly be de-noised before fruit modelling. That is, the points of an object will be classified as inlier or outlier based on Euclidean distance. Then the points which are classified as outliers will be deleted from the point list. Moreover, based on the size of the object boundary boxes, the objects without the sufficient number of points or with severely in-balance length in different axis (X, Y, Z) will be deleted from object list (as shown in (b) and (c) in Figure 9). In the recent work of robotic harvesting of strawberry [14], similar issues of the depth sensing are reported. Their work applied a Density-Based Spatial Clustering (DBSC) to process the point clouds. Our implementation of point cloud denoising is more concise and computation efficient. Meanwhile, experiment results also indicate that our method is efficient in the case of robotic harvesting in unstructured orchards.

b. Experiment Results in Orchard
The experiments are conducted in the apple orchard located at Qingdao, China. The plant setting of the orchard is shown in Figure 5. The time period of the experiment is in the range of 10:00 to 16:00 of days in the apple harvest season. As shown the experiment results of processing scenarios within apple orchard in Figure 10, the developed visual perception and modelling algorithm can efficiently sensing the fruit and branch from the RGB image and modelling the 3D working scenario from the depth map.

Conclusion
In this work, A visual perception and modelling algorithm for robotic harvesting of apple in the unstructured orchard is developed. The visual perception comprises a multi-purpose network DASNet to perform detection, instance segmentation of fruits and semantic segmentation of working scenarios. The visual modelling algorithm performs the centre localisation and pose estimation of fruits and modelling the elements within the scenarios to guide the motion of the manipulator. The experiment results showed that visual perception and modelling algorithm could accurately detect and localise the fruits, and modelling working scenarios in real orchard environments. The F 1 score and MIoU of DASNet on fruit detection and segmentation are 0.833 and 0.852, respectively. The accuracy of centre localisation and pose estimation are 0.955 and 0.923, respectively. Furthermore, 3DVFH+ algorithm based pose verification algorithm is developed to ensure the safety of manipulator in operation.