Identification and Depth Localization of Clustered Pod Pepper Based on Improved Faster R-CNN

Traditionally height of end effector of pod pepper harvester is fixed, which induces it hardly adapt to growth height of clustered peppers. Firstly, aiming at the problems of small size and clustered growth of pepper fruits during identification task, an improved Faster R-CNN algorithm is proposed. On the one hand, strategies such as increasing the types and number of high-resolution anchors and using RoI Align instead of RoI Pooling are employed to improve the detection accuracy for tiny targets. On the other hand, ResNet+FPN instead of VGG16 and ResNet backbone structure is adopted as the low-level feature extractor, so extracting capability for small features can be enhanced effectively. Furthermore, to precisely locate the position of clustered peppers, a height calculation model combining the 2D image recognition results with its depth information is advanced. Comparative experiments show that the overall accuracy AP and AP50 of our method reach 75.79% and 87.30%, respectively. Compared with VGG16 feature extraction model, the two indicators are improved by 8.7% and 1.3%, respectively. The small target detection accuracy APsmall is increased about 11.4%, with recall rate ARsmall increased up to 10.2%. The overall loss rate Loss is reduced by 4.7%, which manifests greatly improvement compared to YOLOv3 model. The detection time of a single frame reaches 42ms, which is slightly longer than that of YOLOv3 network, but it can still meet the real-time detection requirements of pepper harvester. In 3D location experiment, the average absolute height error of clustered peppers from the ground is 4.4mm, that accounts to the relative average error of 1.1%, thus suffices the adjustment error requirement of the end effector.

2) Improvement of recognition accuracy for small-sized 81 cluster or individual fruit via hyperparameters and 82 structure optimizing of Faster R-CNN network. 83 ResNet50+FPN feature extraction layer, Anchor scale 84 and quantity adjustment, and RoI Align sampling are 85 comprehensively used to improve sampling accuracy 86 and extraction ability of the network for small features. 87 3) A spatial height localization model is constructed, with 88 combination of RGB-D depth image information. The 89 calculated height is the crucial input parameter for auto-90 matic height adjustment of the end effector of pepper 91 harvester.

93
Our method will follow the pipeline of Fig.1 The acquired RGB images are manually labeled using 107 LabelImg labeling software. During labelling, smallest cir-108 cumscribed rectangle method for clustered pepper fruits is 109 adopted. Annotations are saved in a file of XML format. 110 All images are made in VOC format. In order to improve 111 sample diversity for model training, it is necessary to enhance 112 the collected data. In our experiment, the original data are 113 processed by rotating, flipping, embossing, adding noise, 114 color enhancement, and changing the grayscale and contrast 115 of the picture. Hence the data volume of RGB images is 116 2406 pieces. Then rotate the original data by 15 • and 30 • , 117 and finally we get 3062 RGB images. Depth images are 118 synchronously processed. The experimental data preprocess-119 ing is shown in Fig.3. For model training, the training and 120 validation data will be randomly allocated in a ratio of 9:1, 121 that is, 2756 and 306 images in training set and validation set 122 separately.   function [12] is: where N cls represents the number of all samples in a mini-152 batch, N reg represents the number of anchor positions (the 153 total number of anchors generated on the feature map), λ is 154 the weight factor of the two losses.

155
L cls denotes multi-class cross entropy loss (SoftMax Cross 156 Entropy) for classification task, which is defined as: where p i represents the probability that the i-th anchor is 159 predicted to be a true label; p * i represents 1 when the sample 160 is a positive sample, and 0 otherwise.

161
For bounding box regression task, its loss function is same 162 as Fast R-CNN: where t i represents the predicted bounding box parameter 165 corresponding to the i-th anchor; t * i represents its related 166 GT value. R represents the Smooth L1 loss function, which is 167 defined as: The initial feature extraction network of Faster R-CNN model 172 is VGG16 convolutional neural network. A large number 173 of experiments have shown that with the deepening of the 174 number of network layers, convolutional neural network will 175 not only cause training results to decline, but also lead to the 176 problem of gradient explosion or gradient vanishing. In this 177 regard, we introduce ResNet50+FPN network model as the 178 backbone feature extraction network (Fig.6). ResNet50 is a 179 residual network that provides an effective solution for gra-180 dient vanishing. Traditional Faster R-CNN network contains 181       utilizes bilinear interpolation in each sub-cell to calculate 218 output value of each sampling point, and output the maximum 219 value in sub-region for fixed-sized RoI via Max pooling 220 method, so it will not round up floating-point coordinates of 221 candidate proposals and divided units.  (5)(6)(7).
In the formula above, (x 1 , y 1 ) and (x 2 , y 2 ) are coordinate 235 values of Q 11

286
We select AP, AP 50 , AP small , and AR small as the exact 287 indicators, of which the superscript small denotes indicator 288 for tiny targets. Among them, AP is the mAP average value 289 calculated from 10 IoU thresholds from 0.50 to 0.95 with 290 a proportional interval of 0.05, and AP 50 is the mAP value 291 when the IoU threshold is 0.5, AP small and AR small are aver-292 age precision and recall rate on tiny targets. The augmented 3062 image data cannot meet the needs of 295 weight training from the very beginning, we adopt the idea of 296 transfer learning. Pre-trained weights exposed by PASCAL 297 VOC2012 dataset are utilize, to enable faster convergence 298 during model training.

300
After one image passes through depth network, a prediction 301 box of target object will be generated. The midpoint of the 302 prediction box is marked as its image position, which denotes 303 the exact locale of the targeted object in the 2D RGB image. 304 However, in the world coordinate system, 3D coordinate of 305 the target still needs to be determined through physical quan-306 titative methods such as distance or depth measures, which 307 the stored Depth image captured by depth camera can provide 308 directly. RGB and Depth images have pixel correspondence. 309 We can easily convert between pixel coordinates {u, v} and 310 world coordinates {X , Y , Z } based on mapped RGB and 311 Depth images. The conversion is expressed by formula (12). 312 where K is internal parameter matrix and [R, t] external 314 parameter matrix of the selected depth camera. 315 VOLUME 10, 2022  In [12], [15], and [16], optimization strategy of RoI Align 349 instead of RoI Pool is adopted to increase detection accuracy 350 of small target objects. In [17], this method is claimed to 351 improve detection ability of for small targets of industrial 352 aluminum profiles by 17%. In our experiment, anchor size 353 is fixed to Anchor-3, and Faster R-CNN networks with back-354 bone VGG16 and ResNet50 are compared. The experimental 355 results are shown in Table 3.

356
It can be seen from Table 3 that RoI Align gains better 357 performance than RoI Pooling. For small objects, the average 358 accuracy AP small has improved by 11.17% and 9.98% over 359    Table 4. 369 In Table 4, the performance of each network on Anchor-5 370 is better than that of Anchor-3. When backbone network 371 of VGG16 and ResNet50 is trained to the 50th epoch,  Table 5. 384 In horizontal comparison (Table 5), some evaluation 385 indexes when taking Anchor-3 are lower than the results of 386 other networks, but with Anchor-5, overall indexes display 387 significant improvement. Among the results with Anchor-5, 388 AP threshold and AP 50 are increased by 8.7% and 1.3% 389 respectively, compared with the original VGG16 backbone 390 network. The indexes for small target AP small and AR small are 391 increased about 11.4% and 10.2% separately. The overall loss 392 rate Loss is reduced by 4.7%. 393 Fig.10 shows the longitudinal comparison results of 394 Faster R-CNN and YOLOv3 when Faster R-CNN net-395 works takes Anchor-5. Fig.10 (a-d) compares evaluation 396 indicators of AP, AP 50 , AP small , and AR small . In the 397 showed epoch scopes (200 epochs), YOLOv3 is still 398 climbing toward stabilization. By increasing the number 399 of training epochs in the YOLOv3 network, YOLOv3 400 network parameters still have room for improvement. 401 It obviously suggests that all Faster R-CNN based net-402 work models achieve best fitting effect after 75 epochs, 403 and the ResNet50+FPN backboned Faster R-CNN converges 404 faster and is more robust than other networks. Fig.10 (e) 405 VOLUME 10, 2022   And, we also encounter some failure cases. Fig.11, 418 presents two typical cases: a missed target on the left 419 (the red dotted box), and multiple frames for a single target on 420 the right (the two green solid boxes). The reason for these two 421 failure cases is different. For the first one, the targeted pepper-422 fruit cluster is blocked into background by a pepper stem. 423 But for the second case, it is mainly caused by differential 424 exposure to different branches of one fruit cluster, due to its 425 scattered growth characteristic.

427
The shooting range of the D435i depth camera is about 428 0.3m-3m. The following randomly selects 5 groups of 429 RGB-D images at different depth detection points for com-430 parative experiments, as shown in Table 6. According to 431 our measurement, when the camera is positioned within 432 0.5m of the object, depth value in some areas cannot be 433 recorded, resulting in invalid depth values. When shooting in 434 the range of 2.5m-3.0m, height estimation error of predicted 435 objects will be over 2%. Therefore, relatively accurately 436 shot pixels by D435i are between depth range of 0.5m 437 and 2.5m.

438
Thus, we set depth range of 0.5m-2.5m as the filtering con-439 dition for predicted proposals. Fig.12 shows the flowchart of 440 our target filtering process. At the end of prediction network, 441 multiple prediction boxes will be generated in RGB domain, 442 then central-point coordinates of prediction boxes are mapped 443 into their corresponding Depth image, and lastly qualified 444 prediction boxes are selected according to the depth value 445 filtering condition.

446
Among the four experimental groups that falls in the depth 447 filtering range (Table 6), average absolute error is about 448 4.4mm, with relative error of 1.1%. [19] states that it is 449  and changes of objects [20]. Therefore, appropriate expanded images has always been a challenging problem [21], but 471 the detection model trained by high-resolution images usu-472 ally cannot recognize or locate objects on low-resolution 473 images [22]. Therefore, during network training, optimiz-474 ing the target occlusion problem [23] and properly integrat-475 ing higher-quality images is conducive to further improving 476 the recognition accuracy. Finally, our method is three times 477 slower than YOLOv3 network (Table 6), which is one of the 478 main differences between one-stage network and two-stage 479 network. Although our method meets the needs of real-time 480 detection for pepper harvester, the lightweight network [24] 481 has characteristics of less training parameters, fast detection 482 speed, high precision and low demand for portable GPU, 483 which can reduce the cost of algorithm landing and the com-484 plexity of equipment.

485
For height positioning model, there are systematic and 486 computational model errors when estimating the height of 487 clustered pod-peppers combined with depth information In 488 case of our depth camera D435i, they include calibration 489 error, 0-2% recording error for depth image, image distortion 490 due to camera jitter, and estimation error incurred from for-491 mula (9). Scholars have attempted multiple high-precision-492 camera assisted shooting [26] and high-precision calibration 493 algorithm [27] to minimize three-dimensional positioning 494 errors. 495 Additionally, planting standardization in the data acqui-496 sition site is incomplete. Standardized planting method can 497 not only reduce damage rate of harvested fruits [28], but also 498 effectively improve the accuracy of target detection network 499 and height positioning model. It is also possible to inte-500 grate detection and positioning of clustered pod-pepper fruits 501 together. [29] have introduced an end-to-end RGB-D fusion 502 deep learning network, where both tasks of target recognition 503 and localization can realize at the same time.