Real-Time Apple Detection System Using Embedded Systems With Hardware Accelerators: An Edge AI Application

Real-time apple detection in orchards is one of the most effective ways of estimating apple yields, which helps in managing apple supplies more effectively. Traditional detection methods used highly computational machine learning algorithms with intensive hardware set up, which are not suitable for infield real-time apple detection due to their weight and power constraints. In this study, a real-time embedded solution inspired from “Edge AI” is proposed for apple detection with the implementation of YOLOv3-tiny algorithm on various embedded platforms such as Raspberry Pi 3 B+ in combination with Intel Movidius Neural Computing Stick (NCS), Nvidia’s Jetson Nano and Jetson AGX Xavier. Data set for training were compiled using acquired images during field survey of apple orchard situated in the north region of Italy, and images used for testing were taken from widely used google data set by filtering out the images containing apples in different scenes to ensure the robustness of the algorithm. The proposed study adapts YOLOv3-tiny architecture to detect small objects. It shows the feasibility of deployment of the customized model on cheap and power-efficient embedded hardware without compromising mean average detection accuracy (83.64%) and achieved frame rate up to 30 fps even for the difficult scenarios such as overlapping apples, complex background, less exposure of apple due to leaves and branches. Furthermore, the proposed embedded solution can be deployed on the unmanned ground vehicles to detect, count, and measure the size of the apples in real-time to help the farmers and agronomists in their decision making and management skills.


I. INTRODUCTION
Monitoring agricultural farms and orchards mainly rely on skilled farmers and workers who are responsible for assessing several growth stages before perform-farming related actions in order to maximize the quality and yield. Manual work of these farmers consumes time and increases production costs, and workers with less knowledge and experience make unnecessary mistakes. With the advancements in precision agriculture and information technology, crop imaging has become an important source of information that can be used The associate editor coordinating the review of this manuscript and approving it for publication was Zhanyu Ma .
to assess vegetation status of the crops, fruits growth, yield, and quality.
Two important features that enable the farmers to estimate crop-load and yield mapping in tree fruit crops are fruit counting and size estimation. Several studies have proposed fruit detection in orchards using machine vision systems for automatic growth assessment, robotic harvesting, and yield estimation [1], [2]. Apple crop-load management has gained much importance due to its impact on yield production. It has been the primary problem to develop algorithms that enable the apple harvesting robot to directly, quickly, and accurately recognize fruits in real-time [3]. In the natural environment, for the visual systems, apple fruit detection is typically more difficult because of the influence of lights and shadows, branches, and leaf coverings. Apple's visual appearances in the natural environment may be categorized as non-occluded fruits and occluded fruits.
Occlusion of fruits due to leaves, branches, and other fruits and variable lighting conditions are some of the main reasons that make it more challenging to achieve good accuracy and robustness in fruit detection [1]. Experiments in few studies have been performed in nighttime environments with the formation of tunnel structures around tree canopies to deal with variable lighting conditions. Images from both front and back ends are taken of tree canopies using multiple sensors to avoid fruit occlusion, [4], [5], [7], which leads to having high fruit detection accuracy. Nevertheless, fruit size estimation is needed for automatic robotic harvesting of mature and good-sized fruits, due to the difficulty level of the real-time robotic harvesting, few studies [7], [8], have exploited fruit size estimation using a machine vision system. Wang et al. [7] performed experiments using an RGB-D camera a thin lens theory to estimate the size of mango fruits in trees. Ultrasonic sensors have also been tested with the color images for size estimation of citrus fruit along with the range information Regunathan et al. [9].
With the upsurge of machine learning, deep learning algorithms have been extensively used in agriculture-related applications [10]. Deep learning can be used for crop mapping [11], [12], crop image segmentation [13], crop target detection [14], [15]. Convolutional Neural Networks (CNNs) are used in [16] to extract target regions in the image, object segmentation, and counting number of fruits on a tree using a successive CNN counting algorithm. Dias et al. [13] used CNN in combination with support vector machine (SVM) to extract the features of apple blossoms automatically way to counter complex background, which leads to achieving comparatively accurate apple blossom area segmentation results than the previous studies. Faster R-CNN [17] was employed with the region proposal network (RPN) method to detect the region of interest (ROI) in the image with a complex background scene followed by a classifier, which classifies bounding boxes. Faster R-CNN with VGG16 net [18] is the state-of-art method in fruit detection [10]. However, Faster R-CNN consists of region proposal networks (RPN) and classification networks that produced excellent results in terms of accuracy, while the detection speed is slow, which can not achieve good results in real-time with high image resolution. The You Only Look Once (YOLO) method [19], [20] deals with the classification and the localization as a regression problem. A YOLO network directly performs regression to detect targets in the image without RPN, hence it is fast and can be implemented in real-time applications. The stateof-art version (YOLOv3) [20] not only has high detection accuracy and speed but also performs well with detecting small targets. However, the YOLOv3 model is not suitable for real-time applications such as in harvesting robots due to its complex architecture that requires more processing power. Optimization of the parameters of the model reduces the computational complexities and thus is needed to deploy on edge devices such as Jetson, and Raspberry Pi.
Large data sets training and validation require highperformance computing machines such as clusters or servers, which are widely being used in deployment of power extensive deep learning algorithms [21], [22], however, in the low power end devices, researchers have raised their concern about efficiency of CNNs, in real-time embedded platforms [23], [24]. Network optimization (i.e., network pruning or quantization) is a technique to reduce the model size by compressing the dense model into sparse or low-bit architecture with minimal or even no accuracy drops.

A. AI ON THE EDGE AND RELATED WORK
Real time smart solutions inspired from deep learning, must possess the following key capabilities such as energy efficient, affordable and small form factor with the fine balance between accuracy and power consumption. Indeed, deep learning based architecture are conventionally deployed with in the centralized cloud computing environment. However, there are constraints such as considerable latency of the network, energy and financial overheads that effects the overall performance of the system. To deal with these limitations, edge computing often called ''edge AI'' has been introduced where computations are performed locally on the data acquired from various devices or sensors.
The challenge in meeting the implementation requirements for edge AI is to ensure high output accuracy of algorithms while consuming low power. Nevertheless, the innovation in hardware options, involving central processing units (CPUs), graphics processing units (GPUs), application-specific integrated circuits (ASICs), and system-on-a-chip (SoC) accelerators, has made edge AI possible. NVIDIA, Intel, and Qualcomm are the leading market brands which are contributing enormously to the development of AI at the edge. Among these, Intel's Movidius Neural Computing Stick (NCS) is the cheapest device to implement computationally extensive algorithms with multiple layers of CNN. In [25], CNN model was deployed in NCS to perform classification of 3D voxel based point clouds.
NVIDIA's Jetson is another uprising embedded hardware and broadly used accelerators for machine learning algorithms. Promising feature of Jetson is the CPU-GPU heterogeneous architecture [26], [27], where CPU boots up the firmware and the CUDA-capable GPU come with the potential to accelerate complex machine-learning tasks. Key features includes form factor, light weight and low power consumption. However, to gain full potential of Jetson and attaining real-time performance involves optimization phase to both Jetson hardware and NN algorithms. Jetson variants termed as TK1, TX1, and TX2 are widely used in past few years. For example, in [28], low cost TK1 was used in drowsiness detection using model compression of deep neural networks. Nvidia TX1 was used in tennis ball collection robot based on deep learning [29]. In [30], qualitative comparison was made among various hardware platforms, TX2 ranked the highest in terms of throughput. They used Tiny-YOLO for object detection and claimed better product of accuracy and frame rate than YOLO and SSD. [31] deployed Tiny-YOLO on TX2 to perform detection and localization of the robot using Kinect-V2 visual sensor. Casecaded CNN model was deployed in TX2 for semantic weed classification using multi spectral images for smart farming [32].
In our work, we deployed a modified version of the YOLOv3-tiny algorithm on embedded platforms such as Raspberry Pi 3 B+ in combination with Intel Movidius Neural Computing Stick (NCS), Nvidia's Jetson Nano and Jetson AGX Xavier for real-time apple detection.
The rest of the paper is organized as follows. Section 2 will cover the data set description and hardware details. In section 3, the architecture framework is described with further explanation of the performed customization for small object detection. Section 4 will describe the experimental results and discussion followed by the conclusion.

II. MATERIALS AND DATA
An orchard has been considered in order to acquire a custom data set for the training process. Subsequently, a technique dubbed transfer learning [47] has been applied, re-training, and fine-tuning a custom version of the network YOlOv3-tiny specifically optimized for accurate detection of small objects on embedded devices. After training, the resulting network has been benchmarked on all images of OIDv4 [46] data set (training, validation, and testing) with the Apple class, producing a reproducible metric that can be easily compared with future works. Finally, the trained model has been tested on several edge AI devices assessing their performance in terms of speed and power consumption.

A. DATA SET DESCRIPTION
Two popular types of apple (Braeburn and Fuji) were considered in this study, which are the most common types found in the north part of Italy. Image acquisition campaign was conducted in randomly selected healthy apple trees in orchards using a reflex digital camera with 18 megapixels during different times and days of September. Image acquisition was performed for separate/non-overlapped fruits, overlapping fruits/occluded fruits under variable lighting conditions such as fully exposed to sun from front, full sun influencing from the back of the fruits, and fruits covered by the shades of leaves/branches or other apples.

B. HARDWARE DESCRIPTION
The concept of Edge AI consists of performing computations locally on an embedded system in real-time. Since the training process requires a lot more computational power as compared to the inference process, it is not performed on the embedded system, but a dedicated workstation. Then, the model with the obtained weights is deployed on the target hardware in order to be executed.
The workstation used for training was equipped with an NVIDIA RTX 2080Ti GPU with CUDA 10 and 64GB of  DDR4 SDRAM. This GPU model features 544 Tensor Cores, an NVIDIA technology specifically designed to boost matrix multiplication performance, thus able to speed up the training process of deep learning models. The computational power of this GPU allows reaching peak performances of about 26.9 TFLOPs (FP16) [33].
For the embedded implementation of the model, different hardware platforms have been considered, as shown in Fig. 2.: a Raspberry Pi 3 B+ with both generation of Intel Movidius Neural Compute Stick accelerators, an NVIDIA Jetson Nano, and an NVIDIA Jetson AGX Xavier. Table. 1. shows a comparison between the main specifications of the selected embedded hardware.
Neural Compute Sticks (NCS) are USB hardware accelerators specifically designed to perform AI computations. Two generations of NCS have been tested: the first is powered by a Myriad 2 VPU (Visual Processing Unit) processor [34], while the second features a Myriad X VPU [35]. These two chips are designed by Intel Movidius to accelerate deep neural network inferences. Since the Neural Sticks provide a USB 3.0 interface, they are suitable to be used with embedded, lightweight, and cheap computers such as a Raspberry.
NVIDIA, on the other hand, provides a family of boards that feature an embedded computer with a dedicated GPU for hardware acceleration. The boards examined in this work, the AGX Xavier and the Nano, are the last two Jetson platforms presented by NVIDIA.
The AGX Xavier has been released in Autumn 2018 and currently is the most powerful Jetson board available. It features a Volta GPU micro-architecture with 64 Tensor Cores, able to reach up to 11 TFLOPs (FP16), and two NVDLA (NVIDIA Deep Learning Accelerator) engines. These chips are specifically designed to perform neural network standard operations such as convolutions efficiently. A single NVDLA is able to compute up to 2.5 TFLOPS. Thus, the overall peak performance of the AGX Xavier is about 16 TFLOPS. The board can work in different power modes, and it gives the user the possibility to select the number of working CPU cores. The available power modes are 10W (2 cores), 15W (4 cores), 30W (2, 4, 6 or 8 cores) [36].
The Jetson Nano has been presented in June 2019 especially for target applications where reducing the board size, power consumption, and price is important. For the hardware acceleration, it features an NVIDIA Maxwell GPU with a peak performance of 472 GFLOPs. The Nano board does not include any deep learning specific accelerator and can work in two power modes at 5W or 10W [36]. So, it does not take advantage of Tensor cores and NVDLA engines for inference acceleration.

III. METHODOLOGY AND ARCHITECTURE FRAMEWORK
YOLO is a network specifically designed for fast and accurate real-time object detection. It has comparable performance in terms of accuracy with other popular object detection algorithms like RetinaNet [37], Faster-RCNN [38], but it is much faster and compact that makes it an optimal choice for real-time embedded applications. It is a single fully convolutional neural (FCN) network that takes as input a raw image and gives as output bounding boxes and related classes of recognized objects inside the presented scene.
Since 2016 different versions have been released [19], [20], [39] that gradually have increased the accuracy of the general framework without giving away too much of its inference speed. At the same time, all different versions have been released with a lighter counterpart dubbed ''tiny'' that has a simplified and optimized structure without loss of too much accuracy. The intrinsic characteristics of the ''tiny'' version make it suitable for AI applications at the edge with the use of embedded systems, enhanced with hardware accelerators. For this reason, this research has taken the last available ''tiny'' version of YOLO, YOLOv3-tiny, as a starting point for the realization of an embedded apple detector system.
In the rest of this section, fundamental working principles of the network and the modifications applied to the original ''tiny'' architecture are presented in order to make it suitable for the detection of smaller objects in the scene like an apple.

A. ARCHITETURE OF THE ORIGINAL FRAMEWORK
YOLOv3-tiny, as already introduced, makes use of only convolutional layers, making it a fully convolutional network that can accept inputs of different sizes during and after training. It can be divided into two main blocks: the first one is the feature extractor or backbone dubbed darknet-19. Its principal and the fundamental role are to extract features in a hierarchical fashion a starting from raw pixels coming from the input layer. Indeed, the extracted representations are later used as starting point by the other modules of the network. Darknet-19 is a light and efficient feature extractor, but can be easily swapped with any other backbone like ResNet [40], DenseNet [41], etc. It features a standard architecture greatly inspired by VGGNet [18], making use of only 3x3 filters throughout the entire structure, max-pooling layers in order to reduce the dimensionality of the input volume and obtain local invariance. Finally, darknet-19 exploits Batch Normalization layers [43] to accelerate the network training, reducing the internal covariance shift. All backbone blocks use  rare features. In all our experimental evaluations it proved to give better results than classical L2 regularization [54] that modifies directly the cost function. We set w d = 0.0001.
All training has been carried out on a workstation with an NVIDIA RTX 2080 Ti and 64GB of DDR4 SDRAM.
The training took on average one-hour using the TensorFlow framework and CUDA 10.

B. QUANTITATIVE RESULTS: MODEL PERFORMANCE
To understand the model performance, mean average precision (mAP) is computed on the test dataset. Mean average precision is a popular object detection scoring method that assesses the network performance in detecting the target objects for different values of target intersection over unit IOU target . This methodology has been presented for the PASCAL Visual Object Classification (VOC) challenge 2012 [55]. Each predicted bounding box i is compared with the ground truth and marked as correct (true positive TP) if the apple is present and IOU i > IOU target . If IOU i is lower than the target or the apple is not present, the prediction is marked as incorrect (false positive FP). Finally, all the apples not detected are marked as missing predictions (false negative FN). Since the predicted bounding boxes are given as output only if they have a level of confidence above a certain threshold c, it is possible to compute the precision (p) and the recall (r) of the network over the test dataset as a function of c:   Comparison between different devices power consumption and performances achieved with our customized version of YOLOv3-tiny. Jetson series boards can be run at different power modes reducing current absorption at the expense of lowering computational capabilities. The mode column shows the theoretical maximum absorbed power in the different working modality, that is different from the actual dissipated power during the execution of the algorithm. The best performance, in terms of frame per second, is highlighted with a red rectangle.
Computing all the possible values of p(c) and r(c), it is possible to get the precision/recall curve. The graph is then usually smoothed in order to get a monotonically decreasing precision curve by setting p(r) = max r ′ ≥r p(r ′ ). The average precision of the network is computed as the area under the obtained curve and is always a number between 0 and 1: An average precision equal to 1 means that the detector is able to reach a perfect precision (100%) for all the values of recall. Thus it is possible to find a value of c such that we are able to detect all the objects with correct bounding boxes. On the other hand, an average precision of 0 means that we cannot detect any object correctly whatever value of c we choose, thus both p(c) and r(c) are always equal to 0. For a multi-class object detection algorithm, the mean average precision is the mean of the AP over all the classes. In our specific context, we are dealing with apples only, thus AP = mAP. The mean average precision gives thus a piece of information on the quality of the network detection independent from the chosen c, that can be chosen considering what is more important among precision and recall for the specific application.
Different values of mAP can be computed depending on the selected IOU target . Usual values are 0.5 and 0.75 in order to evaluate the model performance with different requirements on the detection accuracy of the locations of the objects. Table. 4. presents the recall and precision for the default confidence threshold c = 0.25 and the mean average precision for the two values of IOU target . The same computation has been performed with the original YOLOv3-tiny architecture, retrained for the apple detection only with the same methodology described in section IV-A. The results presented in Table. 4. show how the change in the architecture can boost the mAP on the test dataset of up to 6.6%.

C. QUANTITATIVE RESULTS: EMBEDDED IMPLEMENTATION
After the training, the model has been deployed on the different hardware platforms presented in section II-B. We tested the performance in terms of absorbed power and frame rate.
Firstly we measured the power consumption of the different boards (Jetson AGX Xavier, Jetson Nano, Raspberry Pi 3B+) at idle condition, and then we executed the algorithm for nearly 5 minutes to be sure to be at steady state. We measured directly the current absorbed from the power source, thus obtaining the power consumption of the entire system. Since the Jetson boards allow the user to select different working conditions, we tested all of them. The results are presented in Table. 5.
The NVIDIA Jetson AGX Xavier is the most performing platform, being able to reach 30 fps in the 30W operational mode. Also, in the other modalities, it is able to reach frame rates suitable for strong real team applications. With the Jetson Nano, the frame rate drops to 8 fps in 10W mode, which can still be an acceptable value for soft realtime contexts. With the Raspberry Pi and the Intel's NCS, the performance is further lowered. With the same running conditions, the more advanced NCS2 is able to outperform its predecessor both in terms of frame rate and power consumption. However, despite being more flexible, these USB accelerators cannot go beyond the five fps in the best case. Qualitative results of some additional test images acquired from the same study site of the training data set. It is possible to notice how our custom version of YOLOv3-tiny is robust to different factors of variation [56] in how apples appear. Simultaneously handling variations in illumination, viewpoint, scale, occlusion, and background clutter is a challenging task that our system has to tackle in real-time with limited computational capabilities.
An interesting comparison between the different platforms is the price/fps ratio, shown in Table. 6. The Jetson Nano appears to be the best choice if we are looking for a balance between performance and cost. On the other hand, the AGX Xavier has the higher ratio, since it is a board with the highest quality, but certainly not suitable for low-cost solutions. The Intel Neural Sticks results in the second and third place for price/fps ratio, but it must be underlined that, since they are USB accelerators only, an additional embedded computer must be purchased, increasing the final cost. Fig. 7 presents some test images excluded during the training phase. It is possible to see how our network is able to recognize a great number of apples in several image conditions such as different illumination and contrast. The network is able to detect the fruits on different scales, and in particular it can recognize very small apples, even in bad lighting conditions.

D. QUALITATIVE RESULTS AND COMPARISON
A comparison with the original architecture is presented in Fig. 6. We processed two images excluded from the training dataset with default confidence threshold c = 0.25 and we computed precision and recall with IOU target = 0.5. It's interesting to notice that precision and recall for image (a) are both 0, since no bounding box have sufficient intersection over unit to be considered as a true positive. Image (c) has precision equal to 1, but very poor recall, since the network is able to detect only the 8% of the apples. Our architecture, on the other hand, is able to strongly increase the recall detecting very small fruits, boosting the quality of the predictions. In this scenario, the mAP gain is a lot higher with respect to the test dataset taken from OIDv4. This is due to the fact that the images of the apple class in the OIDv4 dataset present, on average, bigger apples with respect to the training dataset taken on a real orchard, so the difference between the two architectures is less visible. On the other hand, on the test images taken from the same dataset used for training, the ability of detecting little apples becomes fundamental to reach a high recall value. However, we presented our results for the OIDv4 in order to make the experiments repeatable and allow direct comparison with our work.

V. CONCLUSION
A real-time apple detection system has been developed and tested on several edge AI devices. The classical YOLOv3tiny architecture has been modified and adapted in order to increase its accuracy in the presence of small and largely occluded objects. It has been trained with a custom data set, acquired on a real orchard, and tested with all available images of OIDv4. Accuracy results have demonstrated a boost in terms of recall and precision in the presence of targets with disparate sizes. Experimental evaluations have been carried out in order to highlight performances achieved in terms of inference speed and power consumption by the different embedded solutions selected. Experimentation results have shown promising prospects to exploit the tested system to produce real-time positions and numbers of detections with minimal power consumption. A complete framework could integrate the presented research for diverse purposes, from apple counting, harvesting health assessment to smart packaging. Indeed, further works will only focus on yield estimation using the proposed methodology to count the number of apples reliably. Indeed, image registration has not been directly addressed in this study, but it is a strong requirement in order to reduce double count and improve the precision of the system. Moreover, in view of an extension of the presented analysis, FPGA/ASIC implementation will be considered for future developments of this research study. Finally, the adopted methodology is not limited to apple detection task, but could also be implemented for other applications where the detection of small and tiny objects in real-time at the edge is needed.
VITTORIO MAZZIA received the master's degree in mechatronics engineering from the Politecnico di Torino, presenting a thesis ''Use of deep learning for automatic low-cost detection of cracks in tunnels,'' developed in collaboration with California State University. He is currently pursuing the Ph.D. degree in electrical, electronics, and communications engineering with the two interdepartmental centres PIC4SeR and SmartData. His current research interests involve deep learning applied to different tasks of computer vision, autonomous navigation for service robotics, and reinforcement learning. Moreover, making use of neural compute devices (like Jetson Xavier, Jetson Nano, Movidius Neural Stick) for hardware acceleration, he is currently working on machine learning algorithms and their embedded implementation for AI at the edge.