Identification and Classification of Mechanical Damage During Continuous Harvesting of Root Crops Using Computer Vision Methods

Detecting sugar beetroot crops with mechanical damage using machine learning methods is necessary for fine-tuning beet harvester units. The Agrifac HEXX TRAXX harvester with an installed computer vision system was investigated. A video camera (24 fps) was installed above the turbine, which receives the dug-out beets after the digger and is connected to a single-board computer. At the preprocessing stage, static and insignificant image details were revealed. Canny edge detector and excess green minus excess red (ExGR) method were used. The identified areas were excluded from the image. The remaining areas were glued with similar areas of another image. As a result, the number of images entering the second stage of preprocessing was reduced by half. Then Otsu’s binarization was used. The main stage of image processing is divided into two sub-stages: detection and classification. The improved YOLOv4-tiny method was chosen for root crop detection using a single-board computer (SBC). This method allows processing up to 14 images of 416 $\times416$ pixels with 86% precision and 91% recall. To classify root crop damage, we considered two algorithms as candidates: 1. bag of visual words (BoVW) with a support vector machine (SVM) classifier using histogram of oriented gradients (HOG) and scale-invariant feature transform (SIFT) descriptors; 2. convolutional neural networks (CNN). Under normal lighting conditions, CNN showed the best accuracy, which was 99%. The implemented methods were used to detect and classify blurred images of sugar beetroots, which were previously rejected. For improved YOLOv4-tiny precision was 74% and recall was 70%. CNN classification accuracy was 92.6%.


I. INTRODUCTION
Sugar beet is one of the main crops in the world, which is usually harvested mechanically with multi-row self-propelled harvesters. In this case, the leaves are removed from the beets along with the petioles and the crown, and the root rises from the soil during the harvesting process. The soil from the roots is removed from the clearing zone, and then the roots are transported to the harvester tank and stored. The quality The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino . of sugar beet is strongly affected by the conditions of root damage, bruising, and tearing during harvesting [1]. The main reason that can lead to root crops diseases and the loss of their mass is mechanical damage. Root vegetables use energy to heal wounds, which leads to a decrease in the amount of sugar obtained from them [2]. The main factors affecting the magnitude and characteristics of damage to root crops are soil moisture, varietal characteristics, and size of plants, as well as the design and operation of beet harvesters [3]. Sugar beet can get mechanical damage in all processes and components of the harvester [4]. Modern beet harvesters are VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ high-tech complexes that include a unit for removing and grubbing, a separating complex for cleaning root crops, and a bunker for beets with an unloading conveyor. Each of the mechanisms has many settings depending on the harvester model. Failure in any of them leads to crop loss. In particular, this is due to improper adjustment of the beet head cutter, incorrect adjustment of the depth and distance between the openers of the root crop grubber, incorrect displacement of the harvesting unit relative to the beet rows, the incorrectly selected rotation speed of the separating complex, conveyor, elevator, etc. [5], [6]. A significant part of the losses can be avoided by installing adaptive devices based on computer vision systems. The authors of the article [7] used a digital twodimensional imaging system coupled with convolutional neural network (CNN) to detect defects in harvested sugar beet. They used high fps video cameras and artificial lighting to produce high quality images. Various detector models based on the CNN, including You Only Look Once (YOLO) v4, region-based fully convolutional network (R-FCN) and faster regions with convolutional neural network features (Faster R-CNN) were developed. The authors obtained good results in detecting damage, however, their system does not allow real-time operation.
The work [8] is devoted to real-time tubers image processing, under normal weather conditions, using a cheap 4 megapixels RGB camera. The computer vision system was installed on a Grimme SE-170/60 harvester. The authors implemented the algorithm based on Mask R-CNN. The method quality is highly dependent on the tuber density, which, in our opinion, does not allow us to recommend it for use without modification. The work [9] is devoted to the problem of development and adjustment of the harvester adaptive systems. The profiling mechanism and hydraulic driving system were developed. The interaction between the soil and the harvester's profiling mechanism was explored. Profiling control accuracy ranged from 86% to 92%. The harvester's adaptive system did not use a computer vision system. In our opinion, such adaptive systems have a serious drawback, they cannot be customized to the individual characteristics of each root crop.
In digital agriculture [10], [11], computer vision systems are used to quickly detect and count plants [12]- [15], to determine their ripeness and diseases [16]- [20], as part of systems to protect against weeds and pests [21], [22], to determine the position of cattle [23]. In recent years publications have shown that the problem of identifying diseased or mechanically damaged fetuses on transportation systems such as conveyor belts, drums, turbines and etc. can be successfully solved using algorithms that provide image processing [24]- [31].
When working with such images, it should be borne in mind that not all of their area is suitable for processing. There are static parts that get in the way of tracking root crops. Such parts can be removed using special algorithms. They can use the interframe difference method, the Gaussian background difference method, the ViBe background extraction algorithm [32], and its advanced version ViBe + [33]. These methods work on incomplete images: a convolutional network with a spatial graph (SGCN) [34], a method based on the U-Net architecture [35], and a local descriptor based on several grids and a dimensionality reduction approach [36], [37].
Further, the search for objects in the image can be performed using the following methods: the visual words method, neural network architectures such as R-CNN, Fast R-CNN, Faster R-CNN, and the Viola-Jones method, which can be considered one of the best in terms of the ratio of recognition efficiency/speed of work [38]. This method usually searches for faces and facial features by the general principle of the scanning window, however, there are successful attempts to use it as a detector when detecting potato tubers on a conveyor belt [39].
Most often, convolutional neural networks are used for object classification tasks. However, as the authors of [40] showed, convolutional neural networks are not designed to work with high-resolution images on devices with weak processors. To obtain an acceptable result, it is necessary to use large kernels (for example, 7 × 7 or 9 × 9) or a large number of layers to obtain an acceptable field susceptibility using convolutional layers [41]. Both of these schemes lead to a very significant slowdown in the system. Therefore, most low-end systems are limited to image sizes of less than 41 × 41 pixels to achieve acceptable image processing time. Moreover, the processing of each such frame can reach several seconds. In conditions of a continuously moving harvester, this is unacceptable. Under these conditions, algorithms with a favorable speed/resources ratio and power deserve special attention [42].
The currently popular modifications of the convolutional neural network, such as R-CNN, Fast R-CNN, Faster R-CNN, outperform the visual word method [43] in these characteristics, however, the visual word method should not be underestimated. For example, the histogram of oriented gradients (HOG) descriptor shows good results in low visibility conditions, when neural architectures practically stop working. This is important for the fieldwork with no good lighting.
One-pass detectors SSD (Single Shot MultiBox Detector), RetinaNet, and YOLO are considered popular high-speed methods of image classification for object recognition [44]- [47]. At the same time, tests show that modern architectures YOLOv3, YOLOv4 surpass their counterparts in performance and accuracy [48]- [52]. Although YOLO's architecture requires relatively large processing resources of the image analysis system, its high sped results in the possibility of using it in real-time systems [53]. Also, due to the use of the grid approach, these networks can be used to crop the background [48]. The modular structure of these algorithms allows them to be modified for specific tasks, further improving performance and quality metrics. So, for example, the version with densely connected convolutional networks is intended to improve the reuse of functions during transmission in the model [54], [55], and the version YOLOv3-tiny takes up a small amount of memory and gives a higher FPS [56]. The YOLOv4-LITE version uses MobileNetv2 as the model's backbone and uses the Do-Conv convolution [57]. The Fusion-YOLO version uses CSPDenseNet and function fusion modules to maximize gradient flow differences, and the CSPResNeXt network to reduce excess gradient flow [58].
Computer vision systems used in agriculture should be as adaptive as possible and easily integrated into production processes [59]. They should not introduce additional restrictions on the processing speed, the location of agricultural equipment units, and be as informative as possible.
The main limitation of using optical methods is blurring of image in working conditions, resulting from vibrations of working elements of agricultural machines, transferred to optical system of camera. The problem of blurry images was solved in [60]. Using coffee beans as an example. They developed an algorithm based on deep neural networks, which showed good results with a slight blurring of grain boundaries. However, with the significant blurring of images, this method does not give satisfactory results.
Other researchers often mention the combination of YOLO and GAN-type architectures. Thus, in [61], the YOLOv4 Model combined Attention Mechanism and DeblurGanv2 is used for real-time gesture recognition. Reference [62] used deblurring methods for remote sensing tasks. Reference [63] introduces a two-phase framework of deblurring and object detection, by adopting a slimmed version of the deblurring generative adversarial network model and a YOLOv2 detector. The work [64] is devoted to Deblur-YOLO: real-time object detector with efficient motion deblurring.
Therefore, according to the authors, methods based on the YOLO architecture are currently best suited for detecting damage to sugar beets using digital cameras installed on the harvester and will be analyzed in this work. There are relatively few publications comparing fast algorithms used in vision systems designed to assess the condition of root crops. The purpose of this study is to find the optimal detector for sugar beetroots and to determine the optimal method for classifying defects in the mechanical processing of the root crop.

A. DATA COLLECTION AND PREPROCESSING
For the experiment, we used a beet harvester model Agrifac HEXX TRAXX with a hitch, designed to process 12 rows of beets at the same time. Digging up the beets begins with the foliage removal. This process is carried out in two stages, first, the main part of the tops is removed by the topper, so that it does not interfere with the further root crops processing. The operator from the cabin regulates the scalping system after the topper (see Figure 1). It removes the remaining parts of the leaves as accurately as possible. After the tops are cut, the roots are dug up by a lifting system. The working   part of it is the digging shares, which are adjusted using the digger wheels. The dug-out beets are fed to turbines, which provide a continuous flow of root crops for further cleaning and transportation. Damage to beets can be caused by the wrong haulm cut or by a broken tail.
After investigating the possible attachment points for video cameras, the availability for installation and maintenance of computer vision systems, the information content of the resulting image, and the possibility of minimizing the time for reconfiguring the cutting, rotating, and other mechanisms, several promising observations points were selected. One of them is located directly in front of the driver's cab.
A video camera with a frame rate of 24 fps was installed in it. The image from the video camera is shown in Figure 2. This figure shows an image of a rotating turbine, using centrifugal force to separate the roots from the ground.   During the initial run, they are fixed, a mask is formed that replaces these parts with black. When this mask is applied, only the dynamic parts remain in the image.
It was found that the optical system used by us with an increase in the speed of movement of root crops associated with the turbine rotation can no longer give clear images of Figure 4.
The contours of root crops in the lower part of the image are not clear-cut and open. However, some areas are quite suitable for processing. A slight displacement of successive images relative to each other is due to the harvester movement. The root crop is still in the ground, its upper part has been cut off and by the appearance of the cut, it is possible to determine how effectively this was done by the haulm removal device. Incorrect setting of this device will result in significant yield loss and must be carefully monitored by our system. When the root crop rotates, the system manages to fix its image from several sides and determine possible damage. This allows you to instantly react to any deviation of the burrowing share from the movement along the beet row.

B. EQUIPMENT SELECTION
When choosing equipment, we settled on the use of singleboard computers. This choice greatly simplifies the task of positioning and powering the computer vision system. The equipment of the NVIDIA corporation Jetson Nano was used. Unlike its closest competitor, the Raspberry Pi 4, the NVIDIA Jetson Nano has a 128-core Maxwell GPU clocked at 921 MHz and a large number of interfaces. It supports the NVIDIA CUDA, cuDNN, and TensorRT software libraries and several popular AI frameworks and algorithms such as PyTorch, Caffe, MXNet, etc. It supports CSI and USB cameras.

C. METHOD SELECTION
Object detection algorithms are constantly being improved. If only a few years ago there was no question of reliable operation of algorithms operating in real-time, now such algorithms exist. In this article, we will select the most suitable ones. To detect root crops, we will choose one of the YOLO methods. To classify root crop damage, we considered two algorithms as candidates: 1. bag of visual words (BoVW) with a support vector machine (SVM) classifier and various descriptors; 2. convolutional neural networks.

1) DESCRIPTORS USED
The HOG method assumes that the type of distribution of image intensity gradients makes it possible to accurately determine the presence and shape of objects located on it. The image is divided into cells. In cells, histograms hi of directional gradients of internal points are calculated. They are combined into one histogram (h = f(h 1 , . . . , h k )) after which it is normalized by brightness. We can get the normalization factor in several ways, but they all show approximately the same results, so let's take one of the options: where ||h|| 2 is the norm, ε is some small constant. When calculating gradients, the image is convolved with kernels where L(x, y, σ ) is the value of the Gaussian at the point with coordinates (x, y) and blur radius σ ; G(x, y, σ ) is the Gaussian kernel; I(x, y) is the value of the original image; * is the convolution operation. The difference of Gaussians is an image obtained with pixel-by-pixel subtraction the original image Gaussian from the different blur radius (kσ ) Gaussian When moving from one level of the Gaussian and Gaussian difference pyramid to another, the image dimensions are halved. After building the pyramid, key points are determined, which are the Gaussian differences local extremes.
False key points are discarded, and for the remaining ones, their orientation is calculated. We determine the gradient value m and direction θ from the formulae (4) and (5), as shown at the bottom of the next page.
The SIFT method operates the descriptor as a vector. The method takes a 4 × 4 square area centered at the particular point and rotates it according to the singular point's direction. Each area element indicates the gradient value in eight directions.

2) BOVW METHOD
BoVW is used to improve the performance of descriptors. Since the principle of operation of BoVW with all descriptors is the same, we will give an example with the HOG (Figure 5). The descriptor itself and examples of its use can be found in [3].
This approach considers the blocks as key parts of the plant, and each block's HOG represents the local information of the corresponding part. Next, we cluster the HOG of all the blocks in the training set into homogeneous groups using K-means, and the centers will be the mean value of the blocks' HOG within the cluster. (We then group the HOGs of all blocks in the training set into homogeneous groups using K-means, and the centers will be the average of the HOG blocks in the cluster.) These centers will play the role of visual-words in BoVW.

3) YOLO
At the moment, three YOLO modifications are most actively used: YOLOv3, YOLOv4, and YOLOv5. Compared to most of their competitors, these algorithms show amazing results. However, although they are single-pass, the large number of convolutional layers does not allow them to process a large number of object images at the same time, so we were interested in their lightweight versions YOLOv3-tiny and YOLOv4-tiny. YOLOv3-tiny and YOLOv4-tiny and are simplified versions of YOLOv3 and YOLOv4 respectively. They have a faster detection rate while guaranteeing high accuracy. This allows these networks to be installed on devices with low computing power. In addition, we studied one modification of YOLOv4-tiny (see Figure 6), which seemed to us the most promising improved YOLOv4-tiny [65].
It consists of a backbone, a neck, and a detector head. The backbone is used to extract features from images from video frames, and the improved YOLOv4-tiny uses the CSPdarknet53_tiny network. Compared to CSPdark-net53 [66], CSPdarknet53_tiny uses only three CSP modules and replaces the Mish activation function with the Leaky ReLu activation function, which significantly reduces the complexity and number of parameters of the function extraction network [67]. The neck is used to merge objects into a network of feature pyramids to extract feature maps at various scales [68].  The advantage is that the target acquisition speed is effectively increased while providing higher accuracy. The detection head is used for classification and regression. The improved YOLOv4-tiny uses function maps at three different scales to predict detection results: 13 × 13, 26 × 26, and 52 × 52.

4) CNN IMAGE CLASSIFICATION
The VGGNet network was chosen for this study. VGG-16 consists of five different blocks that are set in series so that the output of each block is defined as the input of the next block (see Figure 7). With this architecture, the network extracts properties such as texture, shape, and color from input images.
VGG-16 contains 13 convolutional layers with 3 × 3 kernels and five 2 × 2 maximum pooling layers. The activation function for each convolutional layer is the ReLU (Rectified Linear Unit) function. This function performs the following mathematical operation on each input: θ(x, y) = tan −1 (L(x, y + 1) − L(x, y − 1)) (L(x + 1, y) − L(x − 1, y)) (5) VOLUME 10, 2022 To diagnose different types of damaged root crops and separate them from undamaged ones, a modified model was developed to classify four types of conditions, namely under topped, well topped, overtopped and broken tail. The modified model based on the VGG-16 configuration is followed by a block classifier. The classifier block contained two maximum pooling layers with a 2 × 2 window. Dropout layers were added after each maximum pooling layer for regularization. A smooth layer was then applied to smooth the output of the second dropout layer. The output of this layer was connected to the batch normalization layer, and the output of the batch normalization was passed to the third elimination layer. Finally, the output of the last dropout layer was traversed through a fully connected layer containing four neurons, each corresponding to a probability of four classes.

D. PERFORMANCE CRITERIA
The performance of the model was assessed by comparing the samples of root crops, classified by the algorithm, with their visual classification by experts (the traditional method). The criteria used to evaluate performance on the training and test sets were precision, recall, and F1, which were calculated as described in the equations below: where TP (true positive) indicates the number of correctly detected root crops; FP (False Positive) indicates the number of falsely detected root crops and FN (False Negative) indicates the number of missing roots. For each category in target detection, a PR curve can be plotted based on accuracy and retention. The AP value is the area between the PR curve and the coordinate axis, and the mAP is the average AP value for all categories.

E. STATISTICAL ANALYSIS
The tests were repeated 5 times for each pass of the harvester during the experiment. The mean and standard deviation of TP, FN, and FP values were calculated from the measurements.
The obtained values were used to calculate Precision and Recall values.

A. DATA PREPARATION AND EVALUATION METRICS
To train models for root crops after harvesting, samples of different classes were selected (see Table 1).
A 5-fold cross-validation method was applied to them. The data was divided into two parts. The training sample was related to the testing sample as 80% to 20% of all data, respectively. The distribution structure of training and testing data is shown in Figure 8.   For each fold, 80% of the data was used for training, while the rest was evaluated during the testing phase. The data used in the test phase were shifted in each fold. The parameters used in training are shown in Tables 2 and 3.
Image processing was carried out using algorithms developed using Python, Scikit-learn, Tensorflow, and Opencv.

B. OBJECT DETECTION PERFORMANCE
The characteristics achieved by the YOLO models with a clear video recording of sugar beet in the images of rotating turbines from the separating complex are shown in Figure 9.
The optimal balance between Precision and Recall parameters was obtained with IOU >= 0.5. The IOU operation computes a ratio from the area of intersection and area of the union of the predicted bounding box and the ground truth bounding box [69]. Table 4 shows data on methods for IOU >= 0.5.
The processing time of one 416 × 416-pixel image was analyzed using the four methods presented above. The slowest among them turned out to be YOLOv4. When installed on our SBC, it was able to process no more than   4 frames per second. The fastest was the YOLOv4-tiny, its result surpassed YOLO-v4 by 2.5 times and amounted to 10 frames per second. However, this method showed the worst result in detecting core infertility, so we chose the improved YOLOv4-tiny, which allows processing about 9 frames per second.
To speed up the work of the method, we tested the gluing of two consecutive photographs, with the replacement of blocks that do not include the images of root crops. This allowed us to process 14 frames per second already. Taking into account the fact that each root crop hits the image at least 3 times, this speed of the method is sufficient to ensure the operation of the computer system in real-time.
To classify sugar beet crops according to different types of damage, we used BoVW+SVM with two descriptor options:  The classification accuracy results of the HOG(SIFT)-BoVW-SVM and CNN models are presented in the normalized confusion matrix, shown in Figure 10.
True positive rates (diagonal values) and false-positive rates (column entries besides diagonals) of each class can be observed in this matrix.

C. IMAGE PROCESSING WHEN VIDEO RECORDING IS BLURRY
At the working speed of the turbine rotation (see Figure 1), part of the image obtained from the 24 fps camera turns out to be blurred. Tracking the same root crop in successive frames from a video camera, you can compare its clear image with a blurry one. A clear image is observed in that part of the rotating turbine, where the root crops only fall on it, and a blurry image is observed when the roots begin to rotate at the speed of the turbine from the separating complex. Taking into account the already trained YOLO network, we have collected a new dataset in which the identified and classified root crop is compared with its blurred image.
Taking into account the uneven distribution of root crops by classes and due to the limited computing resources, from each class, an average of 400 objects were randomly selected for the training sample and 200 for the test sample.
The characteristics achieved by the YOLO models with a blurry video recording of sugar beet on the images of turbines from the separating complex are shown in Figure 11. The results of the four YOLO methods with a balance between the Precision and Recall parameters IOU >= 0.5 are presented in Table 5.
The classification accuracy results of the HOG (SIFT)-BoVW-SVM and CNN models when the video is recorded  blurry are presented in the normalized confusion matrix, shown in Figure 12.

IV. DISCUSSION
At this stage in the development of computer vision technologies, object classification is no longer a problem. Convolutional neural networks, decision trees, and other methods began to outperform humans several years ago. However, we drew attention to some limitations of these methods associated with their application in real conditions of harvesting. A beet harvester is a very complex machine, consisting of several units, each of which individually ensures the overall quality of the product. The computer vision system can provide fine-tuning of these units in real-time.
At the preprocessing stage, static and insignificant image details were excluded by the Canny edge detector and excess green minus excess red method. As a result, the number of images before Otsu's binarization was reduced by half. The main stage of image processing is divided into two sub-stages: detection and classification. The improved YOLOv4-tiny method was chosen for detection. HOG (SIFT)-BoVW-SVM and CNN were used for classification.
YOLOv4-tiny allows processing up to 14 images of 416 × 416 pixels with 86% precision and 91% recall. Under normal lighting conditions, CNN showed the best results, which was 99% depending on the damage class.
For blurred image detection and classification YOLOv4-tiny precision was 74% and recall was 70%. CNN classification accuracy was 92.6%. These methods allow detecting and classifying blurred images of sugar beetroots, which were previously rejected.
The developed method made it possible to classify images of root crops that were previously rejected. The use of such an algorithm in combination with the harvester onboard computer allows to significantly reduce the amount of crop damage. ADAM EKIELSKI received the B.Eng., Ph.D., and ProfTit degrees from the Warsaw University of Technology with a specialization in industrial automation. Currently, he works with the Faculty of Production Engineering, where he gives lectures on food technology, process engineering, and the construction of machines for the food industry. He is the Founder and CEO of HISMART startup-which produces biodegradable smart packaging. He is a Professor with the Warsaw University of Life Sciences. He is the author of the first book in the Polish market about agrotronics, where readers can find information about the novelty mechatronic solutions in modern agricultural vehicles and machines.
TIMUR GATAULLIN graduated from Lomonosov Moscow State University. He is an employee of the State University of Management, a candidate of physical and mathematical sciences, and a doctor of economic sciences. He is a specialist in the application of mathematical apparatus in problems for economics and finance, optimization for production, and management processes. SERGEY GATAULLIN graduated from the State University of Management. He is an employee of the Financial University, Government of Russia, and a candidate of economic sciences. He is a specialist in the field of mathematical methods of decision-making, economic, and mathematical modeling. Since 2016, he has been engaged in research and administrative work at the Faculty of Information Technology and Big Data Analysis of the Financial University.