Detection of Norway Spruce Trees (Picea Abies) Infested by Bark Beetle in UAV Images using YOLOs Architectures

In recent years, massive outbreaks of the European spruce bark beetle (Ips typographus, (L.)) have caused colossal harm to coniferous forests. The main solution for this problem is the timely prevention of the bark beetle spread, for which it is necessary to identify damaged trees in their early stages of infestation. Fortunately, high-resolution unmanned aerial vehicle (UAV) imagery together with modern detection models provide a high potential for addressing such issues. In this work, we evaluate and compare three You Only Look Once (YOLO) deep neural network architectures, namely YOLOv2, YOLOv3, and YOLOv4, in the task of detecting infested trees in UAV images. We built a new dataset for training and testing these models and used a pre-processing balance contrast enhancement technique (BCET) that improves the generalization capacity of the models. Our experiments show that YOLOv4 achieves particularly good results when applying the BCET pre-processing. The best test result when comparing YOLO models was obtained for YOLOv4 with the mean average precision up to 95%. As a result of applying artificial data augmentation, the improvement for models YOLOv2, YOLOv3, and YOLOv4 was 65%, 7.22%, and 3.19%, respectively.


I. INTRODUCTION
Preserving natural forests is essential for the environment as they play a very important role in the global ecosystem. Unfortunately, there are several factors that threaten the wellbeing of forests. One of them is different pests that can attack trees leading to their weakening or even death. In particular, the European spruce bark beetle (Ips typographus, (L.)) [1]- [3] is widespread in the coniferous, mainly spruce forests of Eurasia (Sweden, Finland, Denmark, Germany, Bulgaria and other). This beetle belongs to the class of especially dangerous forest pests. Usually, a small population of bark beetles attack predominantly weakened trees. The flight of beetles begins in the spring when the average temperature is about 18 °С. Infestation of Norway spruce trees is accompanied by the mass reproduction of bark beetles that can last 4-5 years, in rare cases up to 9-12 years [4], [5]. There has been an increase in mortality in European forests from beetle attacks of more than 8% from 1850 to the present, mainly due to increased temperatures and the increased frequency of droughts. In addition, during the outbreaks bark beetles attack even healthy trees, leading to mass extinction of the forest [6], [7]. Since the bark beetles cause significant damage to forestry and forest parks, it is important to develop a program for the reduction of their quantity. The first step of this program should be forest monitoring. One of the most effective ways of forest conservation monitoring is the use of remote sensing (RS) data.
The choice of methods for conducting timely analysis of forest monitoring data is also important. Usually, monitoring of the infested forest is carried out during field research. However, due to the high level of outbreaks, it is impossible to carry out timely monitoring of vast areas of the forest. The large size of forests and hard-to-reach places under study do not allow the use of pure manual analysis, such as manual counting and individual detection of trees. Thus, to solve this problem, it is advisable to use Earth remote sensing tools for timely monitoring of the forest at different stages of its infestation. Since the advent of unmanned aerial vehicles (UAVs) in mass production, it has become possible for scientists to collect a large amount of local area imagery with very high spatial resolution, but the problem of processing large arrays of information has remained. In particular, the resource-intensive processing of a large amount of unsorted RS data obtained from drones can include an expert opinion on the type of tree and manual counting of specimens with a particular state of crown integrity. However, recently it has been shown that such problems can be effectively solved with the use of deep neural networks (DNNs) [8]- [10] (see Section II).
The goal of this work is the detection of infested trees in images obtained from UAVs, by using YOLO architectures that employ convolutional neural networks (CNN) [11]- [15]. First, we prepared a dataset for training and testing YOLO architectures from the orthophoto images obtained with UAV. In the next step, we applied a pre-processing procedure to the dataset. This process consists in increasing the contrast of the input image, which makes it possible to increase the accuracy of detecting individual tree crowns. Next, we trained and tested the YOLO architectures, using versions from 2 to 4. We then presented the results of comparing these architectures and determined the best YOLO architecture for the task of detecting infested trees.
The main contributions of this paper can be listed as follows: • We have built a new dataset for the detection of bark beetle infestation in Norway spruce trees to detect four categories of Siberian fir trees (Abies sibirica, (L.)). The dataset is organized into 400 images of Red-Green-Blue (RGB) color images having 3.75 cm/pixel spatial resolution.

•
We have created a new annotator to quickly annotate our dataset called Visual Object Labeller 1.3.

•
We have applied a pre-processing balance contrast enhancement technique (BCET) for improving the quality of the images and hence for increasing the generalization capacity of the detection model.

•
We have evaluated the YOLO object detection architectures with versions 2, 3 and 4 for the task of infested tree detection in UAV images.
The paper is organized as follows: The related works are given in Section II. The materials and methods are presented in Section III, where the study area is shown in Subsection III.A, the pre-processing methodology is provided in Subsection III.B, the annotator of images is described in Subsection III.C, the YOLO models are evaluated and compared in Subsection III.D, the experimental setup is described in Subsection III.E, and the evaluation metric is in Subsection III.F. The experimental Results are presented in Section IV. The paper concludes with Section V. This article is based on Chapter 3 of the Anastasiia Safonova's Ph.D. thesis [16].

II. RELATED WORKS
In this section, we consider related works devoted to the problems of tree detection on the Earth's surface data using deep learning (DL). Object detection is the task of localizing all specified objects of a class and building a bounding box for each of them. Intelligent systems based on NNs can successfully solve problems of plant recognition [17]. There are other, more traditional approaches to solving these problems, but they do not have the required flexibility outside of limited conditions [18]. NNs provide promising alternative solutions, and many applications benefit from their use.
Below are the works on solving tasks of object detection in the forestry industry on RS data by DNN, which are the closest to our problem. In the work [19], using a pre-trained YOLO architecture, authors were able to obtain an average accuracy of up to 91.82% in the detection of affected pine trees on UAV-obtained very high-resolution images. Thanks to this solution, it became possible to localize the affected tree on various aerial images. In another work [20], the authors considered a similar problem of detecting a dead pine tree on UAV data using CNNs of AlexNet and GoogLeNet with a maximum accuracy of up to 97.38%. Tao et al. in [21] performed the insect-damaged tree detection (dead fir, sick fir, healthy fir, deciduous trees, grass, and uncovered) with DJI Mavic 2 Pro quadcopter data and DL technique based on CNNs.
In our previous work [10], we were able to detect four categories of Siberian fir trees (Abies sibirica, (L.)) damaged by the Polygraphus Proximus Blandford bark beetle in UAV images using DL. We solved the problem of detecting individual trees on UAV images using our own algorithm. Then we classified the detected patches as belonging to one of the four categories of the tree using an approach by Krivets et al. [22]: healthy tree or recently attacked by beetles, tree colonized by beetles, recently died tree, and deadwood. In the last step, we proposed a new CNN architecture specially designed to solve that problem. We also presented a comparison of the developed architecture with such models as VGG, ResNet, Inception-V3, InceptionResNet-V2, Xception, and DenseNet. It is important to note that our model, trained with data augmentation, showed up to 98.77% accuracy for the next categories of fir trees: healthy tree, recently dead tree, and deadwood. Our next work was about individual tree crown delineation for the species classification and assessment of vital status of forest stands from UAV images. In this work, we proposed an approach to enhance algorithms for species classification and assessment of the vital status of forest stands by using automated individual tree crowns delineation (ITCD). The performance of the ITCD algorithm was demonstrated for different test plots containing homogeneous and complex structured forest stands. The pixel-by-pixel classification is based on the ensemble supervised classification method error-correcting output codes with the Gaussian kernel support vector machine chosen as a binary learner.
We demonstrated that pixel-by-pixel species classification of multi-spectral images can be performed with a total error of about 1%, which is significantly less than that obtained by the processing of RGB images. For typical scenes, the crown contouring accuracy is about 95%. The advantage of the proposed approach lies in the combined processing of multispectral and RGB photo images from DJI quadcopters [23].
This experiment is complementary to our latest research [23]. In this paper, we analyze the detection of four infestation classes of Norway spruce trees attacked by the European bark beetle (green-, yellow-, red-, and gray-attack (see Section III.A)) on UAV images using YOLO architectures. Then we compare the quality of the trained architectures on the test dataset.

A. STUDY AREA
The study area is the territory of the West Balkan Mountains (up to 1300-1500 m altitude), the south of the administrative center of Chuprene, Vidin Province, Bulgaria ( Figure 1). We considered study area located in protected place of Chuprene Biosphere Reserve, included in the list of UNESCO [23]. The reserve has been created with the aim of preserving the environment of the unique natural complex. Regular and detailed remote monitoring is required to take time management actions aimed at solving problems associated with damage to forest stands by stem pests and changes in stand structure parameters.
We used RS data obtained from a DJI-Phantom 4 Pro UAV with an RGB (red, green, and blue channels) camera with a resolution of 3.75 cm/pixel. The drone flew several times on August 16, 2017, and September 25, 2017, at the maximum permitted altitude of 120 meters above the ground. The object of the study is natural forests damaged as a result of attacks by European spruce bark beetles (Ips typographus, (L.)) [24]. Forests mainly consist of Norway spruce (Picea abies, (L.) Karst.), European beech (Fagus sylvatica, (L.)), Scotch pine (Pinus sylvestris, (L.)), and Black pine (Pinus nigra, (L.)). However, in our experiments, the object of research predominantly is damaged Norway spruce trees (Picea abies, (L.) Karst.). Infested by bark beetles tree can be classified based on visible crown symptoms in four stages of attack. The initial attack is not visible to the human eye (green-attack), and a tree of this class cannot be visually (looking only on its crown) differentiated from a healthy one. With thousands of attacking beetles per one Norway spruce tree, needles first turn yellow (yellow-attack), then to reddish brown (red-attack), and finally to grey (dead tree) [25]- [27]. Verification of infested trees on images was done by the Bulgarian experts in the field of forest entomology and phytopathology prof. DSc Georgi Georgiev and his colleagues in accordance with the methodical manual of assessment of crown conditions of ICP "Forests" [28].
The image dataset has been prepared using QGIS software (Quantum GIS v. 2.14.21). RGB images are represented by unsigned integers from 0 to 255 in natural colors. We built a dataset made of 400 images, where 80% were for training, 20% for validation, and two plots A and B were for external testing (Table 1). We pre-processed the original dataset using pixel contrast enhancement. During the training process, we applied dynamic data augmentation using standard functions: rotation, horizontal flip, vertical flip, resizing. It should also be noted that the number of damaged trees varies from class to class, as in the experiment we used data from the natural forest. Thus, the total number of trees obtained from the training and validating datasets per class is as follows: agreen-attack (594 trees), b-yellow-attack (1206 trees), cred-attack (104 trees), and d-grey-attack (277 trees). The resulting dataset is unbalanced as these are trees from natural forest. The total number of trees in the test plot A is 38, where 12, 23, 1, and 2 correspond to the classes of a, b, c, and d, respectively. The total number of trees in the test plot B is 51, where 12, 34, 3, and 2 correspond to the classes of a, b, c, and d.

B. PRE-PROCESSING METHODOLOGY
UAVs are an affordable and widely used resource for local monitoring of plants. However, in practice, the quality of the images obtained from an UAV can be deteriorated due to the suspended particles in the atmosphere, weather conditions, and the quality of the equipment itself. This results in low contrast images [29]. In this work, to improve the quality of the images, we used a balance contrast enhancement technique (BCET) [30]. This technique was previously successfully applied by us in our previous work [23].
The contrast of the image can be stretched or compressed without changing the histogram pattern of the input image.
For this, we can use a parabolic function in the general form whose coefficients depend on the input image: where IOld is the intensity of the input image (of one of the three-color channels), and INew is the intensity of the output image. The coefficients m, n, and k are calculated using the minimum and maximum intensity values of the input (l and h) and output (L and H) image and the intensity mean value of the input (e) and output (E) image: where s denotes the intensity mean square sum of the input image (5): where the summation is taken over all image pixels with the total number of N. Note that the target values of parameters L, H, and E are set manually. The example of the quality improvement for one image is shown in Figure 2. A clearer visual separation of the individual tree crowns can be observed in the processed image. Figure 2 shows the histogram of the processed image. It indicates that the mean values of all the RGB channels are balanced from 0 to 255, while the mean values of the channels of the source image are from ~90 to ~200. This process enhances image quality allowing it to get better and more accurate results. A histogram value of the RGB channels after applying a contrast balance enhancement shows that the data has been spread over much more of the available dynamic range. Thus, as a result of the application of preprocessing, the crowns of damaged trees are more clearly distinguished in the image.

C. AN ANNOTATOR OF DATASET
For the convenience and acceleration of data annotation for this work, a new lightweight version of the software for annotation according to the YOLO standard was developed to label objects on images [31]. The proposed annotator does not require additional graphic libraries and has a convenient interface allowing for fast annotation of a large number of images in a few steps. The graphical interface of software called Visual Object Labeller 1.3 is shown in Figure 3.

FIGURE 3. Graphical user interface of the Visual Object Labeller version 1.3 annotator with main menu-1, work window-2, file location-3, name and file type-4, selected file-5, total number of files-6, navigation between files-7, assigned class-8, list of labelled points-9.
Our version does not require the OpenCV library installed and uses only the standard graphics functions of the JAVA programming language. The basic steps for using the Visual Object Labeller 1.3 to annotate images are presented in Figure 4. The text labelling of the areas on the image has a form: where <class> is an integer class number, in the present work it corresponds to infested trees classes a, b, c, and d. <XYOLO>, <YYOLO>, <WYOLO>, and <HYOLO> are the reduced coordinates and width and height of a labelled area, calculated as: where Xbox and Ybox are the coordinates of the upper-left corner of the rectangle, Wbox is width and Hbox is height of the rectangle, and Wimage is width and Himage is height of the image. VOLUME XX, 2021

D. YOLO ARCHITECTURES
Object detection is a fundamental computer vision task in which the algorithm analyzes the input image and outputs a label together with a bounding box that delimits where the object-class is in the image [9]. Currently, YOLO models are considered one of the most accurate deep CNN-based methods, since these have optimal speed and work well independently on small datasets in real-time. In this work we used three YOLO architectures, namely YOLOv2 [12], YOLOv3 [13], and YOLOv4 [14], to compare them and analyze the impact of image preprocessing on the detection. It should be noted that the first version of the YOLO architecture was not included in the comparison, since the first version contains a number of errors and cannot detect small images which we used in the experiment [11].
Most object detection algorithms take in and process the image multiple times to be able to detect all the objects present in the images. But YOLO looks at the object only once. It applies a single forward pass to the whole image and predicts the bounding boxes and their class probabilities. The architecture consists of two major components: feature extractor and feature detector or multi-scale detector. The image is first given to the feature extractor which extracts feature embeddings and then is passed on to the feature detector part of the network that produces the processed image with bounding boxes around the detected classes.
If we compare the selected architectures, then YOLOv1 with the Darknet-19 NN results in more localization errors but is much less likely to predict false positives when searched objects do not present in the data. It outperforms other detection methods, including deformable part models (DPM) and Region-based CNN (R-CNN). However, despite the improvements in the YOLOv2 version on Darknet-30 networks, it has better results of detection, but also has a problem while detecting small objects due to down-sampling the input image and losing fine-grained features. The improved architecture of YOLOv3 with Darknet-53 and ResNet networks due to its complexity is a bit slower compared to YOLOv2, but at the same time, gives results with higher accuracy. The improved YOLOv4 architecture, in contrast to the previous version, works much faster without loss of definition quality.

E. EXPERIMENTAL SETUP
All models were trained and tested on an Ubuntu 16.04.6 LTS operating system with an NVIDIA GeForce GTX 1060 graphics processing unit (GPU) and CUDA 10.1 parallel computing platform. The input images were cropped to 416×416 pixels to fit the input layer of the training model. We used a global learning rate of 0.001, and four classes have 8000 iterations for maximum batches. Standardized augmentation techniques were used during training where decay is 0.0005, saturation is 1.5, exposure is 1.5.
The loss function [11] for each of the YOLO models was calculated as follows (11) , (11) where S 2 is the output feature map of all grid cells, B is the number of bounding box for each grid, i is the i-th grid, j is the j-th predicted box of this grid, xi and yi are the location of the centroid of the anchor box, wi and hi are the width and height of the anchor of the box, Ci is the Objectness which means a measure of the probability that an object exists in a proposed region of interest, is the probability of real box class, is the probability of predict box class, The training time for each model was: YOLOv2 -3 hours, YOLOv3 -12 hours, YOLOv4 -6 hours on original images, and YOLOv2 -4 hours, YOLOv3 -15 hours, YOLOv4 -5 hours on pre-processed images.
For detection of infested spruce trees, we trained models on two datasets: without and with data augmentation. The trained models were independently tested on two test plots A and B.

F. EVALUATION METRICS
To evaluate the performance of the trained YOLO architectures in the task of detection of infested spruce trees, we used the mean average precision (mAP) and intersection over union (IoU) metrics. These are popular metrics in measuring the accuracy of object detectors. IoU is a simple scoring metric that requires the following data to apply as: • The ground-truth labelled bounding boxes, which are created manually.

•
The predicted bounding boxes from the trained model: where overlap and union operations are applied to the ground-truth and corresponding predicted areas.
IoU is used to measure the overlap of the boundaries between ground-truth labels and predicted labels from the trained model. If the detection is absolutely correct, then the indicator is equal to 1. The lower the value of the IoU metric, the worse the prediction result. Usually, the threshold of this indicator is set equal to 0.5, meaning that if the IoU > 0.5, then the prediction is considered as correct (true) and false otherwise. Based on this, calculations are made for indicators such as Precision (13) and Recall (14): where TP is a true positive prediction, FP is a false positive prediction, and FN is a false negative prediction.
Precision determines the percentage of correctly recognized labels and Recall shows how good true positives were predicted. On these criteria, F1-score and mAP are calculated to evaluate the performance of a model. F1-score is calculated from the precision and the recall, and mAP is the average precision (the area under the Precision-Recall curve) averaged over all classes [14]: (16) where N is the number of classes.

IV. EXPERIMENTAL RESULTS
In this section, we present the results of our experiments. The purpose of this experiment was to test and compare YOLOv2 [12], YOLOv3 [13], and YOLOv4 [14] architectures on images captured from UAV. The main task is detecting infested trees belonging to one of four classes: a-green-attack, b-yellow-attack, c-red-attack, and dgrey-attack. The models were trained on the original images and pre-processed images, with data-augmentation, and tested on two test plots, A and B.
The results of training all YOLO architectures are presented in Table 2.
We used two pre-processed test plots A and B to test the trained YOLO architectures. The total number of trees in test plot A was 38, where 12, 23, 1, and 2 correspond to classes a, b, c, and d. The total number of trees in test plot B was 51, where 12, 34, 3, and 2 correspond to classes a, b, c, and d. Table 3 shows the results of the predictions of YOLO models for each tree infestation class, presented next to the samples labelled by an expert (ground truth).  It can be noted that classes a and b are best detected by the YOLOv4 architecture on both datasets. The best detection of class c on data with and without pre-processing is provided by the YOLOv2 architecture. The class d is better detected on the original data by the YOLOv3 architecture, and on data with pre-processing it is better detected by the YOLOv3 and YOLOv4 architectures. In general, YOLOv4 showed the best result in detecting infested trees, as it showed fewer false positives and more correct answers. A graphical presentation of the results of Table 3 is shown in Figure 5. Table 4 presents the performance results of testing YOLO architectures.
The results presented in Table 4 show that the use of preprocessed images for training can increase the accuracy for each of the architectures with the degree of the improvement of the results depending on the tree class. Contrast enhancement results in the improvement for YOLOv2 and YOLOv3 in the detection of red-attack (c) and grey-attack (d) classes, and for YOLOv4 of yellow-(b), red-(c), and grey-attacks (d). In general, after applying contrast enhancement the accuracy of classification increased significantly for all considered architectures and there were fewer false positive and false negative.
It should be noted that testing the test areas took a different amount of time for each version of the YOLO. For example, the processing of images with a size of 1280×720 pixels using the CUDA 10.1 and NVIDIA GeForce GTX 1060 GPU on the trained YOLOv2 architecture took about 22.22 microseconds, and for the trained YOLOv3 and YOLOv4 architectures with the same image it was 31.25 milliseconds and 28.57 milliseconds, respectively.

V. CONCLUSIONS
According to the results of the experiments, it can be concluded that for a better result in the task of detecting infested Norway spruce trees (Picea abies (L.)) damaged by the bark beetle and counting the number of detected specimens in images obtained from UAV cameras, it is preferable to use the YOLOv4 architecture, trained on a dataset pre-processed by increasing the pixel contrast. It should be noted that the dataset composed of the preprocessed images showed a very good mAP metric for all trained models. The test results on the preprocessed images in comparison to the original images improved by 65% for YOLOv2, by 7.22% for YOLOv3, and by 3.19% for YOLOv4. Also, the results of image processing speed show that the architecture of the YOLOv4 version can be used as an expert agent for analyzing local areas using UAVs on medium resources (NVIDIA GeForce GTX 1060 GPU) in real-time.
We also note that the limitation of this study is the relatively small dataset. However, in this regard, different versions of YOLO architectures were used for the study. In addition, the considered models are intended to be used in real-time, which is our future research on the implementation of trained models at the UAV station.

ABBREVIATIONS
The following list of abbreviations was used in the manuscript: