Processing math: 4%
Flower Detection Using Object Analysis: New Ways to Quantify Plant Phenology in a Warming Tundra Biome | IEEE Journals & Magazine | IEEE Xplore

Flower Detection Using Object Analysis: New Ways to Quantify Plant Phenology in a Warming Tundra Biome


Abstract:

Rising temperatures caused by global warming are affecting the distributions of many plant and animal species across the world. This can lead to structural changes in ent...Show More

Abstract:

Rising temperatures caused by global warming are affecting the distributions of many plant and animal species across the world. This can lead to structural changes in entire ecosystems, and serious, persistent environmental consequences. However, many of these changes occur in vast and poorly accessible biomes and involve myriad species. As a consequence, conventional methods of measurement and data analysis are resource-intensive, restricted in scope, and in some cases, intractable for measuring species changes in remote areas. In this article, we introduce a method for detecting flowers of tundra plant species in large data sets obtained by aerial drones, making it possible to understand ecological change at scale, in remote areas. We focus on the sedge species E. vaginatum that is dominant at the investigated tundra field site in the Canadian Arctic. Our system is a modified version of the Faster R-CNN architecture capable of real-world plant phenology analysis. Our model outperforms experienced human annotators in both detection and counting, recording much higher recall and comparable level of precision, regardless of the image quality caused by varying weather conditions during the data collection. (K. Stanski, GitHub - karoleks4/flower-detection: Flower detection using object analysis: New ways to quantify plant phenology in a warming tundra biome. GitHub. Accessed: Sep. 17, 2021. [Online]. Available: https://github.com/karoleks4/flower-detection.)
Page(s): 9287 - 9296
Date of Publication: 08 September 2021

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

The Arctic is warming more rapidly than any other biome on the planet, experiencing an average temperature increase of more than 2°C since 1950 [1]. Its average temperature is predicted to rise by a further 6°C–10°C within the next 100 years [1]. This warming leads to a longer growing season [2] which has been estimated by recent studies to increase in the future by approximately 4.7 days per decade [3]. However, vegetation change research is constrained by the logistics of in situ field observations [4]. These standard techniques are extremely costly and cannot be scaled to cover large areas [5]. Therefore, the exact influence of warming on tundra plant communities remains uncertain.

Phenology is a study of plant and animal life cycle events including flower and leaf emergence and decay [6]. The timing of plant phenology can be influenced by changes in a variety of factors, including temperature [5], [7]. Thus, phenological records are valuable for studying the influence of climate change across the world's biomes [6]. The typical way that ecologists have gathered phenological data is to observe phenological changes on-site in localized plots (e.g., 1 m^{2} patches or along short transects) [8]. Unfortunately, on-site observations are extremely time-consuming and highly difficult in less accessible areas, including areas of particular importance for understanding ecological and climate change [9]. However, rapidly developing technology allows for new data collection approaches, including proximal remote sensing using drones [10]. The use of unmanned aerial vehicles (UAVs) is a cost-effective method of conducting detailed analysis with high spatial and temporal resolution while avoiding destructive sampling of sensitive ecosystems [11].

Plant phenology captured using high spatial resolution drone imagery comes with methodological challenges such as variable light conditions and complex background [12]. Initial attempts of robust and accurate data analysis included template matching [13], geographic object-based image analysis (GOBIA) [14], regression analysis [15], and Markov point processes [16], none of which yielded accurate enough results to draw meaningful ecological conclusions [17]. A more advanced method, utilizing maximally stable extremal regions (MSER) from drone imagery [18], has been used for turtle and seabird counting with limited success [18], [19].

Recent advances in deep-learning-based models, including convolutional neural networks (CNNs), present new opportunities for efficient and accurate image analysis. Models based on these architectures learn features that tend to be more informative than handcrafted features [20] such as scale-invariant feature transform (SIFT) or histogram of oriented gradients (HOG) [21], and achieve higher image classification accuracies than their predecessors. Moreover, accompanied by hardware advances (i.e., GPUs), some models are now capable of real-time detection [22].

In this article, we propose a fully automated and efficient method for plant flower detection and counting from high-resolution drone imagery utilizing recently developed deep-learning techniques. Such a tool allows us to quantify the effects of climate change on a tundra biome and other flowering ecosystems, complementing and potentially eliminating the need for on-site measurements. We focus on the sedge species Eriophorum vaginatum (E. vaginatum), which was the most abundant flowering plant within the investigated area (Qikiqtaruk - Herschel Island; 69°N, 139°W) [23].

The main contributions of our article are as follows.

  1. Detection model

    Our model yields better than human-level performance in detecting E. vaginatum flowers in Arctic tundra, and can easily be extended to detect other objects. The following modifications make it possible to detect smaller objects than the original faster R-CNN [24]:

    1. parametric ReLU (PReLU) activation unit to alleviate the issue of vanishing gradients [25];

    2. shallow feature extractor to boost small object detection performance;

    3. context path to eliminate false positive detections by considering the information enclosing an object.

  2. Dataset

    We have created a dataset of 2592 manually annotated images containing nearly 50 000 E. vaginatum flower objects. As a result, our database is a valuable resource for future studies regarding the phenology of E. vaginatum species as well as tundra biomes in general.

  3. Novel evaluation process

    We introduce a comprehensive evaluation process for our method assessing its performance against human annotators in both object detection and counting.

SECTION II.

Related Work

Object detection is a fundamental problem in analyzing remote sensing imagery. Recent advances in detection methods, based on CNNs, have led to dramatic improvements in detection accuracy relative to earlier methods that rely on handcrafted features. State-of-the-art architectures include two-stage region proposal based CNNs (R-CNNs), such as faster R-CNN [24] or feature pyramid network (FPN) [26], which achieve very high accuracy at the cost of real-time performance, and more direct single-step approaches like you-only-look-once (YOLO) [22] which are often capable of real-time detection but have slightly lower accuracy [27].

Most previous work has focused on improving detection accuracy for objects occupying a sizeable area of an image based on standard datasets such as Pascal VOC with instances taking up 14% of the image on average. However, some increasingly popular applications, including analysis of remote sensing imagery, have led to demand for detectors that can identify distant and small objects, requiring architectural improvements. In addition to frequently involving small or low-resolution objects, remote-sensing imagery often includes noisy backgrounds and variable lighting and weather conditions, compounding the challenges in creating high-performance detection systems [17].

One approach to improving the accuracy of small object detection involves using multiple-scale features of increasing complexity. For instance, single shot multibox detector (SSD) [28] predicts objects at each feature level whereas some fully convolutional networks (FCN) [29] combine multiple predictions by averaging segmentation probabilities. Furthermore, higher level fine-grained features provide vital contextual information surrounding an object by increasing the network's effective receptive field to disambiguate an instance from a noisy background (see Fig. 1). Recent studies provide many examples of incorporating context through feature fusion using simple concatenation [12], [30], or element-wise addition operations on the extracted feature maps which greatly improve network performance [31].

Fig. 1. - Examples of false positives generated by the model before the addition of the contextual path. Green and red circles denote effective receptive fields with and without the addition of the contextual path.
Fig. 1.

Examples of false positives generated by the model before the addition of the contextual path. Green and red circles denote effective receptive fields with and without the addition of the contextual path.

Despite introducing various techniques to improve small object detection, most of the studies concerning object detection from remote sensing imagery have been focused on analyzing urban scenes and vehicles in particular. Examples include ships [32] or aircraft [12], [33] as well as buildings such as airports [34]. Few efforts have been made to apply recent advancements to quantify ecological events such as phenological stages. Some attempts involving flower objects and UAV imagery focused mainly on segmentation rather than counting [35]. Other examples include automatic counting of rice seeding [36] or oil palm trees detection and counting [37]. However, these methods are unable to detect overlapping objects and are neither efficient nor robust due to the fixed-size sliding window approach they employ. Another example of applying object analysis in ecology is camera trap detection of wildlife where studies often utilise state-of-the-art model architectures and achieve high levels of accuracy [38]–​[40]. Most recently, deep learning models had been used to detect and count insects although from much lower height [41]. Therefore, to analyse the extensive ecological imagery collected throughout the years, including observations of various plant and animal species at long ranges, it is vital to have efficient, reliable, and robust systems [17].

SECTION III.

Dataset

The original remote sensing imagery was collected from four different sites across Qikiqtaruk - Herschel Island in the Canadian Arctic. The images were gathered between June 2017 and August 2017 using Phantom 4 Advanced Pro drone platforms. The flights were conducted in variable weather (e.g., wind, cloud cover, mist, etc.) and at different times of the day under variable lighting conditions, giving a wide range of image qualities and appearances of the E. vaginatum flowers. Each site was surveyed from four different altitudes, ranging from 12 to 100 m, yielding high-resolution imagery of size 5320 × 4200 pixels. Table I summarizes the details original dataset.

TABLE I Specifications of the Original Drone Imagery and Our Final Dataset; PS1-PS4 Denote Phenology Sites at Different Parts of Herschel Island
Table I- Specifications of the Original Drone Imagery and Our Final Dataset; PS1-PS4 Denote Phenology Sites at Different Parts of Herschel Island

We divided the images into 440 × 440 pixel tiles, to simplify the annotation process for human experts and reduce the memory consumption when training our network. At this stage, we considered only data collected from the 12 m altitude due to its high resolution and visibility of the objects for the human annotators. From this subset, we extracted a uniform random set of 2592 tiles which include a 20 pixel overlap with adjacent tiles to avoid any object truncation and allow lossless reconstruction of the original images of greater size. The overlap was crucial to provide necessary context information regarding the surrounding of the instances which otherwise could be missed or incorrectly classified as a flower.

The ground-truth had been generated through an annotation procedure including nine human experts, mainly Ph.D. and Masters students, who were present on the sites or who were carefully instructed on specific E. vaginatum flower characteristics. Data annotation had been split into two parts each lasted between September to December 2017 and 2018, respectively. To achieve the best possible quality of the ground-truth, each tile was annotated by multiple experts with differences resolved by the majority vote and average bounding box generation. Fig. 2 demonstrates an example of an annotated tile. The total number of annotated objects in the dataset reached 50 521 flowers, indicating the scale of the task. Furthermore, the dataset itself represents a valuable resource for the studies of the Arctic tundra phenology by providing accurate locations and population estimates of the E. vaginatum species.

Fig. 2. - Example tile from our dataset. The green bounding boxes denote the ground-truth annotated by human experts.
Fig. 2.

Example tile from our dataset. The green bounding boxes denote the ground-truth annotated by human experts.

Finally, we split the data into three randomly selected, nonoverlapping datasets. The training set was by far the biggest, accounting for 66.6% (1728 tiles) of the annotated images. The remaining tiles were evenly divided into the validation and test sets, both representing 16.6% (432 tiles) of the original dataset. These sets were used to determine the best performing hyperparameter setup and network evaluation, respectively. We also evaluated our model by comparing its performance directly with the detection and counting accuracy of the human experts (see Sections V-D and V-E).

SECTION IV.

Method

The original faster R-CNN architecture is capable of accurate detection of objects that occupy a sizeable area of an image, such as animals or vehicles in the foreground of a photograph. However, remote sensing imagery poses the additional challenges of much smaller object sizes and their resolution. In particular, a single E. vaginatum flower occupies only about 0.1% of the whole image area compared with the average of 14% for instances in Pascal VOC dataset [42].

Previous attempts to adjust faster R-CNN to various small object detection tasks included anchor box size adjustments [42], multiscale feature fusion using concatenation [12] or element-wise addition [31]. Here, we modify the faster R-CNN architecture for small object detection used specifically for plant phenology analysis from remote sensing imagery. The detection pipeline, shown in Fig. 3, consists of a backbone feature extractor, region proposal network (RPN) and the final fast R-CNN detector.

Fig. 3. - Architecture diagram of our modified faster R-CNN pipeline for E. vaginatum flower detection from remote sensing imagery.
Fig. 3.

Architecture diagram of our modified faster R-CNN pipeline for E. vaginatum flower detection from remote sensing imagery.

The first stage of the pipeline involves feature extraction from the entire input image performed by the backbone network which we describe in more detail in Sections IV-A and IV-C. This fully convolutional network produces a set of features that is shared by the remaining two components making the model a unified framework.

The second stage denotes region proposal generation by the RPN. This small class-agnostic network was the most prominent improvement as the predecessors of the faster R-CNN heavily relied on less efficient methods including selective search to generate a predefined number of proposals that were most likely to contain objects [43]. This nearly cost-free solution significantly improved the model's efficiency by the RPN sharing feature extractor with the rest of the detection network.

The original implementation of this component involved three anchor sizes ({128^2, 256^2, 512^2}) of three different ratios (1:2, 1:1, 2:1). However, as suggested by [42], such large sizes are unsuitable for detecting smaller objects which can be enclosed by a smaller box. Thus, we decreased the anchor sizes to ({12^2, 14^2, 16^2, 21^2}) and reduced them to just a single 1:1 ratio, reflecting the fact that E. vaginatum imagines can be reliably enclosed by a square bounding box. The number of proposals generated for each training and testing image was set to 2000 and 300, respectively, following previous work [24].

The final step in our pipeline is the fast R-CNN detector which utilises the image features extracted by the base network along with the proposals generated by the RPN. The proposals are processed by the detector's region of interest (RoI) pooling layer to produce a fixed-size feature vector followed by a set of fully connected layers. The primary purpose of the detector is further classification and bounding box refinement to produce the final detections. For this step, our methods follow those described in [43]. We used the same loss function as the original faster R-CNN architecture consisting of classification and regression components (i.e., multitask loss) with the latter utilizing smooth-L1 loss [24].

A. Shallow Feature Extractor

We reduced the overall depth of the feature extractor compared with the original VGG-16 backbone network, making it much more appropriate for small object detection [44]. Using fewer blocks results in a more suitable receptive field and reduced risk of object characteristics being lost during pooling operations [12]. Shallower convolutional layers extract coarser low-level features which are more appropriate for detecting simpler shapes of E. vaginatum flowers. Moreover, feature extractors based on the VGG architecture and its variations yield promising results over other alternatives in tasks involving small object detection [45]. Therefore, our baseline network contains three blocks, each consisting of two to three convolutional layers complemented by activation units and followed by a max-pooling layer (see Table II).

TABLE II Architecture of the Feature Extractor Blocks
Table II- Architecture of the Feature Extractor Blocks

Due to the specificity of our task and dataset characteristics compared with any standard datasets, we did not utilise pretrained VGG-16 layers. Instead, we trained our network from scratch. That is because the objects of the desired domain ought to be of comparable shape and size as the objects on which the network was pretrained [46]. Hence, the network designed for a new domain is unlikely to benefit from the set of parameters after being trained on a completely unrelated dataset [47].

B. PReLU

Despite a wide variety of available activation functions, ReLU has been the most widely used among the state-of-the-art architectures, including the original VGG-16 feature extractor within faster R-CNN. ReLU activation functions are computationally efficient due to simple thresholding at zero (1) which greatly accelerates the network's convergence; six times faster than sigmoid or tanh functions in some cases [48]. However, since the negative inputs and gradients are all set to zero, those units will eventually stop responding to variations in error/input and die, making that segment of the network passive [49]. This phenomenon could significantly limit the ability of the network to properly learn from the data [25] \begin{equation*} relu(y_i) = {\begin{cases}y_i, & \qquad \text{if } y_i > 0 \\ 0, & \qquad \text{if } y_i \leq 0. \end{cases}} \tag{1} \end{equation*} View SourceRight-click on figure for MathML and additional features.

To alleviate this issue, many alternatives including leaky ReLU introduce a leakage parameter (\alpha _i) to the horizontal part of the ReLU graph. However, a constant parameter value for leaky ReLU has a marginal impact on improving network performance when compared with the equivalent architecture using ReLU [50]. Thus, we followed the idea of leakage parameter and incorporated PReLU which progressively learns such parameter for each input channel [\alpha _i, (2)] yielding higher accuracy with a marginal extra computational cost [25] \begin{equation*} prelu(y_i) = {\begin{cases}y_i, & \qquad \text{if } y_i > 0 \\ \alpha _i y_i, & \qquad \text{if } y_i \leq 0. \end{cases}} \tag{2} \end{equation*} View SourceRight-click on figure for MathML and additional features.

PReLU introduces a small number of learnable parameters, equal to the total number of channels, which does not slow the training process significantly [25]. Due to the parameter adaptation for each channel, PReLU eliminates the dying ReLU problem as well as reduces the risk of overfitting due to its randomness especially in deeper architectures [51]. Given these advantages of PReLU over standard ReLU activation functions, we experimented with PReLU and found that it improved average precision and F_1 score by 4% and 1% respectively despite our network being relatively shallow. To our knowledge, this is the first work demonstrating PReLU's performance advantages when used within the faster R-CNN framework.

C. Feature Fusion (Context)

Reducing the depth of the feature extractor by eliminating the number of convolutional blocks and pooling layers can bring significant performance improvements while detecting smaller objects [12]. Our first version of the model consisted of only two such blocks with an effective receptive field of 14 pixels delivering a satisfactory F_1 score of 0.74. However, this architecture was highly susceptible to the noisy background including light reflections and field markers, yielding a high number of false-positive detections.

To tackle this issue, we incorporated context information from the enclosing pixels for each object instance by adding an extra convolutional block with a bottom-up path for feature fusion. This way, we extended the effective receptive field to as much as 40 pixels with no information loss due to the coarser features from the shallower block being included in our final set of features. The effective receptive field of 40 pixels is a result of operations applied in each convolutional block with the output fields being 6, 14, and 40 pixels for block-1 to block-3, respectively, when applied in sequence. Similar solutions had been utilized by other faster R-CNN implementations regardless of the feature extraction network type (i.e., VGG-16, Res-Net), which boosted the detection accuracy of smaller objects [12], [30], [31].

The purpose of the additional feature block is to allow the model to extract more complex features. Those features needed to be rescaled in order to perform the fusion with the other set from the higher block. We used a 1 \times 1 convolutional and a 2 \times 2 transposed convolutional operations to compress channel and adjust height/width dimensions, respectively. We performed feature fusion using an element-wise addition layer which simply adds the inputs channel by channel.

After merging, fused feature maps were passed through another convolutional layer with 3 \times 3 kernel followed by PReLU activation to degrade the spatial aliasing effect of down-sampling [31], producing the final output of the feature extractor. Hence, the feature extractor component produced 128 feature maps of size 220 \times 220 which were then passed to the remaining two components of our faster R-CNN pipeline.

SECTION V.

Results and Evaluation

Throughout this research, we have tested the original faster R-CNN with VGG-16, VGG-19, and ResNet feature extractors, along with the previous iteration of the faster R-CNN meta-architecture, namely R-CNN [52], and fast R-CNN [43]. However, none of these models proved to be suitable for the task of E. vaginatum flower detection and model-to-model comparison, as each achieved final F_1 score of below 0.5. Nevertheless, our evaluation procedure includes other points of reference such as comparison against human experts or counts collected on the ground to deliver a thorough assessment of the model performance. All the experiments and network training were conducted on a machine with an NVIDIA Titan V GPU and 32 GB of memory.

A. Parameter Setting

Our setup included the four-stage alternate training procedure described in [24]. Specific parameters used for each training stage are presented in Table III. The learning rate decay parameter was set to 0.1. The optimiser chosen was mini-batch gradient descent with the momentum parameter of 0.9 and weight decay set to 0.0005. Each mini-batch included one image and 256 proposals per image in detector training. The weights were randomly initialized using zero-mean Gaussian distribution with a standard deviation of 0.01. Furthermore, we applied normalization of each input by subtracting the mean channel values from each colour channel of the image which were determined from the training set.

TABLE III Parameter Setup for Each Training Stage
Table III- Parameter Setup for Each Training Stage

B. Evaluation Metrics

Our evaluation procedure was based on widely-used metrics within the object detection community. These included precision, recall, F-measure, and average precision (AP) [20]. Furthermore, we used the percentage ratio between the number of the model to ground-truth detections. This metric was vital to establish our network's capabilities of tracking patterns within a species population.

The correctness of each detection is determined by the intersection over union (IoU) with the ground-truth bounding box being at least 0.5. The number of correct detections is denoted as true positives (TP) whereas the incorrect ones as false positives (FP). The instances which are not detected by the network are defined as false negatives (FN). Thus, precision and recall are defined as follows: \begin{align*} \text{precision} = \frac{\text{true positives}}{\text{true positives + false positives}} \tag{3} \\ \text{recall} = \frac{\text{true positives}}{\text{true positives + false negatives.}} \tag{4} \end{align*} View SourceRight-click on figure for MathML and additional features.

Average precision is a single metric capable of expressing the complex relationship between precision and recall. Mean average precision (mAP) denotes the mean of APs among all considered classes. Since our task considers only one class (E. vaginatum), AP is equivalent to mAP. AP denotes area under the precision-recall curve where the integral representing the average precision is approximated by the finite sum over every position in the ranked sequence of the detected objects \begin{equation*} \text{AP} = \int _{0}^{1} p(r) dr \approx \sum _{k=1}^{n} P(k) \Delta r(k). \tag{5} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The F-measure is another metric capable of expressing the relationship between precision and recall scored by a model. We used F_1 score, weigh precision, and recall equally. F_1 score is defined as \begin{equation*} F_1 = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision + recall}}. \tag{6} \end{equation*} View SourceRight-click on figure for MathML and additional features.

C. Standard Metrics Evaluation

To ensure that our results were representative of the actual model performance, we followed the standard validation and testing procedures. Validation set was used to determine the most optimal hyper-parameter setup whereas the model's final performance was established based on the testing set. The results were presented in Table IV with respect to different evaluation metrics.

TABLE IV Final Model Performance on the Test Set
Table IV- Final Model Performance on the Test Set

The original version of faster R-CNN struggled with detecting small objects, recording values below 0.3 across all the considered metrics. In comparison, our version of the network recorded values of precision and recall of 0.81 and 0.97, respectively. Significantly higher recall value can be attributed to the model's tendency to detect more flowers than indicated by the ground-truth (i.e., over-counting by 20% on average) suggesting a noticeable number of false positives. Nevertheless, recorded values of precision and recall demonstrated a balanced performance of the network, summarized by the F_1 score of 0.87. Our network yielded an average precision of 0.91, which we regard as high when compared to other results for similar remote sensing tasks [31], [53].

Furthermore, in the inference phase, our faster R-CNN can process each 440 × 440 tile in 0.05 s on average while running on a single GPU. This corresponds to the processing speed of just over 7 s per single remote sensing image (5320 × 4200 pixels) and introduces a negligible total cost to the overall data collection and processing pipeline.

D. Network-Human Evaluation

Due to the uniqueness of the task and the goal to achieve a human-like performance of our method, we compared our faster R-CNN with the performance of human experts in E. vaginatum detection and counting. We formed a testing set using 15 randomly selected tiles from the validation set and asked six independent human experts (A-F) who were involved in the dataset annotation to repeat the process. The same set was processed by our model. The results are presented in Table V.

TABLE V Comparison of Human and Network Performance Best scores indicated in bold
Table V- Comparison of Human and Network Performance Best scores indicated in bold

Despite the small size of our corpus relative to the very large corpora in standard image detection tasks, our results present consistent patterns. Humans recorded significantly smaller numbers of detected objects compared with the model, which was indicated by humans’ low recall values. This suggests that human annotators struggled to notice certain objects due to their very small area and background noise (see Fig. 4). An alternative explanation could be that humans are generally more reluctant to annotate objects, but we did not further investigate the specific causes of human recall failures. Humans also tended to record fewer false positives (10% of their outputted detections on average compared with nearly 15% for the model), explaining higher precision scores. Human annotators were also inconsistent in the number of flowers they detected. This finding underscores the complexity of the task and the need for a reliable automatic method, setting aside the costs of obtaining human annotations.

Fig. 4. - Example of flower detections made by different subjects (b), (c) when compared to the ground-truth annotations (a). Red bounding boxes denote the detections.
Fig. 4.

Example of flower detections made by different subjects (b), (c) when compared to the ground-truth annotations (a). Red bounding boxes denote the detections.

Fig. 5. - Example tiles of varying brightness due to changing weather conditions. The red and green boxes denote model detections and ground-truth, respectively.
Fig. 5.

Example tiles of varying brightness due to changing weather conditions. The red and green boxes denote model detections and ground-truth, respectively.

Despite the network's precision of 0.87 being far from the best human annotator's score of 0.98, the network's score was mid-range, with two annotators recording much lower values of 0.82. Nevertheless, the human annotators’ extremely low recall scores prevented them from achieving accurate counts, with nearly half of the ground-truth flowers being missed on average. Thus, our model was decidedly closer to the expected number of flower objects present within each tile, recording an impressive 0.93 of the expected number of objects with the highest human's score being only 0.74, with most human annotators not exceeding 0.60.

Based on the results, our method outperformed all human annotators, with much higher F_1 and AP scores. The poorer performance of human annotators may be attributed to their inability to notice very small flower objects or possibly frustration caused by the tedious nature of the task. These factors did limit our modified faster R-CNN making it more suitable and reliable for analysis of phenology data. Furthermore, faster R-CNN is highly scalable and capable of covering vast areas, which would otherwise require hundreds or thousands of trained annotators.

E. Ground Counts

During the data acquisition, researchers collected flower counts on the ground within \text{2}~m \times \text{2}~m areas around each investigated site. These counts were gathered to determine the number of flowers present at each area unaffected by any image qualities, noise or distortions, unlike the annotation process. Since these data were obtained in person on the ground, those counts are the closest to the true ground-truth making it an appropriate means of assessing our method's performance. We only considered the number of detections (counts) as specific flower locations were not recorded.

The sample consisted of 12 images of the investigated sections from all sites. This set was presented to three human experts (H-I) who performed manual counting. The same imagery was processed by our system. The results are presented in Table VI.

TABLE VI Comparison of Human and Network Performance Against the Ground Counts Best scores indicated in bold
Table VI- Comparison of Human and Network Performance Against the Ground Counts Best scores indicated in bold

The results follow our previous observations regarding inconsistent human performance compared with the model, despite the humans having prior experience with the task. Their counts varied greatly between each image which is summarized by the higher value of standard deviation compared to 0.11 for our method. These results are highly encouraging due to the fact that individuals who took part in the evaluation process were experts within the biodiversity field. The annotators had extensive experience in studying, counting and analyzing habitats of various plant species. They have been analyzing similar data before and knew exactly how E. vaginatum flowers look like. They were also familiar with their habitat as well as growing patterns (i.e., often in clusters). Thus, we expected human annotators to be an appropriate benchmark of the model performance. Humans’ underperformance can be attributed to severe image distortions such a light reflections or blur as well as very small object size relative the tile dimensions.

Furthermore, our network detected a marginally higher number of objects on average (1.06) than was counted on the ground. This surplus of detections matched our previous testing results although the overcount was not as profound (1.20 on the test set), possibly due to a much smaller image sample size. Despite yielding too many objects, the model's result was consistently closer to the true value than the best human score of only 0.83. Once again, humans detected less than 0.85 of the total number of flower objects, most likely due to the small visible flower area and partial obstruction by grass and other obstacles. On the other hand, small object sizes and background noise did not prevent the network from delivering more accurate counts due to its ability to extract relevant multilevel features and consult contextual information around each instance. This conclusion had been drawn on the fact that our early iteration of the network did not include the contextual path which had an adverse effect, particularly in the case of the most ambiguously looking flowers. That version reported a high number of false-positive detections, especially in presence of light reflections or white field markers as shown in Fig. 1.

F. Population Tracking

With our faster R-CNN capable of reliable object detection and counting, we tested it in a potential real-world environment. To do this, we selected imagery of the four sites from two different days including variable weather conditions to evaluate the robustness of our method regarding changing lighting and image quality. The results are presented in Table VII.

TABLE VII E. Vaginatum Flower Population Estimation
Table VII- E. Vaginatum Flower Population Estimation

Our faster R-CNN detected a very similar number of flowers on both days, regardless of the site. Real-world flower counts were expected to be similar due to the 48-hour difference in data collection. Such consistency indicated the network's potential reliability even on a much larger scale than previously considered under different lighting and wind conditions (see Fig. 5). Furthermore, we did not observe any extreme variation between the estimates among different sites. According to the results in Table VII, the variation was estimated as \pm5%, partially due to noise (i.e., lighting) and occlusion (i.e., branches or grass covering flowers). Nevertheless, the true population size of E. vaginatum flowers is likely to be marginally lower than presented in Table VII. That is due to our method's tendency to detect a higher number of objects as shown by our previous results (1.06).

SECTION VI.

Conclusion

In this article, we introduced our modified version of the faster R-CNN architecture capable of E. vaginatum flower species detection and counting. Our major modifications of the feature extractor component included reduced depth and utilization of context information through feature fusion along with the incorporation of a PReLU activation unit. These adaptations yielded promising results in the testing phase, which were further confirmed by the network consistently outperforming human experts at both detection and counting tasks despite varying image quality due to differing weather conditions during data collection. Furthermore, our method did not suffer from random light reflections or noisy background as much as human subjects as indicated by its significantly higher recall scores.

Although we were unable to assess the accuracy of the flower count estimates for E. vaginatum species within the investigated area, our results which consistently exceed 20 000 objects per site demonstrate how time-consuming the task of manual counting would be. Other conventional methods involving tracking the numbers within a much smaller region and assuming a similar distribution for the rest of the area might lead to only rough and most likely inaccurate estimations. That is because they do not consider abnormalities caused by varying terrain characteristics or distribution patterns. Our faster R-CNN can cover a much broader area with a high density of flowers in just over 5 minutes per site consisting of 44 images on average. Such scale could only be matched by dedicating a vast number of human annotators to the task, at great expense, and would lead to less accurate counts.

As a future improvement, we found the idea of utilizing multispectral datasets in addition to the conventional RGB optical bands particularly interesting. Adding bands of different wavelengths such as near-infrared (NIR) proved to be beneficial in low visibility conditions in other object detection tasks [54]. Additional bands would help the model to avoid or significantly reduce the number of nonvegetation false-positives by applying NDVI index [55]. Ultimately, all spectral bands could be utilized to make the model decide which bands are the most significant to detect the specific plant species [56].

Global change impacts necessitate new tools to capture ecological responses across the world's biomes [9]. Our work indicates the great potential of faster R-CNN models for image analysis as reliable tools in plant phenology research. Our method is likely to generalise well to different flower species as well as other kinds of plant phenology or ecological data given a thoroughly annotated and sufficiently large image set. Thus, future phenology research can extend localized on-site measurements to landscape scales by combining drone-based data collection with automated flower detection systems.

ACKNOWLEDGMENT

The authors would like to thank NVIDIA Corporation for the donation of the Titan V GPU used for this research through the NVIDIA GPU Programme. Data collection was funded by the NERC ShrubTundra project (NE/M016323/1). Special thanks are also due to all the Team Shrub members including J. Kerby, A. Cunliffe, G. Daskalova, H. Thomas, S. Angers-Blonin, M. Garca Criado, J. Boyle, A. Bjorkman, E. Walker, S. Kellerhals, and I. Rich for dedicating their own time and efforts to collect the data and annotate the imagery for this project. The authors thank Y. Parks for supporting the field data collection and the Inuvialuit People for the opportunity to conduct research on their land.

References

References is not available for this document.