A Performance-Optimized Deep Learning-Based Plant Disease Detection Approach for Horticultural Crops of New Zealand

Deep learning-based plant disease detection has gained significant attention from the scientific community. However, various aspects of real horticultural conditions have not yet been explored. For example, the disease should be considered not only on leaves, but also on other parts of plants, including stems, canes, and fruits. Furthermore, the detection of multiple diseases in a single plant organ at a time has not been performed. Similarly, plant disease has not been identified in various crops in the complex horticultural environment with the same optimized/modified model. To address these research gaps, this research presents a dataset named NZDLPlantDisease-v1, consisting of diseases in five of the most important horticultural crops in New Zealand: kiwifruit, apple, pear, avocado, and grapevine. An optimized version of the best obtained deep learning (DL) model named region-based fully convolutional network (RFCN) has been proposed to detect plant disease using the newly generated dataset. After finding the most suitable DL model, the data augmentation techniques were successively evaluated. Subsequently, the effects of image resizers with interpolators, weight initializers, batch normalization, and DL optimizers were studied. Finally, performance was enhanced by empirical observation of position-sensitive score maps and anchor box specifications. Furthermore, the robustness/practicality of the proposed approach was demonstrated using a stratified k-fold cross-validation technique and testing on an external dataset. The final mean average precision of the RFCN model was found to be 93.80%, which was 19.33% better than the default settings. Therefore, this research could be a benchmark step for any follow-up research on automatic control of disease in several plant species.


I. INTRODUCTION
According to the latest fresh facts report by a New Zealand (NZ) research and development organization named Plant and Food Research, the horticultural industry achieved a record export of over NZ$6.6 billion by June 2020 [1]. The most prominent fresh fruits were kiwifruit, apples, and avocadoes with an export value of NZ$2.5, NZ$0.9, and NZ$0.1 billion, respectively, followed by the New Zealand The associate editor coordinating the review of this manuscript and approving it for publication was Turgay Celik . wine at NZ$1.9 billion [1]. Furthermore, the largest crop area of 39,935 ha has been estimated for grapevines, whereas 12,905 ha of kiwifruit and 10,750 ha of apples, pears, and nashi have been reported. Based on these statistics, horticultural crops generate a great impact on New Zealand's economy. Hence, addressing the problems associated with horticultural crops could further strengthen the export value of the horticultural sector.
Among several real field problems, plant diseases affect crop yield, and quality [2], and cause economic losses [3]. The precise detection of the disease is an important step in reducing its spread to neighboring plants, application of appropriate disease control treatments, and improving crop productivity. In this regard, this research is dedicated to the accurate identification of plant diseases by deep learning (DL) in the most valuable crops of NZ, including kiwifruit, apple, pear, avocado, and grapevine. Furthermore, several research gaps related to the dataset, real horticultural conditions, and deep learning-based plant disease recognition have been addressed in this study.
Deep learning, a subset of machine learning, has been reported in literature as a successful technique in recognizing plant diseases. A recent review summarized and compared various pre-processing steps (image resizing, data augmentation, normalization and standardization, data annotation, and others), datasets, convolutional neural networks, training techniques, deep learning frameworks, and optimization algorithms [4]. Another review article presented various modified DL models, plant disease detection tasks, problems, and challenges of DL-based plant disease classification [5]. For instance, the significance of the recent solutions for the small datasets was presented, such as transfer learning, few-shot learning, and one-shot learning. Furthermore, the early plant detection problem was explained with the use of hyperspectral imaging (HSI).
In the early stages of research in DL-based plant disease identification, the major focus remained on the classification tasks. For example, [6] did the early work in the domain of DL-based recognition and classification of plant disease by using two well-known DL models, namely AlexNet and GoogLeNet. Similarly, the task of the classification of plant diseases was presented in various articles by transfer learning and fine-tuning methods [7], [8]. These articles showed the importance of using the latest training techniques.
In the next stage, research community focused on the dataset size, particularly small datasets, as it played a significant role in the performance of the DL models. A novel data augmentation technique to classify disease in cassava leaves, tested on a modified MobileNet model, was presented in [9]. Another research [10] presented a generative adversarial network (GAN) for classifying the disease in PlantVillage dataset [11], which contains 38 classes of healthy/disease leaves for 14 plant species. Yet another article discussed a GAN-based model to classify tomato leave disease [12]. These articles formed a basis for data augmentation in plant disease detection. However, they only studied the performance of the DL models on single datasets leaving questions around their performance with other datasets containing disease in different crops.
Modification of well-known DL models is another area of research that has seen continuous focus for a long time. A modified CenterNet model with DenseNet-77 was proposed by [13] to identify plant disease while a MobileNet was modified for the classification of plant disease by [14].
A research showing the effectiveness of deep learning optimizers was presented in [15]. Then, a study was focused on the plant disease identification task that contains both classification and localization in a single framework, using the same PlantVillage dataset [16]. Although, this research presented an improvement in the accuracy of plant disease detection and classification tasks, the major limitation was the analysis of the deep learning technology in a controlled environment dataset.
Some of the studies also focused on datasets collected in a real agricultural environment. For example, an article presented tomato disease detection in real agricultural conditions, using three DL object detectors named Faster Region-based Convolutional Neural Network (RCNN), Single-shot MultiBox Detector (SSD), and Region-based Fully Convolutional Network (RFCN) [17]. Various realworld scenarios were considered. However, the external dataset could also be tested to validate the research. A study presented a Convolutional Neural Network (CNN) named SoyNet, to classify the disease in soybean leaves after segmenting the images of leaves [18]. This study presented variations in the parameters of the DL model, such as dropout, pooling operations, and inclusion of activation functions. These adjustments were found to be successful in improving the performance of the model. Moreover, the usefulness of the proposed method was compared with other techniques. An article presented the classification of cardamom plant diseases using the EfficientNet-V2 model [19]. This study did not provide training profiles/plots. A multilayer convolutional neural network was presented for the classification of disease on mango leaves [20]. Although this article presented the significance of the DL model compared with other machine learning-based techniques, better effectiveness should have been shown by comparing it with the DL models as well. In recent research, an improved version of the Xception model was proposed for the identification of peach diseases [21]. The novelty of this work was shown by comparing the proposed method with the well-known models. However, other modified versions of the state-of-the-art DL models could be used for the analysis. Tomato disease were detected using a modified version of the you look only once (YOLO) model [22]. It was observed that the training performance of the models was presented with limited information. Another study presented a DL-based method for the detection of tomato disease divided into target and control classes [23]. This research proposed a new way of performing plant disease detection task that can open various opportunities for future research. An improved region proposal network was proposed for the detection of northern maize leaf blight [24]. A few studies have also proposed real-time detection of plant disease. A DL model was presented for the identification of tomato disease [25]. Similarly, disease detection on grape leaves was performed using a DL architecture based on Faster region-based convolutional neural network (R-CNN) [26].
After rapid advancement and research on deep learningbased plant disease identification, there are still important research gaps and considerable room for further developments to investigate the practical aspects of horticultural fields. The current literature has mainly focused on the plant disease detection task to only plant leaves. Moreover, the available datasets emphasize on the presence of a single disease at a time in a plant leaf. Furthermore, recent studies have shown high accuracy in the PlantVillage dataset (which contains defective leaves in 14 plant species) [16], but none of the articles have provided the significance of a single deep learning model for different crops in real agricultural conditions. This is important to consider, as each crop could have different background elements. Therefore, the robustness of the DL should be analyzed for that case.
This research addresses several research questions related to the capability of DL to address various complex agricultural problems. The first question is whether deep learning can perform plant disease detection with the same trained/optimized/modified model for three problems at a time: (a) identification of diseases in several organs of plants, (b) presence of multiple diseases in a plant organ, and (c) recognition of diseases in various crops considering variations in their environments/background elements? Connecting to the previous question, can a DL architecture correctly distinguish symptomatically identical diseases in different crops? The final question is how well can the attained accuracy of the DL-based method be validated for the problems highlighted earlier?
To answer these questions, this article presents a deep learning-based performance optimization approach. First, the dataset images were collected from various New Zealand farms and horticultural fields. It contains 20 classes of healthy and defective leaves, fruits, and stems/canes of five different crops. Then, various image resizing techniques, batch normalization, and weight optimization were applied. These techniques have not yet been explored for the detection of plant disease. Furthermore, the main novelty of the RFCN model (the most suitable DL architecture obtained after comparing several models) was analyzed. In this regard, the position-sensitive score maps were empirically evaluated, and anchor boxes were modified, to obtain high accuracy for the identification of all healthy and disease classes.
This research also addresses some of the research gaps outlined in recent articles, such as the validity of deeplearning-based plant disease identification. Moreover, the data augmentation has been applied after dividing the data into training, validation, and testing sub-datasets, to avoid biased results; otherwise, there was a possibility to get similar images in the sub-datasets. Furthermore, this study provided new insights into DL-based plant disease detection, rather than giving redundant discussions using the excessively explored dataset like PlantVillage [27].
The key contributions of this research are: 1) A new dataset of plant diseases has been proposed for the most important horticultural crops in New Zealand, named NZDLPlantDisease-v1. 2) Detection of disease has been performed in multiple plant organs for five different crops.
3) The presence and detection of multiple diseases on a single plant organ have been addressed.
4) The effects of data augmentation techniques have been studied by dividing them into various categories rather than considering all conventional methods together. 5) A comprehensive deep learning-based plant disease detection pipeline has been presented. In this regard, various steps have been explored prior to suggesting any modification to the state-of-the-art DL models. 6) The confusion/false positive results in symptomatically similar diseases (occurring in different crops) have been addressed. An in-depth analysis of the best-obtained DL model named region-based fully convolution network (RFCN) has been performed by position-sensitive score maps and anchor boxes. 7) The proposed approach has been validated using a stratified k-fold cross-validation technique and an external testing dataset.

A. PROPOSED APPROACH
The proposed methodology consists of various practical considerations related to the presence of plant diseases in a real horticultural environment. A comprehensive deep learningbased optimization approach has been proposed. The presented methodology has successfully solved three identified agricultural problems, including the detection of disease in multiple plant organs, the identification of disease in different crops, and the presence of multiple diseases in a plant organ at a time. These problems have been solved by different techniques presented in sub-sections. The idea was to improve the average precision of each class. The results from each of the step were evaluated, the respective problems were highlighted and addressed in the next step. First, the research questions were outlined to begin collecting images of the dataset. Next, well-known DL architectures were compared, and the two best deep learning (DL) models were obtained, which attained the highest mean average precision. Then, the data augmentation techniques were applied category-wise, including color change (brightness, contrast, and sharpness), the inclusion of noise with variation in color, rotational and translational changes, and finally, the combination of all categories, including the original images. The next step was the performance optimization of the DL model using various techniques. In this regard, the effects of image resizers and interpolators were analyzed. This step was performed to investigate different input images for the DL-based plant disease detection. Then, different DL initializers were tested. Subsequently, batch normalization was applied to cope with the internal covariate shift. Then, DL optimizers were leveraged to optimize the weights of the deep learning model. This led to a further improvement in the performance of classes that achieved low average precision (AP). This objective was achieved by empirically analyzing the novelty of the best-obtained model. The final step was the modification of the DL model by empirically tuning its anchor box scale and aspect ratios. In case of unsatisfactory results, the feature extractor/classification model had to be modified. The final results were validated using a stratified k-fold crossvalidation technique and a test dataset generated through various online/open-source images. The overall methodology of this study is presented in Fig. 1

B. NZDLPLANTDISEASE-V1 DATASET 1) OVERVIEW AND GENERAL INFORMATION OF THE PROPOSED DATASET
The proposed dataset has several properties of real agricultural fields that have not been presented in previous open-source datasets. A comprehensive overview of several datasets along with the new/proposed dataset for this research is presented in Table 1. Further details of the important features of the presented dataset are explained in the following subsections.
This dataset contains plant disease in five different crops in New Zealand, including kiwifruit, apple, pear, avocado, and grapevine, named NZDLPlantDisease-v1. The images were acquired by using a Samsung smartphone Galaxy S10 plus: 12 MP f/1.5-2.4 (wide), 12 MP f/2.4 (telephoto), and 16 MP, f/2.2 (ultrawide). Several local horticultural fields were visited in Auckland and Palmerston North, New Zealand. The images were taken at a working distance of 200-300 mm.

2) PRACTICAL CONSIDERATIONS
The dataset was collected between December 2020 and May 2021. The abrupt change in New Zealand's weather was considered a positive aspect of the dataset generation because it helped obtain diversity in the dataset via variations in illumination and environmental conditions. Furthermore, dataset images were captured in the presence and absence of shadows to include real horticultural conditions. Several examples of these practical considerations are shown in Fig. 2.

3) MULTI-DISEASE AND MULTI-ORGAN DATASET IMAGES
One of the research gaps addressed in this article is the detection of the disease in various organs/parts of the plants. Therefore, healthy and disease classes are considered in the leaves, stems, and fruits of apple and pear. However, the dataset classes for avocado and kiwifruit only consist of leaves. The images for the grapevine were only taken for healthy and disease cane, due to the end of the season of the grapevines at the time of dataset collection. In this research, the presence of multiple classes of disease in plant organs has also been addressed. For example, black spot, mosaic virus, and glomerella leaf spot (or two of them) were present in some of the apple leaves at one time. Similarly, algal leaf spot VOLUME 10, 2022 and branch canker were present on avocado leaves at the same time. Samples of multiple disease problems are presented in Fig. 3.

4) ANNOTATIONS OF HEALTHY AND DISEASE CLASSES
The number of images from each class ranged from 60 to 318, as presented in Table 2. The NZDLPlantDisease-v1 dataset was divided into three sub-datasets: training (70%), validation (20%), and testing (10%). The dataset images were annotated by using an open-source tool called LabelImg. The bounding box coordinates were stored in XML format, converted into CSV, and finally, to the TF records [16]. The common/scientific names of each class along with the number of images (without augmentation) are shown in Table 2. An example of each annotated healthy and disease class is presented in Fig. 4.

5) DATA AUGMENTATION TECHNIQUES
When collecting the images for the dataset, some of them were taken in a group; cropping of those images increased the size of the dataset. Furthermore, several data augmentation 89802 VOLUME 10, 2022 techniques were applied, such as a 30% increase and decrease in brightness, contrast, and sharpness [25]. Moreover, two noises are injected into the training images to study their effects and increase the variability in the dataset. In this regard, Gaussian and Laplacian noise were added by using the online available software named XnViewMP. The random intensity of 2.0 and 10.0 at a maximum scale of 10.0 and 50.0 was set, respectively. In addition, rotational/translational changes were also considered, including 90 • , -90 • , 180 • , horizontal, and vertical changes. An example of augmented images for a kiwifruit bacterial canker is shown in Fig. 5.
The data augmentation techniques are grouped into five categories to thoroughly understand their effects on the performance of the DL model. These categories are only original (OO), original and change in translation/rotation (OT), original and color change (OC) (brightness, contrast, and sharpness), original with an injection of noise (Gaussian and Laplacian) and color change simultaneously (OCN), and finally a combination of all (OTCN).

C. DEEP LEARNING FRAMEWORK, HARDWARE SPECIFICATIONS, AND PERFORMANCE METRICS
All experiments are performed using the TensorFlow objectdetection API. The DL models are trained using the transfer learning technique with pre-trained weights on the COCO dataset. An NVIDIA GeForce GTX 1080 Ti graphics processing unit (GPU) is used with the following specifications: 11 GB memory, 1582 MHz boost clock, 3584 CUDA cores, and 484 GB/s memory bandwidth. The CuDNN library is imported to accelerate training.
The performance of the DL models is evaluated through the training and validation profiles in terms of various classification and localization losses. This helped to gain insight into the models by box classifier loss, region proposal network (RPN) loss, and total loss. Furthermore, the testing performance is presented using the mean average precision (mAP), which is a commonly used performance metric for object detection tasks [16].
The complexity of the DL models is presented by training and detection time, architectural differences, and the number of parameters, as shown in Table 3.

E. ARCHITECTURAL OPTIMIZATION OF THE RFCN MODEL 1) FUNDAMENTALS OF RFCN
Following the proposed methodology, the RFCN is selected as the best DL model for the detection of plant diseases. The main idea of this DL model is to address the ambiguity between translational invariance (identifying a particular object at different pixel values) and translational variance (identifying the exact location of the object) using position-sensitive score maps. An RFCN is a two-stage DL architecture. First, the input image is applied to the feature extraction layer using a convolutional neural network (CNN) to generate feature maps. These maps are applied to a convolutional layer to generate region-of-interest (ROI) proposals. In the former DL model named Faster RCNN, ROI proposals were used to extract the feature region in the feature map and to extract features to differentiate a particular class. However, in the RFCN, another convolutional layer is used to generate position-sensitive score maps. This map splits the ROI into k × k bins, where each bin is used to vote for the class to which the object belongs. Therefore, the main idea was to consider the characteristics of an object divided into a region k × k instead of as a whole. Both the RFCN and Faster R-CNN models have the same extraction of ROI proposals, but technical and computational differences arise in the application of a fully connected layer (FC) in each ROI proposal for Faster RCNN. In contrast, the RFCN generates only the proposed score maps, and the ROI is only used to vote for the regions in the score maps. Hence, the overall training and testing times of the RFCN network are much reduced than those of the Faster RCNN model [35]. This difference can also be observed in Fig. 6.
For w x h rectangular ROI, each bin is of size w k x h k . The x and y coordinates for (i,j)th bin for one of the slices of anchor Then, the pooled response r c (i, j) on (i, j) bin for class c is equal to the sum of all the pixels within that bin coming from the position sensitive score maps. This sum is divided by number of pixels (n) as the layer before the softmax function is the average pooling layer. Finally, we take a vote by averaging them out or taking the maximum and get the position sensitive scores that lead to the softmax to predict the class where θ presents all learnable parameters, z i,j,c is the one score map out of k 2 (C+1) score maps, and (x 0 ,y o ) is the top left corner of the ROI.

2) EMPIRICAL OBSERVATIONS ON THE POSITION-SENSITIVE SCORE MAPS
The default architectural settings of the RFCN attained satisfactory outcomes for most of the plant disease classes. However, the pear scab could not be detected, showed falsepositive identification. It was confused with the apple black spot. This was due to the similarity in the symptoms of both diseases. These results motivated us to empirically investigate the main novelty of the RFCN model. In this regard, the spatial bin width and height for position-sensitive score maps are tuned/analyzed for this research. The main purpose of this step is to improve the average precision (AP) of the pear scab and maintain the high AP of other classes.

3) PERFORMANCE ENHANCEMENT THROUGH MODIFIED ANCHOR BOXES
The final step is the improvement of classes that achieved low AP (less than 80%). These classes include apple black rot, apple black spot, apple European canker, and pear healthy (leaves) classes. In this regard, this study explored the enhancement of anchor boxes in two steps: adjustment of scale sizes and aspect ratios. Here, the scale size is gradually modified, whereas the aspect ratios are reduced/enhanced in both a step-by-step (1:2, 1:3, 1:4, and so on) and reciprocal fashion (1:2, 2:1; 1:3, 3:1; etc.). The final output attained a high AP for all healthy and diseased classes. The following steps are taken to obtain a modified or enhanced version of the anchor boxes.
• Afterward, smaller/larger scale sizes are added to the default to understand their effects on model performance.
• After obtaining the best combination of anchor box scales, the aspect ratios are added and enhanced to obtain further refinement in the detection of plant diseases. From the default aspect ratios, the reciprocal ratios such as 1:3, 3:1, and 1:4, 4:1 are applied.
• Subsequently, an empirical adjustment of the aspect ratio is performed, and a gradual reduction/enhancement of the aspect ratio is proposed to improve the AP of several classes. The combined effect of reciprocal and gradual changes in the aspect ratio is also studied.
• Finally, the training, validation profiles and testing outcomes are compared between the proposed modifications and the default settings.

F. IMAGE RESIZERS AND INTERPOLATORS
After obtaining the best combination of the data augmentation technique and DL architecture, the effects of image resizers on the model performance are studied. Aspect ratio and fixed shape resizers are used along with four types of image interpolators: bilinear, bicubic, area, and nearest neighbor.

G. WEIGHTS INITIALIZERS
Three weight initialization techniques are compared to optimize the performance of the best-suited DL architecture. By default, a truncated normal is used to remove dead neurons caused by the ReLU. Then, variance scaling is applied, which is beneficial to balance the variance of the output with the input layers [40]. The last initializer is a random normal initializer used to create tensors through a normal distribution.

H. BATCH NORMALIZATION
To accelerate training speed, batch normalization (BN) is used in this research. This technique solves the problem of internal covariate shift due to the variation in the input of the distribution of the neural network with the variation in the parameters of the previous layer [41]. The mini batch mean (µ φ ) for a mini-batch (φ), mini-batch variance (σ 2 φ ) and normalize (affine transform) are evaluated for each row of input matrix (x i ) by: where N is the number of instances in mini batch, ε is added for the numerical stability. There is a zero mean and variance for each component of x i though the hidden units should have different distributions. Therefore, the normalization scheme learns the distribution by scaling the normalized values through scaling (γ ) and shifting (β) parameters and evaluated the output of the batch normalization as follows:

I. DEEP LEARNING OPTIMIZERS AND SELECTION OF HYPERPARAMETERS
In this study, three DL optimizers are used. Stochastic gradient descent (SGD) with momentum [42] is applied as a default optimization algorithm, later, root mean square propagation (RMSProp) [43] and adaptive moment estimation (Adam) Adam [44] are used to optimize the weights. A brief overview of the DL optimizers is given as follows: The stochastic gradient descent (SGD) is the most commonly used optimization algorithm for neural networks. The momentum version of the SGD has a great capability of faster convergence compared to the original SGD optimizer. The exponentially weighted averages (Vdw and Vdb) are used to evaluate the gradient and use the gradient to update the weights (W ) and biases (b). The algorithm uses the following equations: where, β, lr, dw, and db present momentum, learning rate, gradients of the weights, and biases, respectively.

2) RMSProp
This DL optimizer allows to select a large learning rate. It works on the idea of using the moving average of the squared gradient and dividing the gradient by square root the mean square, using the following equations: where, ε is used for the numerical stability in the denominator.

3) ADAM
The Adam optimizer is a combination of RMSProp and SGD with momentum optimizers. Like RMSProp, Adam takes squared gradients to scale the learning rate and uses moving average of the gradients similar to the SGD with momentum. As it has an adaptive learning rate, it calculates separate learning rates for each parameter. Adam contains estimations of first (mean) and second moments (uncentered variance) of gradient that are used to adapt the learning rate for each weight of the DL model/neural network. Whereas the moment is considered as the expected value of a variable to the power of n. The first (mdw, mdb) and second moment (vdw, vdb) estimates are evaluated by equations (21)- (24).
Then, the weights and biases are evaluated by: where ε is equal to 10 −8

4) SELECTION OF HYPERPARAMETERS
The hyperparameter values are selected using the random search method [45], presented in Table 4. For example, the learning rate (lr) of the SGD optimizer to train the RFCN model was tuned exponentially from 10 −5 to 10 −1 , while the momentum (mom) was tuned with a difference of 0.1. The hyperparameter tuning was started with lr of 10 −1 and zero mom, the RFCN did not get the training convergence. Then, the lr was started to reduce and mom was increased.
The training of the RFCN model started to settle down. For example, at lr of 10-3 and mom of 0.8, the mAP was 61.60% with a total training loss of around 0.41%. A further reduction in the lr positively influenced the performance of the RFCN. At the learning rate of 10 −4 and mom of 0.8, the training loss was reduced to 0.23% with the mAP of 73.256%. But a further increase in lr (10 −5 ) significantly increased the training time. Therefore, small random changes were made for lr and mom and the performance of the RFCN model was checked in various values. It was found that lr of in 3 × 10 −4 and mom of 0.9, the loss was reduced to around 0.09% and the mAP improved significantly to 74.47%.

1) STRATIFIED FIVE-FOLD CROSS-VALIDATION
The proposed DL-based approach has been validated using two techniques. First, owing to the class imbalance problem due to the different number of images of each class (can be seen in Table 2), a stratified cross-validation method is used. This method retains the particular number of data points/sample size of each class in each fold [46]. It ensures the unbiased distribution of the dataset among all folds. Otherwise, random sampling could generate bias in the folds when all dataset images are randomly shuffled and split into a certain number of folds.

2) TESTING ON AN EXTERNAL DATASET
Another contribution of this study is the validation of the final results using an external test dataset (obtained by a random search on various websites). This was done to show the effectiveness and robustness of the work that presented DL-based approach would also be applicable under different environmental conditions than the one used for dataset generation.

III. RESULTS AND DISCUSSIONS
After the dataset generation, the proposed approach is divided into several steps to get the optimized DL model for plant disease detection. The results presented in this section follow the methodology of the research, as shown in Fig. 1. First, a comparison of the DL architectures was performed. This step was performed to obtain the top two models. The training and validation plots are presented (to understand the performance of several DL models) and the detection results (to evaluate the mAP). The best two DL models were trained with all data augmentation methods to understand their effects. Later, the effects of image resizing techniques and interpolators are provided in terms of training, validation losses and mAP. These methods evaluated the impact of the input image on the DL model. Afterward, performance optimization has been explained by weight initializers, batch normalization, and DL optimizers. The effects of various parameters of the weight initializers are also provided. Similarly, the performance of the best-obtained DL model was evaluated in the presence of batch normalization, to show the better convergence ability of the DL model. DL optimizers are also compared to optimize the weights of the best-obtained DL architecture. After the optimization of the DL model, further in-depth class-wise analysis has been performed. In this regard, the performance of the individual classes is evaluated. The classes that attained the lowest AP were explicitly focused. This step also aimed to maintain the high AP of the other classes obtained in the previous step. The position-sensitive score maps are analyzed, as it was one of the major novelties of the RFCN model (the best-obtained model). The detection results are shown to understand the impacts of the spatial bin width and heights of the score maps. Furthermore, the anchor boxes were enhanced to show the influence of various anchor box scales and aspect ratios. The results are shown by the training and validation plots and average precision of each class, along with the mAP of each enhanced version. Finally, the stratified k-fold cross-validation method was used due to the class imbalance problem in the proposed dataset and to validate the final mAP of the optimized DL model.

A. COMPARISON BETWEEN DL ARCHITECTURES
First, the DL models are trained on the original (without augmentation) images. Subsequently, the two best models are retrained on the augmented images. It is empirically found that DL architectures should be trained to 200K steps to achieve training convergence. The input images are resized to 300 × 300 pixels with fixed image resizers for SSD MobileNet-v2 and SSD Inception-v2 and 640 × 640 pixels for SSD ResNet-50 (RetinaNet). An aspect ratio resizer with minimum and maximum pixel dimensions of 600 and 1000, respectively, is considered for the models including all versions of Faster R-CNN and RFCN. The EfficientDet model is also trained with an aspect ratio resizer with 512 minimum and maximum pixels, according to the GPU requirement. Furthermore, SGD with a momentum optimizer is used to train the models for this stage of the research. Different batch sizes are tested, and the most reasonable is found to be 4 to reduce the trade-off between accuracy and training time. Four models required the lowest iteration steps of 170K to achieve training convergence: Faster RCNN ResNet-50, RFCN ResNet-101, EfficientDet, and RetinaNet. The lowest training times are obtained for SSD MobileNet around 5.5 h. However, the Faster RCNN Inception ResNet-v2 required the highest time to complete the training. The following observations are made on the training and testing performance of the DL models.

1) TRAINING PERFORMANCE
• Plots of the total training and validation losses for each model are shown in Fig. 7. It can be observed that the Faster R-CNN ResNet-101 and RFCN ResNet-101 models attained the lowest total training and validation losses of approximately 0.05-0.08%, 0.06-0.09%, and 0.04-0.2%, 0.06-0.18% respectively. Both models took around 10.5 hours to achieve convergence.
• However, the versions of SSD models with Inception-v2 and MobileNet-v2 have approximately 1.5% total loss. This is comparatively higher than that of the other DL models, apparently reflected in their detection results as a low mAP as shown in Table 5.
• Later, the two best models are retrained with augmented images due to their lowest training and validation losses (after training on non-augmented images), including RFCN ResNet-101 and Faster RCNN ResNet-101. Their loss plots are shown in Fig. 8. It can be concluded from the plots that RFCN ResNet-101 has a slightly lower training and validation losses of approximately 0.7% and 1.0%, respectively. VOLUME 10, 2022  Table 5. This is because RFCN achieved a high AP of 10 healthy/disease classes. A sample of each class is shown in Fig. 9 (a).
• The Faster RCNN trained with Inception ResNet-v2 and ResNet-101 attained a higher mAP than the rest of the models, including Faster RCNN ResNet-50, Faster RCNN Inception-v2, SSD models, RetinaNet and Effi-cientDet. The Faster R-CNN ResNet-101 is found to be the most useful model for the healthy class of avocado (Av_healthy_l) and attained the highest AP among all DL models. It is also noticed that some of the testing images of apple glomerella leaf spot and pear fire blight obtained false positive and false negative detections, respectively, with Faster R-CNN ResNet-101, as shown in Fig. 9 (b). Similarly, classes such as apple healthy leaves and pear stony pits are well detected using Faster Inception ResNet-v2, as shown in Fig. 9 (c-d).
• Although the RFCN model attained the highest mAP, it misclassified some of the testing images of the classes, such as the stony pit on the pear, as shown in Fig. 9 (d).
• The testing performance of the models, including Effi-cientDet and RetinaNet, is unsatisfactory. This was due to several classes remaining undetected and giving false positive results, as presented in Fig. 9 (e).
• It can also be seen from Table 5 that the black spot on the apple leaves failed to be detected and localized by all DL models, an example from each DL model is presented in Fig. 9 (f). Some of the classes attained 0% average precision when trained by the models like Effi-cientDet, SSD Inception-v2, SSD MobileNet-v2, and SSD ResNet-50. Because these models failed to detect a few of the classes, that was observed in two ways. First, the testing images of those classes were undetected, second, the false positive results were obtained due to the  confusion with another plant disease/healthy classes. In Fig. 9 (f), an example of the undetected/false negative outcome for apple black spot is presented for Efficient-Det and RetinaNet.
• Further analysis was required to validate the selection of the best DL model for the next phase of research.
• In this regard, the top two DL models (RFCN ResNet-101 and Faster RCNN ResNet-101) in terms of the lowest training, validation losses and the highest mAP are retrained using augmented images (considering all 13 augmentation categories).
• Classes such as apple black rot, apple European canker, and healthy leaves of apples have improved their AP after training the RFCN through augmented images. On the contrary, the AP of almost 13 classes is significantly reduced by training in the augmented images, as shown in Table 5.
• The main finding of this step is that the RFCN model has achieved the highest mAP with and without augmented images. Another important observation is that the augmented images helped improve the AP of only a few classes. Moreover, the RFCN has shown its ability to address problems such as the detection of plant diseases in different organs and the identification of disease in different fruits, as shown in Figs. 9-10. However, the VOLUME 10, 2022 low AP of several classes was attained in the presence of augmentation techniques. This motivated us to individually evaluate the effects of the augmentation techniques. This helped to understand the reason for performance degradation after the application of 13 types of augmentation methods for this step of the study.

B. EFFECTS OF DATA AUGMENTATION TECHNIQUES
The effects of data augmentation methods are studied by dividing them into five categories as described in Section II-B (5). The two top DL models are trained on all five groups of augmented images, including RFCN ResNet-101 and Faster R-CNN ResNet-101. The important results from this phase of the study are discussed in the following.
• The RFCN model has achieved the highest mAP after training with the OT data augmentation category, followed by the results obtained through the OO images, as shown in Table 5. However, comparatively lower mAP values are observed for OC and OCN. Whereas, RFCN has achieved the lowest mAP with the OTCN.
• To further perform an in-depth analysis of the data augmentation techniques, a class-wise analysis is performed. For example, OT is found to be the best method due to its higher AP by the RFCN in eight classes of healthy individuals and diseases. Therefore, the OT category has attained superior results.
• The effectiveness of the OT category is also validated using Faster RCNN ResNet-101, as shown in Table 5. The mAP is higher in the OT group than in all other categories. However, with the OTCN, the model has achieved the lowest mAP value.
• There could be several reasons for performance degradation when training with OC and OCN. The nature of the real agricultural environment could contribute to the confusion in discriminating between plant diseases. Because the real field contains several background elements, a change in color or addition of noise to the original images distracts/fails the model to extract and learn the specific and distinct features of the disease symptoms. There may be similarities between the symptoms of the disease and background elements [47] after adding noise and changes in brightness, contrast, and sharpness. This has resulted in a low AP for the individual classes and a low mAP for the 20 classes.
• The above statement can be further understood by taking examples of some classes that achieved comparatively lower AP. For example, avocado branch canker is confused with apple glomerella leaf spot, healthy avocado leaf with healthy pear leaves, and apple mosaic virus with healthy apple healthy leaves. Similarly, the black spot-on grapevine cane could not be detected and/or is misclassified as an apple European canker. The stony pit virus on the pear is also confused with fire blight. Furthermore, in avocado leaves, several algal leaf spots are not detected. Some examples of false-positives and undetected outcomes are shown in Fig. 11.
• In contrast, the improved performance with OT has shown the practical aspect of this work. Because the location of disease spots in a real agricultural environment varies from one plant to another. Furthermore, if a DL model such as RFCN can detect the disease in the translated and rotated images, it shows the importance of the OT-based data augmentation technique for the identification of plant diseases.

C. EFFECTS OF IMAGE RESIZERS AND INTERPOLATORS
The next step is to study the effects of image resizers with different interpolators on RFCN. In this regard, the RFCN model is initially trained using an aspect ratio resizer and a bilinear interpolator. Eight possible combinations of image resizers and interpolation methods are used to obtain the best results, as shown in Table 6. The performance of each combination was not only evaluated by mAP, training, and validation losses, but also by the training time as considered in [48]. The main observations and discussions of this step of the analysis are presented below.
• The aspect ratio is considered the default resizing technique with minimum and maximum dimensions of 600 and 1000 pixels, respectively [35]. This image resizing technique is used in conjunction with bilinear interpolation.
• As presented in Table 6, the aspect ratio resizer with the other three interpolators, such as bicubic, area, and nearest neighbor, has degraded the performance of RFCN in terms of lower mAP.
• Later, a fixed-shape resizer with a default value of 300 × 300 pixels is applied with bilinear interpolation. Subsequently, three other interpolators are tested. The RFCN model trained with the bicubic interpolator with fixed shape resizer required the lowest training time, training loss (0.52%) and validation loss (0.85%) as shown in Table 6. Both losses were lower than the loss acquired with the default resizer/interpolator, as shown in Fig. 12. • There is no such sign of overfitting of the RFCN model as the validation loss was also settled to the final value and no such increase in the loss was observed. It is validated with a higher mAP of 80.59%, which is 3.83% better than that obtained using the default settings. Furthermore, the AP of avocado algal leaf spot, pear fire blight, and healthy pear leaves are significantly improved to 88.55%, 98.16%, and 49.9%, respectively.
• The bilinear interpolator with a fixed-shape resizer has also performed slightly better than the nearest neighbor and area interpolation.
• Bicubic interpolation considers a 4 × 4 or 16-pixel square and evaluates the resulting interpolated pixels, compared to 2 × 2 pixels for bilinear interpolation. Therefore, better-quality images are obtained for the healthy and disease classes to be fed into the RFCN model, producing a better mAP.
• In conclusion, both the training and testing performances of the RFCN model are improved with an enhancement in the AP of the three classes (after training with a fixed-shape resizer along with a bicubic interpolator). In the future, other relevant datasets can be tested using the combinations of resizers and interpolators presented in this phase of the research.

D. EFFECTS OF WEIGHT INITIALIZERS, BATCH NORMALIZATION, AND DL OPTIMIZERS
The next phase of the proposed approach is the optimization of the RFCN model. The appropriate selection of weight VOLUME 10, 2022  initializers along with their parameters solves the problems of vanishing and exploding gradient descent. Three weight initialization methods are used: truncated normal (default initializer), scaling variance, and random normal. After determining the most suitable initialization technique, the effects of batch normalization are studied. Then, the best DL optimizer is selected, and the hyperparameters are tuned using a random search method [45]. These steps are performed before proposing any modifications to the RFCN model. The training and testing evaluations of this phase of the study are summarized below.
• First, the RFCN model is trained using a truncated normal initializer. Subsequently, it is trained using the scaling variance and random normal initializers.
• The random normal initializer has achieved the lowest training loss and the highest mAP with a standard deviation of 0.01 and a mean value of zero. It has been observed that the selection of an optimum value for standard deviation and mean plays an important role in the performance of the model.
• When searching for the appropriate values of standard deviation and mean, the random normal initializer was initially used with a default standard deviation of 1 and a mean of zero. However, these values were unable to achieve convergence. Therefore, the standard deviation started to tune exponentially. It was empirically found that 0.01 was the most suitable standard deviation and the mean value of zero remained the same. This was due to the lowest training and validation losses, which resulted in a slight improvement of 0.916% in the mAP. Furthermore, the extraction of distinct features was also improved, as the AP of some of the classes was enhanced with the described settings. These classes include apple black rot, apple European canker, apple glomerella leaf spot, apple mosaic virus, and healthy pear (leaves), with an improvement of 8% to 28%. Similar standard deviations and mean values were also suitable for the truncated normal initializer.
• The best settings for scaling variance were single scaling factors with a uniform distribution and considering the average of the input and output units in the weight tensor (Fan_avg). However, these parameters did not contribute to improving the model performance and attained an mAP of only 70.20%.
• The random normal initializer has produced the best result for a particular application of the detection of plant disease by the RFCN model. Therefore, it can be concluded from its basic functionality that the initialized weights through the generation of tensors with normal distribution performed well for the selected problem. Moreover, theoretically, the random normal is supposed to work with weights initialized very close to zero. So, each neuron of the network does not perform the same calculation.
• It was experimentally found that a small standard deviation value was not suitable. For example, at a standard deviation of 0.001, the performance declined in terms of mAP to 78.33%, compared to mAP at 81.50%, obtained with a 0.01 standard deviation.
• The next step was the use of batch normalization. The RFCN is trained with the default values of epsilon and decay of 0.001 and 0.99, respectively. It is also noted that the training convergence is achieved earlier to around 160K steps from around 170K iterations. Therefore, it can be concluded that batch normalization reduced the overall training time and showed a fast convergence ability [49].
• The decay and epsilon were started to tune, and it was experimentally found that the lower value of decay and the higher value of epsilon improved the performance of RFCN. Therefore, the decay was set to 0.5 and the epsilon was 0.01. The training performance was slightly improved to around 0.515%.
• The testing performance of the RFCN model was also significantly improved. The model attained an mAP of 85.94% with an improvement of 4.345% compared to the one obtained in the previous step.
• The last step before proposing any modification to the RFCN model is the utilization of different DL optimizers. SGD (with momentum) is used to train the model as the default DL optimizer. Subsequently, its performance is compared with that of Adam and RMSProp.
• After training the RFCN model using all three DL optimizers, it is found that SGD with momentum is the best DL optimization algorithm. Adam optimizer is unable to achieve a high mAP. Therefore, it did not effectively optimize the weights of the RFCN model. However, RMSProp has also achieved a lower mAP of 82.819%.
• Individual APs of several classes, including apple black rot, apple black spot, and pear canker are degraded by RMSProp.
• The best performance of the SGD optimizer demonstrates its generalizability in extracting the features of the healthy and disease classes and optimizing the weights of the RFCN. Therefore, it can be summarized that, for the NZDLPlantDisease-v1 dataset of healthy and diseased plant organs, the non-adaptive optimizer -SGD with momentum was quite successful compared to the adaptive optimization techniques RMSProp and Adam.
• To address one of the research gaps presented in the previous section, the RFCN model trained with a random normal initializer, using batch normalization and SGD with momentum optimizer is also successful in identifying multiple disease problems and detection of plant disease in different weather conditions (sunny, cloudy), as shown in Fig. 13 and Fig. 14, respectively. Another important observation is that the pear scab class still attained a low AP of 5.06%. Although all steps until the application of various DL optimizers significantly improved the performance of RFCN in terms of a lower training and validation losses compared to the default configurations of the model. Still, the pear scab remained almost undetected or falsely identified. This result has provided a strong basis for the next steps of the research to focus on the architectural evaluation/modifications of the RFCN model.

E. PERFORMANCE ENHANCEMENT OF PEAR SCAB
There are two major goals for this step of the research. First, an improvement in the AP of pear scab, which is undetected after the application of several techniques explained earlier.  Second, the high AP of the other 19 healthy/disease classes should be maintained. In this regard, the RFCN model has been investigated in two stages: position-sensitive score maps and enhanced/modified anchor box scale and aspect ratio.
One of the primary novelties of the RFCN model is the generation of position-sensitive score maps. The spatial bin configuration was set to 3 × 3 by default. It was empirically observed that the spots of pear scab were so small that the model could not extract its features and therefore could not be detected. This might be because none of the sub-regions of positive-sensitive score maps could match the pear scab for most of the testing images. Therefore, the position-sensitive region of interest (RoI) pool cannot vote for pear scab disease. In this regard, the first attempt is to increase the score maps using multiples of 3. The width and height of the 9 × 9 spatial grid have yielded satisfactory results and attained an mAP of 84.68%. Otherwise, with other spatial bins, such as 6 × 6, 12 × 12, and 15 × 15, a lower mAP of 82.819%, 82.041% and 82.59%, respectively are observed with the AP of the pear scab of 5.03%, 5.05%, and 5.28%. However, there is a slight difference in the total training and validation losses with 9 × 9 grids and the model has detected pear scab with an AP of 20.1%. Still, the DL model could detect the pear scab with a high AP. Furthermore, the RFCN trained with 9 × 9 spatial bins has disrupted the detection of apple black spots and achieved a lower AP of 49.69%. An example of the detection of apple black spot is presented in Fig. 15.
Another attempt has been made to solve this problem. The training images of the pear scab are magnified, and the RFCN model has been trained again. This is one of the ways to overlap the pear scab with 3 × 3 score maps. RFCN has successfully detected and localized both the apple black spot VOLUME 10, 2022

F. ENHANCEMENT OF ANCHOR BOXES
After a significant improvement in the average precision of the pear scab has been achieved in the previous step, the anchor boxes of the RFCN are enhanced. In this regard, the scale size and aspect ratios are modified to obtain an optimum anchor box that can provide an AP of more than 80% for each class. The summary of the results is as follows.
• Although the previous step has considerably improved the AP of the pear scab, a few classes also required attention towards further performance enhancement. For example, classes such as apple black rot, apple black spot, apple European canker, and pear healthy (leaves) have achieved an AP of less than 80%. During the annotation of the training images, it was empirically observed that the bounding-box coordinates of several classes varied. Therefore, different scales are tested to generate the anchor boxes. Otherwise, none of the other combinations of reciprocal aspect ratios has shown noticeable results.
• Subsequently, the effects of the step-by-step/gradual enhancement of the aspect ratio are studied. In this regard, a small aspect ratio was started to add from 1:4 to default ratios of 1:2, 1:1, and 2:1, and enhanced scale sizes. After several experiments, it is found that the addition of the aspect ratio like 1:2, 1:1, 2:1, 3:1, and 4:1, has improved the training and testing performance of the model. The total training and validation loss from 0.4-0.515% and 0.4-0.8% have been reduced to almost 0.3-0.37% and 0.4-0.71%, respectively Furthermore, the individual loss of box classifier localization loss was reduced to almost 0.2% from 0.3%, as shown in Fig. 17. There is no sign of overfitting as the losses were converged, no abrupt rise in the validation loss was observed after the final iteration step, and there was a small difference between both training and validation losses.
• The feature extraction of the healthy and disease plant classes has been presented by t-distributed stochastic neighbor embedding (t-SNE) plots in Fig. 18. It can be seen for each of the healthy/defective classes trained by the final RFCN model that there is a high interclass distance separability, a small intraclass distance and grouped clusters have been created which were concentrated in their respective features. Furthermore, the effectiveness of the proposed modifications has been presented by comparing the t-SNE plot for the previous step and the default settings of the RFCN and Faster RCNN model after the application of the OT data augmentation method.
• It can be observed in Fig. 18 (b) that after the application of several techniques (presented in section III C-E), some of the features of the classes such as apple black rot, apple black spot, apple European canker, and pear healthy leaves were not well extracted and confused with the features of other apple and pear classes. Similarly, the t-SNE plots by the RFCN model after the OT data augmentation technique (Fig. 18 (c)) attained comparatively small interclass distances. There were several features of the apple black spot, pear fire blight, and pear healthy leaves, were not extracted and  confused with apple European canker, apple healthy leaves, pear fire blight, pear healthy leaves, and pear scab. Likewise, the Faster RCNN ResNet-101 model ( Fig. 18 (d)) provided a degraded clustering performance as compared to the final RFCN model. The distinct features of apple black spot, apple European canker, apple mosaic virus, grapevine black spot, pear fire blight, pear healthy leaves, and pear scab were not made a proper cluster and confused with the features of other healthy/disease classes. This shows that the proposed modifications in the anchor boxes generate a significant difference in the feature extraction of the healthy and diseased plant classes.
• Moreover, the mAP is improved with a margin of 6.406% (Fig. 19). Also, a significant improvement in individual AP of classes such as apple black rot, apple  black spot, apple European canker, pear healthy (leaves), and pear stony pit, as shown in Fig. 19.
• Other combinations of gradual addition of aspect ratios are examined, as shown in Fig. 19 Fig. 20. The class-wise performance of the four prominent combinations of enhanced anchor boxes is presented in Fig. 20. A few examples of the false-negative results by the default anchor boxes, solved by the enhanced anchor boxes are presented in Fig. 21. A pictorial representation of the proposed modification of the anchor boxes is presented in Fig. 22.

G. OVERALL REMARKS ON THE PREVIOUS STEPS
A summary of the results presented from Section III-A to Section III-F is provided as under: VOLUME 10, 2022 • The proposed deep learning (DL)-based method gradually enhanced the accuracy of plant disease detection from the step of comparison between deep learning architectures (Section III-A) to the enhancement of anchor boxes (Section III-F). The average precision of each class along with the mean average precision of the deep learning models are evaluated for each stage of the research. Each step has significance in terms of better training and testing results. The main reason for getting better results was that each step performed an in-depth analysis and identified a strong motivation for the subsequent steps to further improve mAP.
• For example, a comprehensive analysis of several DL models was performed to select the best-suited model. This selection was done and validated by using augmentation techniques. But the mAP obtained by all augmentation methods was significantly reduced.
To cope with this problem, a category-wise comparison of the augmentation technique was performed, which gave us the best-suited technique for the selected application.
• Similarly, after the application of various techniques such as image resizers, interpolators, weight initializers, batch normalization, and deep learning optimizers, pear scab achieved unsatisfactory results. To attain a high AP of pear scab, the major novelty of the original RFCN model was analyzed, and the enhancement of the anchor boxes was attempted. In this way, the strong analyses of each step gave us the solid grounds for applying the following/next steps. A summary of the all steps including the best-selected method/model along with mAP is presented in Table 7.
• From Table 7, it can be concluded that all succeeding steps achieved a higher mAP, compared to its previous step. Furthermore, the most effective step in terms of improvement in the mAP was found to be the enhancement of the anchor boxes with an improvement of 6.406 compared to its earlier step.
• The effectiveness of the proposed approach has also been presented by the confusion matrix. For example, classes such as the apple black spot and pear stony pit were confused with the healthy apple (leaves) and the fire blight, respectively, during the initial step of the methodology (Fig. 23 (a)). This can be verified by their detection results, already presented in  Figs. 9 (d) and (f). Consequently, a low recall of these classes was attained at 40.62 and 36.84, respectively. However, none of the classes suffered from a high number of wrong/missed classifications after the enhanced anchor boxes (Fig. 23 (b)), which led to significantly high mAP.

H. VALIDATION OF THE FINAL RESULTS
This study has also validated the results and the claims described in this article, in two ways. The first technique adopted is the stratified five-fold cross-validation, through which the dataset images of each class are folded five times so that the testing images in each fold are different from one-fold to another. This technique has been applied because of the class imbalance problem in the generated dataset and it avoids biased distribution in the dataset for each fold. To further evaluate the final mAP by the proposed method, the variance was calculated by the formula (31) and evaluated as 0.70157.
where n is the number of folds, x i is the fold number (fold1, fold2, etc.), andx is the mean. The next way to validate the claimed results is to test the optimized model on an external curated dataset. The original and translational/rotational change -OT augmentation method is also applied to that testing dataset. The mAP is 87.95%, which is only 5.85% lower than the final mAP obtained from the testing sub-dataset of the proposed dataset. Eleven classes, including avocado algal leaf spots, avocado branch cankers, avocado healthy, apple black rot, apple glomerella leaf spot, apple healthy (fruit), grapevine healthy, kiwifruit healthy, pear healthy (fruit), pear fire blight, and pear scab, are detected with a high AP of more than 90%. However, a few classes, such as pear canker, grapevine black spot, and pear stony pit, attained an AP of less than 80%. Therefore, the difference in mAP is obtained from that obtained by the testing images of the NZDLPlantDisease-v1 dataset. Examples of a few classes that achieved high and low AP from the external dataset are presented in Fig. 24.

I. LIMITATIONS OF THE STUDY
Although the presented methodology has successfully detected the plant disease using the proposed dataset. Still, there are a few limitations of this study that can be taken into account in future research. The presence of disease in multiple organs of apple and pear has been considered for this research. Whereas, for grapevine, avocado, and kiwifruit, disease in only one plant organ has been considered. Moreover, only one disease class is presented for both grapevine and kiwifruit. Therefore, the proposed dataset should be further extended to get more insight into deep learningbased plant disease detection. Moreover, the validation of the modified/optimized model on an external generated dataset revealed that few of the classes did not achieve a high AP. One of the reasons could be the absence of diversity in the samples of those classes in the presented dataset. Furthermore, the dataset images from both sides of the plant organs should be considered. For example, the symptoms of the disease on the front and backside of the plant leaf could be included. This would generate more variety in the symptoms of plant disease. Also, all dataset images were collected from New Zealand horticultural fields. However, the dataset can be extended by capturing images of similar diseases in the same crops from horticultural fields in different countries. Moreover, the annotation was a bit tiring process due to the addition of the augmented images. Furthermore, as this article has addressed various practical problems, the most difficult among them was the detection of multiple diseases in a plant organ at a time. This task required even more time to correctly annotate the task. Therefore, it can be said that there is still a human intervention to use deep learning to perform complex task like plant disease detection.

IV. CONCLUSION AND FUTURE DIRECTIONS
This study addresses various research gaps in the identification of plant diseases based on deep learning. In this regard, a new dataset called NZDLPlantDisease-v1 is generated, and a DL-based approach is presented to detect and localize the disease in five of the most important New Zealand horticultural crops in terms of export value. After training and testing various DL architectures, the region-based fully convolutional network (RFCN) has achieved the highest mean average precision with and without the application of augmentation techniques. The proposed methodology consists of a comprehensive evaluation of various techniques impacted on the deep learning model that has not yet been explored for plant disease identification tasks. Furthermore, a modified/optimized version of the RFCN model is proposed by performing an in-depth analysis of position-sensitive score maps and anchor-box scales with aspect ratios. An improved mAP of 93.80% is achieved, which was 19.33% better than the default setting. The optimized RFCN includes training the model with a fixed-shape resizer with a bicubic interpolator, a random normal initializer, use of batch normalization, and SGD with a momentum optimizer. It is also observed that the translational/rotational augmentation method is the most suitable for obtaining satisfactory results. Furthermore, the addition of a 16 × 16, 32 × 32, 64 × 64 scales with an aspect ratio of 3:1 and 4:1 significantly improved the performance of the RFCN. The optimized/modified RFCN model has successfully answered research questions, including the detection of diseases in several plant organs, the presence of multiple diseases in one organ at a time, and the identification of diseases in different crops using the same trained DL model. Finally, the statements and results are validated by two different methods: stratified five-fold cross-validation and testing on an external dataset. These validation approaches demonstrate the significance and novelty of this study.
Furthermore, one of the advantages of this study is that different crops (selected for this research) have certain variations in their environments/backgrounds. Therefore, high average precision in each class shows an extended potential of deep learning technology for the detection of plant diseases, considering various challenges of the horticultural environment.
The idea/methodology proposed in this study can be utilized in several ways in future studies. A deep learningbased method can be embedded in automated/robotic systems to apply disease control techniques. For example, a fungicide spray can be applied to the defective parts of plants using a robotic manipulator. Furthermore, the diseases are treated differently based on the pathogen affecting the plants. Therefore, the research question related to the detection of multiple classes of plant diseases (suffering from different diseases at a time) [50] could be useful for implementing a cost-effective protection system. For instance, black spots on apples are normally treated with a fungicide spray, whereas no such treatment is available for apple viruses [51]. Hence, this research will be helpful for growers to take appropriate treatment measures after detecting multiple plant diseases in an organ.
In addition, various tasks can be performed to further enhance research on DL-based plant diseases. For instance, advanced data augmentation techniques, including superresolution convolutional neural networks (SRCNNs) and super-resolution generative adversarial networks (SRGANs), can be explored. Moreover, segmentation-based DL models can be leveraged and modified by using the generated dataset. Furthermore, the performance metrics presented in [52] can also be explored to perform a more in-depth analysis of the multi-label plant disease detection problem. Some other research ideas can be explored to further strengthen the research on DL-based solutions for agricultural problems. For example, the sensitivity analysis (like the one performed for a teleoperation system to examine the effects of important parameters on the system performance [53]) can be performed for the DL models to implement various agricultural operations. Moreover, the layer-wise output of the well-known DL models could be visualized to modify the hidden layer. The comparison of CNNs with CapsuleNet models can also be emphasized, as this class of machine learning is being explored for various object detection problems [54].

AVAILABILITY OF DATA
The dataset generated and analyzed during the current study is available in the GitHub repository https:// github.com/kmarif/NZDLPlantDisease-v1.