Hybrid CNN Model for Classification of Rumex Obtusifolius in Grassland

Rumex obtusifolius Linnaeus (R. obtu. L.) is one of the vital broad-leaved weeds in grassland that needs removal. It affects dairy products and reduces their quality. Hand-removal methods are costly and time-consuming. Chemical treatment using herbicides has a negative impact on crops and causes environmental pollution. In smart farming, weeding is performed by using computer vision to recognize the weeds efficiently and effectively. Conventional machine learning (ML)-based algorithms face challenges, especially in identifying the weeds in real-world data due to a lack of features. Deep learning (DL) approaches use self-learning to extract all potential features that assist in classifying malignant weed species accurately. Recently, single deep learning methods achieved high performance in identifying well-separated and illumination but suffered from misclassification in more sophisticated cases such as overlapping and partial occlusion leaves. This paper presents a hybrid Convolutional Neural Network (CNN) model of three state-of-the-art CNNs to classify Rumex obtusifolius. The proposed model utilizes convolutional neural networks to extract features and classify images. The framework of the proposed method comprises three paramount stages to accomplish the classification key idea, including the data preparation phase, pre-processing phase, and classification phase. A hybrid model of three CNN extractor networks is used as the backbone in the classification stage. Our tested data is real-world data that includes multi-circumstances (overlap, occlusion, various illuminations, etc.) acquired from nature. The first extractor is the Visual Graphics Group-16 (VGG-16) for well-separated leaves and non-complicated issues. The second extractor is Residential Energy Services Network-50 (ResNet-50), to overcome complex real-world issues. The third extractor is Inception-v3 to solve the illumination problem. Therefore, combining three networks into one model improves the discriminatory ability to extract additional useful features. The proposed model has been tested using two benchmark datasets for Rumex weed plants. Both of these datasets were captured in real-world environments. The first dataset consists of 900 samples, while the second dataset consists of 677 samples. Each dataset is individually tested in the proposed model to evaluate the classification accuracy using a set of standard evaluation metrics including accuracy, precession, recall, True-Positive Rate (TPR), False-Positive Rate (FPR), and F1-score. The total averages of the proposed model on both datasets are 97.51%, 97.4%, 94.45%, and 95.9% on the accuracy, recall, precision, and F1-score, respectively.


I. INTRODUCTION
In recent years, continuous development towards weed control within planted crops has been offered. Rumex obtusifolius (R. obtu.), or dock broad-leaved, is considered an undesirable weed plant in agriculture that necessitates removal. The harmful effects of this weed have spread around the world, particularly in Europe. In Germany, 85% of organic farms encounter broad-leaved dock issues. It diminishes the grass's productivity by 10%-40% [1]. Due to the Rumex's widespread nature, livestock gormandize it readily and intensively. Therefore, it has a substantial impact on dairy and productivity [2]. It causes animal health issues due to high oxalic acid, which hinders the quality of products due to low nutritional value. In addition, it significantly affects the economic growth of countries. Progress in the accurate classification of Rumex is, however, restricted by the demand for physical removal or chemical treatments. Therefore, these issues triggered this study to design a robust weed classification model that can be utilized in an automatic weed control system to classify this harmful species of weed. Due to the vast range of weed species in nature and working conditions, this research issue is fraught with challenges.
Manual or hand-weeding is one of the well-known techniques to eliminate weeds. The farmer scans the entire farm for undesirable or unusual plants, plugging them out using his hands or simple tools. Their technique faces numerous challenges, such as lengthy-time of completion, difficulty of detection, and labor-cost. Another method for malignant weed removal is chemical treatment, which targets numerous weeds using herbicides sprayed on large-scale farms [3]. Farmers or machines perform this process. The problem with such a technique is the treatment cost and environmental pollution issues. In addition, this adversely affects animal and human health [4]. Thus, both hand-removing and chemical treatment techniques are time-consuming, costly, and can result in environmental issues [5].
Nowadays, precision farming or smart farming approaches utilize computer vision as an alternate technique to determine the Region of Interest (RoI) in the pasture [6]. These approaches are more robust in terms of efficiency and effectiveness. Some studies focused on detecting various weed species over the last three decades by discriminating these weeds from crop plants, as in Binch and Fox [7]. In their work, Jia et al. [8] examined the identification of the plant on the farm using thresholding. They located the root position by computing the cross points of major veins in corn leaf images captured from the top scene. In the same year, Franz et al. [9] used the curvature technique to detect partially occluded leaves on different seedlings plants at late growth stage. By aligning the resampled curvatures for each genus, the author revealed the significance of identifying a leaf that was not entirely occluded. The shortage in their approach was related to the accuracy of curvature to identify various shape of serration. Tian [10] utilized spatial features to determine certain locations of cotyledon crop plants. They obtained the location information by calculating the center point of the stem during the early growth stage. Woebbecke et al. [11] proposed a method for differentiating monocot and dicot weed plants, representing two weed species that exist in the United States. They claimed that the optimum period to address these harmful plants would be from the 14 th to the 23 rd day of their growth. In the last three decades, real-world data has remained a challenging task for the computer vision scientific community. Occlusion, overlapping, different illumination, and various growth stages conditions are common issues in real-world data [12]. Hand-engineering features are basic Machine Learning (ML) methods to extract features manually. These methods achieved satisfying results with artificial data under controlled conditions. Deep Learning (DL)-based approaches are an extension of ML approaches to achieve encouraging results with real-world data using selfextracting features [13].
The key limitation of the previous methods is how to accurately classify the Rumex under real-world conditions in the case of insufficient training images. Due to the mundane nature of annotating a huge number of images, the motivation of designing a model to work with a reasonable number of images that contain various real-world conditions is desirable. Therefore, the main drive of this work is to investigate the issues of Rumex classification, including heavy occurrence, various growth stages, overlapping with plants, and adverse environmental-agricultural impact. In addition, using mechanical or chemical actuation methods to control Rumex species impacts human and animal life and reduces the plants' amount and quality. This is the first time that a hybrid Convolutional Neural Network (CNN) has been used to improve the accuracy of classifying Rumex weed plants in a complicated scenario. The purpose of using CNN networks in classification is to provide a robust method for self-extracting features [14].
Our main contribution focuses on designing a new ensemble model of three CNN networks at its backbone base. The framework of this model is adaptable to numerous weed control applications to address various weed species. This ensemble uses voting majority rule to decide whether the plant is considered by the model as a weed or not. This combination has not previously been used in the Rumex classification model to the best of our knowledge. The Ensemble-Region Convolutional Neural Network (E-RCNN) network is proposed for its novelty in using ensemble classifiers at its backbone base. The second contribution is using the new proposed model to address data challenges under real-world conditions such as occlusion, overlapping, various image resolutions, various growth stages, and different illumination conditions. Combining three extractors into one model provides the following expected benefits: i) enhancing classification accuracy, ii) reducing the illumination effect, iii) controlling the occluded and overlapped issues, and iv) enhancing the capability of feature extraction and representation. This study uses benchmark datasets from Kounalakis et al. [15] and Van Evert et al. [1]. Both sets are real-world of the actual farm that were captured under challenging conditions, such as various illuminations, occlusions, and overlapping conditions. The remainder of this paper is organized as follows: The related work is presented in Section II; the proposed model is outlined in Section III; the materials and methods are explained in Section IV; Section V is the conclusion; and finally, limitations and future trends are elucidated in Section VI.

II. RELATED WORK
This section discusses most related works that explored ML and DL techniques to address broad-leaved weed plants. Dürr et al. [16] utilized the Local Binary Patterns (LBP) with C-histograms to extract the size and spectral features of the Rumex weeds. Then, they eliminated the detected regions using a heating oven at 1200 KW. The problem with their method is the high error rate of misclassified regions, which reached 35%. Van Evert et al. [17] found that texture is a significant feature for identifying broad-leaved weeds like Urtica and Rumex. Binch and Fox [7] compared different ML algorithms using real data. Their comparison demonstrated that the best results were obtained by combining LBP with Support Vector Machine (SVM) for Rumex classification. Unfortunately, the LBP method relies heavily on texture features and ignores beneficial information such as shape, and color. This dependence restricts the method's performance, making it unable to mitigate the error rates.
Gao et al. [18] used the Normalized Difference Vegetation Index (NDVI) color index with Random Forest (RF) to classify Rumex and two additional species of weeds, Convolvulus arvensis and Cirsium arvense, from the maize crop. The mean classification rate of Rumex is 69.1%, which is better than the K-Nearest Neighbor (KNN). However, they depend on a specific number of extracted features from 8 different bands. In addition, their method is costly due to the multispectral camera. Kounalakis et al. [19] proposed Speed-Up Robust Feature (SURF) features with Linear2-regularized with Linear2-loss logistic regression model using primal computation (L2regL2lossSVCp) to recognize the Rumex. They captured 100 images of Rumex in a real field using multiple high-resolution cameras. Then, each image is segmented into 9 patches to yield 900 images. Dividing an image into muti-regions has adversely affected the quality of the features represented in the image. Moreover, their method is based on hand-crafted features that represent image content. Thus, the classification results of such a method record 89.09% accuracy with a 4.38% False-Positive Rate (FPR).
Zhang et al. [20] utilized a single CNN approach to recognize Rumex obtusifolius in various illumination conditions. They achieved 96.88%. The problem with such a method is that resizing the input image size to 64 × 64 pixels causes a loss of useful information that can assist in solving challenging cases. Valente et al. [21] used AlexNet transfer learning to classify Rumex obtusifolius in grassland. They generated high-resolution data using a small Unmanned Aerial Vehicle (UAV). They scored 91.9% accuracy when the Rumex in moved and cut-off cases. The drawback of such a method is that the images are not tested under various illumination cases. In addition, they captured their images from the same level at 10 meters in height. Such image types restrict the method from obtaining sufficient information about the leaves and the entire object. This limits the method's performance to identifying Rumex in various real-world conditions. Lam et al. [22] used the Visual Graphics Group (VGG) method to classify the early growth of Rumex weeds using UAV. One of the limitations of their method is focusing on limited cases of Rumex that are found on one field site and ignoring other cases such as the different growth stages and other real conditions. The results of their proposed method are 92.1% and 78.7% on the accuracy and F1-score, respectively. In this work, we utilized the data collected by Kounalakis et al. [23] and Van Evert et al. [1] to estimate the performance of the proposed model. Furthermore, four extracted features comprise visual texture features, spatial context features, spectral features, and biology morphology features. Besides, their study supported the idea that the sophisticated system is a trade-off between accuracy and efficiency. Finally, Kounalakis et al. [23] applied the transfer learning technique to recognize Rumex in grassland.
The significant contributions of this research are designing a new ensemble model of three CNN architectures to enhance the classification accuracy of Rumex. To the best of our knowledge, the three DL networks were not previously combined into one model to be implemented in the agricultural field, especially with Rumex. In addition, this work improves the classification accuracy for Rumex and reduces the error rate. This improvement leads the scientist to utilize this model to be generalized to various weed plants.

III. PROPOSED METHOD
In this paper, the proposed method has been discussed thoroughly. Three main stages are identified and proposed to accomplish the classification objectives, including the data preparation stage, pre-processing stage, and classification stage. The stages of the methodology are thoroughly discussed in the following subsections. This paper highlights the generic framework of this research and introduces the required steps to implement the research systematically. Ensemble learning is the aggregation of numerous models, such as extractors and classifiers, to tackle a specific computational intelligence issue. The key idea of ensemble learning is to enhance classification accuracy and prediction. The structure of the E-RCNN network consists of two parts. The first one is the features extractor, and the second part is the classifier network. Each extractor consists of convolutional layers, dropouts, and max-pooling layers in between. Three CNN models are adapted in their structure to fit the data requirements. This data suffers from illumination, overlapping, and occlusion. These three extractors are merged to form a hybrid model. The ensemble model requires an odd number of methods for voting purposes, such as three, five or seven and upwards. Therefore, determining the number of elements (e.g., methods) in an ensemble is critical [24]. Three selected methods are combined to design our proposed model in this case. Using more than three architectures in one model increases the memory space and reduces efficiency. Regarding using more than five networks, the model complexity is also increased; as a result, the model will be complicated, which negatively affects the system's performance. Fig. 1 illustrates the mechanism of the ensemble. First, each variable is passed through the three extractors to be processed. Then, these individual extractors' outputs contain the predicted label. Hence, the predicted outputs attained from the three extractor backbones are passed through the ensemble model as inputs to vote for one classified label in each process. Each classified object is selected to have a low error rate with a high probability. The formula of the ensemble is depicted in the below Equation.
Given some training data: where: D: represents the classifier model. n: is the number of classes. Inductive learning: Ensemble learning: The ensemble model yields optimum performance when there is critical diversity in the output results of the composing methods [25]. The first layers of the feature extractor network extract useful features such as color identification, edges, and curves of the objects in the image. Then, the annotated images were divided to the ratio of 80:20 into a training set and a testing set, respectively. All this data is with RGB color images of various sizes. After that, the prepared data becomes ready to feed the proposed model. Generally, DL models require a small square image to reduce the time and memoryconstrains. In addition, DL networks require a fixed resolution of training images to feed the network [26]. Furthermore, data augmentation is utilized to improve generalization [27], [28], [29]. In this regard, the image resizing technique is applied to decline the input image size to the standard size of 224 × 224 × 3 pixels [30], [31], [32]. Furthermore, DL methods necessitate a vast dataset to increase accuracy and hinder overfitting.
The outputs of these methods are grouped to produce the final predictions. Each extractor uses the mean subtraction algorithm located in the data loader. This technique assists in accumulating the data around the mean where the helpful features exist. The benefit of such a technique is that it reduces the effects of outliers and illumination issues in some cases. Due to the high performance of the three networks in the ImageNet competition [14], they are selected to combine the proposed model. 1. The first extractor model is VGG-16 [33], which is the basis of our hybrid model. It is efficient and accurate [34] to handle well-separated and some partially occluded leaves. VOLUME 10, 2022 2. Residential Energy Services Network-50 (ResNet-50) [35] is the second extractor model, which has a more dense convoluted architecture than VGG-16 [33] due to its further dense layers. Nevertheless, it handles the overfitting problem of VGG-16 and deals with more sophisticated issues such as overlapping and occlusion. 3. The final extractor is the Inception-V3 model [36], which is more robust than the VGG-16 and ResNet-50 to overcome the illumination issue not solved in the preprocessing step.
The E-RCNN utilizes a hybrid model composed of three backbone networks, including VGG-16, ResNet-50, and Inception-V3 for feature extraction and classification. Further explanation can be found in [33] and [35].
A. VGG-16 EXTRACTOR Fig. 2 illustrates the sixteen layers of the VGG-16 network architecture. Some of these layers include trainable parameters, while some do not, like the Max pool layer. The key idea of the VGG depth group was to investigate how the depth of convolutional networks influences the accuracy of models for wide-range image recognition and classification. All of VGG's architectures have many Fully Connected Layers (FCLs) with various convolutional layers. The more depth, the more convolutional layers. Fig. 2 shows thirteen blue rectangles related to the hidden layers and the non-linear activation function represented by the Rectified Linear Unit (ReLU). The five red rectangles are related to the max-pooling layers. In addition, two green rectangles represent two FCLs. Therefore, the total number of layers with adjustable parameters is 15, including 13 convolution layers and 2 FCL layers. The proposed method fine-tunes the last two layers, the SoftMax layers, to fit our dataset. In this work, the SoftMax function is re-initialized to carry the appropriate number of classes for the samples to decide whether the plant is Rumex or non-Rumex. In this design, VGG-16 commenced with a relatively small channel capacity of 64 and rose by a scale factor after each max-pooling layer till it reached 512. Fig. 3 shows the flattened architecture of VGG-16.
The structure consists of five blocks. The first two adjacent blocks are composed of pair-convolution layers and then max-pooling. The last three contiguous blocks have three convolution layers followed by max-pooling. Finally, the last three dense layers represent the FCL, or as they are known, the classification layers [22]. The first two FCLs are flattened, consisting of 512 depths, while the last FCL includes 128 depths. The size is reduced by half after every max-pooling. Table 1 displays the VGG-16's overall network configurations.
These are the characteristics of the VGG-16 network: 1. Input Layer: It accepts 224 × 224 color images with three channels as input.  2. Convolution Layers: They are a sequence of dense layers that the input images are passed through. Every convolution filter has a tiny filter of 3 × 3 size with a stride of 1. Each window size (also known as kernel size) utilizes VOLUME 10, 2022 row and column padding to preserve the input and output size as fixed. 3. Max pooling: Is implemented across a 2 × 2 of window size with a stride of 2, indicating that max pool windows are non-overlapping windows. 4. A max pool layer is not always the layer that follows a convolution layer. Instead of the max-pool layer, a convolution layer is followed by another convolution layer. 5. The proposed model modifies the original technique by replacing the last three connected layers of the original method with two FCLs to fit the number of our classes. The first FCL has 1 ×1 × 28 neurons. Increasing the number of neurons means increasing the complexity and processing time of the model with the same accuracy, causing overfitting, while decreasing this number causes underfitting. The second FCL consists of two outputs 1 × 1 × 2 as there are two classes, Rumex and non-Rumex in our dataset. 6. ReLU is the activation function that is used in the hidden layers.
To justify selecting the window size of 3 × 3 is that it is the minimum potential value to fulfill the required directions of the entire image from top to bottom and from left to right passing through the center. Furthermore, stacking pair-convolutional layers of 3 × 3 excepting max-pooling between them has an effective receptive field of 5 × 5. Similarly, using triple 3 × 3 convolution layers have an effective receptive field of 7 × 7.

B. RESNET-50 EXTRACTOR
There are four stages of the ResNet-50 architecture, as illustrated in Fig. 4. The dimensions of the input image for this network are 224 × 224 × 3. The configuration of kernel sizes in ResNet uses 7 × 7 and 3 × 3 for initial convolution and max-pooling, respectively. After which, the process of the first stage commenced. The first stage consists of three residual blocks. Each block of the residual includes three layers. The kernel sizes of the layers in the block residual are 64, 64, and 256, respectively. There are two types of curved arrows. The first type is connected curved arrows used with an identity connection. The second type of curved arrow is the dashed curved arrow, denoting that the convolution operation is using stride 2 in the residual block. At this stage, the input size of the image was reduced by 50% for the height and width, while the channel increased by dual. Observably, the channel width increases to dual, whereas the input size decreases as it proceeds through the stages. Most deeper networks such as ResNet-101 and ResNet-50 provide a bottleneck in their architectures.
The benefit of using bottlenecks in such an architecture is that it decreases the number of network parameters represented by feature maps in the network while preserving the network's depth. Another advantage is that it permits the network to be generalized with new data. The bottleneck consists of a small-dimensions convolution layer that is 1 × 1, where the number of output channels of this layer is less than the number of input channels. Each residual function comprises three layers assembled on top of one another. The dimensions of these convolutional layers are 1 × 1, 3 × 3, and 1 × 1. The first and third convolutional layers of 11 are employed to reduce and then retrieve the input resolution. At the same time, the second convolution layer, which is 3 × 3 is used as the bottleneck to resize the dimensions for input and output [37]. In addition, our research fine-tunes the FCL to fit with the class numbers of our data, which includes two classes, Rumex and non-Rumex.

C. INCEPTION-V3 EXTRACTOR
Unlike ResNet-50, the Inception family is a wider-style network rather than a deeper one. In Inception-V3, various multiple transformations of the same input map are calculated simultaneously. Then, the results are concatenated into a solo output. The previous version of Inception used three layers of 5 × 5, 3 × 3 convolution, and one max-pool. In the later versions, the filter size of the 5 × 5 convolution layer was replaced with two 3 × 3 convolution layers, instead of using only one large filter size. This reduction is called factorization. The benefit of factorizing is to reduce the number of parameters by 28%, which helps to reduce the computational cost. Generally, the purpose of increasing the depth of any network is to enhance accuracy. However, it causes vanishing gradient issues, such as consuming additional resources for computation. To overcome this issue, Inception-V3 intro-duced an auxiliary unit of a 1 × 1 convolution layer. Using these units is helpful because they address the problem of vanishing gradients and make a more comprehensive network [38]. Fig. 5 shows the effects of adding 1 × 1 convolution on the computational cost of Inception-V3. Szegedy et al. [36] claimed that using a bottleneck in the initial layers causes the loss of useful information from the input layer. In addition, they adopted one of the principles in all the Inception families to enhance the accuracy of classification at a reasonable computational cost by parallel increasing the width and depth. Inception-V3 differs from the other Inception families in using additional techniques such as factorized 7 × 7 convolutions, label smoothing, and auxiliary units or auxiliary classifiers [36]. Fig. 6 illustrates the Inception-V3 architecture. Our research modified the last two layers to fit our data. The experimental configurations and the parameter details of our Inception-V3 are depicted in Table 2.

A. DATASET DESCRIPTION
The description of the data used in this study is elucidated thoroughly. In this study, two standard benchmark datasets have been used. The first dataset [dataset 1] is obtained from Kounalakis et al. [23]. The total number of images in this dataset is 900 images of Rumex weed plants in grassland. The second dataset [dataset 2] is acquired from Van Evert et al. [17]. The total number of images in this dataset is 677 images of Rumex weed plants in grassland. Both these datasets are two-dimensional RGB-colored images. The format of these datasets is Joint Photographic Group (JPG). The images in the dataset have various resolution sizes. The first data was captured using a robotic system on an organic dairy farm VOLUME 10, 2022 in France. In contrast, the second dataset was taken using a Cybershot DSC-60 by Sony, Tokyo, Japan, on two different dairy farms in the Netherlands. The size of dataset 2 is 2304 by 1728 pixels hand-held at a 1.7 m distance from the ground. These datasets are considered real-world images. Real-world data is captured under various circumstances such as illumination, overlapping, and occlusion. Due to a lack of information, these cases are classified as sophisticated issues for ML techniques. However, this information could contain significant features that are crucial in identifying the leaf type.
Through leaf type, it becomes feasible to identify the plant type. The images in this dataset either contain the entire Rumex weed surrounded by the grassland or only the leaves of this weed (e.g., Rumex). The RoI of this dataset is the ability to identify Rumex in the grass or among the leaves of a scene. Fig. 7 shows that both Rumex and grass are likely to share the green color, which increases the difficulty of ML to recognize them. However, the grass is a more intense color than Rumex. In the pasture, the grass frequently covers the weeds. Therefore, shape and size are considered apparent features. The Rumex leaf differs from the grass leaf in its shape. Rumex leaves are short and broad, whereas grass leaves are long with a narrow edge of several millimeters. According to texture, Rumex is coarser than grass, which carries valuable information in the classification. Van Evert et al. [1] claimed that the detection performance of Rumex improved when the grass was short and the plant was in rosette form.

B. IMAGE RESIZING
The first step in image preparation is image resizing. Several image sizes were introduced to train our proposed model, commencing from 128 × 128, which achieved acceptable performance. Then, we raise the scale to 196 × 196, which leverages the performance level by 2%. By proceeding with the rescaling process using 224 × 224, 299 × 299, 336 × 336 until 350 × 350, it is observed that the model yields optimum results in network performance when the input image is 224 × 224. Finally, we investigated that increasing the scale over 224 × 224 yields the same performance but with high computation.

C. DATA AUGMENTATION
After image resizing, data augmentation is implemented to boost the number of training samples [39] and mitigate overfitting [40]. Since CNN methods are greedy to vast annotated data [41], several transformations are implemented to enlarge the training data size and introduce various shapes of Rumex, such as flip, mirror, and rotate. These transformations are randomly augmented for each epoch of training and validation. To implement all these transformations in Python, an exciting class, namely ImageDataGenerator, has been used. For rotation, each image is rotated 20 degrees clockwise to extend the dataset by 18 times to cover all potential changes in the input image's position. Then, both the horizontal and vertical scales with a range of 0.5 are utilized to enlarge the image. In addition, a cropping adjustment with a range of 15% is applied. Both vertical and horizontal transformations are performed using the flipping operation. The generated images from these transformations are merely used during batch training [42], [43]. These transformations are executed temporally in memory during runtime, but they are not saved to disk. These three extractor models are incorporated to establish a hybrid backbone for the weed classification model. The hybrid model can handle the overlapping occlusion and illumination conditions in real-world images. The details of the hybrid method are discussed in the subsequent sections.

D. PROPOSED METHOD IMPLEMENTATION
The experiments were implemented on a machine using Windows 10 64-bit as an operating system. The hardware components of this machine comprise an Intel Core i7-10 Gen. The primary memory size was 32 GB. The GPU was an RTX 2070 with 16 GB of memory. Python 3.7 with CUDA 10.1 was the programming language for developing the DL model. PyCharm was employed as the framework for coding. Python provides the entire package of both the Pip and Conda libraries. The proportion of training data to testing data is 80:20 samples. The initial values for the batch size, epoch size, and learning rate are 32, 10, and 10 −4 , respectively. The framework of this approach is shown in Fig. 8.
Based on this figure, the processes involve three main stages, including data preparation, image pre-processing, and image classification, distributed into eight steps. The data preparation stage includes two steps, and the image pre-processing stage involves two steps. Finally, the feature extraction and image classification stage consists of two steps. An additional step is introduced to evaluate the classification results of the proposed model using quantitative measures. These are the summary descriptions of the functions of each step: Step 1: Collecting samples from the source to be prepared for the training process.
Step 2: Dividing the dataset into 80% for the training set and 20% for the testing set.
Step 3: In the pre-processing stage, image resizing reduces the memory space and time-consuming execution. The proposed model converts all the images from various sizes to a specific size, which is 224 × 224 pixels as a standard size.
Step 4: Increasing the scale of the dataset using the data augmentation technique. This technique is used to increase dataset size and reduce overfitting. Several augmentation operations are utilized in this step, such as mirroring, flipping, zooming, and rotations.
Step 5: Training the three extractor models with all the annotated training. Individually, these extractor models are used to extract the features, so that the output of each model yields its classification results.  Step 6: Finally, all the outputs of the three extractor models are grouped into the ensemble model for voting by the majority. The result of the ensemble model yields one classified label as the predicted result.
Step 7: Evaluating the results of the proposed method using the performance metrics including precision, accuracy, recall, and F1-score for classification accuracy.
The concentration of this research would be on the classification stage to be the basis for weed detection. In the following sections, these steps are illustrated in more detail. Fig. 9 demonstrates the algorithmic step code of our proposed model.

E. RESULTS AND DISCUSSIONS
To test the efficacy of our proposed approach, we compared it with the previous competing studies. The effectiveness of the ML and DL methods is tested to measure the method's validity on a designated test problem. Similarly, the proposed method aims to improve classification accuracy. Fig. 10 and Fig. 11 illustrate the analysis of the training accuracy and loss error rate of the three backbone networks of this approach applied to two various Rumex datasets, including dataset 1 [23] and dataset 2 [17]. Fig. 10 and Fig. 11 compare the performances of three backbone networks, VGG-16, ResNet-50, and Inception-V3, during the training process for dataset 1 and dataset 2. It is observed that VGG-16 is more stable than Inception-V3 and ResNet-50 networks during the training process of dataset 1 and dataset 2. However, the performance of Inception-V3 decreased in both datasets at the final level of the training process. According to ResNet-50, it is monitored that the performance of this network increases sluggishly compared to other backbone networks. It requires ample time to be trained due to its dense layers. To analyze the error rate or loss of the three backbone networks, the pay attention is recorded for ResNet-50, which has a lower error rate than Inception-V3 and VGG-16. For VGG-16, however, the greater error rate is considered.
For evaluation, some well-known metrics such as accuracy, F1-score [44], precision, and recall [45] are employed to observe the effectiveness of the proposed method. In this research, we focus on using quantitative measurement to quantify the robustness of our proposed model. This set of metrics compares the predicted label with the ground-truth label in terms of accuracy [46], precision or PPV [21], recall [32], and F1-score [47]. The standard formula of accuracy, precession, recall, and F1-score are shown in the following Equations: Recall or Sensitivity or TPR = TP TP + FN 90950 VOLUME 10, 2022 where: TP: represents the total number of Rumex weeds classified by both images of ground truth and the proposed model. FP: represents the total number of non-Rumex weeds (e.g., grass) that are not classified as ground truth images while they are recognized as Rumex through the proposed method.
FN: represents the total number of Rumex weeds that are recognized via the ground truth image and not recognized through the proposed model. TN: represents the total number of non-Rumex that are not found in both the ground truth and the proposed model. After which, we compute the True-Positive Rate (TPR), and True-Negative Rate (TNR) to make a fair comparison using the confusion matrix.

F. COMPARISON TO PREVIOUS RUMEX CLASSIFICATION APPROACHES
Several hand-crafted and self-learning classification methods are compared to the proposed. These methods were applied to classify the Rumex weed plants from grass using realworld data. Table 3 illustrates the classification results of these methods using the standard evaluation metrics. These metrics are applied to verify that each true positive pixel in the Domain of Interest (DoI) has been precisely classified. For fair comparison, all representative methods were applied to the same tested data. The results in this table show that the Scale Invariant Feature Transform (SIFT) feature-based system [15] has the lowest accuracy, precision, and F1-score rates of all the tested techniques due to the high FPR. That means it is inefficient to determine the non-Rumex weeds correctly. Later, the SURF feature-based system [48] was proposed to overcome the previous method's shortage by lowering the FPR and False-Negative Rate (FNR), but it remains suffering from FPR sensitivity. The problem with those methods is that they used vectors to extract features, which are inefficient in identifying the negative objects due to the occlusion issue with Rumex weeds. Sünderhauf et al. [49] proposed Overfeat CNN for feature extraction with Extreme RF for classification. Such a method improves the system's recognition capability by reducing the FPR at a low rate. However, that method registered the highest FNR of all competing methods, classifying the true positive (TP) plant. Reyes et al. [50] used fine-tuned AlexNet [51] for weed recognition.  To analyse the results in Table 3, it is observed that our proposed model achieved accurate classification results compared to competing methods. As shown in Table 3 and  Table 4, our method achieved 97.02% accuracy using dataset 1 and 98% accuracy using dataset 2. Although AlexNet finetuning [50] and Overfeat with ExtremeTrees [49] achieved somewhat greater accuracy than our proposed model (1.36% and 1.90%, respectively), their results are relatively poor in terms of precision, recall, and F1-score metrics. However, the proposed model delivered on its promises by recording 81.35%, 27.41%, and 71.56% high difference rates on precision, recall, and F1-score, respectively. The shortcomings of the compared methods are due to the lack of addressing the challenging scenario and focusing on well-separated leaves or plants in the scene. Specifically, the limitation of these methods frequently occurs due to insufficient learning to classify occlusion and overlapped cases [52]. Furthermore, some methods depend on specific features such as shape or texture, which are not adequate to recognize the type of object [53], [54]. Concretely, the empirical results demonstrate, in overall evaluation metrics, that our proposed model provides a higher baseline accuracy than existing methods. Due to the diverse architectural designs of each network in our model, different features are yielded. These features play a crucial role in identifying our complex scenario. The finding of this work is that using a hybrid model produces a higher baseline accuracy against occlusion than using a single method. In some occlusion cases, however, our model showed low performance, especially when there are multi-occluded cases of Rumex in the same scene and due to the low-resolution imagery. Another challenging issue is observed when the scene contains a part of Rumex leaves distributed on the boundary where most features are absent.
Both Kounalakis et al. [23] in their Inception-V1 with L2-regularized with L2-loss logistic regression (L2regLogReg) and Reyes et al. [50] in their AlexNet method used the same training parameters by setting 10 to the learning rate for their classifiers. Reyes et al. [50] reduced the FPR to raise the recall ratio. At the same time, the FNR of such a method does not produce sufficient results to recognize the true positive pixels due to overfitting. It is observed that there are unbalanced results in the evaluation metrics of the same method. The accuracy of most compared methods is high, while the F1-score metric reported low-rate values. In this regard, this proposed method achieves stable performance using the same standard metrics. The first evaluation was applied to the [23] data. In their work, Kounalakis et al. [23] demonstrated that the Inception-V1 with the L2regLogReg approach achieved the highest accuracy compared to other representative methods. We compute the F1-measure rate of the Inception-V1+L2regLogReg approach and other compared methods to be evaluated with our proposed model. Based on the analysis, we investigated that some metrics such as recall have high sensitivity to true negative pixels due to the high disparity between positive and negative pixels. Therefore, the second experiment is applied to the second set of Rumex data [17] as presented in Table 4.
Van Evert et al. [17] used 2-D Fourier analysis in their generated data from the above table. This method achieved 82%-89% acceptable scores using the accuracy metric, while it has not been tested on other evaluation metrics. In addition, their method is not being compared with other methods. In this work, we implemented several methods to measure the performance of our work and the 2-D Fourier analysis method. The comparative results of the competing approaches in terms of FPR and FNR of dataset 1 are shown in Fig. 12. According to this figure, the SIFT feature-based system has the greatest FPR rate, while the Overheat-Extreme Trees technique has the highest FNR. The lowest FPR and FNR, on the other hand, attained the key target by recording VOLUME 10, 2022  0.02% on both measures when using the proposed model. That implies the application does not waste time processing the non-existent Rumex or misclassifying the actual Rumex in reality. The comparative results of the competing approaches in terms of FPR and FNR of dataset 2 are shown in Fig. 13. This figure illustrates that the SVM using a polynomial function scored the highest FPR and FNR of all the competing methods, while its classification accuracy is similar to that of 2-D Fourier analysis. On the other hand, we applied SVM with Gaussian Radial Basis Function (RBF) to the same tested data to achieve the highest accuracy, precision, recall, and F1-score rates of all the competing methods. However, * Polynomial Kernel Function. the FPR and FNR of the SVM_RBF are high due to poor images and occlusion, making it inappropriate to be utilized with a robust detection model. Nevertheless, the results show that the proposed model outperforms by 4%, 1%, 46%, and 28% higher results than the best-compared methods on the accuracy, precision, recall, and F1-measure, respectively. Furthermore, its FPR and FNR are tiny to identify the RoI and effectively misclassify unwanted regions. This outperforming leads to the fact that this approach is promising for a new detection model. Table 5 details the comparison results of our proposed methods using two different datasets. On the other hand, the accuracy rate has the lowest rate of the two datasets, with dataset 2 having a 0.98% higher rate. Fig. 14 and Fig. 15 depict the confusion matrix of dataset 1 and 90954 VOLUME 10, 2022   dataset 2, respectively. The preliminary results of the proposed model on precision and F1-score in dataset 1 are higher than those in dataset 2. In contrast, the proposed model is higher in accuracy and recall when using dataset 2. Thus, our finding is a balance between these two datasets that could be merged into one dataset. In addition, our method achieved high accuracy (97%-98%), which is higher than the individual methods to classify Rumex in different conditions. However, it does not concentrate on a single issue like illumination, as used in Zhang et al. [20] work.

V. CONCLUSION
Rumex. is a vital weed plant that has a substantial effect on dairy yield and production. Real-world images such as illumination, overlapping, and occlusion reduce the accuracy of the classification model. These issues are considered challenging task to computer vision. Most previous works focused on weed classification under controlled conditions, whereas weeds are certainly allocated in grassland under the real-world conditions of farms. In this study, a new hybrid CNN model with three various extractors at the backbone is proposed to improve classification accuracy in real-world data. Unlike the single approach, combining three different networks into one ensemble model increases the ability to extract deepening (e.g., additional beneficial) features due to the variety of architectural designs for each network. In addition, each extractor provides the ability to address one or more challenging real-world issues so that the shortcomings of each network are addressed by the two remaining networks. Experimental results show that utilizing different extractor networks was able to reduce the FPR and FNR to a low-level rate. Compared to other recent models, this reduction helps generalize the model with unseen fields to classify Rumex in real-world conditions. This work uses a standard benchmark dataset of images captured under real-world conditions. Images in these two datasets are captured in challenging conditions of a real farm, such as various illumination, occlusion, and overlapping conditions. In addition, each image includes single or multi-leaves or entire Rumex weed plants. The proposed approach has been compared and evaluated using the same dataset with different methods. The results have shown that the proposed approach produces better results than other competing methods. The total averages of this approach on both datasets are 97.51%, 94.41%, 97.44%, and 95.93% using accuracy, precision, recall, and F1-score, respectively.
This work introduces pivotal knowledge to the computervision community. Firstly, it improves the classification methods for Rumex in real-world conditions by using a combination of three different classifiers. Regarding the agricultural community, this research can be implemented in a weed management system or an automated weed spraying system. It assists the farmer in alleviating laborintensive costs, reducing time-consuming tasks, preventing herbicide pollution in the environment, and controlling weed separation.

VI. LIMITATION AND FUTURE DIRECTIONS
Real-world data is a challenging issue in computer vision approaches. The limitation of this data is the deficiency of beneficial information in the occluded and overlapped regions. Increasing these regions adversely impacts the classification accuracy of the results. However, using sufficient samples in the training of DL raises the model's potential for extracting and classifying. For future work, these two datasets can be combined to increase the number of samples, especially those for the entire Rumex plant in the grass, due to the limited amount. In addition, this work can be expanded to produce a new detection model focusing on Rumex weed plants using You Only Look Once (YOLO) detection and Mask Region-Convolutional Neural Network (R-CNN). Moreover, we will investigate the restrictions on why other networks achieve high performance in their related tasks as compared with our data. Besides, we plan to apply our proposed model to classify the diseases and lesions of Rumex or other weed species.