A Deep Learning-Based Recognition Technique for Plant Leaf Classification

In the practice of plant classification, the design of hand-crafted features is more dependent on the ability of computer vision experts to encode morphological characters that are predefined by botanists. However, the distinct features that each plant has as demonstrated by its leaves can be automatically learned based on the end-to-end advantage of Deep Learning algorithms. Therefore, Deep Learning based plant leaf recognition methods is an important approach nowadays. In this article, we are applying three technologies to achieve a model with high accuracy for plant classification. A Conditional Generative Adversarial Network was used to generate synthetic data, a Convolutional Neural Network was used for feature extraction and the rich extracted features were fed into a Logistic Regression classifier for efficient classification of the plant species. The effectiveness of this method can be seen in the wealth of plant datasets that it was tested on. The paper contains results on seven datasets with different modalities. We utilized both Deep Learning and Logistic regression in effectively classifying the plants using their leaf images with accuracies averaging 96.1% for about eight datasets used, but greater for the individual datasets from 99.0 to 100% on some individual datasets. Extensive experiments on each of the datasets demonstrate the superiority of our method compared with others and are highlighted in our results.


I. INTRODUCTION
ACCORDING to the State of Plants report [1], there exist about 390,000 plant species known to modern science. This makes it difficult for a botanist or an expert to be able to identify and classify all these species. Even more so impossible for non-experts. Not forgetting the fact that some plant species have high similarities and the differences being so little pose the need for a fine-grained classification approach. The leaves of plants contain a considerable amount of information regarding the particular plant species and a lot can be learned about a plant just from its leaves. Plants, obviously, right from the inception of creation have had a variety of use for humans and are not stopping any moment soon, ranging from their use for food, medicine, shelter, etc.
In addition, owing to global warming and other factors, many plants nowadays are faced with the challenge of extinction. Non-endangered as well as endangered plant species need to be preserved and conserved adequately to counter this risk. Hence, there is a need to develop an automated or computerized system to identify and classify plants in an efficient manner. Therefore, the need for adequate knowledge and management of plant species cannot be overemphasized. The traditional artificial identification methods are based on plant morphology, which though having its form of success is subjective, costlier in terms of manpower and proves inefficient in the long run. Fig. 1 shows the general leaf features used in the classification of plant species. Plant leaves are twodimensional in nature; therefore, making it possible to engage in the identification of plant leaves automatically using image-processing techniques. In recent times, Deep Learning and in particular the use of Convolutional Neural Networks (CNNs) have proven well suited for addressing computer vision problems of which plant classification can be considered to be one. Deep learning eliminates the need for domain expertise and hard-core feature extraction that only expert botanists can provide. VOLUME XX, 2017 1 instead, a series of consecutive convolution operations between the input image and convolutional filters, form the feature map and extract discriminative patterns from individual plant leaves. Several image recognition methods and applications based on deep learning are easily accessible, and they are still attracting a lot of attention. Some of these methods could be applied to the field of botany and agriculture to tackle problems computationally. Table 1 below shows the general leaf features that are widely studied in classification. Features such as shape, color, and texture can be seen to be the generic distinguishing features that separate plant leaves and other details sub-features can be grouped within the three as shown in Table 1. This paper proposes a method based on an ensemble of some Deep Learning techniques for efficient Plant Classification applied on plant leaf images.
The main contributions of this paper are summarized as follows: • Utilizing a blending of three technologies to improve the accuracies of plant species classification via plant leaf images. This is novel in the area of botany and plant classification and it reliably produces state-of-the-art results. The results can be compared with other previous results in this field. Tables 13-15 display such results. • The utilization of a conditional Generative Adversarial Network to tackle the problem of a lack of sufficient training data or uneven class balance that could be found within datasets in performing Deep Learning tasks. This serves to augment leaf image datasets, which have not been large enough, as this field still lacks a large number of datasets for adequate training of Deep Neural Networks for better generalization. • The result of the augmented leaf dataset produced more than 3.0% increase in accuracy. This is a milestone in the field of Deep Learning. The rest of the paper is organized as follows: Section II reviews previous literature with related work done on plant classification, Section III details the various datasets we used, describes our approach and our experiment parameters, while Section IV presents the experiments carried out with parameters used; Section V gives the discussion of results, and also provides a comparison with conventional solutions. Finally, conclusions are drawn in Section VI.

II. RELATED WORK
There are lots of methods that are used for classification in Machine Learning. To train the computer to adequately and correctly classify objects is a huge task that artificial intelligent practitioners have been working hard on for many years now and plant classification is no different in this feat.
Variations on leaf characteristics are preferably employed in automated plant identification systems using computer vision methods because of leaves being easily observable, accessible, and describable compared to other plant organs. Kadir et al. [2], as well as Cope et al. [3], and Ahmed et al. [4], give comprehensive surveys on methods for automated plant identification. However, plant identification is still considered to be a challenging and unsolved problem since all classical computer vision employs handcrafted methods that are dependent on chosen natural features given the extreme diversity of botanical data. For example, Jin et al. [5], employed a classical image processing chain of image binarization to separate the background and the leaf from a leaf image, detection of contours and contour corners, and geometrical derivations of special leaf tooth features. This approach was then evaluated on eight different species of plants and Jin et al. [5] reported in-species specific identification accuracies ranging between 72.8% and 79.3%. But obviously, this approach solely cannot deal with species showing any significant appearances of leaf teeth and the percentage accuracy is still far from the optimum.
Many hand-crafted features can be selected from plant leaf images, most of which fall into either shape, texture, or venation. Most of the previous studies on this depended on shape recognition techniques to model and represent the contour shape of the leaf due to the diversity of leaf shapes. Neto et al [6] used Elliptic Fourier and discriminant analyses to show the distinction in plant species based on their unique leaf shapes. [7] proposed two shape modeling approaches based on the invariantmoments and centroid-radii models. Du et al [8] also went on to propose a technique based on combining geometrical and invariant moments features to extract morphological structures of leaves.
Shape Context (SC) and Histogram of Oriented Gradients (HOG) have also been useful approaches in the attempt of creating an efficient leaf shape descriptor [9] [10]. Recently, Aakif and Khan [11] proposed using different shape-based features such as morphological characters, Fourier descriptors, and a newly designed Shape-Defining Feature (SDF). Although the algorithm showed its effectiveness in baseline leaf datasets like Flavia [12], the SDF is highly dependent on the segmented result of leaf images and doesn't produce optimum performance. Hall et al. [13] proposed using Hand-Crafted Shape (HCS) and Histogram of Curvature over Scale (HoCS) [14] to analyze leaves. Larese et al. [15] recognized legume varieties based on leaf venation in which they first segmented the vein pattern using Hit or Miss Transform (UHMT), then going further to use LEAF VOLUME XX, 2017 1 GUI measures to extract a set of features for veins and areoles but it was computationally expensive. Wilf et al. [16] applied their approach to the vein features of leaves and report a classification accuracy of 72.14% for 19 leaf families. However, the images are taken from cleared specimens that are prepared laboriously and there is still a limitation to a restricted class of features. Kumar et al. [14], proposed a mobile app, called LeafSnap, to enable users to identify trees from leaf photographs. It achieves a top-1 recognition rate of about 73% for 184 tree species, however, a higher accuracy could still be achieved.
The last step towards model-free approaches to plant identification is to get rid of hand-crafted features. In the last years, deep learning convolutional neural networks (CNNs) have seen a significant breakthrough in computer vision, especially in the field of visual object categorization, Krizhevsky et al. [17], due to the rise of efficient generalpurpose computing on graphics processing units (GPGPU provides high degrees of parallelization) and the availability of largescale image data (in publicly available datasets, on the internet, in social media, (specialized) social networks, etc.) that provide the data amount necessary for training deep CNNs with thousands of parameters.
An essential advantage of deep CNNs is the automatic learning of task-specific representations of the input data which replace traditional feature-based representations using hand-crafted features but there's still the challenge of obtaining optimum accuracy in classification using CNNs.
Lee et al. [18], presented a CNN approach to taxon identification based on leaf images and reported an average accuracy of 99.7% on a dataset covering 44 species. Zhang et al. [19], used CNN to classify the Flavia dataset and obtained an accuracy of 94.69%. Goëau et al. [20], report on the plant identification task PlantCLEF 2016 that was organized within the ImageCLEF initiative2 dedicated to the system-oriented evaluation of visual-based plant identification but there's still a need for higher accuracy in this classification.
Another study by [21] attempted deep learning in plant identification using vein morphological patterns. They first extracted the vein patterns using UHMT, and then trained a CNN to recognize them using a central patch of leaf images, yet, accuracy and complexity are still problems to combat, and having the model generalize well is still a great need. Therefore, in the follow-up study, we should focus on how to improve identification accuracy. Also, deep learning methods have some limitations. If the number of available images is less than the number of images required for state-of-the-art-based models such as GoogleNet [22], AlexNet [17], and so on, we will be challenged to train the network having only a handful of leaf images in our datasets. This is because CNNs on a small dataset surely get overfitting. To overcome this problem, the number of training data should be increased. But providing a large amount of training data is very difficult and costly. Hence, in this work, we employed the use of Conditional Generative Adversarial Networks to generate synthetic data to augment the datasets. Additionally, we used a state-of-the-art CNN model to perform feature extraction, it can extract rich features from images; and then applied a reliable and efficient classifier to classify the plants given the extracted features. Results show great performance on the various evaluation metric that we used.

III. PROPOSED METHOD
Deep learning, being a special class of machine learning algorithms is of high importance in the field of Intelligent Computing. Generative Adversarial Network(GAN) was first proposed by Goodfellow et al. [23] and they have had tremendous success in the field of Deep Learning. Similarly, a deep Convolutional Neural Network (CNN) can extract higher-level features progressively from the input images given to it by the multiple layers used in a model. The Residual Network [24] architecture has proved to be a reliable architecture over the past few years since its inception and coming 1st place on the ILSVRC 2015 classification task. It has lots of success in other research works with almost 30,000 reference citations to prove its great contribution and effectiveness in the field of Deep Learning, however, applying it alone does not give us the optimal classification accuracy we desire in plant classification via the leaf images, hence our ensemble with other DL methods. For our research, we utilized a Residual Network model, with 50 layers, which was pre-trained with the state-of-the-art ImageNet dataset. We applied it to extract the features of the leaf images in our respective datasets. Figure 7 shows the flowchart of the overall framework of our model. It consisted mainly of four main steps as shown in the figure, which is, image preprocessing, data augmentation, feature extraction, and image classification. First, the leaf images were preprocessed and fed into the conditional Generative Adversarial Network, which serves as the means for extra data augmentation for the Deep Residual network to have more data. Afterward, the Deep Residual Network, which has been pre-trained on a large dataset was applied to learn the features of the leaf images and hence serve as the feature extractor to retrieve important information from the leaves. Lastly, the extracted features were trained and classified by using a machine learning method known as Logistic Regression.
The preprocessing was done by reducing the images to a uniform size of 256 * 256. Furthermore, the leaf images were paired with their corresponding patches to form a single image which would serve as a ready input for the cGAN model. VOLUME XX, 2017 1

A. cGAN-BASED METHOD OF BALANCING AND AUGMENTATION OF THE TRAINING DATA
Balancing data in deep learning applications like plant species classification is a critical area that has been studied and surveyed through the years and is driven by the necessity of a robust algorithm for effective classification and to aid it to generalize well. However, some essential elements to be considered are costeffectiveness, user-friendliness, accuracy, and sensitivity. Data augmentation is the process of generating additional training data from the available existing data [25]. Typically, this is done by using annotation-preserving transformations on the input data, such as randomly rotating, deforming, or translating the image, scaling, flipping the images vertically and horizontally, and applying random zooming. Through the random nature of data augmentation, it can be used to potentially generate an 'infinite' amount of training data by augmenting the existing data. Model performance for tasks like classification, detection, and recognition of plant species can be improved using data augmentation which can overcome the problem of inadequate data and imbalanced distribution. In this research, we employed the use of conditional Generative Adversarial Networks (cGANs) to perform data augmentation. Generative Adversarial Network(GAN) proposed by Goodfellow et al. [23] is a deep learning architecture for generating images, which composes of both a generator and a discriminator. The generator generates fake images from the input noise. The fake images created by the generator are given to the discriminator, which attempts to perform a binary classification to discern the genuineness of the images created by the generator. The discriminator is trained to maximize the probability where it can correctly discern the real images and the fake images. At the same time, the generator is trained to minimize log(1−D(G(z))) in Equation (1) [23].
Through this competitive and repetitive training process, the generator can generate fake images that are similar to the real original images. However, because it generates images based on input noise z, it is difficult to control the images created by the generator, and it experiences difficulties when generating high-resolution images. Because of this problem, Mirza and Osindero [26] introduced conditional Generative Adversarial Networks (cGANs), which extends GANs into a conditional model. This conditional form of GANs, namely cGANs [26] enables controllable image synthesis, allowing a user to synthesize images given various conditional inputs such as class labels, user sketches, or textual descriptions. In cGANs, the generator G and the discriminator D are conditioned on some extra information c. This is done by putting c as additional inputs to both G and D. cGANs provide additional controls on which kind of data are being generated, unlike the vanilla GANs which do not necessarily possess such controls. It makes cGANs popular for image synthesis and image editing applications. The structure of cGANs is illustrated as Fig.  2. The objective function which is the same as that of Pix2Pix [27] is represented by: The cGAN model consists of a generator network that synthesizes leaf images and a discriminator network that distinguishes whether an image is the actual leaf image or leaf image produced by the generator network. Their architecture is described below:

1) The Generator
The generator network was based on U-Net [28] and consists of 15 layers having skip connections. The introduction of skip connections is strategically done between layer i and layer n -i when the total number of layers in the encoder-decoder structure was n, to avoid losing low-level information that is reduced by progressive down-sampling otherwise known as the vanishing gradient problem that could be encountered in training very deep convolutional networks. This method is effective in preserving the low-level information of the input in the output of the generator.
Further, the generator can synthesize more realistic leaf images from the learned features. When the generator is trained, the L1 distance (Equation (3)) between the ground truth image and the generated image is reflected in the objective function, as shown in Equation (4) [29]. Thereby, low-level information is strengthened so that leaf images can be generated that are clearer than conventional cGAN [27]. Moreover, as a means of preventing the blurring effect in the images created by the generator, the PatchGAN concept was applied to the discriminator. This is a method that attempts to classify the input images of the discriminator in N× area patches [27]. The generator module receives conditioned leaf images of size 256×256×3 (height × width × channel) as input, and the feature maps are calculated with a 5 × 5 filter from the 1 st to the 8 th convolutional layers within the encoder. This special design was so that a separate pooling layer, such as max pooling, wouldn't have to be used. Special padding and stride of 2×2 were used to reduce the size of the feature map. The feature map, which is reduced to 1×1×512 by the 8 th convolutional layer, is up-sampled through the 1 st to 8 th transposed convolutional layers within the decoder. To avoid losing low-level information, skip-connections were added between layer i and layer n-i when the total number of layers in the generator model is n. Figure 19 details the generator architecture of the cGAN model.

2) The Discriminator
The discriminator network is a Convolutional Neural Network consisting of four layers. It divides the input image into small patches to determine whether the input image is a real leaf image or a generated leaf image. It makes the generator represent the details of the leaf images better. Figure 20 shows the architectural structure of the discriminator module of our cGAN model, which is used to discern whether the leaf images created by our generator module are genuine or fake. The input for the discriminator is a pair of images consisting of an image created by the generator and an image that uses extra information (fake image), or a pair of images consisting of a geometric center image and an image that uses extra information (real image), as shown in Equation (2). The input image is reduced to a 32×32×512 size as a result of going through the 4 convolutional layers. The reduced feature map is produced as the final value regarding whether the image is real or fake via the linear regression and sigmoid function of the fully connected layer.

3) The Combined cGAN Model
As training progressed, the two networks operated adversarially; the discriminator distinguishes the images more and more accurately, and it makes the generator synthesize more and more realistic leaf-like images. The loss function was also important to synthesize more realistic leaf-like images. We used cross-entropy for the adversarial loss of the generator network and the discriminator network. For the content loss of the generator network to synthesize images, we used L1 distance loss. The generator learns a mapping from the source image x and random noise vector z to the target image y, i.e., {x,z→y}. Discriminator discriminates the label of y|x, real or fake. Figure 3 shows the block diagram of the cGAN [23] model we utilized for data augmentation. The results are shown more clearly in section v of this paper.

B. FEATURE EXTRACTION USING A DEEP LEARNING MODEL AND CLASSIFICATION
Convolutional Neural Networks is now the state-of-theart method for many tasks including classification, detection, localization, and segmentation [30]. Their architectures are most commonly applied to image analysis or other problems where shift-invariance or covariance is needed. Inspired by the fact that an object on an image can be shifted in the image and still be the same object, CNNs adopt convolutional kernels for the layer-wise affine transformation to capture this translational invariance. A 2D convolutional kernel w applied to a 2D image data x can be defined as , and the summation is over all possible positions. An important variant of CNN is the residual network (ResNet), which incorporates skipconnections between layers. These modifications have shown great advantages in practice, aiding the optimization of these typically huge models.

1) Convolutional Layer:
This layer is responsible for the performance of convolution operations to learn features from the images. Here, a convolutional kernel slides along the image with a certain stride and outputs convolution plus a bias. The input to this layer could be either an RGB image or the output feature of another layer. Convolutional kernel means that given an input image when it is processed, the weighted average of pixels in a small area of the input image becomes each corresponding pixel in the output image, and the weight is defined by a function, and hence, they share weights to reduce parameters in the model. All the weights in this layer can be learned. This process can be expressed mathematically as in the l layer. f (.) is the activation function. Higher-level unique features can be identified through a series of increased convolution layers, hence, the need to go deeper.

2) Pooling Layer:
After convolution, we need to reduce the dimensionality of the image for further processing. This process is known as down sampling or a pooling operation. This process can be described as where l j X represents the j th feature map in the l layer. l j  and l j d are the multiplicative factor and bias respectively. down ( ) .
s  represents an undersampling function. Under-sampling can be done in many forms, some of which are average pooling, maximal pooling (max pool), minimal pooling operation, and so on. In our work, we employed max-pooling.

3) Fully Connected Layer:
Each neuron in the fully connected layer is connected to all neurons in the feature map of the previous layer, and the output can be expressed as: where , () Wb hx is the output, and W represents the corresponding weights. The inputs to the fully connected layer are many features extracted from the previous layer. Each feature in the former layer represents different semantic information that is unique and important for further processing.

4) Loss Function:
The Loss function measures the disparity between the predicted output and the desired output. The network makes use of categorical cross-entropy for calculating loss,  . The categorical cross-entropy function is given as: where y and ŷ represent both the actual and the expected outputs respectively.

5) Activation Layer:
Non-linearity has proven to be an integral part of CNNs and makes them more powerful. CNNs must be able to take any input from −  to +  , at the same time it should also be able to map it to an output that ranges between {0,1} or {-1,1} in some instances, thus, the need for activation function. Non-linearity is needed in activation functions because it aims at producing a nonlinear decision boundary via non-linear combinations of the weight and inputs in Deep Learning architectures. Classification can be done either by a binary classifier or a multi-class classifier. For binary classification, the activation function sigmoid is used whereas for multiclass classification, the softmax function is widely used. The softmax activation is utilized in the last dense layer to calculate the probability of predicted classes. The class with the highest probability is chosen as the output. The softmax function is shown mathematically by: The Deep Residual Network we used is 50 layers deep with a small receptive field of 7 × 7 in the input layer followed by a max-pooling layer of 3 × 3 kernel size. Figure 4 shows the additional identity mapping and Figure  Deep Residual Network we used for feature extraction while Table 3 details the parameters of the architecture of the network used. Figure 6 depicts a block diagram of the model used in the feature extraction and classification section.

C. LOGISTIC REGRESSION MODEL FOR CLASSIFICATION
Like all regression analyses, Logistic Regression (LR) is a predictive analysis method. It is a qualitative statistical technique that is different from quantitative techniques. There are two types of logistic regression, binary and multinomial logistic regression. Whereby the binary logistic regression is applied to variables that are divided into two subgroups, resulting in only 0 or 1; for multinomial logistic regression, different outcomes are predicted. That is, it can display results in more than two groups and hence can be used to solve more complex problems than binary logistic regression. Multinomial LR system needs to find the relationship between input data and the output to create a model and generate weights for classification.
The Logistic regression model is of Equation (3), where hθ(x) = P(y=1|x, θ) represents the probability of the output variable y = 1.
Logistic regression is also known as logit regression or logistic model. It takes in independent features and returns the output as categorical output. The probability of occurrence of a categorical output can also be found by the LR model by fitting the features in the logistic curve. In our work, the features that were extracted by the Deep Residual Network model were the input to our LR model which went on to effectively classify them into the respective plant categories with very high accuracy. This is shown more vividly in Table 12 and a comparison with other models is displayed by Tables 13-15.

A. DATASET SETUP
The experiments in this study were carried out using about seven original datasets of plant leaves that have been made publicly available for research purposes. A brief description of each one used in the work is given in the subsequent subsections.

1) Flavia Dataset
The first is the flavia dataset consisting of 1703 plant leaf images of 32 different species most of them being common plants in the Yangtze Delta of the People's Republic of China. The images are of size 1600 × 1200. The images contain only blades, without petioles. The flavia dataset [12] being pre-processed is clean and has little noise. It is regarded as a fine-grained classification challenge where the task is to recognize 32 distinct species of plants using their leaf images. Table 2 gives a detailed description of the classes in this dataset.
This image dataset is quite small, having only an average of 65 images per class for a total of 1,360 images. A general rule of thumb when applying deep learning to computer vision tasks is to have 1,000 -5,000 examples per class, so we are certainly at a huge deficit here. We call flavia classification a fine-grained classification task because all categories are very similar (i.e., species of plants). We can think of each of these categories as subcategories. The categories are certainly different, but share a significant amount of common structure (e.g. shape, color, vein, etc.) The result of our model on this dataset is represented in Table 6.

2) MalayaKew Dataset
The second dataset used is the MalayaKew (MK) Leaf dataset [18] which was collected at the Royal Botanic Gardens, Kew, England. It consists of scan-like images of leaves from 44 species classes. This dataset is very challenging as leaves from different species classes have a very similar appearance. Specifically, the D1 images of the MK dataset was used. It consists of segmented leaf images with size 256 * 256 pixels. Having 2288 training and 528 testing images respectively. The result of our model on this dataset, without adding synthetic data generated is represented by Table 10 while the result with the two combined datasets is represented by Table 11.

3) Swedish Dataset
This dataset, introduced by So¨derkvist [31], was captured as part of a joined leaf classification project between the Linkoping University and the Swedish Museum of Natural History. It contains images of isolated leaf scans on a plain background having 75 samples from each of 15 species of Swedish trees. This dataset is considered very challenging due to its high inter-species similarity. The result of our model on this dataset is represented in Table 8.

4) Middle European Woody Plants
This dataset, popularly known as MEW_2012 [32] is a collection of plant leaves from Europe. It contains native or frequently cultivated trees and shrubs of the Central Europe Region.

5) Folio Dataset
The folio dataset [33] is a standard leaf dataset used in plant recognition. The leaves were placed on a white background and then photographed. The pictures were taken in broad daylight to ensure optimum light intensity. It contains 32 different species taken from plants in the farm of the University of Mauritius and nearby locations. The result of our model on this dataset is represented in Table 9. The is the Peruvian Amazon Forestry Dataset [34] [35]. It is taken from the Allpahuayo-Mishana National Reserve in Peru contains 59,441 leaf images from ten timber tree species. The dataset is gathered in different excursions and conditions. Due to the unavailability of one of the classes on the internet at the time of conducting this research, only the available nine (9) classes were used for this research, therefore, not the entire dataset was used. The result of our model on this dataset is represented in Table 7.

7) LeafSnap Dataset
The final dataset used to make this research robust is the LeafSnap dataset. Published by Kumar et al. [14], it is a leaf dataset that includes images of leaves taken from two different sources and also segmentation produced automatically for them. They are divided into 23147 Laboratory photos, consisting of high-quality images of compact leaves, from the Smithsonian series and 7719 Field images, consisting of ordinary outdoor images provided with a mobile phone. These images contain a variety of blur, noise, lighting patterns, shadows, and so on. Each image is of size 800 × 600. Figures 1 and 2 show samples of some of the datasets.

B. EXPERIMENTAL SETUP
To implement and test the proposed method, a computer system has been employed for conducting the analysis processes. The computer system includes the following specifications: Windows 10, 64-bit, Intel Core i7-4720 CPU @ 2.60 GHz, RAM 32 GB and GPU Nvidia GeForce GTX 1050 4 GB dedicated memory, and Python 3.7 on Anaconda. The PC was used on a Linux-based DELL PowerEdge T640 Tower Server with CUDA-based video cards 4X 1080TI, each GPU Video memory is 11Gb, with a storage memory of 10TB Hard Drive and 3320GB SSD. Tables 4 and 5 detail the setup of the system used and the parameters used in running the model.

C. EVALUATION METRIC
We analyzed the proposed method for the accuracy, precision, recall and f1-score. The accuracy of the proposed plant recognition system has been computed via the following expression, which utilizes the numerical details encompassing True Positive (TP) (it is the number of leaf images that have been correctly identified), False Positive (FP) (it is a parameter for representation of the number of leaves that are incorrectly detected), True Negative (TN) (it is a parameter for representation of the number of leaf images that are correctly detected), and False Negative (FN) (it is a parameter for representation of the number of leaf images that are correctly recognized).
Also, precision and recall are defined as a measure for system evaluation. These concepts are calculated by Eqns. (16) and (17), respectively:

D. TRAINING DETAILS FOR THE cGAN
First, we prepared our dataset for training by generating the training data in the form of pairs of images and combining each pair of images into a single image file, hence making it ready for training by the cGAN model. Data were split into train, validation, and test sets randomly. This means creating a folder of images, each of which contains an X/Y pair of both original leaf images and their corresponding leaf patches. The X and Y image each occupy half of the full image in the set. Thus, they are the same size. By default, the images are assumed to be square (if they are not, they will be squashed into square input and output pairs). In pix2pix [27], the testing mode is also set up to take image pairs just like in the training mode, where there is an X and a Y. We held out a test set of images from training, and afterward performed a comparison on the generated leaf images, Y to the known leaf images Y, as a way to visually evaluate the leaf images. We created an HTML page with a row for each sample containing the input, the output (generated Y), and the target (original Y) leaf images. We mostly followed the setting of [36] in training our cGAN model. More specifically, we use the Adam optimizer with β1 = 0.5 and β2 = 0.999. For the Malayakew D1 dataset, 2816 training images constituted the training set, we trained for 3100 epochs with a learning rate of 0.0002, with random jitters and mirroring. We used a U-Net-based [28] generator, which has been described in subsection B of section III, with PatchGAN as the discriminator. We use a batch size of 1 and instance normalization [40] in place of Batch Normalization that is used in the original U-Net [28] architecture. All training images are loaded in their original sizes then cropped to 256 × 256 patches.  Table 5 details the parameters used in extracting the features and classification of the plant species. The training/test set split ratio was set to 75/25%. In as much as there is no fixed value for train/test split ratio, some previous experiments have shown some success around some particular value for training datasets depending on the size of the dataset. However, too few training samples can also cause the model to be underfitting irrespective of the size of the dataset. Due to the large number of datasets to be tested and the prior success of the above ratio and ratios close to it on previous experiments [37][38] [39], the above ratio was solely selected. However, in future experiments, multiple experiments with different train-test split values would be conducted with the sole purpose of determining the impact of train/test data split ratio on our method and the results bench-marked against state-of-the-art results on such classification.

E. TRAINING DETAILS FOR FEATURE EXTRACTION AND PLANT CLASSIFICATION
The training data was used to extract the features that were further input to the Logistic Regression module for classification. The MalayaKew Dataset, whose augmented results are shown in Figures 8 & 9 was updated, now having a mixture of both the original and the augmented images in the newly created dataset was also processed by the feature extractor and classified. The results are tabulated in the subsequent tables, Tables 6-12 respectively.
The GridSearchCV class function of scikit-learn [40] was used to tune the parameters of the Logistic regression Classifier to find the best C, which stands for the strictness of the Logistic Regression classifier to determine what its optimal value is for the individual tasks. Each optimal value of C is stated on the tables that report the classification results (Tables 6-12).

A. DATA AUGMENTATION RESULTS
We trained the cGAN for 3100 epochs using NVIDIA GeForce GTX 108 GPU. The training took an average of 4.0 seconds per epoch. During the training process, the discriminator and generator operate adversarial to each other. At first, the generator had difficulty making realistic simulator-like images, and the discriminator worked at distinguishing the real simulated images with images made by the generator with high accuracy. Therefore, the loss of discriminator was decreased. Then, to deceive the discriminator, the generator constructs more realistic simulator-like images. Thus, the loss of the discriminator increased, while that of the generator decreased gradually. As a result, the generator network can synthesize more and more realistic images of given leaf images as training progresses. Figures 8 and 9 show the results of the synthetic images of the leaves that were generated.

B. PLANT RECOGNITION RESULTS
The results of our work are carefully represented by figures (Figures 10-18), tables (Tables 6-12), and a chart below. Also, the comparison with other models is highlighted in Tables 13-15. Figures 10-17 clearly show the confusion matrices of the tested model on all of the datasets that were used in this work. Details reveal the actual classes versus the predicted classes, showing which classes the model was confused on and to what degree.

C. DISCUSSION
It is pertinent to note that the reason for the lower performance of the model on the MalayaKew dataset can be found in three very similar classes. Class 2 and 9 are highly similar, more of a fine-grained classification challenge; and also, class 42 was a problematic class to differentiate. This can be observed in Tables 10 & 11 that show both the results of the MK-D1 original and the MK-D1 original + synthetic datasets, though on the second dataset which was augmented the results were highly improved. The model had a 100% accuracy on the Swedish dataset, and the lowest accuracy being the LeafSnap leaf dataset with an 89.27% accuracy. It is clear why, because we used the LeafSnap dataset as it is, without adequate preprocessing due to time constraints.
Evaluating the quality of synthesized images is an open and difficult problem. In this study, we investigated the effectiveness of applying a conditional GAN's image-to-image translation model as a data augmentation tool to improve the performance of an automated plant classification system using plant leaf images. The accuracy of the conditional Generative Adversarial Network in producing augmented data and how very near the exact original images the model was able to produce the synthetic ones can be seen in Figures  8 & 9. From these figures, we noted that within the first 300 epochs of training, the network had difficulty generating images with a correct shape as close to the original images but began to make a better image with regards to the leaf shape afterward. Similarly, leaf veins became more distinct from the 800 th epoch and got better from the 2400 th epoch onwards.
From the results displayed, we observed that the percentage accuracy of the model on all the datasets combined produced an average of 96.1% and outperforms other research on the individual datasets. Also, it was observed that a combination of the original and synthetic MK-D1 datasets on testing the model produced a 3.0% increase in accuracy from the original dataset, proving that more data enables our Deep Learning models to perform better and generalize properly.