Convolutional Neural Networks for Texture Feature Extraction. Applications to Leaf Disease Classification in Precision Agriculture

This paper studies the use of deep-learning models (AlexNet, VggNet, ResNet) pre-trained on object categories (ImageNet) in applied texture classification problems such as plant disease detection tasks. Research related to precision agriculture is of high relevance due to its potential economic impact on agricultural productivity and quality. Within this context, we propose a deep learning-based feature extraction method for the identification of plant species and the classification of plant leaf diseases. We focus on results relevant to real-time processing scenarios that can be easily transferred to manned/unmanned agricultural smart machinery (e.g. tractors, drones, robots, IoT smart sensor networks, etc.) by reconsidering the common processing pipeline. In our approach, texture features are extracted from different layers of pre-trained Convolutional Neural Network models and are later applied to a machine-learning classifier. For the experimental evaluation, we used publicly available datasets consisting of RGB textured images and datasets containing images of healthy and non-healthy plant leaves of different species. We compared our method to feature vectors derived from traditional handcrafted feature extraction descriptors computed for the same images and end-to-end deep-learning approaches. The proposed method proves to be significantly more efficient in terms of processing times and discriminative power, being able to surpass traditional and end-to-end CNN-based methods and provide a solution also to the problem of the reduced datasets available for precision agriculture.


I. INTRODUCTION
Image feature extraction and classification is a computer vision field that has been studied intensively by researchers due to its practical relevance for various scenarios, including that of precision agriculture, [1]. Plant diseases have a huge effect on the agricultural productivity [2]. They can easily degrade the quality of the products, so they must be detected as soon as possible. The current methodology for detection is the human perception of plant leaves [3]. However, this method is not efficient in terms of available resources, especially for large crops, and automatic image The associate editor coordinating the review of this manuscript and approving it for publication was Gangyi Jiang. classification systems can be beneficial in this situation. In the literature, several plant disease classification problems were addressed, such as the classification of cucumber and citrus leaves [4], [5] which is performed by using the Gray-Level Co-occurrence Matrix (GLCM) for the extraction of relevant features. In [6], colour information is used along with GLCM -derived features and Gabor characteristics for the classification of mango leaves. Deep-learning methods are also mentioned for the classification of plant diseases in [7]- [9].
Until recently, [7], [9], the problem of image classification has been addressed as a two-stage approach: the extraction of handcrafted features and machine-learning classification. The feature extraction step is regarded as the most important stage because the subsequent classification task is based VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ on the derived image descriptors. Even the most powerful machine-learning classifier will provide a poor classification performance if the image features are not chosen appropriately. The extraction of relevant and discriminative features is a challenging task for real-world applications. Moreover, images are captured under various conditions and, to obtain good classification results, the extracted features should provide invariance to several transformations (such as scale, rotation, illumination conditions) and robustness to noise. One of the most popular and efficient feature extraction methods is the Local Binary Patterns operator (LBP) [10] and its improved version, proposed in [11]. The LBP descriptor is based on the signs of differences between neighbouring pixels, and it is used to describe locally the texture of the analysed image. Later, several LBP-derived operators which provide improved invariance to different transformations and a greater discrimination power were proposed, such as the Median Robust Extended Local Binary Patterns (MRELBP) [12]. Also, in order to improve the robustness to Gaussian noise, the Block Matching and 3D Filtering Extended Local Binary Patterns (BM3DELBP) was introduced by us in [13]. Another popular texture feature descriptor is the Gray-Level Co-occurrence Matrix (GLCM) [14] which achieved significant performance for texture classification tasks as reported in the literature.
In the case of traditional machine-learning methods, expert-driven feature selection and extraction are needed. A specialist must design a feature extraction method capable of outputting the most relevant features and feed them into a conventional machine-learning classifier. The classifier is then trained to learn from data and apply the learnt information to new data in order to make a classification decision.
However, lately, [7], [9], impressive results were obtained with the use of deep-learning methods, revolutionising the image and object classification field. Rather than relying on handcrafted features, these methods can be used as endto-end approaches because they work by automatically learning the relevant features themselves, without the need of expertise, from the raw data provided as input. Deep-learning methods are constructed to learn hierarchically, their architecture being composed of several hidden layers, and are generally trained on large datasets to obtain a good classification performance. Such a dataset is ImageNet [15]. The main disadvantage of these algorithms is the long training time, which in most situations is a lot larger compared to the case of traditional classification methods. This is due to the large number of parameters that have to be learnt from the data.
The Convolutional Neural Network (CNN) is a deeplearning technique that has been widely used in the past years with great success for many computer vision tasks, [1], [7], [9], [16]. The architecture of a CNN is composed of several types of layers: convolutional, nonlinear, pooling, fully connected, normalization and others. The stack of convolutional, nonlinear, and pooling layers act as a feature extractor. The second part of the CNN is composed of several fully connected layers that are used to make a classification decision based on the generated features. We show in Fig. 1 the general block scheme of an end-to-end CNN architecture for classification tasks.
One of the main disadvantages of CNN-based methods is the fact that very large datasets are required in order to achieve significant results, [17], like with any deep-learning technique. However, there are applications in which the number of available training samples is limited, [18], especially because of the large resources (time, expertise, etc.) needed to acquire and label consistently a vast number of images (e.g. precision agriculture). This is largely addressed either by performing some sort of ''data augmentation,'' where, from the existing data, ''new'' data is generated, or by deploying what is termed ''transfer learning.'' Data augmentation is a challenging approach, as it tries to create relevant variability in the data, and, with the use of generative adversarial networks contributes more to the increase in the overall complexity of the classification system. The work in [17] provides a relevant overview of the field, and [19] is an example of an applied case of vine leaf classification.
The ''transfer learning'' concept, developed by [20], [21], resembles the approaches we, as humans, take in our everyday life, as we do not learn everything from scratch, but rather use the knowledge gained in a particular previous task in other related new tasks. Practically, we transfer the knowledge acquired in the past to solve future problems. Isolated training models are designed specifically for a particular task and dataset, whereas in the transfer learning models, the gained knowledge can be transferred and used in another related new task which can imply a better performance obtained on a smaller dataset and less training time. CNN-based methods that explored the transfer learning approach by using features derived from pre-trained CNNs on large image datasets can be found in the work of [22]- [26] and others.
Typically, a new object classification problem is addressed by using a pre-trained model without its classification layers to extract the relevant features for the new problem. Practically, the weights of the network are not updated for the new task, but they are used in the new problem exactly as they were trained for the previous task and only the classification part is replaced. Popular CNN models and datasets which are widely used for feature extraction in the context of transfer learning and belonging to the object classification problem include: AlexNet [27], VggNet [28], GoogleNet [29], and ResNet [30]. Features can be extracted either from the convolutional layers or from the fully connected layers of the network. In general, it was shown, [23], [31], [32], that the features extracted from convolutional layers have a better generalization capability. The features extracted from fully connected layers have a poorer transferability because they are more specific to a particular task or dataset.
In the context of plant disease, the classification problem translates to a texture classification problem from the point of view of the image content, where the disease manifests itself more as a variation in the leaf texture rather than a type of object that is present in the image, [3], [7], [9]. Such, different state-of-the-art CNN architectures for texture classification and recognition have been proposed [33]- [36]. However, this strategy is highly impacted by the lack of large texture datasets compared to the object classification problem datasets. Fine-tuning (retraining only some layers) on small texture datasets does not bring enough improvements in the classification accuracy [33]. In [33], the authors proposed the Texture CNN architecture which is based on AlexNet but uses an energy measure derived from the last convolutional layer. They arrived at the conclusion that the size of the dataset strongly influences the performance. The authors also observed that the fine-tuning performed on a network pre-trained on textured images achieves better results than by using a network pre-trained on a dataset that contains mostly objects. This happens probably because an image from an object-oriented dataset can contain multiple textures. In [36], the authors propose Bilinear CNN Models in which the fully connected layers are replaced with bilinear pooling models.
Our paper is structured as follows. Section II describes the proposed technique which involves using pre-trained CNN models on object-oriented datasets to extract textural features. Section III details the experimental configuration setups. We use publicly available textures and images of different real-world plant species affected by disease datasets for evaluation. Section IV details the obtained experimental results together with a comparison between other handcrafted and deep-learning methods and the proposed technique in terms of classification performance and time efficiency. Section V is dedicated to final conclusions and remarks.

II. THE PROPOSED METHOD
We are interested in the study of the performance of deeplearning pre-trained models in the classification of textured images even if the models were pre-trained on object categories. We show, therefore, how the chosen networks behave in a real task in which the textural characteristics are essential, namely in the classification of diseases that affect plant leaves. The underlying approach of the proposed method is to analyse which are the best pre-trained CNN models and their relevant layers for feature characterisation. We take advantage of the fact that there are large object datasets that allow for the pre-training of CNNs and keep the weights for the model and use this model in a new classification task.
The use of pre-trained models has several advantages. One of them is the fact that the feature extraction process is time-efficient because the images pass only once through the network. Secondly, relevant results can be obtained for small datasets for the classification task and no architecture handcrafting is required. This can be achieved because such models were trained on very large datasets, so there are many patterns and features already learnt that can be used to solve a different problem. For the significance of the results, the initial and the new task should be similar. Since we are interested in the classification of textures, even if the datasets on which popular CNN models were built are object-oriented, they are well-suited also for texture classification problems. This happens because of the hierarchical architecture of CNNs: while the early and mid-convolutional layers detect low-level features and texture structures, only the features computed from the last layers are more specific to the initial object classification task. We show in Fig. 2 the block scheme used to describe the considered texture classification system.
We use a pre-trained CNN model from which a feature vector is obtained for each image of the dataset. The chosen supervised classifier is the Support Vector Machine with RBF kernel which is trained on 75% of the images from each class of the considered dataset and is evaluated on the rest (25%). In order to benefit from the advantages of the transfer learning concept and thus to keep the already learnt weights of the considered network, the classification layers at the end of the CNN network are removed because they are adapted to the number of classes on which the training of the CNN was performed, which is different to that of the current problem. Thus, pre-trained CNNs are used only for feature extraction in this work and the SVM is responsible for the classification. Although an artificial neural network consisting of a fully connected layer, a softmax layer, and an output layer could have been used for the classification part, it would not have surpassed the efficiency of SVM. According to [37], CNNs are very powerful as feature extractors due to their convolutional base, but less efficient for the classification operation since the classifier is in this case a linear one. On the other hand, SVM is better for the classification of more complex data [37] since by using the RBF kernel the initial feature space where data cannot be linearly separated is transformed into another higher dimensional space where the separation of data classes is possible. Using an SVM classifier on top of features extracted from CNNs instead of CNN classification layers provides better results in [38], [39].
The training and test sets are chosen randomly. For feature extraction, we considered several pre-trained models that are widely used in practice. They are presented in Table 1. Their default architecture is given in Appendix in Fig. 20-23. For these already existing models, the final classification layers are eliminated, and the features are extracted from several VOLUME 9, 2021 different layers in order to observe which are the most relevant for a texture classification task.
All these models were pre-trained on the ImageNet object-oriented dataset which contains more than a million images of objects classified into 1000 classes. Each pre-trained model requires input images to be of a fixed size, as given in Table 1. So, if the analysed textured images have a different size, before using the pre-trained CNN models to extract the features, the input images are resized. The convolutional base performs convolutional operations by means of several filters. The weights are the filter values and they are determined by the number and size of filters. This means that the weights corresponding to the convolutional base network do not depend on the size of the input image. So, the convolution operation is not influenced by the input image size. Filter sizes remain the same if the input image size is changed. However, the size of the feature maps will be different and that is why the number of neurons for the fully connected layers is changed depending on the input image size. This means that retraining is necessary for this situation. Changing the architecture of the model would require changing the weights which is done by training and, in this case, the purpose behind the transfer learning concept would be lost. So, to be able to rely on this concept, the images are resized to match the size required by the considered CNN models. If the difference between the size of the initial images and that imposed by CNN models is not very large and the aspect ratio is kept the same (1:1), resizing the images does not bring artifacts that could negatively influence the performance.
We experiment with the extraction of features from several layers in the network. After feature extraction, the obtained feature vectors are fed into an SVM classifier whose parameters [40] are chosen through a grid search in order to obtain the best classification accuracy for each particular experiment and method.

III. EXPERIMENTAL SETUP
We validate the approach by two different experimental setups. Firstly, we investigate how transfer learning can be used in general for the classification of textures when the CNN models were pre-trained on large object datasets and what are, in practice, the relevant layers that can be considered from the hierarchical CNN to extract features from. Then, we use the results to provide an applied example of texture classification for the plant disease detection problem in precision agriculture.

A. TEXTURE DATABASE: OUTEX_TC_00013
For evaluating the proposed method, we used the Outex_TC_00013 dataset [41] which contains 68 categories of RGB textured images. There are 20 samples of size 128 × 128 pixels for each class, giving a total number of 1360 images. We show in Fig. 3 a sample for each image category. This dataset is challenging because the variability between different classes is rather small in some cases, such as the granite categories, the sandpaper ones, or the barleyrice classes. Therefore, the classification task can be difficult in such cases especially because the number of samples per class is limited.

B. PLANTVILLAGE DATASET
For validation of the method, we used the PlantVillage dataset [42] containing several plant species, some of them healthy and some affected by different diseases. In [43], the authors use three versions of this dataset: the original RGB images, the grayscale version, and the segmented RGB variant. In this paper, there is considered only the segmented RGB set. In our experiments, we only considered the segmented RGB images from [43] since the color information is relevant to this classification problem (as the change in leaf color can be a sign of a certain disease) and because the use of the segmented variant excludes any potential bias that might be caused by the presence of the background information. Images from this dataset were captured under different conditions, the plant leaves suffer different rotations and have different shapes. Moreover, there are some segmentation problems because the leaves are not always perfectly segmented from the background. We discarded from the initial dataset the images that were poorly segmented and could no longer be recognized. We show in Fig. 4 some examples of images that have segmentation problems, some of them being kept and some being discarded.
We performed two experiments: plant species identification and disease detection. For the plant species identification, we considered only the categories with healthy plant leaves. Fig. 5 shows three samples for each considered category for this experiment and Table 2 shows the 12 classes used in the plant species identification scenario. For the disease detection experiment, we considered several setups described in detail in Table 3. We also show in Fig. 6 some sample images for each class considered in each setup.

A. OUTEX_TC_00013 RESULTS
In the first experiment, we considered extracting the features from the last layer located before the classification layers of the four pre-trained CNN models from Table 1: ResNet18, AlexNet, Vgg16, and ResNet50. For AlexNet and Vgg16, the last layer situated before the classification layers is fully connected (fc7 for both), whereas, for ResNet18 and ResNet50, the last layer is an average pooling layer (pool5 for ResNet18 and avg_pool for ResNet50). The pre-training of    all the models was performed on the ImageNet dataset, [15]. We show in Table 4 the obtained results of the learning transfer from the pre-trained CNNs to the texture classification problem of the OUTEX_TC_00013 dataset. All feature vectors are normalised using a Z-score approach that is applied on columns (features are normalised independently from each other). We feed these feature vectors into the SVM classifier stage, with the parameters described in the second column of Table 4. All performance metrics are macro-averaging. From the results in Table 4, we can observe that the best performance is obtained by using the features extracted using the pre-trained ResNet50 model. Since the ResNet18 network achieved the worst classification results, we discarded it in the next experiments. In terms of time/resource efficiency, the fastest feature extraction for all images from the dataset is performed by the AlexNet model.  We were interested to see if the concatenation of the obtained features using two different models can increase the classification performance. We concatenated the feature vectors obtained by using AlexNet, Vgg16, and ResNet50. Table 5 details the obtained classification scores. The results achieved in the three cases are similar, the concatenation of two feature vectors generated using different models being able to slightly increase the performance. However, the size of the corresponding feature vectors is larger, which implies longer processing times.
For the two models, AlexNet and Vgg16, we pooled the features from the fully connected layers in the first experiment. However, features extracted from the fully connected layers are more specific to the initial task, on which the model was pre-trained, than features extracted from convolutional layers [23], [31], [32], [44]. In the third experiment, we also considered extracting features from the last convolutional layer (actually from the ReLU layer following the last convolutional layer) for the two models. For AlexNet, that is the relu5 layer which has 256 feature maps of size 13 × 13. In order to obtain the feature vector in this case, we averaged each feature map over all spatial locations. The layer relu5_3 of the Vgg16 model has 512 feature maps of size 14 × 14. The final feature vector is of size 512,   obtained after the averaging of the activations over all locations. We also concatenated the obtained features. The results are shown in Table 6. We can see from the obtained results the fact that the performance is marginally decreased by extracting the features from the last convolutional layers. This can happen because the features learnt by CNN up to that layer in the architecture are very complex and are more related to the initial task of object classification on which the network was pre-trained. Therefore, this validates experimentally that features from earlier layers in the network are more general and can be better at describing the texture in the context of transfer learning.
Consequently, we extracted features from several convolutional layers from AlexNet, ResNet, and Vgg16, the results being given in Tables 7-9. In all situations, we averaged each feature map over all spatial locations (the average of all values contained in the matrix corresponding to that feature map). We can observe from the obtained results the fact that extracting features from earlier layers improves the texture classification performance. This practical observation is in accordance with the theoretical understanding of CNNs. Earlier layers depict more general features, like texture structures, which are not specifically related to the initial classification problem on which the model was trained. Also, in terms of performance, the different models trade off feature extraction time to accuracy and precision as can be seen in the summary from Table 10.
Considering the model that provided the best performance (see Table 10) in terms of classification, Vgg16, we discuss hereafter the choice in the selection of the relevant layers for the transfer learning problem. For the architecture of the pre-trained Vgg16 model depicted in Fig. 7, we are interested to observe the features learnt by this model in earlier layers, such as relu2_1 (which achieves the best performance in terms of the proposed experiment for texture classification), in middle layers such as relu3_3 and relu5_1 and the features ''seen'' by the CNN model in deep layers such as relu5_3 (where the classification scores are worse; see Table 9). We consider a random choice of the training and test images for which we used the model to extract features and then classify in the SVM stage and we show in Fig. 8 a portion of the confusion matrices obtained for each selection of features corresponding to these layers.
For class 6 (barleyrice006), we can see from Fig. 8 that by considering the features extracted from the relu5_3 layer of the pre-trained Vgg16 model, none of the test images is correctly classified whereas all test samples (5) are predicted correctly by using the features extracted from the relu2_1 layer.   Therefore, we are interested to observe the features learnt by the pre-trained Vgg16 model from the two layers and other intermediary layers by considering as input a test image from this class (barleyrice006). Fig. 9 presents the first 64 obtained feature maps. For better visualization, the following logarithmic transform is applied to all feature maps: where I is the initial feature map and I l is the feature map obtained after applying the logarithmic transform.
All images from Fig. 9 present only the positive activations since they are extracted from ReLU layers. As we can see, the relu2_1 layer extracts different textural features and most of the channels show large activations. As we move deeper into the network, fewer and fewer activations occur. In the case of relu5_3, most of the feature maps do not contain activations at all. This happens because the last convolutional layers explore abstract and complex structures related to the objects present in the images from the ImageNet dataset that was used in the pre-training step. So, the feature maps are VOLUME 9, 2021   not activated in most cases and they are not useful for texture classification. Fig. 10 shows the considered test image along with three feature maps with large activations for the relu2_1 layer.
We can observe from Fig. 10 the fact that a lower layer such as relu2_1 is able to extract relevant features for the considered test image. There is no need to go very deep into the network for texture classification. Also, in some cases, pooling features from deeper layers can actually degrade the performance.
We also concatenated the feature vectors extracted from the considered models that achieved the best performance and obtained the results given in Table 11. By concatenating the feature vectors generated from the pre-trained AlexNet and ResNet50 models, only a slight increase in performance is observed compared to the individual scores. The results of Vgg16 are decreased when performing the concatenation to other feature vectors.
In order to compare the results obtained using the proposed architecture for texture feature extraction from pre-trained CNNs to the performance obtained using handcrafted feature vectors on the same datasets [12], [45], we show in Table 12 a synthesis of these results. We also consider the AlexNet and Vgg16 architectures as end-to-end approaches, where both the features and classification are made by the network. We directly train the AlexNet and Vgg16 CNNs on the Outex_TC_00013 dataset in order to observe if the obtained results surpass a model pre-trained on a large dataset consisting of object categories. The considered training parameters are given in Appendix. Fig. 11 shows the training progress on a random partition of the training and test sets for both networks. We can observe from Table 12 that the pre-trained AlexNet deep-learning model surpasses the MRELBP operator [12] which works on grayscale images.
However, when incorporating the colour information provided by OCCBM3DELBP [45], the pre-trained AlexNet is outperformed. This comes with the trade-off of much longer processing times for extracting features using OCCBM3DELBP, 2223.2 seconds being the average feature extraction time for all images in the dataset. By training AlexNet end-to-end on the Outex_TC_00013 dataset, we can observe from Fig. 11 the fact that the classification accuracy on the training set reached up to 100%. However, the classification scores obtained for the validation set are lower as the results for the two sets start to diverge around the 25 th epoch. The difference is about 25%, meaning that the network has learnt the data from the training set, but it is not able to generalize well for new data. This happens because there is a small number of training images per class. The same applies to the trained end-to-end Vgg16 model where the difference between the training and validation accuracy is even higher, of approximately 36% (probably because there are more parameters in this case). This is consistent with the observation that CNNs require large datasets for satisfactory classification results. However, the handcrafted operators and the pre-trained models work very well even in this situation.
We can see from Table 12 that the best performance is achieved by considering the pre-trained Vgg16 model and by extracting features from the relu2_1 layer. A good compromise between classification accuracy and time efficiency can be obtained by extracting features from the relu3 layer of the AlexNet model since it is by far the fastest strategy: the average feature extraction time for all images is 38.55 seconds for the pre-trained AlexNet compared to more than 2 hours for the pre-trained Vgg16 relu2_1 model.   From the obtained results we can conclude that, even if the considered models have been (pre)trained on object classes, they are also efficient for texture classification. Features extracted from early convolutional layers in the network represent less complex patterns being mostly related to features such as the textural content. Moreover, they are more general and are not related specifically to the initial classification problem on which the models were trained and, in combination with classifiers that support small datasets (like SVMs) provide important and relevant classification scores.

B. PLANTVILLAGE RESULTS
We are interested to evaluate the performance of pre-trained CNNs on real-world images of plant leaves. For feature extraction, we consider the pre-trained AlexNet model since it achieves a promising performance for the texture dataset compared to the rest of the analysed models in Section IV A, also being the fastest in terms of processing time. In such practical applications, the feature extraction time should be as small as possible for real-time processing and classification. The relu2, relu3, and relu4 layers were chosen for extracting features using the pre-trained AlexNet model based on the exhibited performance for the more general, in terms of texture classification tasks, Outex_TC_00013 dataset. Even if the pre-trained Resnet50 model offered a performance similar  to that of AlexNet and the pre-trained Vgg16 model proved to support better classification results for the Outex dataset, these models, however, did not qualify for consideration for real-time processing applications. Thus, we did not consider them for the PlantVillage dataset.
The first experiment consists in the identification of 12 plant species by considering only healthy leaves. We compare our results with the results obtained using other handcrafted feature vectors on the same dataset. The obtained classification scores are shown in Table 13. The obtained results show that the pre-trained AlexNet model achieves the best classification scores by considering the relu3 layer, with an average feature extraction time of only 321 sec (compared to OCCBM3DELBP which achieves more than 30 hours). Fig. 12 shows the confusion matrix computed for one run (particular random choice of the training and test sets) using  the pre-trained AlexNet model and relu3 layer for feature extraction and an SVM classifier. We also show in Fig. 13 the four incorrectly classified images. We can observe that in a) and b), the plant leaves are rotated at different angles which does not allow a complete exposure of the leaf and thus, the incorrect classification appears. The image from Fig.13 c) presents some segmentation errors which can be the reason for the erroneous prediction. The image sample shown in Fig.13 d) is very similar visually to the training images from the predicted class and probably, due to the high intra-class variability for category 3, the misclassification occurs.
The second experiment contains nine setup configurations used for disease detection in several plant species. We show in Table 14 the obtained results for all setups in comparison to the other handcrafted methods. We can see that the pre-trained AlexNet model is much faster than the other ones and also achieves the best classification results for all setups. For cherries and strawberries, a perfect classification is obtained. The highest improvement is achieved for the disease detection of tomato leaves which is one of the most difficult classification problems due to the high number of classes (10 classes) compared to the other setups.
Since the lowest scores are obtained for the classification of corn and tomato, we are going to analyse some of the incorrectly classified samples for these setups.
In the third setup corresponding to corn leaves, there are three classes associated with different leaf diseases and one class of healthy leaves. We show the confusion matrix obtained for one random partition using the AlexNet relu2 method in Fig. 14. As we can see, 23 samples are misclassified in this case. We show in Fig. 15 some of these images.
From the total of 23 misclassified samples, 22 were due to the confusion between two diseases, classes 1 and 3 (Cercospora leaf spot and Northern leaf blight). Even visually  in some cases, it is difficult to make a distinction between the two categories.
The last setup corresponding to the tomato leaves comprises images from 10 categories: tomato leaves affected by nine different diseases and healthy ones. Fig. 16 shows the computed confusion matrix for one run on this setup by considering the pre-trained AlexNet model and relu3 layer for feature extraction and the SVM classifier. We can see that for this run, there are 82 misclassifications. The true class with the smallest percentage of correctly classified samples is 2. We show some misclassified observations in Fig. 17. There is a high intra-class variability for class 2 and that is why the classification results are poorer for the images comprised in this category. However, taking into account the fact that for this setup there were considered nine different diseases and images are exposed to various conditions, the achieved performance is promising.
For each of the two setups corresponding to cherry and strawberry leaves, there are two categories: healthy and nonhealthy. Fig. 18 presents some image samples. We can see that even visually the classification task is not challenging in these cases and therefore perfect classification scores are obtained. Table 15 shows the results obtained by other state-ofthe-art methods for different experiments performed on images from the PlantVillage dataset. We can observe that no other work obtains better performances for the considered experiments than the method proposed in this paper.

C. APPLICABILITY OF THE METHOD
The greatest impact of the proposed method is represented by its applicability for texture classification in cases where large datasets are not available and time performance approaching real-time scenarios is required, like, for instance, precision agriculture. It is in such instances that constructing the model, and training it can be challenging and may require either expertise in tailoring the feature extraction step to the task (and a good understanding of the variability in the data) or,  if end-to-end deep-learning-based models are used, be dependent on the size of the training dataset. To address this, we proposed the pre-training of existing popular models (like AlexNet) on object-based large datasets (ImageNet) and the use of the description some of the hidden layers in the model provide as features for an SVM classifier. If the choice of the SVM classifier is also approached through a grid search, in order to maximize the classification accuracy, the level of expertise adapted to the particularities of the texture classification task is greatly reduced, together with no increase in the available dataset size or complexity. Also, by reconsidering the common processing pipeline for real-time processing scenarios that can be easily transferred to manned/unmanned agricultural smart machinery (e.g. tractors, drones, robots, IoT smart sensor networks, etc.), the classification system becomes a single image prediction approach where the model is trained once and then it is stored locally, on the machinery.
We have shown that feature descriptors like MRELBP are not sufficiently discriminative. Whereas OCCBM3DELBP and the pre-trained AlexNet model are close in terms of classification accuracy, they are not in terms of time efficiency.  We show in Fig. 19 the estimated time required to classify an image from the PlantVillage database on the spot for    The total decision time for the proposed approach based on the pre-trained AlexNet relu3 layer model is approximately 30 ms which can be easily considered for any real-time case. For the OCCBM3DELBP operator, in addition to the fact that the feature extraction step takes longer, the classification is longer too. This happens because the feature vector in this situation has 4800 values compared to AlexNet which generates only 384 features, for the same input image.
In the case of the pre-trained AlexNet model with the relu3 layer used for feature extraction, the required memory is 216 MB for storing the pre-trained model. For the SVM train model file, the required memory depends on the number of generated features and the number of training images. For example, in the case of the tomato setup, when considering 13212 training images, 10.2 MB are required to store the SVM train model file for the AlexNet relu3 layer (384 features).

V. CONCLUSION
We proposed using a deep-learning-based method for texture classification with performance compatible with real-time processing scenarios. We considered using CNNs as feature descriptors rather than end-to-end classifiers and combine them with SVMs. To obtain a relevant classification performance even for small datasets, we based our work on the transfer learning concept and adapted to the task popular CNN models (AlexNet, Vgg16, ResNet) pre-trained on the very large ImageNet object-based dataset. In the experimental section, we considered two datasets: a public one with generic RGB textures (for initial validation of the proposed approach) and a dataset from the applied field of precision agriculture consisting of images with leaves from several plant species and affected by several diseases (for illustrating the applicability of our work).
We analysed the classification results obtained by extracting features from several different layers of the different CNN pre-trained models and using them for describing the textures in the proposed datasets. We showed experimentally that the extraction of features from early convolutional layers is relevant for texture classification as the generated characteristics are more general and not necessarily specific to the task, a result consistent with the theoretical understanding of the CNNs presented in the literature. We compared the results with handcrafted features derived for the same dataset and we concluded that the proposed CNN-based system achieved the most satisfying overall performance (time and classification score). For the PlantVillage dataset, we performed plant species identification and we proposed nine setups in the experimental section for disease detection. We compared the obtained results with the performance achieved using classical machine-learning texture extractors and end-to-end deep-learning techniques. The pre-trained AlexNet model was chosen for feature extraction since it provided a promising performance in the general texture dataset evaluation and exhibited the smallest processing time. Thus, for the PlantVillage dataset, only the pre-trained AlexNet model was employed since the other considered models didn't meet the criteria for real-time processing applications. The proposed architecture (based on the use of the pre-trained AlexNet model on the ImageNet dataset and the selection of the relu3 layer as a descriptor together with an SVM classifier whose parameters were obtained through a grid search) surpasses the other operators in the considered cases in terms of both classification performance and processing times, making it a relevant candidate for real-time processing tasks.