Introduction
Early and accurate detection and diagnosis of plant diseases are key factors in wheat production and for reduction of both qualitative and quantitative losses in crop yield [1]. Therefore, developing technologies to accurately classify the categories of wheat leaf diseases is crucial for disease prevention. The state-of-the-art advancements in artificial intelligence and computer vision domains have actually motivated researchers to employ this effective technology in agriculture for automatic categorization of crop diseases caused by biotic and abiotic stresses [2]–[5]. Although remarkable performances have been achieved in normal diseases classification, it is still hard to distinguish several diseases with subtle discrimination.
Fine-grained image classification aims at discriminating the sub-categories sharing one common basic-level category through digital images [6], e.g., classifying different vehicle makes and models [7]–[10], tree categories [11], bird classes [12]–[14], dogs classes [15]–[17], flower species [18], aircrafts [19], [20], body parts [21], [22], etc. Due to its tremendous challenge and study merit both in theory and practice, fine-grained image classification has been extensively studied recently: Liang et al. [23] proposed a Gaussian mixture model, which fused local features by Gaussian mixture layer and achieved high classification accuracy; Iscen et al. [24] adopted approaches based on superpixels, edges, and a bank of Zernike filters used as detectors, and they found that a better accuracy was achieved when the patches were extracted along the edges and not around the detected regions; Xuan et al. [25] proposed a novel evolving convolutional neural network (ECNN), which could use the limited clearly labeled images and weakly labeled images for better fine-grained classification of CIFAR-10, Oxford pets, etc; Seo and Shin [26] proposed to pre-train the GoogLeNet on ImageNet dataset and fine-tune fine-grained fashion dataset based on design attributes apparel classification, and their strategy got promising performance; Zhang et al. [27] proposed a fine-grained vehicle recognition framework based on lightweight convolutional neural network (CNN) with combined learning strategy, and competitive recognition performances were achieved whilst decreasing the computational complexity; Zhang et al. [28] developed a novel fine-grained image categorization system based on an active learning algorithm and support vector machine (SVM), which achieved better spatial pyramid matching performance and categorization accuracy.
Through the above investigation, we learned that most previous works were aimed at boosting up the classification rate from three main aspects:
more precise location of object and domain, which is also known as the global/domain level attention.
more robust feature representations for subordinate categories discrimination.
human in the loop [29] and reinforcement learning.
Since our goal is automatic fine-grained image classification and our design is based on a simple intuition, i.e. directly boosting up the accuracy through more robust discriminative features extracted by an effective algorithm, we are more focused on the related research of the first two, Due to small discrepancies, different sub-categories are always distinguished by domain-specific areas, such as the texture of a feather [30]–[34] and a petal [35], [36], the color of a coat [37], [38] and a beak [39]–[43], the shape of a trademark [35], [45] and a vehicle [46], [47], etc. Consequently, detecting these subtle discriminative domains from similar areas is crucial for fine-grained image classification [48]–[52].
Another point is that the fine-grained classification tasks are common and more challenging in uncontrolled realistic crop disease classification. Different subordinate categories have almost similar appearance of diseases. Their discriminations mainly exist in subtler areas. Perhaps counter intuitively, intra-category discriminations can be very larger than inter-category among different sub-categories in some cases, as depicted in Fig. 1. More seriously, a number of visual interferences, such as reflection, dispersion, and blur caused by dew, equipment jitter, and extreme illuminations can lead to large visual interferences in terms of leaf appearance. This can severely degrade the accuracy and robustness of the disease classification [53], [54]. Our previous work [55] also suffered these interferences seriously.
Illustration of the difficulty of fine-grained image classification of crop diseases: Large intra-category discriminations and small inter-category discriminations. (a) The sub-categories and basic-level categories of wheat leaf diseases images. (b) The sub-categories and basic-level categories of tea leaf diseases images.
Due to the adequate collected samples and research value both in theory and practice, our work focuses on the fine-grained image classification of winter wheat leaf diseases. Traditionally, early classification of wheat leaf disease is dominated by delicate hand-crafted features [56]–[61], e.g., HOG, SIFT, SURF, and LBP. However, the design of these descriptors is typically time-consuming and their performances are unsatisfactory [23]. Thus, researchers have attempted to tackle the problems by proposing artificial intelligence methods. For instance, Zhao et al. [62] proposed an optimized MSF-AdaBoost model to classify and monitor powdery mildew on winter wheat on a regional scale. A high classification accuracy and promising monitoring performance was achieved; Tian et al. [63] presented an SVM-based Multiple Classifier System (MCS) for pattern recognition of wheat leaf diseases. Compared with the previous classifiers, their algorithm could achieve better recognition rate; Niu et al. [64] proposed a modified K-means clustering for efficient identification of wheat leaf diseases, and better performance was achieved for three common diseases (powdery mildew, leaf rust, and stripe rust); Yang et al. [65] presented a diagnosis model of stripe rust in field scales based on Bayesian network, which provided technical support for accurate identification and short-term prediction of stripe rust on a small scale.
The above methods are concerning to surface learning. Although some progress has been made, there is still some room and potential for improvement. The extraction of hand-crafted features, such as inertia moment, roundness, and entropy largely relies on prior knowledge; thus, the extracted features are often inadequate and lacking in detail [55]. Furthermore, while shallow-level features can be extracted effortlessly, abstract representations hidden in the deeper level are difficult to obtain without learning procedures [55].
Moreover, the main challenge for fine-grained image classification of wheat leaf disease is indubitably the very small discrepancies among different categories. Specifically, the difficulties mainly come from three aspects:
the strong similarity among different disease spots.
the large visual interferences of the cluttered environments.
the large search space of possible disease spot positions.
Consequently, an effective classification model was required to accurately extract subtle features from the domain-regions. The model that would have a high objectiveness, containing the vital discriminates of certain objects. Driven by this requirement, an improved convolutional neural network codenamed matrix-based convolutional neural network (M-bCNN) has been proposed in this work.
CNN is a multilayer variant perceptron (MLP) [66] inspired by Hubel-Wiesel biological vision system. It can adaptively construct implicit feature description through multi-layer non-linear mapping under training data driving [67]. During the last seven years, mainly due to the state-of-the-art performance of CNN, the quality of image classification and other related fields have progressed at a dramatic pace. In 2012, Krizhevsky et al. [68] won two first prizes (in two separate tasks) for developing AlexNet model in ImageNet Large Scale Visual Recognition Competition (ILSVRC) [69], where its accuracy rate exceeded by 10% that of the second-placed competitor. In 2014, two milestones in face classification were achieved, when Taigman et al. [70] and Ouyang et al. [71] proposed CNN based DeepFace and DeepID. These proved to be the most perfect authentication models for face classification. Two other brands of CNN frameworks with the design concept of “go deeper” became the champion (GoogLeNet [72]) and runner-up (Visual Geometry Group (VGG) [73]) in ILSVRC-2014. The following year, ResNet [74] designed by Microsoft Research Asia (MSRA) won the championship in ILSVRC-2015. Its Top-5 error rate (3.57%) was lower than humans’ classification error rate (5.1%), which illustrated that its object recognition ability surpassed that of human eye. In 2016, DeepMind, a subsidiary of Google, applied CNN to an intelligent robot AlphaGo [75] and defeated Go champion Li Shishi. Sharing the congenital advantages of CNN (though it is hard to see the study of CNN in the fine-grained image classification of wheat leaf diseases), we proposed to utilize it as the theoretical basis in this work.
However, we needed to enhance the representational ability of CNN to better tackle the fine-grained image classification tasks. It is obvious that most representative CNN models gain improvement in accuracy through stacked layers [76], [77]. During 2012 to 2015, all the leading works [68], [72], [74] in the challenging ILSVRC [69] exploited “extreme deep” models, with a depth of 8 [68] to 152 [74]. He et al. [74] even utilized ResNet with 1202 layers to analyze CIFAR-10. Recent evidences reveal that increasing hidden layers is essential for success in the current state-of-the-art convolutional networks [73], [74], [78], [79]. Previous studies show that most deep neural networks typically follow a standard structure originating from LeNet-5 – linearly stacked convolutional layers are optionally followed by one or more subpooling layers and fully-connected layers. These “plain” nets that simply stack layers [74] prevail in the image classification literature and have produced impressive results so far on CIFAR-10, MNIST and other classification tasks. Nevertheless, the design is not efficient to improve the representational ability of networks compared with parallel structure, which is concretely embodied in the growth rate of parallelism level, data streams, scheduling efficiency, neurons, link channels, etc [72], [74]. Moreover, if several convolutional layers are linearly chained together, this will result in a quadratic increase of parameters and computational budget. Then more terrible over-fitting and curse of dimensionality will follow, which will result in serious issues with accuracy [72], [74], [79].
The depth of model representations is also imparative for distinguishing fine-grained visual categories. In order to find an effective method to increase representational ability without obvious side-effects, an improved hierarchical CNN denoting the M-bCNN was proposed and its gratifying performances were evaluated in fine-grained image classification of wheat leaf diseases.
In order to employ the discriminative feature representational ability of CNN, our model originates as a fundamental plain network (i.e. AlexNet [68]). First, we add two convolutional neural layers to the first two low layers (i.e. conv_1 and conv_2) for extracting the global features of images. Then the next three linearly stacked higher convolutional neural layers (i.e. conv_3, conv_4, and conv_5) take place by three
Due to the large parameters of the model, direct training from scratch on the image set of wheat leaf diseases will result in over-fitting. Therefore, M-bCNN is first pre-trained on the ImageNet database [69]. Based on the training weights, we fine-tuned the model to adapt to the fine-grained images. M-bCNN has proven theoretically and practically that the convolutional kernel matrix is effective to increase the number of data streams, neurons, and link channels, while it inhibits parameter growth. In addition, the tiny features can be extracted by minitype convolution filters arranged in matrix. Moreover, they can combine freely with each other, because of fully-connected mode, generating different feature maps thus improving the flexibility and characterization ability of the model. Therefore, convolutional kernel matrix caters for the strict requirements of enhancing representational ability and suppressing time complexity. This is meaningful for fine-grained image classification of crop diseases. Convolutional kernel matrix allows M-bCNN to acquire the accuracy gains from increased hidden layers without obvious accuracy loss penalty, producing results which significantly compete against AlexNet and VGG-16.
The main novelties and contributions of this research are summarized in three points:
To the best of our knowledge, it is the first work that used improved CNN for the fine-grained image classification of wheat leaf diseases. A novel hybrid CNN structure codenamed M-bCNN is proposed, which significantly increases the data streams, neurons, and link channels. The effectiveness of its hallmark, the convolutional kernel matrix, is proven both theoretically and practically.
As there is no large-scale publicly available image set of wheat leaf diseases at present, a total of 16,652 high-fidelity winter wheat leaf images, containing eight categories, were collected from locations of Shandong province, China. Moreover, a total of 83,260 augmented images were produced by five augmentation methods. Obviously, this is the first large-scale high-resolution image set of winter wheat leaf diseases. We intend to open source this set when it gets richer both in terms of quantity and species.
We have undertaken a significant amount of work on the image set. Our proposed model achieves higher validation accuracy, individual classification rate, precision, recall, and F1-score improvement with a tolerable parameter addition.
The remainder of this paper is as follows. Section II illustrates the methodology of the proposed M-bCNN and the effectiveness of convolutional kernel matrix. Section III presents the construction of the original and augmented image sets of wheat leaf diseases. The details of experimental process and results are being covered in Section IV. The detailed discussion and analysis are elaborated in Section V. Finally, in Section VI, the concluding remarks and suggestions for future works are provided.
Matrix-Based Convolutional Neural Network
A. Overview of M-bCNN
In an attempt to leverage the success of CNN for object classification, the M-bCNN for fine-grained image classification of wheat leaf disease is proposed. This section details the novelty of our method. It describes the new hierarchical M-bCNN architecture that integrates the proposed convolutional kernel matrix and other tricks, such as ELU, LRN, and DropConnect. Convolutional kernel matrix aims at increasing the model’s representational ability so as to learn a domain-specific discrimination to deal with fine-grained classification, whilst suppressing parameter growth rate. The model depicted in Fig. 2 is called as M-bCNN-CKM-3 for its
The overview of the proposed M-bCNN-CKM-3. Layers of original CNN are shown in green and blue. Convolutional kernel matrixes that we propose are in purple.
As obvious from Fig. 2, M-bCNN-CKM-3 mainly contains four convolutional layers (
1) DropConnect
Fig. 3 reveals that the input pixel vector \begin{equation*} \begin{cases} x_{i} \ast M_{i} \\ M_{i} \sim Bernoulli ~distribution~(p) \end{cases} \tag{1}\end{equation*}
2) Exponential Linear Unit (ELU)
Next, the processed feature maps are calculated by convolution filters. In order to inhibit vanishing gradient and increase model convergence rate, ELU is utilized as the activation function in convolutional layers, convolutional kernel matrixes and subsampling layers. Suppose the input signals are denoted as \begin{equation*} \begin{cases} y_{i} `=f(y_{i})=f\left({\displaystyle \sum \limits _{i=1}^{n} {x_{i} w_{i}^{T} +b} }\right) \\ f=ELU(y_{i})=\begin{cases} y_{i} &\quad \text {if y}_{i} \ge 0 \\ \alpha (e^{y_{i}}-1)&\quad \text {if y}_{i} {< 0} \\ \end{cases} \\ \end{cases} \tag{2}\end{equation*}
3) Local Response Normalization (LRN)
After the non-linear mapping of ELU, we employ the channel internal normalization contained in LRN for better generalization ability. Its local region is extended in the independent channel. The received signal is normalized as shown in Eq. (3).\begin{equation*} y_{i} ''=y_{i}'/\left({k+\left({\frac {\alpha }{n}}\right)\sum \limits _{j=\max (0,i-n/2)}^{\min (N-1,i+n/2)} {(y_{j}')^{2}} }\right)^{\beta }\tag{3}\end{equation*}
Finally, M-bCNN ends with an eight-way fully-connected layer with Softmax [81], [82] function:\begin{equation*} S_{i} =\frac {e^{V_{i}}}{\sum \limits _{j=1}^{K} {e^{V_{j}}}}\tag{4}\end{equation*}
B. Convolutional Kernel Matrix
In this section, we demonstrate the positive effect of convolutional kernel matrix on representational ability enhancing and parameter growth inhibition. The commonly-used plain nets and proposed convolutional kernel matrix are compared in terms of their structures, data streams, neurons, link channels and training parameters. This is because they largely reflect the performance of a neural network from a mathematical point of view.
1) Schema PlainNet-2 and Schema CKM-2
In Fig. 4, we hypothesize that the size of input image is
The structure of Schema PlainNet-2 and Schema CKM-2. Left: a 2-layer plain network as a reference. Right: a
The number of neurons, link channels, and training parameters of the two schemas are shown in Table 2.
Table 1 reveals that the number of data streams in Schema CKM-2 is four times that of Schema PlainNet-2, which provides more pipelines for feature integration. Accordingly, the number of link channels in Schema CKM-2 is four times that of Schema PlainNet-2 in Table 2, which brings more non-linear mappings for feature extraction. In addition, the number of neurons in Schema CKM-2 is two times that of Schema PlainNet-2. It means stronger feature extraction ability. The increase of neurons and link channels are both meaningful for boosting the model’s representational ability. Meanwhile, the number of training parameters is also increased with the addition of layers, but not enough to cause serious computational burden.
2) Schema PlainNet-3 and Schema CKM-3
In Fig. 5, linearly stacked convolutional layers,
The structure of Schema PlainNet-3 and Schema CKM-3. Left: a 3-layer plain network as a reference. Right: a
The number of neurons, link channels, and training parameters of two schemas are shown in Table 4.
In Table 3 and Table 4, the numbers of data streams and link channels in Schema CKM-3 are both 27 times those of Schema PlainNet-3, which is a distinct improvement compared with Schema CKM-2. The number of neurons in Schema CKM-3 is three times that of Schema PlainNet-3. Moreover, the number of training parameters is also increased by three times. It can be seen that the improvement of neurons and link channels is more significant than Schema CKM-2, while the number of training parameters is within the acceptable range.
3) Schema PlainNet-N and Schema CKM-N
With the improvement of hardware, the implementation of convolutional kernel matrix with bigger size may be allowed, whose structure is like Schema CKM-n in Fig. 6. It is composed of
The structure of Schema PlainNet-n and Schema CKM-n. Left: an n-layer plain network as a reference. Right: an
The number of neurons, link channels and training parameters in Schema PlainNet-n and Schema CKM-n are calculated by Eq. (1) to Eq. (6), where
The number of neurons, link channels, and training parameters in Schema PlainNet-n. (a) The number of neurons in Schema PlainNet-n. (b) The number of link channels in Schema PlainNet-n. (c) The number of training parameters in Schema PlainNet-n.
The number of neurons, link channels, and training parameters in Schema CKM-n. (a) The number of neurons in Schema CKM-n. (b) The number of link channels in Schema CKM-n. (c) The number of training parameters in Schema CKM-n.
It can be seen in equations 5, 6, 8, and 9, that the numbers of neurons and link channels in Schema CKM-n are
The time complexity of one convolutional kernel matrix can be calculated by Eq. (11): \begin{equation*} \begin{cases}{l} \text {Time}\sim O\left({\displaystyle \sum \limits _{i=1}^{N} {NM_{i}^{2} \cdot NK_{i}^{2} \cdot N^{N}C_{i-1} \cdot N^{N}C_{i}} }\right) \\ M=(X-K+2\ast Padding)/Stride+1 \\ C_{i} =(K^{2}+1)F(X-iK+1)^{2} \\ N\in \{x\vert x\ge 2,x\in Z\} \\ \end{cases} \tag{11}\end{equation*}
In conclusion, the above three comparisons (see section 1, 2, and 3) demonstrate that convolutional kernel matrix provided significantly better performance than the plain networks. This proves that the matrix structure helps with achieving a substantial increase of data streams, neurons, and link channels at a tolerable increase of computational requirements for affordable parameters addition. This way, the curse of dimensionality will not appear within a proper matrix size. Sharing the above advantages, M-bCNN could easily relish accuracy gains from considerably increased depth, producing efforts substantially better than plain networks.
Data Description
In this work, winter wheat leaf diseases images were utilized as the experimental samples of fine-grained classification for their strong similarity with subordinate categories in some cases. At present, no large-scale image set of wheat leaf diseases is publicly available. So 16,652 high-fidelity images were collected from several wheat planting areas of Shandong province and were assigned as the original image set. Then an augmented database containing 83,260 images was constructed by five augmentation methods. The original and augmented image sets were utilized as the training and testing samples, respectively. To the best of our knowledge, this is the first available large-scale high-resolution images sets of winter wheat leaf diseases.
A. Image Acquisition
From the wheat planting bases of Shandong Province, China, 16,652 winter wheat leaf images containing eight categories were collected from the field using Canon EOS 80D camera. They were acquired between 8:00 a.m. and 5:00 p.m., and the distance of the camera from the leaf was three to seven cm. The image format was JPEG and each one was a 24-bit color bitmap. Each image included only one disease and was classified into one corresponding ground truth category by plant protection experts. The original image set was utilized as testing sample. See Table 6 for randomly selected samples.
From Table 6 we can observe that the images of some sub-categories, within one common basic-level category, have strong similarities. For example, the images of Powdery Mildew and Cochliobolus Heterostrophus are similar to each other, and they belong to the same basic category of Spore Parasitism. Therefore, it is generally much harder to classify each image with in its true category and this is a meaningful dataset for model evaluation of fine-grained classification. The number and proportion of each category is shown in Table 7.
B. Image Augmentation
Adequately labeled samples can reduce under-fitting in the model training process [83]. In order to increase the number and diversity of original images and enable extracted features to own the robustness of rotation, translation, and scaling, etc., an augmented image set was constructed through noise addition [84], color jittering [68], PCA jittering [68], rotation blur [55], and scaling blur [85] for their implementation simplicity and satisfactory performances proved in previous researches [23], [52], [72], [73]. These methods simulate the interferences of noise, illumination fluctuation, and object jitter which are frequently encountered during the acquisition process and practical application scenarios. See Table 8 for details and Table 9 for the processed images corresponding to each method.
Finally, we augmented dataset of 83,260 images that were enough for the model’s convergence. This image set was utilized as the training sample and the number and proportion of each category is shown in Table 10.
Experiments
A. Experimental Environments
Training a deep CNN through a large number of iterations largely relies on high-performance graphics processing units (GPUs). We run the experiments using multiple GPUs on NVIDIA (R) GeForce GTX 1080 graphics card. Its basic configuration is shown in Table 11.
The computer was a HP EliteDesk 880 G2 TWR with an Intel(R) Core i7 6700K (3.40 GHz) processor and 16 GB memory. The operating system was Ubuntu 16.04.1 (64 bits). Implementation of the M-bCNN used TensorFlow, an open-source machine learning framework for numerical computation developed by Google Brain Team. The Python was utilized as the programming language to adapt to the core of TensorFlow.
B. Evaluation Metric
The following metrics are considered to evaluate the model. First and foremost, the accuracy is widely implemented for the target classification and recognition. However, the performance of the model cannot be illustrated sufficiently only with accuracy. So, we use confusion matrix, also known as error matrix in supervised learning, which clearly depicts the actual and predicted categories in each column and row respectively. Moreover, the precision, recall, and F1-scores across individual categories are utilized to evaluate the performances of the classifier model directly.
C. Model Training
Before the training starts, 70% of the images in each category are randomly selected as training samples and the remaining 30% are utilized as validation samples. Since the augmented dataset is already balanced, this sampling method can ensure the inter-class balance.
1) Optimization Objective
Suppose there are \begin{equation*} E^{N}=\frac {1}{2}\sum \limits _{n=1}^{N} {\sum \limits _{m=1}^{c} {(y_{m}^{n} -EO_{m}^{n})^{2}}}\tag{12}\end{equation*}
2) Loss Function
The standard cross-entropy [86] is utilized as the loss function during the model training stage and it is defined as in Eq. (13):\begin{equation*} E=-\frac {1}{n}\sum \limits _{x} {[y\ln y'+(1-y)\ln (1-y`)] }\tag{13}\end{equation*}
3) Regularization Term
In order to better resist over-fitting and vanishing gradient, L2 regularization is exploited and is shown in Eq. (14):\begin{equation*} L_{2} =\frac {1}{2n}\lambda \sum \limits _{w_{i}} {w_{i}^{2}}\tag{14}\end{equation*}
4) Optimization Strategy
In pursuit of faster training speed, the strategy of “SGD+momentum” is utilized as the optimization algorithm. Its optimization speed is \begin{equation*} \begin{cases}{l} J(\theta)=\displaystyle \frac {1}{N}\sum \limits _{i=1}^{N} {\frac {1}{2}(y^{i}-h_{\theta } (x^{i}))^{2}} \\ v=\alpha v-\varepsilon J(\theta) \\ \theta =\theta +v \\ N=1,2,3,\ldots,n \\ \end{cases} \tag{15}\end{equation*}
Finally, Batch normalization (BN) [88] is adopted right after each convolution layer and all models are trained from scratch.
5) Training Implementation
The structures of Schemas CKM-2 and CKM-3 are realized in models M-bCNN-CKM-2 and M-bCNN-CKM-3 respectively, and contrasted with two representative plain networks, AlexNet [68] and VGG-16 [73], for comparison studies. In the same experimental environment, M-bCNN-CKM-2 and M-bCNN-CKM-3 are first pre-trained on the ImageNet dataset [69] for their large parameters, and then four models are fine-tuned for up to 100 epochs end-to-end by SGD + momentum with back-propagation on the augmented image set, where the mini-batch size is 50. Fig. 9 (a), (b), and (c) depict the accuracy of training and validation image set throughout the whole procedure.
Fig. 9 (a)-(c) compare the training and validation accuracy of four models. Fig. 9 (a) shows that M-bCNN-CKM-2 and M-bCNN-CKM-3 converged after about 50 training epochs. The results indicate that the two models have equivalent accuracy for the training image set, whereas for validation image set, the validation accuracy of M-bCNN-CKM-3 is better than that of M-bCNN-CKM-2. Based on these result, the M-bCNN-CKM-2 was then compared with AlexNet and VGG-16 as shown in Fig. 9 (b) and (c). The results demonstrate that the training and validation accuracy of M-bCNN-CKM-2 are both higher than those of AlexNet and VGG-16, and only ten more training epochs are required to achieve convergence.
Accuracy in the training image set and validation image set. (a) Iteration of training accuracies changes and validation accuracies changes by M-bCNN-CKM-2 and M-bCNN-CKM-3. (b) Iteration of training accuracies changes and validation accuracies changes by M-bCNN-CKM-2 and AlexNet. (c) Iteration of training accuracies changes and validation accuracies changes by M-bCNN-CKM-2 and VGG-16.
According to the results in Fig. 9 (a)-(c), the model that maximized the accuracy for the validation image set is considered to be the best. Table 12 shows the training accuracy, validation accuracy, training epoch, and training time for each model. M-bCNN-CKM-3, which achieved the highest validation accuracy, is the best performing model. When the models are convergent, the highest validation accuracies of M-bCNN-CKM-2 and M-bCNN-CKM-3 are about 91.32% and 96.5% respectively, which are obviously higher than those of AlexNet and VGG-16 (83.12% and 88.54% respectively). M-bCNN-CKM-2 and M-bCNN-CKM-3 achieved higher validation accuracies of fine-grained classification for wheat leaf diseases’ images, but required just about four more hours to converge. It suggests that the convolutional neural network is effective both in boosting up the representational ability and suppressing parameter growth, while the training and validation accuracies do not suffer the penalty of the curse of dimensionality.
D. Feature Visualization
In order to get a clearer understanding of how and why the models work, guided-backpropagation and deconvolution [89] are both utilized to visualize the constantly updating filters of the model throughout the whole training stage. In the course of the experimental iterations, the visualization of some randomly selected filters in M-bCNN-CKM-3 is shown in Fig. 10.
We can view the above filters as the learned feature descriptors encoding the distinctive fusion structures. It is noticeable that despite each filter is independent in Conv1 and S1, our filters in CKM-35 smoothly change during training stage. In this manner, they provide much richer and more meaningful domain-specific representations. In a scene, this also further demonstrates that using convolutional kernel matrix is a crucial requirement for a model to learn better representations.
E. Model Testing
This section illustrates the performance verification of models in fine-grained image classification experiments. We utilized 16,652 original images as the testing samples. Then compared trained M-bCNN-CKM-2, M-bCNN-CKM-3 against AlexNet and VGG-16 and evaluated them in terms of individual accuracy, precision, recall, F1-socre, and overall accuracy.
In Fig. 11, four confusion matrixes, which compared the true category (Ordinate) against the predicted category (Abscissa), were calculated to describe the individual classification rate of each model. Note the color distribution of the confusion matrix and that the Normal Leaf and Bacterial Leaf Blight have high average classification rates (91.0% and 92.3% respectively) for all models. The large green area of healthy leaves and the golden appearance of leaves infected with Bacterial Leaf Blight make them easier to be distinguished with other sub-categories sharing one common basic-level category. Furthermore, M-bCNN-CKM-3 and AlexNet achieved the highest (89.6%) and the lowest (80.5%) average classification rate of eight categories, as obvious from Fig. 11 (b) and (c).
Confusion matrix of the testing results. (a) Illustration of individual classification rate by M-bCNN-CKM-2. (b) Illustration of individual classification rate by M-bCNN-CKM-3. (c) Illustration of individual classification rate by AlexNet. (d) Illustration of individual classification rate by VGG-16.
Fig. 12 and Table 13 depict the precision, recall, F1-score and accuracy of eight categories by four models for the testing image set. In Fig. 12, we observe that the precision (93.32%) of Bacterial Leaf Blight is the highest, while the recall (91.68%) of Leaf Rust is the highest. Other sub-categories sharing one basic-category are harder to be distinguished because of their strong similarity. The average precision (90.15%) and recall (88.62%) of M-bCNN-CKM-3 are the highest among these models, and those of AlexNet are the lowest (69.83% precision and 64.71% recall). In Table 13, the average F1-score (85.5%) and accuracy (90.1%) of M-bCNN-CKM-3 are also the highest for four models, and those of AlexNet are the lowest (51.75% F1-score and 72% accuracy). The F1-score and accuracy of these models indicate that M-bCNN-CKM-2 and M-bCNN-CKM-3 provide better performances than the other two plain networks in fine-grained image classification, and M-bCNN-CKM-3 is the top-performing approach compared to M-bCNN-CKM-2. Based on testing assessment, the convolutional kernel matrix is meaningful for boosting up representational ability compared with the linearly stacked layers, and the accuracy penalty caused by the curse of dimensionality has not appeared. Additionally, from the comparison of M-bCNN-CKM-2 and M-bCNN-CKM-3 (see also Fig. 9 (a) and Table 12), we observe that this advantage becomes more significant as the matrix size increases. Sharing the advantages of convolutional kernel matrix, M-bCNN can easily acquire accuracy gains from the increased layers depth in the form of a matrix.
Discussion
Recently, a number of studies have been conducted on fine-grained classification methods, and most of them provide promising performance in certain fields. Inspired by the design conceptions of parallel networks (e.g., Part-based CNN [8], Two-level Attention CNN [16], MCNN [55], GoogLeNet [72], ResNet [74], and Hypercolumn CNN [90]), we proposed a novel hybrid CNN structure codenamed M-bCNN, which leverages convolutional kernel matrixes to effectively increase the data streams, neurons, and link channels. The matrix-based architecture played an important role and the expected accuracy gains from it were delivered in the fine-grained image classification of wheat leaf diseases. The model’s satisfying performance surpassed the two representative plain networks, i.e. AlexNet [68] and VGG-16 [73]. The experimental results and conclusions are basically consistent with other studies on hierarchical models [8], [16], [55], [72], [74]. Parallelization and grading of neural networks is one of the developmental trends for deep learning. Our strategy might prove meaningful for other fine-grained tasks such as action or attribute categorization.
There could be one potential issue with the proposed model concerning the training phase: the convolutional kernel matrix may introduce heavy computations, when the iterations, training samples and matrix size are very large. Computations burden penalty is a common phenomenon that often occurs in the parallel networks (e.g., GoogLeNet [72], Two-level Attention CNN [16], and Hypercolumn CNN [90]), which cautiously sacrifice the proper algorithm efficiency for accuracy benefits. Consequently, it is important to carry out research on how to find an effective pruning mechanism for model compression, so that limited computational resources can be allocated scientifically and rationally. There is still room and potential to improve the performance to be at par with or even surpass the M-bCNN. One direction of future work is to delve deeper into the architecture optimization and integrate modified pruning mechanism into complex representational framework.
Conclusion
In this work, we have proposed a unified CNN model, denoted M-bCNN, based on convolutional kernel matrix, for fine-grained image classification. The main advantage of convolutional kernel matrix is significant gains of data streams, neurons, and link channels at a modest increase of computational requirements compared to plain networks. We described the methodology of our architecture and positive effort on both representational ability enhancing and parameter growth inhibition.
The experiments demonstrated that the promising performances of our model compete against AlexNet and VGG-16 in the fine-grained image classification of wheat leaf diseases. Our approach yields solid evidence that convolutional kernel matrix is a feasible and useful idea in general, which provides a new path for the identification of crop diseases.
Future work directions are of two aspects: First, we will focus on optimizing the architecture and hyper-parameters of M-bCNN for other challenging fine-grained classification tasks. Second, we will try other models such as generative adversarial networks (GANs), regions with CNN (RCNN) to deal with semantic segmentation, object detection, and open-set recognition.