Journals & Magazines >IEEE Access >Volume: 7

A Unified Matrix-Based Convolutional Neural Network for Fine-Grained Image Classification of Wheat Leaf Diseases

The overview of the proposed M-bCNN-CKM-3. Layers of original CNN are shown in green and blue. Convolutional kernel matrixes that we propose are in purple.

Abstract:

Fine-grained image classification methods often suffer from the challenge that the subordinate categories within an entry-level category can only be distinguished by subt...Show More

Metadata

Abstract:

Fine-grained image classification methods often suffer from the challenge that the subordinate categories within an entry-level category can only be distinguished by subtle differences. Crop disease classification is affected by various visual interferences, including uneven illumination, dew, and equipment jitter. It demands an effective algorithm to accurately discriminate one category from the others. Thus, the representational ability of algorithm needs to be strengthened to learn a robust domain-specific discrimination through an effective way. To address this challenge, a unified convolutional neural network (CNN) denoting the matrix-based convolutional neural network (M-bCNN) was proposed. Its hallmark is the convolutional kernel matrix, whose convolutional layers are arranged parallelly in the form of a matrix, and integrated with DropConnect, exponential linear unit, local response normalization, and so on to defeat over-fitting and vanishing gradient. With a tolerable addition of parameters, it can effectively increase the data streams, neurons, and link channels of the model compared with the commonly used plain networks. Therefore, it will create more non-linear mappings and will enhance the representational ability with a tolerable growth of parameters. The images of winter wheat leaf diseases were utilized as experimental samples for their strong similarities among sub-categories. A total of 16652 images containing eight categories were collected from Shandong Province, China, and were augmented into 83260 images. The M-bCNN delivered significant improvements and achieved an average validation accuracy of 96.5% and a testing accuracy of 90.1%; this outperformed AlexNet and VGG-16. The M-bCNN demonstrated accuracy gains with a convolutional kernel matrix in fine-grained image classification.

The overview of the proposed M-bCNN-CKM-3. Layers of original CNN are shown in green and blue. Convolutional kernel matrixes that we propose are in purple.

Published in: IEEE Access ( Volume: 7)

Page(s): 11570 - 11590

Date of Publication: 10 January 2019

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2019.2891739

Funding Agency:

Contents

SECTION I.

Introduction

Early and accurate detection and diagnosis of plant diseases are key factors in wheat production and for reduction of both qualitative and quantitative losses in crop yield [1]. Therefore, developing technologies to accurately classify the categories of wheat leaf diseases is crucial for disease prevention. The state-of-the-art advancements in artificial intelligence and computer vision domains have actually motivated researchers to employ this effective technology in agriculture for automatic categorization of crop diseases caused by biotic and abiotic stresses [2]–[5]. Although remarkable performances have been achieved in normal diseases classification, it is still hard to distinguish several diseases with subtle discrimination.

Fine-grained image classification aims at discriminating the sub-categories sharing one common basic-level category through digital images [6], e.g., classifying different vehicle makes and models [7]–[10], tree categories [11], bird classes [12]–[14], dogs classes [15]–[17], flower species [18], aircrafts [19], [20], body parts [21], [22], etc. Due to its tremendous challenge and study merit both in theory and practice, fine-grained image classification has been extensively studied recently: Liang et al. [23] proposed a Gaussian mixture model, which fused local features by Gaussian mixture layer and achieved high classification accuracy; Iscen et al. [24] adopted approaches based on superpixels, edges, and a bank of Zernike filters used as detectors, and they found that a better accuracy was achieved when the patches were extracted along the edges and not around the detected regions; Xuan et al. [25] proposed a novel evolving convolutional neural network (ECNN), which could use the limited clearly labeled images and weakly labeled images for better fine-grained classification of CIFAR-10, Oxford pets, etc; Seo and Shin [26] proposed to pre-train the GoogLeNet on ImageNet dataset and fine-tune fine-grained fashion dataset based on design attributes apparel classification, and their strategy got promising performance; Zhang et al. [27] proposed a fine-grained vehicle recognition framework based on lightweight convolutional neural network (CNN) with combined learning strategy, and competitive recognition performances were achieved whilst decreasing the computational complexity; Zhang et al. [28] developed a novel fine-grained image categorization system based on an active learning algorithm and support vector machine (SVM), which achieved better spatial pyramid matching performance and categorization accuracy.

Through the above investigation, we learned that most previous works were aimed at boosting up the classification rate from three main aspects:

more precise location of object and domain, which is also known as the global/domain level attention.
more robust feature representations for subordinate categories discrimination.
human in the loop [29] and reinforcement learning.

Since our goal is automatic fine-grained image classification and our design is based on a simple intuition, i.e. directly boosting up the accuracy through more robust discriminative features extracted by an effective algorithm, we are more focused on the related research of the first two, Due to small discrepancies, different sub-categories are always distinguished by domain-specific areas, such as the texture of a feather [30]–[34] and a petal [35], [36], the color of a coat [37], [38] and a beak [39]–[43], the shape of a trademark [35], [45] and a vehicle [46], [47], etc. Consequently, detecting these subtle discriminative domains from similar areas is crucial for fine-grained image classification [48]–[52].

Another point is that the fine-grained classification tasks are common and more challenging in uncontrolled realistic crop disease classification. Different subordinate categories have almost similar appearance of diseases. Their discriminations mainly exist in subtler areas. Perhaps counter intuitively, intra-category discriminations can be very larger than inter-category among different sub-categories in some cases, as depicted in Fig. 1. More seriously, a number of visual interferences, such as reflection, dispersion, and blur caused by dew, equipment jitter, and extreme illuminations can lead to large visual interferences in terms of leaf appearance. This can severely degrade the accuracy and robustness of the disease classification [53], [54]. Our previous work [55] also suffered these interferences seriously.

FIGURE 1.

Illustration of the difficulty of fine-grained image classification of crop diseases: Large intra-category discriminations and small inter-category discriminations. (a) The sub-categories and basic-level categories of wheat leaf diseases images. (b) The sub-categories and basic-level categories of tea leaf diseases images.

Show All

Due to the adequate collected samples and research value both in theory and practice, our work focuses on the fine-grained image classification of winter wheat leaf diseases. Traditionally, early classification of wheat leaf disease is dominated by delicate hand-crafted features [56]–[61], e.g., HOG, SIFT, SURF, and LBP. However, the design of these descriptors is typically time-consuming and their performances are unsatisfactory [23]. Thus, researchers have attempted to tackle the problems by proposing artificial intelligence methods. For instance, Zhao et al. [62] proposed an optimized MSF-AdaBoost model to classify and monitor powdery mildew on winter wheat on a regional scale. A high classification accuracy and promising monitoring performance was achieved; Tian et al. [63] presented an SVM-based Multiple Classifier System (MCS) for pattern recognition of wheat leaf diseases. Compared with the previous classifiers, their algorithm could achieve better recognition rate; Niu et al. [64] proposed a modified K-means clustering for efficient identification of wheat leaf diseases, and better performance was achieved for three common diseases (powdery mildew, leaf rust, and stripe rust); Yang et al. [65] presented a diagnosis model of stripe rust in field scales based on Bayesian network, which provided technical support for accurate identification and short-term prediction of stripe rust on a small scale.

The above methods are concerning to surface learning. Although some progress has been made, there is still some room and potential for improvement. The extraction of hand-crafted features, such as inertia moment, roundness, and entropy largely relies on prior knowledge; thus, the extracted features are often inadequate and lacking in detail [55]. Furthermore, while shallow-level features can be extracted effortlessly, abstract representations hidden in the deeper level are difficult to obtain without learning procedures [55].

Moreover, the main challenge for fine-grained image classification of wheat leaf disease is indubitably the very small discrepancies among different categories. Specifically, the difficulties mainly come from three aspects:

the strong similarity among different disease spots.
the large visual interferences of the cluttered environments.
the large search space of possible disease spot positions.

Consequently, an effective classification model was required to accurately extract subtle features from the domain-regions. The model that would have a high objectiveness, containing the vital discriminates of certain objects. Driven by this requirement, an improved convolutional neural network codenamed matrix-based convolutional neural network (M-bCNN) has been proposed in this work.

CNN is a multilayer variant perceptron (MLP) [66] inspired by Hubel-Wiesel biological vision system. It can adaptively construct implicit feature description through multi-layer non-linear mapping under training data driving [67]. During the last seven years, mainly due to the state-of-the-art performance of CNN, the quality of image classification and other related fields have progressed at a dramatic pace. In 2012, Krizhevsky et al. [68] won two first prizes (in two separate tasks) for developing AlexNet model in ImageNet Large Scale Visual Recognition Competition (ILSVRC) [69], where its accuracy rate exceeded by 10% that of the second-placed competitor. In 2014, two milestones in face classification were achieved, when Taigman et al. [70] and Ouyang et al. [71] proposed CNN based DeepFace and DeepID. These proved to be the most perfect authentication models for face classification. Two other brands of CNN frameworks with the design concept of “go deeper” became the champion (GoogLeNet [72]) and runner-up (Visual Geometry Group (VGG) [73]) in ILSVRC-2014. The following year, ResNet [74] designed by Microsoft Research Asia (MSRA) won the championship in ILSVRC-2015. Its Top-5 error rate (3.57%) was lower than humans’ classification error rate (5.1%), which illustrated that its object recognition ability surpassed that of human eye. In 2016, DeepMind, a subsidiary of Google, applied CNN to an intelligent robot AlphaGo [75] and defeated Go champion Li Shishi. Sharing the congenital advantages of CNN (though it is hard to see the study of CNN in the fine-grained image classification of wheat leaf diseases), we proposed to utilize it as the theoretical basis in this work.

However, we needed to enhance the representational ability of CNN to better tackle the fine-grained image classification tasks. It is obvious that most representative CNN models gain improvement in accuracy through stacked layers [76], [77]. During 2012 to 2015, all the leading works [68], [72], [74] in the challenging ILSVRC [69] exploited “extreme deep” models, with a depth of 8 [68] to 152 [74]. He et al. [74] even utilized ResNet with 1202 layers to analyze CIFAR-10. Recent evidences reveal that increasing hidden layers is essential for success in the current state-of-the-art convolutional networks [73], [74], [78], [79]. Previous studies show that most deep neural networks typically follow a standard structure originating from LeNet-5 – linearly stacked convolutional layers are optionally followed by one or more subpooling layers and fully-connected layers. These “plain” nets that simply stack layers [74] prevail in the image classification literature and have produced impressive results so far on CIFAR-10, MNIST and other classification tasks. Nevertheless, the design is not efficient to improve the representational ability of networks compared with parallel structure, which is concretely embodied in the growth rate of parallelism level, data streams, scheduling efficiency, neurons, link channels, etc [72], [74]. Moreover, if several convolutional layers are linearly chained together, this will result in a quadratic increase of parameters and computational budget. Then more terrible over-fitting and curse of dimensionality will follow, which will result in serious issues with accuracy [72], [74], [79].

The depth of model representations is also imparative for distinguishing fine-grained visual categories. In order to find an effective method to increase representational ability without obvious side-effects, an improved hierarchical CNN denoting the M-bCNN was proposed and its gratifying performances were evaluated in fine-grained image classification of wheat leaf diseases.

In order to employ the discriminative feature representational ability of CNN, our model originates as a fundamental plain network (i.e. AlexNet [68]). First, we add two convolutional neural layers to the first two low layers (i.e. conv_1 and conv_2) for extracting the global features of images. Then the next three linearly stacked higher convolutional neural layers (i.e. conv_3, conv_4, and conv_5) take place by three $3\times 3$ convolutional kernel matrixes. These are responsible for enhancing representational ability and searching for domain-specific representations in subtler areas. Meanwhile, other tricks, such as exponential linear unit (ELU), local response normalization (LRN), and DropConnect are also integrated together to inhibit vanishing gradient and over-fitting. The overall design conception obeys the following principle: extracting the global features and then searching for domain-specific discrimination.

Due to the large parameters of the model, direct training from scratch on the image set of wheat leaf diseases will result in over-fitting. Therefore, M-bCNN is first pre-trained on the ImageNet database [69]. Based on the training weights, we fine-tuned the model to adapt to the fine-grained images. M-bCNN has proven theoretically and practically that the convolutional kernel matrix is effective to increase the number of data streams, neurons, and link channels, while it inhibits parameter growth. In addition, the tiny features can be extracted by minitype convolution filters arranged in matrix. Moreover, they can combine freely with each other, because of fully-connected mode, generating different feature maps thus improving the flexibility and characterization ability of the model. Therefore, convolutional kernel matrix caters for the strict requirements of enhancing representational ability and suppressing time complexity. This is meaningful for fine-grained image classification of crop diseases. Convolutional kernel matrix allows M-bCNN to acquire the accuracy gains from increased hidden layers without obvious accuracy loss penalty, producing results which significantly compete against AlexNet and VGG-16.

The main novelties and contributions of this research are summarized in three points:

To the best of our knowledge, it is the first work that used improved CNN for the fine-grained image classification of wheat leaf diseases. A novel hybrid CNN structure codenamed M-bCNN is proposed, which significantly increases the data streams, neurons, and link channels. The effectiveness of its hallmark, the convolutional kernel matrix, is proven both theoretically and practically.
As there is no large-scale publicly available image set of wheat leaf diseases at present, a total of 16,652 high-fidelity winter wheat leaf images, containing eight categories, were collected from locations of Shandong province, China. Moreover, a total of 83,260 augmented images were produced by five augmentation methods. Obviously, this is the first large-scale high-resolution image set of winter wheat leaf diseases. We intend to open source this set when it gets richer both in terms of quantity and species.
We have undertaken a significant amount of work on the image set. Our proposed model achieves higher validation accuracy, individual classification rate, precision, recall, and F1-score improvement with a tolerable parameter addition.

The remainder of this paper is as follows. Section II illustrates the methodology of the proposed M-bCNN and the effectiveness of convolutional kernel matrix. Section III presents the construction of the original and augmented image sets of wheat leaf diseases. The details of experimental process and results are being covered in Section IV. The detailed discussion and analysis are elaborated in Section V. Finally, in Section VI, the concluding remarks and suggestions for future works are provided.

SECTION II.

Matrix-Based Convolutional Neural Network

A. Overview of M-bCNN

In an attempt to leverage the success of CNN for object classification, the M-bCNN for fine-grained image classification of wheat leaf disease is proposed. This section details the novelty of our method. It describes the new hierarchical M-bCNN architecture that integrates the proposed convolutional kernel matrix and other tricks, such as ELU, LRN, and DropConnect. Convolutional kernel matrix aims at increasing the model’s representational ability so as to learn a domain-specific discrimination to deal with fine-grained classification, whilst suppressing parameter growth rate. The model depicted in Fig. 2 is called as M-bCNN-CKM-3 for its $3\times 3$ convolutional kernel matrix.

FIGURE 2.

The overview of the proposed M-bCNN-CKM-3. Layers of original CNN are shown in green and blue. Convolutional kernel matrixes that we propose are in purple.

Show All

As obvious from Fig. 2, M-bCNN-CKM-3 mainly contains four convolutional layers ($\text {Conv}_{1} $ , $\text {Conv}_{2} $ , $\text {Conv}_{3} $ , $\text {Conv}_{4}$ ), three MaxPooling layers (S₂, S₄, S₇), three $3\times 3$ convolutional kernel matrixes ($\text {CKM-}3_{5} $ , $\text {CKM-}3_{6}$ , $\text {CKM-}3_{7}$ ), and three fully-connected layers (F₈, F₉, F₁₀). Specifically, $\text {CKM-}3_{5} $ , $\text {CKM-}3_{6} $ , and $\text {CKM-}3_{7}$ are responsible for increasing the model depth and representational ability. Each one contains nine $3\times 3$ convolutional layers ($\text {Conv}_{(1,1)}$ , $\text {Conv}_{(2,1)} $ , $\text {Conv}_{(3,1)} $ ; $\text {Conv}_{\text {(1,2)}} $ , $\text {Conv}_{\text {(2,2)}}$ , $\text {Conv}_{\text {(3,2)}} $ ; $\text {Conv}_{(1,3)}$ , $\text {Conv}_{(2,3)} $ , $\text {Conv}_{(3,3)})$ and each layer contains $96\,\,3\times 3$ convolutional kernels. See Fig. 3 for detailed architecture and data streams in convolutional kernel matrix.

FIGURE 3.

Data streams in convolutional kernel matrix.

Show All

1) DropConnect

Fig. 3 reveals that the input pixel vector $x=[x_{1},x_{2},x_{3},\ldots, x_{n}]$ is first processed by DropConnect. Since the abundant training parameters and complex structures in convolutional kernel matrix easily cause over-fitting, DropConnect is utilized not only in fully-connected layers, but also in $\text {CKM-}3_{5}$ , $\text {CKM-}3_{6} $ , and $\text {CKM-}3_{7} $ . It can randomly mask the weights of convolution kernels through a binary matrix shown in Eq. (1). The model tends to be less sensitive to the specific weights of neurons, hence less likely to overfit the training samples and capable of better generalization ability.\begin{equation*} \begin{cases} x_{i} \ast M_{i} \\ M_{i} \sim Bernoulli ~distribution~(p) \end{cases} \tag{1}\end{equation*} View Source where $x_{i} $ is the input signal, $M_{i} $ represents a binary random mask matrix, which obeys the Bernoulli distribution. The DropConnect rate increases from 0.15 to 0.5.

2) Exponential Linear Unit (ELU)

Next, the processed feature maps are calculated by convolution filters. In order to inhibit vanishing gradient and increase model convergence rate, ELU is utilized as the activation function in convolutional layers, convolutional kernel matrixes and subsampling layers. Suppose the input signals are denoted as $x=[x_{1},x_{2},x_{3},\ldots,x_{n},b_{i}]^{T}$ , the data streams in convolution filter are depicted as below:\begin{equation*} \begin{cases} y_{i} `=f(y_{i})=f\left({\displaystyle \sum \limits _{i=1}^{n} {x_{i} w_{i}^{T} +b} }\right) \\ f=ELU(y_{i})=\begin{cases} y_{i} &\quad \text {if y}_{i} \ge 0 \\ \alpha (e^{y_{i}}-1)&\quad \text {if y}_{i} {< 0} \\ \end{cases} \\ \end{cases} \tag{2}\end{equation*} View Source where $x_{i} $ and $y_{i} `$ represent the input signals and output feature maps respectively; $f$ represents the non-linear activation function whose role is played by ELU, $\alpha $ is initialized to 0.25 and then self-adjusted by optimization; $w=[w_{1},w_{2},w_{3},\ldots,w_{n},+1]^{T}$ denotes the weights of $i$ th convolution filter, $b$ is the bias.

3) Local Response Normalization (LRN)

After the non-linear mapping of ELU, we employ the channel internal normalization contained in LRN for better generalization ability. Its local region is extended in the independent channel. The received signal is normalized as shown in Eq. (3).\begin{equation*} y_{i} ''=y_{i}'/\left({k+\left({\frac {\alpha }{n}}\right)\sum \limits _{j=\max (0,i-n/2)}^{\min (N-1,i+n/2)} {(y_{j}')^{2}} }\right)^{\beta }\tag{3}\end{equation*} View Source where $y_{i} '$ and $y_{i} ''$ represent the input and output feature maps of LRN respectively; $\alpha $ and $\beta $ denote the scaling factor and exponential term respectively; $N $ and $n$ represent the number of channels and local size of the normalized range respectively. The variables $\alpha $ , $\beta $ , and $n$ are initialized to 0.0001, 0.75, and 5 respectively, following Krizhevsky et al. [68].

Finally, M-bCNN ends with an eight-way fully-connected layer with Softmax [81], [82] function:\begin{equation*} S_{i} =\frac {e^{V_{i}}}{\sum \limits _{j=1}^{K} {e^{V_{j}}}}\tag{4}\end{equation*} View Source where $e^{V_{i}}$ and $e^{V_{j}}$ represent the probability belonging to $i$ and $j$ categories respectively; $k$ denotes the number of categories and it is initialized into eight in this paper. The prediction of each category can be calculated by Softmax function.

B. Convolutional Kernel Matrix

In this section, we demonstrate the positive effect of convolutional kernel matrix on representational ability enhancing and parameter growth inhibition. The commonly-used plain nets and proposed convolutional kernel matrix are compared in terms of their structures, data streams, neurons, link channels and training parameters. This is because they largely reflect the performance of a neural network from a mathematical point of view.

1) Schema PlainNet-2 and Schema CKM-2

In Fig. 4, we hypothesize that the size of input image is $\text {L}\times \text {L}$ . Convolutional layers $\text {Conv}_{(1,1)}$ and $\text {Conv}_{(1,2)} $ both consist of b $\text {a}\times \text {a}$ convolution filters in Schema PlainNet-2. It represents a standard and common CNN structure, called plain net, starting from LeNet-5 - linearly stacked convolutional layers are optionally followed by one or more normalization layers, max-pooling and fully-connected layers. Based on the serial structure of Schema PlainNet-2, we turn the network into its matrix version. In Schema CKM-2, the $2\times 2$ convolutional kernel matrix is made up of four convolutional layers ($\text {Conv}_{(1,1)} $ , $\text {Conv}_{(1,2)} $ , $\text {Conv}_{(2,1)} $ , $\text {Conv}_{(2,2)})$ and each one is composed of b $\text {a}\times \text {a}$ convolution filters. Specifically, $\text {Conv}_{(1,1)} $ and $\text {Conv}_{(2,1)} $ are fully connected to $\text {Conv}_{(1,2)} $ , and $\text {Conv}_{(2,2)}$ . The data streams of Schema PlainNet-2 and Schema CKM-2 are shown in Table 1.

$FIGURE 4. - The structure of Schema PlainNet-2 and Schema CKM-2. Left: a 2-layer plain network as a reference. Right: a $2\times 2$ convolutional kernel matrix.$

FIGURE 4.

The structure of Schema PlainNet-2 and Schema CKM-2. Left: a 2-layer plain network as a reference. Right: a $2\times 2$ convolutional kernel matrix.

Show All

The number of neurons, link channels, and training parameters of the two schemas are shown in Table 2.

Table 1 reveals that the number of data streams in Schema CKM-2 is four times that of Schema PlainNet-2, which provides more pipelines for feature integration. Accordingly, the number of link channels in Schema CKM-2 is four times that of Schema PlainNet-2 in Table 2, which brings more non-linear mappings for feature extraction. In addition, the number of neurons in Schema CKM-2 is two times that of Schema PlainNet-2. It means stronger feature extraction ability. The increase of neurons and link channels are both meaningful for boosting the model’s representational ability. Meanwhile, the number of training parameters is also increased with the addition of layers, but not enough to cause serious computational burden.

TABLE 1 Data Streams in Schema PlainNet-2 and Schema CKM-2

TABLE 2 The Number of Neurons, Link Channels and Training Parameters in Schema PlainNet-2 and Schema CKM-2

2) Schema PlainNet-3 and Schema CKM-3

In Fig. 5, linearly stacked convolutional layers, $\text {Conv}_{(1,1)} $ , $\text {Conv}_{(1,2)} $ and $\text {Conv}_{(1,3)} $ consist of b $\text {a}\times \text {a}$ convolution filters in Schema PlainNet-3. In Schema CKM-3, convolutional kernel matrix ($3\times 3$ ) is made up of nine convolutional layers ($\text {Conv}_{(1,1)}$ , $\text {Conv}_{(2,1)} $ , $\text {Conv}_{(3,1)}$ ; $\text {Conv}_{(1,2)} $ , $\text {Conv}_{(2,2)}$ , $\text {Conv}_{(3,2)} $ ; $\text {Conv}_{(1,3)}$ , $\text {Conv}_{(2,3)} $ , $\text {Conv}_{(3,3)})$ and each one is also made up of b $\text {a}\times \text {a}$ convolution filters. Layers in adjacent columns are fully connected with each other. Therefore, there is nine data streams in Schema CKM-3. See Table 3 for detailed data streams in Schema PlainNet-3 and Schema CKM-3.

TABLE 3 Data Streams in Schema PlainNet-3 and Schema CKM-3

$FIGURE 5. - The structure of Schema PlainNet-3 and Schema CKM-3. Left: a 3-layer plain network as a reference. Right: a $3\times 3$ convolutional kernel matrix.$

FIGURE 5.

The structure of Schema PlainNet-3 and Schema CKM-3. Left: a 3-layer plain network as a reference. Right: a $3\times 3$ convolutional kernel matrix.

Show All

The number of neurons, link channels, and training parameters of two schemas are shown in Table 4.

In Table 3 and Table 4, the numbers of data streams and link channels in Schema CKM-3 are both 27 times those of Schema PlainNet-3, which is a distinct improvement compared with Schema CKM-2. The number of neurons in Schema CKM-3 is three times that of Schema PlainNet-3. Moreover, the number of training parameters is also increased by three times. It can be seen that the improvement of neurons and link channels is more significant than Schema CKM-2, while the number of training parameters is within the acceptable range.

TABLE 4 The Number of Neurons, Link Channels and Training Parameters in Schema PlainNet-3 and Schema CKM-3

3) Schema PlainNet-N and Schema CKM-N

With the improvement of hardware, the implementation of convolutional kernel matrix with bigger size may be allowed, whose structure is like Schema CKM-n in Fig. 6. It is composed of $\text {n}^{2}$ convolutional layers ($\text {Conv}_{(1,1)}$ , $\text {Conv}_{(2,1)},\ldots,\text {Conv}_{(n,1)} $ ; $\text {Conv}_{(1,2)}$ , $\text {Conv}_{(2,2)},\ldots,\text {Conv}_{(n,2)}; \ldots;\text {Conv}_{(1,n)} $ , $\text {Conv}_{(2,n)}$ , $\ldots \text {Conv}_{(n,n)})$ and each one owns b $\text {a}\times \text {a}$ convolution filters. Layers in adjacent columns are fully connected with each other, so there is $n^{n}$ data streams in Schema CKM-n. As a reference, Schema PlainNet-n consists of n linearly sequenced convolutional layers ($\text {Conv}_{(1,1)} $ , $\text {Conv}_{(1,2)},\ldots \text {Conv}_{(1,n)})$ and each one also has b $\text {a}\times \text {a}$ convolution filters. The data streams of Schema PlainNet-n and Schema CKM-n are shown in Table 5.

TABLE 5 Data Streams in Schema PlainNet-n and Schema CKM-n

$FIGURE 6. - The structure of Schema PlainNet-n and Schema CKM-n. Left: an n-layer plain network as a reference. Right: an $\text {n}\times \text {n}$ convolutional kernel matrix.$

FIGURE 6.

The structure of Schema PlainNet-n and Schema CKM-n. Left: an n-layer plain network as a reference. Right: an $\text {n}\times \text {n}$ convolutional kernel matrix.

Show All

The number of neurons, link channels and training parameters in Schema PlainNet-n and Schema CKM-n are calculated by Eq. (1) to Eq. (6), where $L$ , $b$ , and $a$ denote the input image size, the number and the size of convolution filters respectively, and they are initialized to 256, 10, and 3 respectively. $Num\_{}PlainNet-n_{Neu} $ , $Num\_{}PlainNet-n_{lc} $ , and $Num\_{}PlainNet-n_{tp} $ represent the number of neurons, link channels, and training parameters of Schema PlainNet-n while $Num\_{}CKM-n_{Neu} $ , $Num\_{}CKM-n_{lc} $ , and $Num\_{}CKM-n_{tp} $ represent those of Schema CKM-n, and $n$ denotes the size of convolutional kernel matrix. The corresponding functions of each equation are illustrated in Fig. 7 and Fig. 8.

FIGURE 7.

The number of neurons, link channels, and training parameters in Schema PlainNet-n. (a) The number of neurons in Schema PlainNet-n. (b) The number of link channels in Schema PlainNet-n. (c) The number of training parameters in Schema PlainNet-n.

Show All

FIGURE 8.

The number of neurons, link channels, and training parameters in Schema CKM-n. (a) The number of neurons in Schema CKM-n. (b) The number of link channels in Schema CKM-n. (c) The number of training parameters in Schema CKM-n.

Show All

It can be seen in equations 5, 6, 8, and 9, that the numbers of neurons and link channels in Schema CKM-n are $n$ and $n^{n}$ times those of Schema PlainNet-n, respectively. This means sufficient convolution filters and non-linear mappings are available for better features extraction. From Fig. 7 (a), Fig. 7 (b), Fig. 8 (a), and Fig. 8 (b), we can also see that this improvement generated by convolutional kernel matrix becomes more significant with the increase in matrix size. Especially for link channels, the number of them in Schema CKM-n is seven orders of magnitude larger than that of Schema PlainNet-n, when the matrix size is greater than or equal to six. In terms of training parameters, from Eq. (7) and Eq. (10), we can see the number of it in Schema CKM-n is $n$ times that of Schema PlainNet-n, but it is noteworthy that the link channels simultaneously increased by $n^{n}$ times. Therefore, the matrix structure of convolutional layers is an effective and economical way to boost the representational ability of the model. Moreover, from Fig. 7 (c) and Fig. 8 (c) we observe that the number of training parameters in Schema CKM-n is within 7,000, when the matrix size is equal or less than eight, which is just one order of magnitude larger than that of Schema PlainNet-n. It is not enough to lead in unacceptable computational budget and the curse of dimensionality. Therefore, the obvious computational burden and serious accuracy loss will not occur at the experimental stage. Schema PlainNet-n and Schema CKM-n are represented in (5)–(7) and (8)–(10), respectively, as shown at the bottom of this page.

Show All

The time complexity of one convolutional kernel matrix can be calculated by Eq. (11): \begin{equation*} \begin{cases}{l} \text {Time}\sim O\left({\displaystyle \sum \limits _{i=1}^{N} {NM_{i}^{2} \cdot NK_{i}^{2} \cdot N^{N}C_{i-1} \cdot N^{N}C_{i}} }\right) \\ M=(X-K+2\ast Padding)/Stride+1 \\ C_{i} =(K^{2}+1)F(X-iK+1)^{2} \\ N\in \{x\vert x\ge 2,x\in Z\} \\ \end{cases} \tag{11}\end{equation*} View Source where $N$ denotes the matrix size, $M$ and $X$ denote the size of output and input feature maps respectively, $K$ and $F$ denote the size and number of convolution filters, $C_{i-1} $ and $C_{i} $ denote the channels of (i-1)th and $i$ th layer respectively. From the Eq. (11) we can perceive that the amount of neurons and channels are related with time complexity, and so is the amount of parameters. If the neurons, link channels, and parameters all grow sharply, it will undoubtedly result in intolerable time complexity (often occurs in deep plain networks [72], [74], [79] with many linearly stacked layers). So the convolutional kernel matrix was proposed to increase the neurons and link channels whilst restraining parameters growth, and the previous two are beneficial for representational ability enhancing. Although the neurons and link channels grow rapidly, the total time complexity is not so obvious for the suppression of parameter growth. Moreover, thanks to the activation function “ELU” (see section II, A, 2) and optimization strategy “SGD+momentum” (see section IV, B, 4), the convergence rate has been accelerated to some extent. From the training phase (see section IV, B, 5) we can observe that only ten more epochs (about four more hours) are required to achieve convergence. The time complexity can also be diluted by integrated mature tricks. It also demonstrates that the additional time complexity is not obvious.

In conclusion, the above three comparisons (see section 1, 2, and 3) demonstrate that convolutional kernel matrix provided significantly better performance than the plain networks. This proves that the matrix structure helps with achieving a substantial increase of data streams, neurons, and link channels at a tolerable increase of computational requirements for affordable parameters addition. This way, the curse of dimensionality will not appear within a proper matrix size. Sharing the above advantages, M-bCNN could easily relish accuracy gains from considerably increased depth, producing efforts substantially better than plain networks.

SECTION III.

Data Description

In this work, winter wheat leaf diseases images were utilized as the experimental samples of fine-grained classification for their strong similarity with subordinate categories in some cases. At present, no large-scale image set of wheat leaf diseases is publicly available. So 16,652 high-fidelity images were collected from several wheat planting areas of Shandong province and were assigned as the original image set. Then an augmented database containing 83,260 images was constructed by five augmentation methods. The original and augmented image sets were utilized as the training and testing samples, respectively. To the best of our knowledge, this is the first available large-scale high-resolution images sets of winter wheat leaf diseases.

A. Image Acquisition

From the wheat planting bases of Shandong Province, China, 16,652 winter wheat leaf images containing eight categories were collected from the field using Canon EOS 80D camera. They were acquired between 8:00 a.m. and 5:00 p.m., and the distance of the camera from the leaf was three to seven cm. The image format was JPEG and each one was a 24-bit color bitmap. Each image included only one disease and was classified into one corresponding ground truth category by plant protection experts. The original image set was utilized as testing sample. See Table 6 for randomly selected samples.

TABLE 6 Samples of Winter Wheat Leaf Disease Images

From Table 6 we can observe that the images of some sub-categories, within one common basic-level category, have strong similarities. For example, the images of Powdery Mildew and Cochliobolus Heterostrophus are similar to each other, and they belong to the same basic category of Spore Parasitism. Therefore, it is generally much harder to classify each image with in its true category and this is a meaningful dataset for model evaluation of fine-grained classification. The number and proportion of each category is shown in Table 7.

TABLE 7 Number and Proportion of Each Category in Original Image Set

B. Image Augmentation

Adequately labeled samples can reduce under-fitting in the model training process [83]. In order to increase the number and diversity of original images and enable extracted features to own the robustness of rotation, translation, and scaling, etc., an augmented image set was constructed through noise addition [84], color jittering [68], PCA jittering [68], rotation blur [55], and scaling blur [85] for their implementation simplicity and satisfactory performances proved in previous researches [23], [52], [72], [73]. These methods simulate the interferences of noise, illumination fluctuation, and object jitter which are frequently encountered during the acquisition process and practical application scenarios. See Table 8 for details and Table 9 for the processed images corresponding to each method.

TABLE 8 Methods of Image Augmentation Used in the Study

TABLE 9 Augmented Images as a Consequence of Five Individual Augmentation Methods

Finally, we augmented dataset of 83,260 images that were enough for the model’s convergence. This image set was utilized as the training sample and the number and proportion of each category is shown in Table 10.

TABLE 10 Number and Proportion of Each Category in Augmented Image Set

SECTION IV.

Experiments

A. Experimental Environments

Training a deep CNN through a large number of iterations largely relies on high-performance graphics processing units (GPUs). We run the experiments using multiple GPUs on NVIDIA (R) GeForce GTX 1080 graphics card. Its basic configuration is shown in Table 11.

TABLE 11 Basic Characteristics of GPUs

The computer was a HP EliteDesk 880 G2 TWR with an Intel(R) Core i7 6700K (3.40 GHz) processor and 16 GB memory. The operating system was Ubuntu 16.04.1 (64 bits). Implementation of the M-bCNN used TensorFlow, an open-source machine learning framework for numerical computation developed by Google Brain Team. The Python was utilized as the programming language to adapt to the core of TensorFlow.

B. Evaluation Metric

The following metrics are considered to evaluate the model. First and foremost, the accuracy is widely implemented for the target classification and recognition. However, the performance of the model cannot be illustrated sufficiently only with accuracy. So, we use confusion matrix, also known as error matrix in supervised learning, which clearly depicts the actual and predicted categories in each column and row respectively. Moreover, the precision, recall, and F1-scores across individual categories are utilized to evaluate the performances of the classifier model directly.

C. Model Training

Before the training starts, 70% of the images in each category are randomly selected as training samples and the remaining 30% are utilized as validation samples. Since the augmented dataset is already balanced, this sampling method can ensure the inter-class balance.

1) Optimization Objective

Suppose there are $N$ training samples and the feature vector of $n$ th ($1\le n\le N$ ) sample is denoted as $x^{n}=(x_{1}^{n},x_{2}^{n},\ldots,x_{m}^{n})$ , where $m$ represents the number of dimensions. The corresponding actual output vector and the expected output vector are $y^{n}=(y_{1}^{n},y_{2}^{n},\ldots,y_{m}^{n})$ and $EO^{n}=(EO_{1}^{n},EO_{2}^{n},\ldots,EO_{c}^{n})$ , where $c$ denotes the numbers of output vectors. Then the optimization objective of M-bCNN is the mean squared error of all samples, as shown below:\begin{equation*} E^{N}=\frac {1}{2}\sum \limits _{n=1}^{N} {\sum \limits _{m=1}^{c} {(y_{m}^{n} -EO_{m}^{n})^{2}}}\tag{12}\end{equation*} View Source

2) Loss Function

The standard cross-entropy [86] is utilized as the loss function during the model training stage and it is defined as in Eq. (13):\begin{equation*} E=-\frac {1}{n}\sum \limits _{x} {[y\ln y'+(1-y)\ln (1-y`)] }\tag{13}\end{equation*} View Source where $y$ and $y'$ denote the expected and actual output, respectively.

3) Regularization Term

In order to better resist over-fitting and vanishing gradient, L2 regularization is exploited and is shown in Eq. (14):\begin{equation*} L_{2} =\frac {1}{2n}\lambda \sum \limits _{w_{i}} {w_{i}^{2}}\tag{14}\end{equation*} View Source where $w$ and $n$ denote model parameters and the number of samples respectively, $\lambda $ is the weight decay and is assigned to 0.001.

4) Optimization Strategy

In pursuit of faster training speed, the strategy of “SGD+momentum” is utilized as the optimization algorithm. Its optimization speed is $1/1-\alpha $ times faster than that of SGD [87], where $\alpha $ denotes momentum and ranges $0< \alpha < 1$ . The optimization process of “SGD+momentum” is shown in Eq. (15):\begin{equation*} \begin{cases}{l} J(\theta)=\displaystyle \frac {1}{N}\sum \limits _{i=1}^{N} {\frac {1}{2}(y^{i}-h_{\theta } (x^{i}))^{2}} \\ v=\alpha v-\varepsilon J(\theta) \\ \theta =\theta +v \\ N=1,2,3,\ldots,n \\ \end{cases} \tag{15}\end{equation*} View Source where $x^{i}$ and $y^{i}$ denote input and output signals, $J(\theta)$ and $h(\theta)$ is the gradient estimation and fitting function, $\theta $ is the parameter needed to be optimized and it decides $h(\theta)$ , $\varepsilon $ is the learning rate and is initialized into 0.01, its decay steps and decay rate are assigned to 3,000 and 0.1 respectively, which means that $\varepsilon $ is divided by ten at every 3,000 iterations, momentum $\alpha $ is assigned to 0.99 for 100 times improvement of optimization speed, and $v$ is the learning speed that is refreshed after every iteration.

Finally, Batch normalization (BN) [88] is adopted right after each convolution layer and all models are trained from scratch.

5) Training Implementation

The structures of Schemas CKM-2 and CKM-3 are realized in models M-bCNN-CKM-2 and M-bCNN-CKM-3 respectively, and contrasted with two representative plain networks, AlexNet [68] and VGG-16 [73], for comparison studies. In the same experimental environment, M-bCNN-CKM-2 and M-bCNN-CKM-3 are first pre-trained on the ImageNet dataset [69] for their large parameters, and then four models are fine-tuned for up to 100 epochs end-to-end by SGD + momentum with back-propagation on the augmented image set, where the mini-batch size is 50. Fig. 9 (a), (b), and (c) depict the accuracy of training and validation image set throughout the whole procedure.

Fig. 9 (a)-(c) compare the training and validation accuracy of four models. Fig. 9 (a) shows that M-bCNN-CKM-2 and M-bCNN-CKM-3 converged after about 50 training epochs. The results indicate that the two models have equivalent accuracy for the training image set, whereas for validation image set, the validation accuracy of M-bCNN-CKM-3 is better than that of M-bCNN-CKM-2. Based on these result, the M-bCNN-CKM-2 was then compared with AlexNet and VGG-16 as shown in Fig. 9 (b) and (c). The results demonstrate that the training and validation accuracy of M-bCNN-CKM-2 are both higher than those of AlexNet and VGG-16, and only ten more training epochs are required to achieve convergence.

FIGURE 9.

Accuracy in the training image set and validation image set. (a) Iteration of training accuracies changes and validation accuracies changes by M-bCNN-CKM-2 and M-bCNN-CKM-3. (b) Iteration of training accuracies changes and validation accuracies changes by M-bCNN-CKM-2 and AlexNet. (c) Iteration of training accuracies changes and validation accuracies changes by M-bCNN-CKM-2 and VGG-16.

Show All

According to the results in Fig. 9 (a)-(c), the model that maximized the accuracy for the validation image set is considered to be the best. Table 12 shows the training accuracy, validation accuracy, training epoch, and training time for each model. M-bCNN-CKM-3, which achieved the highest validation accuracy, is the best performing model. When the models are convergent, the highest validation accuracies of M-bCNN-CKM-2 and M-bCNN-CKM-3 are about 91.32% and 96.5% respectively, which are obviously higher than those of AlexNet and VGG-16 (83.12% and 88.54% respectively). M-bCNN-CKM-2 and M-bCNN-CKM-3 achieved higher validation accuracies of fine-grained classification for wheat leaf diseases’ images, but required just about four more hours to converge. It suggests that the convolutional neural network is effective both in boosting up the representational ability and suppressing parameter growth, while the training and validation accuracies do not suffer the penalty of the curse of dimensionality.

D. Feature Visualization

In order to get a clearer understanding of how and why the models work, guided-backpropagation and deconvolution [89] are both utilized to visualize the constantly updating filters of the model throughout the whole training stage. In the course of the experimental iterations, the visualization of some randomly selected filters in M-bCNN-CKM-3 is shown in Fig. 10.

FIGURE 10.

The filters of some hidden layers in M-bCNN-CKM-3 visualized as small patches.

Show All

We can view the above filters as the learned feature descriptors encoding the distinctive fusion structures. It is noticeable that despite each filter is independent in Conv₁ and S₁, our filters in CKM-3₅ smoothly change during training stage. In this manner, they provide much richer and more meaningful domain-specific representations. In a scene, this also further demonstrates that using convolutional kernel matrix is a crucial requirement for a model to learn better representations.

E. Model Testing

This section illustrates the performance verification of models in fine-grained image classification experiments. We utilized 16,652 original images as the testing samples. Then compared trained M-bCNN-CKM-2, M-bCNN-CKM-3 against AlexNet and VGG-16 and evaluated them in terms of individual accuracy, precision, recall, F1-socre, and overall accuracy.

In Fig. 11, four confusion matrixes, which compared the true category (Ordinate) against the predicted category (Abscissa), were calculated to describe the individual classification rate of each model. Note the color distribution of the confusion matrix and that the Normal Leaf and Bacterial Leaf Blight have high average classification rates (91.0% and 92.3% respectively) for all models. The large green area of healthy leaves and the golden appearance of leaves infected with Bacterial Leaf Blight make them easier to be distinguished with other sub-categories sharing one common basic-level category. Furthermore, M-bCNN-CKM-3 and AlexNet achieved the highest (89.6%) and the lowest (80.5%) average classification rate of eight categories, as obvious from Fig. 11 (b) and (c).

FIGURE 11.

Confusion matrix of the testing results. (a) Illustration of individual classification rate by M-bCNN-CKM-2. (b) Illustration of individual classification rate by M-bCNN-CKM-3. (c) Illustration of individual classification rate by AlexNet. (d) Illustration of individual classification rate by VGG-16.

Show All

Fig. 12 and Table 13 depict the precision, recall, F1-score and accuracy of eight categories by four models for the testing image set. In Fig. 12, we observe that the precision (93.32%) of Bacterial Leaf Blight is the highest, while the recall (91.68%) of Leaf Rust is the highest. Other sub-categories sharing one basic-category are harder to be distinguished because of their strong similarity. The average precision (90.15%) and recall (88.62%) of M-bCNN-CKM-3 are the highest among these models, and those of AlexNet are the lowest (69.83% precision and 64.71% recall). In Table 13, the average F1-score (85.5%) and accuracy (90.1%) of M-bCNN-CKM-3 are also the highest for four models, and those of AlexNet are the lowest (51.75% F1-score and 72% accuracy). The F1-score and accuracy of these models indicate that M-bCNN-CKM-2 and M-bCNN-CKM-3 provide better performances than the other two plain networks in fine-grained image classification, and M-bCNN-CKM-3 is the top-performing approach compared to M-bCNN-CKM-2. Based on testing assessment, the convolutional kernel matrix is meaningful for boosting up representational ability compared with the linearly stacked layers, and the accuracy penalty caused by the curse of dimensionality has not appeared. Additionally, from the comparison of M-bCNN-CKM-2 and M-bCNN-CKM-3 (see also Fig. 9 (a) and Table 12), we observe that this advantage becomes more significant as the matrix size increases. Sharing the advantages of convolutional kernel matrix, M-bCNN can easily acquire accuracy gains from the increased layers depth in the form of a matrix.

TABLE 12 Accuracy and Epoch of the Best Models, and Training Time (h)

TABLE 13 F1-Score and Accuracy

FIGURE 12.

Precision and Recall.

Show All

SECTION V.

Discussion

Recently, a number of studies have been conducted on fine-grained classification methods, and most of them provide promising performance in certain fields. Inspired by the design conceptions of parallel networks (e.g., Part-based CNN [8], Two-level Attention CNN [16], MCNN [55], GoogLeNet [72], ResNet [74], and Hypercolumn CNN [90]), we proposed a novel hybrid CNN structure codenamed M-bCNN, which leverages convolutional kernel matrixes to effectively increase the data streams, neurons, and link channels. The matrix-based architecture played an important role and the expected accuracy gains from it were delivered in the fine-grained image classification of wheat leaf diseases. The model’s satisfying performance surpassed the two representative plain networks, i.e. AlexNet [68] and VGG-16 [73]. The experimental results and conclusions are basically consistent with other studies on hierarchical models [8], [16], [55], [72], [74]. Parallelization and grading of neural networks is one of the developmental trends for deep learning. Our strategy might prove meaningful for other fine-grained tasks such as action or attribute categorization.

There could be one potential issue with the proposed model concerning the training phase: the convolutional kernel matrix may introduce heavy computations, when the iterations, training samples and matrix size are very large. Computations burden penalty is a common phenomenon that often occurs in the parallel networks (e.g., GoogLeNet [72], Two-level Attention CNN [16], and Hypercolumn CNN [90]), which cautiously sacrifice the proper algorithm efficiency for accuracy benefits. Consequently, it is important to carry out research on how to find an effective pruning mechanism for model compression, so that limited computational resources can be allocated scientifically and rationally. There is still room and potential to improve the performance to be at par with or even surpass the M-bCNN. One direction of future work is to delve deeper into the architecture optimization and integrate modified pruning mechanism into complex representational framework.

SECTION VI.

Conclusion

In this work, we have proposed a unified CNN model, denoted M-bCNN, based on convolutional kernel matrix, for fine-grained image classification. The main advantage of convolutional kernel matrix is significant gains of data streams, neurons, and link channels at a modest increase of computational requirements compared to plain networks. We described the methodology of our architecture and positive effort on both representational ability enhancing and parameter growth inhibition.

The experiments demonstrated that the promising performances of our model compete against AlexNet and VGG-16 in the fine-grained image classification of wheat leaf diseases. Our approach yields solid evidence that convolutional kernel matrix is a feasible and useful idea in general, which provides a new path for the identification of crop diseases.

Future work directions are of two aspects: First, we will focus on optimizing the architecture and hyper-parameters of M-bCNN for other challenging fine-grained classification tasks. Second, we will try other models such as generative adversarial networks (GANs), regions with CNN (RCNN) to deal with semantic segmentation, object detection, and open-set recognition.

References is not available for this document.

A Unified Matrix-Based Convolutional Neural Network for Fine-Grained Image Classification of Wheat Leaf Diseases

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction