Introduction
Tea diseases have a large influence on tea yield and quality [1]. In modern agricultural production, manual identification remains the primary method of detecting tea infections, which presents a substantial difficulty for tea garden managers. Manual observation takes a lot of time and effort, making it difficult to swiftly identify disease states in huge amounts of data. This might influence the identification’s accuracy and dependability. The swift advancement of artificial intelligence technology has led to a growing interest in investigating more precise and effective ways to diagnose tea diseases.
Nowadays, deep learning-based image recognition algorithms are widely used in agriculture [2], [3], [4], [5], since deep learning is progressively replacing classical machine learning in this field to address issues related to disease identification. Deep learning learns advanced characteristics from data in an incremental way, significantly improving the accuracy of crop disease diagnosis. Traditional machine learning often extracts disease features manually, which increases the complexity of disease feature extraction.
Despite the fact that crop disease identification and detection techniques have been done well [6], [7], the timing and state of disease occurrence produces large variations due to geographic differences in different regions. The use of many models is not highly robust, and there is no way to achieve a more reasonable migration, so we need to construct our own dataset and experimentally confirm the effectiveness and realism of the identification of specific diseases with the help of deep learning methods. For the implementation of disease recognition algorithms, many models use increasing the depth of the network hierarchy to improve the model performance, which not only causes phenomena such as gradient explosion or gradient vanishing but also affects the model recognition speed and size. Therefore, it is necessary to design a high-performance disease recognition algorithm with a minimalist structure.
In this study, a minimalist architecture tea disease recognition model, VCRUNet, based on channel reconstruction unit (CRU), is proposed for the problem of accurate recognition of tea diseases in complex backgrounds. The proposed network model uses a convolutional module for image feature extraction, adds CRUs to improve the ability of disease feature localization in complex backgrounds, and combines global features to significantly improve the accuracy of tea disease recognition. The main contributions of this paper can be summarized in the following three aspects:
The tea disease dataset was constructed to address the problem of insufficient samples of tea disease datasets with complex backgrounds. Meanwhile, the diversity of image features was enhanced using data enhancement strategies to simulate the light, angle, and occlusion encountered in real-life scenes;
A tea diseases recognition model, VCRUNet, is developed to address the challenge of recognizing tea diseases in complex backgrounds. It is built around a minimalist framework consisting of VanillaNet13 and CRU. In parallel, migration learning is used to fine-tune parameters on a pre-training model using a self-constructed tea illness dataset in order to reduce the effects of overfitting caused by insufficient sample size;
The proposed method achieves the best recognition accuracy of 92.48% on the tea disease dataset. In terms of model recognition speed, it is 4.5 seconds per 100 images. The recognition success rate has a leading edge over other advanced algorithms.
Related Works
Many scholars have lately addressed agricultural disease issues with deep learning-based recognition systems. In order to limit the loss of feature information and expand the sensory field, Pan et al. [8] developed an inflated convolution-based feature fusion (DCFF) module. This module aggregates feature information from various convolutional layers in two phases, thereby extending the sensory field and providing a powerful feature representation capability. Additionally, the InceptionC module was added to further improve performance. Jadhav et al. [9] presented an effective transfer learning-based strategy for identifying soy-bean diseases that used pre-trained AlexNet and GoogleNet convolutional neural networks (CNNs) to reach 98.75% and 96.25% accuracy, respectively. Simhadri et al. [10] proposed the use of transfer learning for rice leaf disease recognition, and experiments showed that the InceptionV3 model achieved the best performance. Zhao et al. [11] proposed a self-supervised Contrastive learning method for Leaf disease identification with do-main Adaptation (CLA), which utilizes large-scale yet messy unlabeled data to train the encoder and obtains their visual representations in the pre-training stage. Zhou et al. [12] developed a residual-distilled transformer architecture to address the issues with identification accuracy and model interpretability in rice leaf disease diagnosis study. The authors present a distillation technique to extract weights and parameters from the pre-trained vision transformer models. As residual blocks for feature extraction, the residual concatenation of the vision transformer and the distilled transformer is supplied into a multi-layer perceptron (MLP) for prediction. Sanida et al. [13] used the VGGNet network based on transfer learning to increase the accuracy of tomato disease identification algorithms, outperforming other state-of-the-art approaches in the test set. Zhang et al. [14] proposed a Multi-channel Automatic Orientation Recurrent Attention Network (M–AORANet) to extract abundant disease features, with the goal of addressing the issues of noise easily generated during image acquisition and transmission of tomato diseases, as well as the intra-class variability and intra-class similarity of leaf diseases. Meanwhile, the noise interference in the photos was minimized using an asymptotic non-local mean approach. For the detection of apple leaf diseases, Zheng et al. [15] developed an effective lightweight model (RepDI) based on structural reparameterization. In order to achieve better detection performance, the model embeds the parallel expansion attention mechanism module and applies depth-separable convolution and structural reparameterization approaches. In an effort to improve the accuracy of strawberry disease identification, Li et al. [16] proposed the spatial convolutional self-attention-based transformer (SCSA-Transformer), which makes use of Multi-Head Self-Attention (MSA) to capture feature dependencies over long distances in images of strawberry disease. Chen et al. [17] proposed MS-DNet, a novel lightwei-ght network architecture that achieves an average accuracy of 98.32% for the identification of various crop disease kinds. It is minimal in size and has a rapid computing speed for the network model. Gao et al. [18] suggested a backbone network BAM-Net based on aggregate coordinate attention mechanism (ACAM) and multi-scale feature refinement module (MFRM) to diagnose apple leaf diseases.
As a consequence of continuous advancements in deep learning technology for crop disease diagnosis and detection, research on various disease types in many crops has gradually changed, and people have tried to use the method to solve a range of issues. Hu et al. [19] proposed a technique for identifying tea diseases that relies on an enhanced deep convolutional neural network (CNN). With an average recognition accuracy of 92.5%, a multi-scale feature extraction module was introduced to the upgraded CIFAR10 speed model deep CNN to enhance the automatic feature extraction from various tea disease photos. In 2022, Hu et al. [20] proposed employing weight initialization and diseased leaf segmentation techniques in conjunction with the multi-convolutional neural network model MergeModel to automatically diagnose tea diseases in small samples. The outcomes demonstrated that MergeModel could successfully discriminate between healthy and sick tea leaves as well as recognize common tea diseases as sooty mold, tea white scab, tea leaf blight, and tea red scab. Datta et al. [21] proposed utilizing deep CNN with several hidden layers to categorize sick tea leaves into distinct groups. In order to distinguish between three types of plant stress, Zhao et al. [22] presented a multistep method based on hyperspectral imaging and continuous wavelet analysis (CWA). The outcomes of the experiment demonstrated the effectiveness of hyperspectral imaging for plant phenotyping following pests and diseases.
Materials and Methods
A. Data Acquisition
1) Tea Disease Dataset
The experimental data in this experiment were obtained from the Liu Bao Tea Trial Base of Wuzhou College, Wanxiu District, Wuzhou City, Guangxi Zhuang Autonomous Region, China, as shown in Figure 1. The data were collected in early–mid-2023, when tea tree diseases were in high incidence, and images of tea leaves were taken at different times and under different climatic conditions, which is conducive to the study of the impact of multiple factors on the identification of tea diseases in complex backgrounds and at the same time enhances the robustness of the data.
The collection and shooting tools were Nikon Z5 camcorder and smart phone (the iPhone 14 Pro). The images of tea leaf blight, tea brown leaf spot, tea red scab, and healthy tea leaf were collected, and 658 images were randomly selected for the recognition study. The images were saved in JPG format, and some of the examples are shown in Figure 2. Due to the uncertainty of disease production, the number of image data collected was not evenly distributed, as shown in Table 1.
2) Public Dataset
The public dataset Plant Village [23] was used for model pre-training, which contains a total of 54,305 images of 26 diseases on 14 plant species. An example of the Plant Village dataset images is shown in Figure 3.
3) Data Enhancement
In order to improve the accuracy and robustness of the model classification, this study performs data enhancement on the collected sample data. The main data enhancement means used are as follows:
The adjustment of image dimness was processed using three different image brightness adjustments to simulate different sunlight conditions;
Rotate the image at different angles, 90° clockwise, 180° clockwise, and 270° clockwise, to simulate different shooting angles;
Performs horizontal flip processing on the image;
Noise processing, pretzel noise, and Gaussian noise were added in turn to simulate different shot qualities;
Adjusts the image chroma;
Adjusts the image contrast;
Adjusts the image sharpness;
A random erasure [24] operation was performed on the images to simulate occlusions that occurred during the shooting process.
We divided the original data into training set, validation set and testing set in the ratio of 7:1:2. Among them, the data in the training set was data-enhanced using the above enhancement strategy, and an example of data enhancement is shown in Figure 4. The same segmentation strategy is used for public dataset.
B. Experimental Methodology
1) Vanilla Neural Architecture
With the development of artificial intelligence (AI) chip technology, factors such as the complexity and depth of the neural network design are now major determinants of the inference speed of neural networks. Huawei Noah’S Ark Lab has proposed a network architecture called VanillaNet [25], which is detailed in Figure 5. The completely connected layer, main body, and stem comprise the architecture. Nevertheless, VanillaNet minimizes the number of levels by using only one layer at a time to construct a very simple network topology. Reducing the neural network’s computational complexity and speeding up inference are the goals.
The above figure shows the VanillaNet architecture with 6 layers as an example. For the stem, a
2) Attention Mechanisms Module
The Channel Reconstruction Unit (CRU) [26] is mainly designed to be able to make full use of the redundant information in the feature channel. The structure is shown in Figure 6, and its main idea is to adopt a three-step strategy, i.e., Split-Transform-and-Fuse strategy.
In the split section, for an initial feature X (given an intermediate feature mapping
In the transform section, Xup is fed into the upper transform stage to act as a “Rich Feature Extractor” and use group-wise convolution (GWC) (set group size g = 2 in the experiments) and point-wise convolution (PWC) to extract high-level representative information. GWC reduces the number of parameters and computations but cuts off the flow of information between groups of channels.
PWC compensates for the loss of information and helps the information flow between functional channels. After that, the outputs are summed to form a merged representative feature map, Y1. The formula for the above stages is expressed as:\begin{equation*} Y_{1}=M^{G}X_{up}+M^{P_{1}}X_{up} \tag{1}\end{equation*}
\begin{equation*} Y_{2}=M^{P_{2}}X_{low}\cup X_{low} \tag{2}\end{equation*}
In the fuse section, the output features Y1 and Y2 from the up-and-down conversion stages are adaptively merged by using the simplified SKNet method [27]. Global average pooling is first applied to select the global spatial information Sm with channel statistics. The equation is expressed as:\begin{align*} S_{m}=Pooling\left ({Y_{m} }\right)\!=\!\!\frac {1}{H\times W}\sum \nolimits _{i=1}^{H} \sum \nolimits _{j=1}^{W} {Y_{c}(i,j)}, m=1, 2 \tag{3}\end{align*}
\begin{equation*}\beta _{1}=\frac {e^{S_{1}}}{e^{S_{1}}+e^{S_{2}}}, \beta _{2}=\frac {e^{S_{2}}}{e^{S_{1}}+e^{S_{2}}}, \beta _{1}+\beta _{2}=1 \tag{4}\end{equation*}
Finally, merging the upper feature Y1 and the lower feature Y2 yields the channel refinement feature Y.
3) Proposed Model Structure
In this study, a novel neural network model structure for tea disease recognition, VCRUNet, is proposed, and Figure 7 illustrates the model structure of VCRUNet. In order to improve the recognition accuracy of the model, the model introduces the attention mechanism module CRU on the basis of the VanillaNet-13 model to strengthen the model’s ability to extract the features of tea diseases.
Firstly, a batch of tea leaf disease images is input with a resolution of
4) Tea Disease Recognition Model Based on Transfer Learning
Deep convolutional neural network models are larger and more accurate than traditional machine learning methods, but deep convolutional neural networks usually require a large amount of training data. The premise of using this method to solve the problem of tea disease identification is that we must have a large amount of data on tea diseases, and in practice, this data requirement limits the application of this method. Transfer learning can effectively improve the performance of models by utilizing existing knowledge, reducing the amount of data and computational overhead, and it is also widely used in agriculture [28], [29], [30], [31]. Thus, we use transfer learning to apply the knowledge learned from the source target domain to the target domain, where we transfer the trained network model on the Plant Village dataset to the tea disease dataset and realize the reuse of the network model parameters and weights on the tea disease dataset by fine-tuning the parameters during the model training process. In this case, the model transfer process is shown in Figure 8.
Results
A. Experimental Design
In our experiments, all image processing as well as model training were implemented using Anaconda3 (Python 3.7.16) and PyTorch. The experimental hardware environment consists of Intel (R) Core(TM) i9-13900K with 128 GB of RAM and NVIDIA GeForce RTX 4090 graphics card for model training and testing, as detailed in Table 2. This experiment mainly includes comparison experiments on different network models, comparison experiments with our proposed model, model transfer, and ablation experiments between the public dataset and our own established tea diseases dataset.
Throughout the model training kind, the number of training rounds was set to 100, and the batch size and learning rate were set to 64 and 0.01, respectively. Other detailed hyperparameters are shown in Table 3.
The use of cutting-edge Convolutional Neural Networks (CNNs) in computer vision applications has recently gained popularity in research. To evaluate the performance of our proposed method, we select different model networks for experimental comparison, including ResNet [32], MobileNetV2 [33], Swin-Transformer [34], Vision-Transformer [35], and VanillaNet. The reason for using these networks for comparison is that these network models are widely used in the field of disease identification, and it is more characteristic and convincing to use the experimental results of these network models as the initial baseline. Therefore, the models were tested after 100 epochs of training, and their results were evaluated while comparing them with the method proposed in this study.
B. Evaluation Metrics
In order to show the performance of the network in this study, accuracy, precision, recall, and F1-Score were used to measure the performance of the network model in recognizing tea diseases. The formulas for these measures are given below:\begin{align*}Accuracy&=\frac {TP+TN}{TP+FP+TN+FN} \tag{5}\\ Precision&=\frac {TP}{TP+FP} \tag{6}\\ Recall&=\frac {TP}{TP+FN} \tag{7}\\ F1-Score&=2\times \frac {Precision\times Recall}{Precision+Recall} \tag{8}\end{align*}
C. Model Classification Results and Performance Analysis
The tea diseases test dataset was used to evaluate the performance of all the models, and Table 4 shows the recognition accuracy results of all the models. The VCRUNet proposed in this study performs the best among all the models, with a recognition accuracy of 92.48%. The VCRUNet model adds the channel attention mechanism CRU to the VanillaNet13 model, which improves the feature extraction ability of the network and makes the recognition effect optimal. Experimental results showed an improvement in identification accuracy of 1.5% compared to ResNet18, ResNet50, and Swin-Transformer-Tiny, an increase of 2.25% compared with MobileNetV2 and Swin-Transformer-Small, and an improvement of 0.75% and 3.76% in comparison with Vision-Transformers and VanillaNet-13.
Therefore, based on the experimental results, we chose VanillaNet-13 with the highest accuracy rate as the basic framework of the model for improvement and compared the recognition accuracy of different models for tea diseases in complex backgrounds by introducing the attention mechanism CRU, which proves that the model can better extract the disease features in the image and enhance the feature expression ability.
The network achieves the best results in all the performances compared to the traditional CNN network. The recall for the VCRUNet model was 92.37%, up 2.15% compared to ResNet18 and ResNet50, up 2% compared with MobileNetV2, up 2.17% relative to Swin-Transformer-Small, up 1.79%. Compared to Swin-Transformer-Tiny, up 0.78% from Vision-Transformers, up 3.96% against VanillaNet-6, and up 0.21% with VanillaNet-13.
The precision of the VCRUNet model was 92.27%, up 0.45% from ResNet18, up 1.09% compared to ResNet50, up 0.75% against MobileNetV2, up 2.6% over Swin-Transformer-Small, up 1.78% relative to Swing-Transformers-Tiny, up 0.57%.
Compared to Vision Transformers, up 4.52% compared with VanillaNet-6 and up 0.3% with VanillaNet-13.
The F1-Score of the VCRUNet model was 92.32%, up 0.26% compared to the best-performing VanillaNet-13. The above data suggests that we have the best of all model performance evaluation indicators, so our method has a clearer advantage in identifying tea diseases.
The recognition time of the VCRUNet model is 4.5s/100 images, which is 2.16s different from the fastest, but 3.76% more accurate. Meanwhile, when compared with other models, it is found that the recognition time is not much different.
Our model was first pre-trained using the public dataset Plant-Village, and the validation set accuracy and loss values of different network models on the Plant-Village dataset are shown in Figure 9(a) and Figure 9(b). The validation set accuracy and loss values of different network models on the tea diseases dataset are shown in Figure 9(c) and Figure 9(d). The experimental results show that the VCRUNet model proposed in this study has the fastest convergence speed, and while combining with other evaluation indexes, it can be concluded that the network structure proposed in this paper has more obvious advantages.
Diagrams of model recognition accuracy and loss value for different models. (a) Accuracy change curves for the Plant-Village dataset; (b) Loss change curves for the Plant-Village dataset; (c) Accuracy change curves for the tea diseases dataset; (d) Loss change curves for the tea diseases dataset.
The confusion matrix is one of the methods used to evaluate the effectiveness of classification models. In order to verify the performance improvement brought by the migration learning to the models in this study, confusion matrices are produced for different network models on the test set of tea diseases, and the horizontal coordinates of the confusion matrices indicate the true values and the vertical coordinates indicate the predicted values. For the test set of four images of tea leaf blight, tea brown leaf spot, tea red scab, and healthy tea leaf, experiments were carried out and confusion matrices were produced, and the results of the experiments are shown in Figure 10. As can be seen from the figure, (a) – (h) are the comparative model recognition effects, and (i) is the VCRUNet recognition effect. Our proposed method is more accurate in classifying each tea diseases data.
Confusion matrix of different models in the tea diseases test set. (a) ResNet18; (b) ResNet50; (c) MobileNetV2; (d) Swin-Transformer- Small; (e) Swin-Transformer-Tiny; (f) Vision-Transformer; (g) VanillaNet-6; (h) VanillaNet-13; (i) VCRUNet. D1 is tea leaf blight; D2 is tea brown leaf spot; D3 is tea red scab; D4 is healthy tea leaf.
Receiver Operating Characteristic (ROC) graphs help to get a clearer assessment from the numerical results. The ROC curves for all models are shown in Figure 11. Based on the area analysis of micro-average ROC curves it is concluded that the Area Under roc Curve (AUC) region of VCRUNet is maximum, which indicates that the combination of VanillaNet model structure combined with Attention Mechanism CRU module helps in the problem of classification of tea diseases.
ROC graphs for different models. (a) ResNet18; (b) ResNet50; (c) MobileNetV2; (d) Swin-Transformer-Small; (e) Swin-Transformer-Tiny; (f) Vision-Transformer; (g) VanillaNet-6; (h) VanillaNet-13; (i) VCRUNet.
D. Model Feature Extraction Visualization Results
To better investigate the interpretability of VCRUNet, we used Grad-CAM [36] to visualize the class activation map (CAM) of VCRUNet, as shown in Figure 12. Based on the heat map display, it is known that the more extensive the red areas in the image, the more attention the model pays to this part of the image. If the heat map color is blue, it indicates that this region is prone to redundancy. In CAM, all activation regions of the VCRUNet model were located in the diseased region of the leaf, with more accurate region coverage than the other models. From the figure, it can be seen that the tea diseases recognition model VCRUNet proposed in this paper can better recognize tea diseases, and the added CRU structure has an obvious effect on the improvement of model performance.
CAM for different models. (a) Original figure; (b) ResNet18; (c) ResNet50; (d) MobileNetV2; (e) Swin-Transformer-Small; (f) Swin-Transformer-Tiny; (g) Vision-Transformer; (h) VanillaNet-6; (i) VanillaNet-13; (j) VCRUNet.
Discussion
The complex background of tea diseases images collected in natural environments and the lack of large-scale dataset construction lead to low recognition accuracy for many models. In this study, a minimalism neural network model, VCRUNet, which incorporates a minimalism neural network, VanillaNet, and a CRU, is proposed as an effective solution for recognizing tea diseases in complex backgrounds. Meanwhile, in order to cope with the problem of a small amount of data, the data enhancement technique was first used to expand the disease dataset and then utilized the public dataset for transfer learning to get the best recognition results. However, the small amount of data is still a common problem encountered in disease identification and detection, and the time period of disease occurrence is not centralized. The difficulty of data collection is also an important factor contributing to the problem; therefore, it is also necessary to construct a publicly available large-scale dataset of tea diseases. In future work, data on common diseases on tea collected and expanded.
Secondly, in order to further improve the usefulness of VCRUNet, we will further reduce the number of model parameters and computations in our future work. In addition, during the period of high incidence of diseases in tea plantations, the work of grading the degree of disease is also an important issue that needs to be addressed urgently, so we will also carry out related work such as disease severity assessment after collecting enough data to be able to manage and control the occurrence of tea diseases in a timely manner.
Conclusion
The aim of this study is mainly to solve the problem of identifying tea diseases in complex backgrounds and to provide a key method for promoting the income of tea farmers and the intelligent management of tea gardens. In this study, we have proposed VCRUNet, a model for recognizing tea diseases in complex backgrounds based on the minimalist neural network VanillaNet. The study constructed a dataset containing 658 images of three common tea diseases and healthy leaves and introduced various data enhancement strategies. Then, the channel redundancy between features in the convolutional neural network is reduced by introducing CRU, which improves the recognition accuracy of the model. Finally, the public dataset Plant Village dataset was used for pre-training to obtain a pre-training model, and then the self-built tea diseases dataset was used on the pre-training model to fine-tune the model parameters by means of transfer learning to mitigate the effects of overfitting and other effects brought about by the small amount of data. The experimental results show that the proposed method is superior to the classical methods based on ResNet18, ResNet50, MobileNetV2, Swin-Transformer-Small, Swin-Transformer-Tiny, Vision-Transformer, VanillaNet-6, and VanillaNet-13. The recognition accuracy can reach 92.48%, and the recognition time is 4.5s per 100 pictures, which can be effectively applied to the process of tea diseases control.