Introduction
Tomato is one of the most popular vegetables in human daily life since it is consumed by millions every day. However, with the trend of aging work force, the labor cost is rising and it has become one of the limiting factors in many agricultural industries. On the one hand, a large number of agricultural enterprises are facing the challenges of low profit. While, on the other hand, with the growing world population, the production of tomato still needs to satisfy the demand. The harvesting robot of tomato seems to be a plausible way to solve these critical issues of keeping tomato quality control with reduced labor cost. Due to these reasons, many researchers have been working on developing robots for fruit and vegetable harvesting for the last few decades [1], [2].
The color of tomato is a major index to judge maturity. Tomato fruit passes through five different stages of maturity. These can be recognized through color changes from green turning to light pink, pink, light red, and then red, which classify them into five distinct categories. An appropriate appearance of the produce brings high price for the company. So, one must take into account the length of transport route and storage time for an optimal harvest. In general, from green color tomato needs 21 to 28 days for breakers, 15 to 20 days for the turning, 7 to 14 days for pink, 5 to 6 days for light red, 2 to 4 days for red stages [3]. Therefore, it is an important task to improve the tomato classification system for the design of harvesting robot.
In recent years, the method based on machine vision and pattern recognition has been well studied and applied, especially in many intelligent agricultural products’ processing or sorting [4]–[6]. Specifically, computer vision is one of the most important parts of the harvesting robot. However, the methods based on machine vision are selected by experienced personnel. Obviously, such methods have drawbacks in flexibility and timeliness that make them hard to apply in farm enterprises. Furthermore, the development of such system with good performance in terms of accuracy timeliness and scalability should resolve many challenging issues. These include tasks such as illumination variation, occlusions and so on for conducting work on various factors. Although many researchers have used machine vision technologies, there still has a long distance to supply for the automatic harvesting robot. Both accuracy and efficiency have not been achieved for designing such robot.
Recently, convolutional neural network (CNN) based classification systems have made ground breaking advances in many tasks. Deep neural network (DNN) system often encounters over-fitting problems although it has shown outstanding performance in many aspects. The problems of over-fitting are mainly caused by three reasons, i.e. complex models, data noise and limited training data [7]. To create a dataset with enough samples is often a difficult and time-consuming task. Particularly, some images are hard to obtain, such as specific disease of one kind of plant. Therefore, data augmentation is an effective way to pursue, by artificially increasing training data, when the number of images in the dataset is insufficient.
Our motivation for this study was to look for an efficient procedure for observing tomato ripeness. Moreover, we needed to find a method having a good performance both in predicting time and accuracy. Furthermore, we aimed to design the classification system with extensible capabilities, so that it can be applied to harvesting robot commendably. Based on the above considerations, we designed and implemented classification system as shown in Fig. 1.
We analyzed the relationship between tomato storage duration and changes in appearance, and then divided images of tomato into five categories according to ripeness in this study. We propose novel network model architecture with less complexity to implement the task of fast classification. For the dataset construction, we used several ways to collect the images of tomato, which include different ripeness levels. After labeling each image, we verified the accuracy of the dataset. Considering that the acquisition of data sets is a time-consuming work, we took advantage of several different methods for dataset augmentation. By comparing both the performance of training and prediction results of the model under different augmentation methods, we derived the most suitable augmentation method for this study. This method offers suggestion for designing tomato harvesting robot.
The structure of this paper is as follows. In Section I, we introduce the motivation of this research, and also some relevant background. In Section II, we show the establishment of our dataset, mainly including the collection of images, annotation, filtering of inappropriate data, and data augmentation. We build a novel framework for ripeness classification based on deep learning in Section III. Then, we show some results of the classification system on different datasets in Section IV. In Section V, we discuss the conclusions of this research.
Related Works and Background
Estimation of tomato maturity is a significant and important study for automatic picking. To estimate the maturity of tomatoes, Goel and Sehgal [8] proposed a color based method, where the ripeness classification system achieved high accuracy. Pavithra et al. [9] used machine learning technology for automatic detection and sorting of cherry tomatoes and they designed classifier system to improve accuracy and economize time consumption. Lu et al. [10] used machine vision and Visible/Near-Infrared Spectroscopy technologies to comprehend rapid assessment of tomato ripeness.
Mohapatra et al. [11] adopted image processing approach for red banana’s ripening grade determination. Although these methods gave good performance through experiments in certain environments, they are still difficult to apply.
At present, the classification system based on CNN has achieved good results in many areas. In this regard, researchers have proposed different data augmentation methods. For example, Zhou et al. [12] proposed cross-label suppression dictionary learning for signal representation in face recognition to preserve the label property effectively. Chen et al. [13] proposed a novel approach that applies cascades of three deep convolutional neural networks (DCNNs) methods to detect the defect of fasteners. Li et al. [14] proposed FingerNet, which consists of one common convolution part and two different de-convolution parts to enhance fingerprint. To effectively suppress the outliers and accurately reconstruct the image from compressive measured data, Li et al. [15] presented a novel multiplier network based algorithm to achieve better performance in image reconstruction. Ma et al. [16] presented a new method based on variational Bayesian learning method, and achieved flexible performance for modeling vector with positive elements on Dirichlet process mixture of the inverted Dirichlet distributions. Also, there are some CNN-based studies for face recognition [17], wireless communications [18]–[20], automatic speaker verification [21] and internet of things [22]. Similarly, driven by the remarkable success of deep learning, CNN-based classification or identification systems have recently made ground breaking advances in agricultural industry. For example, investigations have been done on classification and identification system for crop diseases [23], [24]. Also many researchers have developed systems for plant identification and detection [25], [26].
Over-fitting is one of the most serious problems based on the CNN method. Beneficial to the effective way of preventing over-fitting problems, many deep learning-based studies exploit data augmentation methods. For example, a new approach based on CNN for alcoholism detection task with data augmentation methods only uses one hundred training image [27]. For road detection, Muñoz-Bulnes et al. [28] used two ways for dataset augmentation. First is a geometric transformation, which includes random affine transformations, perspective transformations, mirroring and so on. The second is called pixel value changes, which includes noise, blur and color changes. The final experimental results showed that training on data augmentation improved performance by 1 to 2% [28]. By random rotation or adding several kinds of noise separately, Hussein et al. [29] augmented CT images for nodule characterization of lung. They compared the identification accuracy on different datasets to prove the superiority of the proposed method. Ma et al. [30] proposed another novel method for bounded support data that can be used in many important applications.
Dataset
A. Image Annotation and Verification
During this study, we took nearly 200 pieces of color images, containing five maturity levels, collected from farm during daytime under natural light conditions. Each maturity level included more than 30 tomatoes’ images. This was a relatively small amount of data compared to that needed generally for network training.
According to the market demand, we took both quality and storage life into consideration and classified acquired images into five categories. Table 1 shows both the expiry date and quality of each stage. By these standards, we divided these 200 images into their corresponding categories.
t-Distributed Stochastic Neighborhood Embedding (t-SNE) is an algorithm derived from Stochastic Neighborhood Embedding (SNE) [31]. The idea of t-SNE is to view whether high-dimensional data
In high dimensional space, the pairwise distance between two points is converted into a joint probabilistic distance \begin{align*} p_{ij} =\frac {\textrm {exp}\left ({{-{\left \|{ {x_{i} -x_{j}} }\right \|} \mathord {\left /{ {\vphantom {{\left \|{ {x_{i} -x_{j}} }\right \|} {2\sigma ^{2}}}} }\right. } {2\sigma ^{2}}} }\right)}{\sum \nolimits _{k} {\sum \nolimits _{l\ne k} {\exp \left ({{-{\left \|{ {x_{i} -x_{j}} }\right \|} \mathord {\left /{ {\vphantom {{\left \|{ {x_{i} -x_{j}} }\right \|} {2\sigma ^{2}}}} }\right. } {2\sigma ^{2}}} }\right)}}},\quad \textrm {for }\forall i,~j:i\ne j \\{}\tag{1}\end{align*}
In a low dimensional space, the pairwise distance between two points is converted into a joint probabilistic distance \begin{align*} q_{ij} =\frac {\left ({{1+\left \|{ {y_{i} -y_{j}} }\right \|^{2}} }\right)^{-1}}{\sum \nolimits _{k} {\sum \nolimits _{l\ne k} {\left ({{1+\left \|{ {y_{i} -y_{j}} }\right \|^{2}} }\right)^{-1}}}},\quad for~\forall i\forall j:i\ne j \\{}\tag{2}\end{align*}
By minimizing the Kullback-Leibler divergence measuring, t-SNE gets the low-dimensional represented by the cost function, can be formulated as Eq. (3).\begin{equation*} C=KL\left ({{\left.{ P }\right |\left |{ Q }\right.} }\right)=\sum \nolimits _{i} {\sum \nolimits _{j\ne i} {p_{ij}}} \log \frac {p_{ij}}{q_{ij}}\tag{3}\end{equation*}
Therefore, benefiting from the advantage of t-SNE method, after manual annotation images, we exploit t-SNE method to check the distribution of the Dataset. Fig. 2 shows the result of a part of dataset. From the result we can find that the image is basically in accordance with the level of maturity gathered together. We also deleted undesirable data base in this result.
B. Augmentation Methods
Creating a data set for learning often requires a lot of energy. Collection, data cleaning, tagging, and so on, takes a lot of time. Taking into account the situations of more usage scenario, such as random noising and translation of size, this paper proposes two types of data augmentation operations to alleviate these problems.
1) Geometric Transformations
Scaling and rotations are two ways for geometric transformations. In order to find the best augmentation methods, we generated three datasets in this section.
\begin{equation*} \begin{cases} x_{j}= {s_{x}u}_{p} \\ y_{k}={s_{y}v}_{q} \\ \end{cases}\tag{4}\end{equation*}
\begin{equation*} \begin{cases} x_{j}= u_{p}\cos \theta - v_{q}\sin \theta \\ y_{k}={u}_{p}\sin \theta + v_{q}\cos \theta \\ \end{cases}\tag{5}\end{equation*}
Then, we generated the datasets R & S based on datasets S and R. The number of each category in each dataset is shown in Fig. 3.
2) Random Noise
We adopted three types of noises, i.e. Pepper, Salt, Gaussian for data augmentation methods. Probability density function (PDF) of Gaussian expression as Eq. (6).\begin{equation*} p_{g}\left ({z }\right)=\frac {1}{\sqrt {2\pi } \sigma }e^{-\left ({z-\overline z }\right)^{2} \mathord {\left /{ {\vphantom {\left ({z-\overline z }\right)^{2} {2\sigma ^{2}}}} }\right. } {2\sigma ^{2}}}\tag{6}\end{equation*}
\begin{equation*} p_{P}\left ({z }\right)=\begin{cases} p_{P}& z=p \\ 0 &else \\ \end{cases}\tag{7}\end{equation*}
\begin{equation*} p_{s}\left ({z }\right)=\begin{cases} p_{s}& z=s \\ 0 &else \\ \end{cases}\tag{8}\end{equation*}
3) Combination
In order to study the relationship between different ways of data augmentation and the predicted results of this task, we combined the two ways of geometric transformations and random noise by adding Pepper, Salt and Gaussian to the datasets of R, S and R &S separately, and then got nine kinds of datasets for training. They are R & PN, S & PN, S & R & PN, R & SN, S & SN, S & R & SN, R & GN, S & GN, and S & R & GN.
Classification Architecture
In this paper, we designed classification architecture shown in Fig. 4. The purpose of design classification architecture is to maintain overall information, while preserving local details, with short response time of prediction. Based on the above considerations, we designed this architecture, which consists of three parts. The first part is to input color images of three channels, and these images have 200 pixels attributes of both the height and width. In the second part, we exploited five layers of CNN to extract features. The convolution kernel sizes are
A. Feature Extractor
There are many ways to extract features, such as SVM [32], HOG [33], SIFT [34] and so on. In general, feature extraction based on image processing or machine learning requires the selection of appropriate features according to experience and is not suitable for expansibility of system. In the present study, we exploited CNN based methods to extract features of images without any extra pre-processing. This part mainly includes the following three sub-parts.
1) Convolutional Layers
There are five convolutional layers in the design of feature extraction architecture. The kernels of these five convolutional layers are
2) Activation Function
Several activation functions have been proposed, where ReLU [35] is one of the most popular functions in many classification tasks. The function is defined as Eq. (9).\begin{equation*} f\left ({x }\right) =\max (0,x)\tag{9}\end{equation*}
3) Pooling
When a deep learning based architecture gets deeper, it gets larger parameters, more calculations and easy to occur over-fitting phenomenon.
The most important function of pooling layer is to keep invariance in the main feature information, reduce parameters and prevent over-fitting. Commonly, mean-pooling and max-pooling are two forms for the pooling layer. Mean-pooling is calculating the average value of image area as the pooled value of this area.
Similarly, Max-pooling is choosing the max value of image area as the pooled value of this area. In this research, we exploited max-pooling with the size of 44 after each convolution layer.
B. Classifier
Each node of the fully connected layer is connected with all nodes of the previous layer to combine the extracted features with the previous edges. Due to the fully connected characteristics, generally there are more parameters than other layers. In this task, we exploited only one fully connected layer with 32 neurons to connect with the last convolution layer. Softmax [36], [37] model can be used to effectively solve classification problems. The function is given as Eq. (10).\begin{equation*} p_{j}=\frac {e^{x_{j}}}{\sum \nolimits _{i=1}^{K} e^{x_{i}}}\tag{10}\end{equation*}
C. Training Strategy
We used cross-entropy [38] as the loss function during the model training. The loss function is defined as Eq. (11).\begin{equation*} C= -\frac {1}{n}\sum \limits _{x} \left [{ y\ln {a+}\left ({1-y }\right)\ln \left ({1-a }\right) }\right]\tag{11}\end{equation*}
\begin{equation*} \theta ^{i}=\theta ^{i-1}-\alpha \frac {\partial }{\partial \theta ^{i-1}}J\left ({\theta ^{i-1} }\right)\tag{12}\end{equation*}
Experiments
The experiments performed on a Windows 10 64-bits PC equipped with an Intel(R) Core (TM) i5-7500 CPU @ 3.20GHz processor, and 8 GB-RAM. For deep learning technology, parallelizing calculation is an important power. Benefiting from GPU’s parallelizing power, we used NVIDIA GTX 1060 GPU having 3GB of memory to reduce our training time. Also, we used high-level neural networks application programming interface of Tensorflow to implement our proposed deep learning model.
A. Classification Based on the Dataset S, R, S & R
In this section, we train the model on dataset with geometric transformations.
Fig. 5 shows the curves of model accuracy rate changes as the number of iterations increases on S, R, and S & R datasets during training. From the results curve, we can find that the convergence of model on S & R is the fastest, followed by S. However, the convergence of model trained on R is the slowest and the accuracies of validation changes performance is not good as well (as shown in Fig. 6).
Iteration of training accuracies changes on dataset with augmented methods of geometric transformations.
Iteration of validation accuracies changes on dataset with augmented methods of geometric transformations.
Then, we took the trained model to predict 100 pieces of untrained images. The results are presented in Table 2. The best result is of the model trained on dataset S & R. It is worth mentioning that, although training on R did not perform very well, the prediction result is better than S.
B. Classification Based on the Dataset S With Noise
In this section, we train the model on dataset S with three types of random noise. Fig. 7 shows the curves of accuracies of training changes as the number of iterations increases on S, S & PN, S & GN, S & SN.
Iteration of training accuracies changes on S with noise of Pepper, Salt and Gauss separately.
From the results curves, we can find the models trained on all the datasets are fast to converge. After thirty epochs, all the models nearly complete to converge. Moreover, the accuracies of validation changes on these datasets are similar with the accuracies of training, and the accuracies of validation changes shown in Fig. 8. Beyond that, both the accuracy of training and validation reach at 96%.
Iteration of validation accuracies changes on S with noise of Pepper, Salt and Gauss separately.
Then, we tested the predictive results of these models (see Table 3). The table shows that the dataset of S with Salt noise has the best effect on final results, followed by that with Gaussian noise and finally with Pepper noise.
C. Classification Based on the Dataset R With Noise
In this section, we trained the model on dataset R with three types of random noise. Fig. 9 shows the accuracies of training changes as the number of iterations increases on R, R & PN, R& GN, R & SN. Unlike the series of S, we can find the convergence rate of all these datasets much slower. Besides, the validation accuracy based on all these datasets is hard to promote it when reaches at nearly 80%, as shown in Fig. 10.
Iteration of training accuracies changes on R with noise of Pepper, Salt and Gauss separately.
Iteration of validation accuracies changes on R with noise of Pepper, Salt and Gauss separately.
Although, training performances of these datasets are inferior to S series, they can be closely considered for the predicting results. The results are presented in Table 4.
D. Classification Based on the Dataset Combined R & S
In this section, we train the model on dataset R & S with three types of random noise. Fig. 11 shows the accuracies of training changes as the number of iterations increases on R & S, R & S & PN, R & S & GN, and R & S & SN with respect to these three datasets. Fig. 12 shows the accuracies of validation changes during training on these datasets. From these curves we can find that the models trained under these datasets have the best performance both in convergence and accuracies changes. There were few shocks in training. Although training based on R series alone did not perform well, its performance improved when combined with S.
Iteration of training accuracies changes on R & S with noise of Pepper, Salt and Gauss separately.
Iteration of validation accuracies changes on R & S with noise of Pepper, Salt and Gauss separately.
This group also had better predictions than previous groups as obvious from the result shown in Table 5. Similarly, the model trained on salt noise provides the best prediction results.
E. Performance Comparison With Augmented Methods
Image brightness and flip changes are two important methods of augmentation and solve the problem of over-fitting effectively. It is due to the images brightness as changes prevalently exist in natural environment. Therefore, we compared our method with these two augmentation methods in this study.
Firstly, we generated two datasets, one augmented by random brightness changes (BRIG), and the other augmented by flipping both horizontally and vertically (FLIP). Fig. 13 shows the accuracies of training changes as the number of iterations increases on BRIG, FLIP and R & S & SN.
Iteration of training accuracies changes on the datasets with three augmentation methods.
Fig. 14 shows the accuracies of validation changes during training on these datasets. From these curves we can find the models trained on R & S & SN still have the best performance both on convergence and accuracies changes compared with the other two datasets.
Iteration of validation accuracies changes on the datasets with three augmentation methods.
Again, we took these three trained models to predict 100 pieces of untrained images. The results are given in Table 6. The best results are obtained for the model trained on dataset R & S & SN.
F. Runtime Analysis
Additionally, to promote the system, time cost plays an essential role. Therefore, we take response time of this system into consideration with two experimental conditions. One is under Intel(R) Core (TM) i5-7500 CPU, and the other is adding with NVIDIA GTX 1060 GPU. We tested the response time of our system under these two conditions to predict 100, 300, 500 and 700 pieces of images. The results are shown in Fig. 15.
The time cost of the classification model on CPU or GPU condition for different number of images.
The results demonstrated that the response time of our classification system was less than 1 millisecond (ms) per one hundred images, whether the device had parallel computing power or not.
Discussion and Analysis
Recently, there are a number of studies on data augmentation methods, and most of these methods provide good performance in certain fields. From our research, we consider the characteristic of this task was found concurrent to meet object distance and angle changes in tomato classification. Therefore, adopting rotation, scale change and rotation with scale change are the three methods we used to augment our datasets.
Furthermore, Noise exists widely in natural environment. For approaching our dataset to the real environment condition, we add three types of noise, i.e. Pepper, Salt and Gaussian to the dataset. Adding Gaussian and Pepper noise brings colorful pixel to images. As the colorful information of images plays an important role in the classification task, such methods can create confusion. Comparatively, adding Salt noise will not introduce other colorful information except white, so the results would be much better compared with the other two methods.
Conclusions
In this paper, we divide tomato into five categories according to different ripeness indices based on the relationship between the storage time and appearance. Here, we designed and implemented a novel architecture, based on deep learning for the classification of tomato maturity levels. Compared with other classical architectures, it has less parameter calculation and higher accuracy.
To achieve better performance of the designed classification model, we use t-SNE to verify the distribution of the dataset, and to delete bad images. In order avoid over-fitting problem during training of the model, we exploit three methods of augmentation for the datasets.
Through experiments on different groups of datasets, we obtained the best predicted results by training on the R & S & SN dataset. With this, the final accuracy reaches to 91.9% and the prediction time becomes less than 0.01 second per one hundred images.