A Two-Stage Feature Selection Approach for Fruit Recognition Using Camera Images With Various Machine Learning Classifiers

Fruit and vegetable identification and classification system is always necessary and advantageous for the agriculture business, the food processing sector, as well as the convenience shops and hypermarkets where these products are sold. Therefore, it is necessary to build an effective automated tool to meet the needs of the market by boosting the outcome, in order to improve economic efficiency. In this paper, a two-stage model is proposed to recognize fruits using camera images. We employed a Densnet121 to get the features from the fruits dataset in the first module. In the second stage, we utilize a feature subset selection method to choose the most significant features for recognizing fruits from the images of the fruits. In this study, Adaptive particle - Grey Wolf Optimization (APGWO) has been applied for choosing the most pertinent features. The final subset feature has been used for recognizing fruits using several machine learning classifiers, namely K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and Multilayer Perceptron (MLP). The proposed research’s experimental results are highly effective; the training time of proposed models is reduced to over 50%, and the classification accuracy reaches 99%.


I. INTRODUCTION
Fruit is very important to our body. It contains various vitamins and minerals and is recommended to be eaten after meals [1]. Due to these reasons, the development in the food industry is escalating to increase. Technology, in Parallel, is also developing to meet the required demand so that automation increases and operating costs of factories are significantly reduced. With the development of CNN, it is easier and easier to classify types of pests, diagnose diseases, and determine the maturity of food. Especially during the covid pandemic, human resources are decreasing, and the demand for food is high, leading to automation in the factory.
The associate editor coordinating the review of this manuscript and approving it for publication was Pasquale De Meo.
With the advancement in the fields of Image Processing, Computer Vision, and Machine learning, Automation in the Food Industry is no longer a problem when a robot with cameras can replace human at certain processing steps. Deep learning method, to be more specific, deep convolutional neural network is one of the leading approaches in visionbased detection, recognition, and classification for the food industry.
There are several approaches which applied deep learning methods to classify fruits. To be more specific, the owner of a fruit360 dataset developed a Convolutional Neural Network (CNN) with four convolution layers interleaved with four max-pooling layers. The output of these layers is the input of two fully connected layers to generate 256 outputs with the ReLu activation function. The last layer is the softmax loss layer with 256 inputs. The output of this layer is equal to the number of classes [2]. Their work gained a very high result with 96.3% accuracy in 60 classes. Because of the diversity of data with a large number of samples, this is one of the pioneers in the development of fruit recognition techniques. In addition, Kausar et al. proposed Pure-CNN and used a fruit360 dataset with 81 types of fruits to test and achieved 98.88% accuracy [3]. Up to now, many researchers have modified and developed the CNN models to gain not only better accuracy but also reduce the training time. Since the late 20th century, there are many successful pre-trained CNN models such as LeNet, AlexNet, VGG, etc. Tomar applied the fruit360 dataset with seven transfer learning models, namely InceptionV3, ResNet50, VGG16, MobileNet, xception, DenseNet and InceptionResNet [4]. After comparing the experimental results of these models, they concluded that with 120 classes of the fruit360 dataset, the best model is VGG16, which has 99% and 95% training and testing accuracy, respectively. However, in this research, a total of 131 classes of fruits had been taken into account. Our approach is used to develop a new algorithm that can solve these current issues by combining multiple methods in Artificial Intelligence and computer vision to receive high accuracy but reduce costs. First, we extract the features of the Fruit360 dataset from Densenet121. Then we apply Adaptive Particle Grey Wolf Optimizer to choose the suitable features. These selected features will be trained by well-known models applied in fruits recognition: A. SUPPORT VECTOR MACHINE (SVM) a linear model for regression and classification problems. SVM can be used for linear or nonlinear problems, it is useful for many practical applications. SVM creates a line or a hyperplane to divide the data into classes [5]. It has been shown to be an effective approach for high-dimensional space problems. This method is typically utilized in sentiment analysis and classification because of how effectively it handles computing on enormous datasets.

B. K-NEAREST NEIGHBORS (K-NN)
A well-known statistical technique for pattern recognition [6]. The KNN algorithm is simple and effective. The training set is not required for the classifier, and there is no training time complexity. The quantity of documents in the training set directly relates to the computational complexity of KNN classification. Other approaches for dividing the set of samples that have more intersections or overlaps in the class domain are less effective than this method.

C. MULTILAYER PERCEPTRON (MLP)
Multiple nodes are grouped in layers in the MLP classifier. This creates a directed graph with input, hidden, and output layers, all of which are fully connected to each other. This classifier features a straightforward architecture and a straightforward method, making it one of the most wellknown neural network models [7], [8], [9].

D. DECISION TREE (DT)
The process of categorizing instances according to features is represented by the decision tree model. Its main advantages are that this model is readable and the classification procedure is fast. When learning, it uses the training data to build a decision tree model according to the principle of minimizing the loss function [10].
RF is a commonly used ensemble learning algorithm, and the base classifier is a decision tree. Each tree selects randomly variables and observations for classifier construction, and voting is used to determine the outcome. Generally, logN features are selected for each tree (N is the number of features). If all features are selected for each tree, the random forest at this time can be regarded as a bagging algorithm [11]. This approach has been frequently used to choose the model parameter in landslide susceptibility mapping [12], [13].
The following and preceding sequences are used to organize this research. The literature review is presented in Section II. In addition, Section III has a detailed explanation of the proposed method. Finally, the final work and experimental results are demonstrated in section IV.

II. LITERATURE REVIEWS A. FEATURE EXTRACTION (FE)
The feature extraction approach is to extract features from the dataset to increase the performance of the learned model. This also helps the classifier to increase the training speed due to reducing the unappropriated features.
The input of the classifier for image classification and segmentation is usually the significant feature of each image. There are three main types of features that can be extracted from an image, shape-based features (area, perimeter, shape index), intensity-based features (statistical parameters such as mean, variance, median), and texture-based features (contrast, correlation, entropy, inertia [14]). In this study, we will use the mean intensity of the intensity-based features.

B. FEATURE SELECTION (FS) 1) AN OVERVIEW OF FEATURE SELECTION
In the feature space, a feature extraction technique may create a few redundant and irrelevant features. It's critical to get rid of any redundant or irrelevant feature properties; it not only improves overall classification accuracy but also reduces computing overheads [15]. The feature selection method is usually used to do this.
The feature selection method is a machine learning procedure to select features to solve a high dimensionality problem. To make subsequent analysis easier, we select a number of important features and normally disregard noisy and irrelevant data in terms of redundancy and relevancy. These feature subsets were categorized by Yu et al. [16] into four categories: noisy and irrelevant, redundant and weakly relevant, forcefully relevant, and non-redundant. VOLUME 10, 2022 Furthermore, filter and wrapper methods can be used to implement a variety of approaches, including search strategies, models, feature evaluation, and feature quality measures. The number of characteristics and the size of the hypothesis spaces are proportionate. As the number of features increased, so did the size of the searching space. If an original dataset contains M features with a binary class label, the search space contains 2 2 M Combination. The operating mechanism of the function choice process is depicted in Figure 1. It could be found that there are five essential additives of the process including dataset, selected subset feature, evaluation, selection criteria, and validation. There are three types of feature selection strategies: wrapper, filter, and embedded methods.
A filter method is a selection approach related to the relationship between the objective variable and the explanatory variable that is unrelated to the machine learning method. The correlation coefficient score and the chi-squared test are all examples of filtering processes [17], [18] The performance of the wrapper approach is determined by the classifier. Try to put a subset of selected features into an algorithm and see how well it performs compared to the prior model before deciding whether or not to include more features [19], [20], [21] Finally, the embedded technique selects features using hybrid learning methods and ensemble learning. It works similarly to the Wrapper Method; this technique also chooses variables in the machine learning model at the same time. Moreover, the feature quantity is calculated concurrently with model learning, without the need to combine model learning, making it simple to calculate.
The following figure describes a flow chart of technique categorization to solve feature subset selection and explains the way arrived at metaheuristic algorithms.

2) METAHEURISTIC ALGORITHMS
Metaheuristic algorithms are optimization methods that aim to find the best solution to a problem. These are derivative-free techniques that are flexible, simple, and avoid local optima. Because of producing random solutions in their optimization process, this method has a stochastic behavior [22]. The major feature of metaheuristic algorithms is the extraordinary ability to prevent algorithms from converging prematurely. The algorithm has been used to solve a range of engineering and science problems with great success, including those in electrical engineering, industrial fields, civil engineering, communication, data mining and so on.
The two primary groups of metaheuristic algorithms are as follows.

a: SINGLE SOLUTION-BASED METAHEURISTIC ALGORITHMS
During the iteration step, the optimization process is modified after applying a single solution. It may lead to local optima trapping and a lack of exploration of the search space  c: EVOLUTION-BASED ALGORITHMS this algorithm starts with a population of randomly generated solutions and is based on the evolution of nature. The best results are combined in these types of algorithms to produce new individuals. There are many popular algorithms such as Genetic algorithm (GA), differential evolution, tabu search, genetic programming, and evolution strategy [22], [23].

d: SWARM INTELLIGENCE-BASED ALGORITHMS
based on the behaviours of animals and birds. Particle Swarm Optimization (PSO) is based on the behaviour of birds, which fly over the search space to find the ideal position [24]. Swarm intelligence techniques include honeybee swarm optimization algorithm, ant colony optimization, monkey optimization, and others. e: PHYSICS-BASED ALGORITHMS based on the laws of physics that govern the universe. Physics-based algorithms include Harmony search, simulated annealing, and others [25].

f: HUMAN BEHAVIOUR-RELATED ALGORITHMS
solely based on the behaviour of humans. Every human being has a unique method of carrying out tasks that has an impact on their overall performance. It encourages researchers to work on algorithms [26].

3) HYBRID METAHEURISTIC ALGORITHMS
In order to develop a new and better algorithm, hybrid metaheuristic algorithms combine many different metaheuristic algorithms. Several methodologies have been developed to choose the best relevant features for solving the feature selection problems. The improvement of algorithms helps in avoiding premature convergence, local optima trapping, and efficient search space exploration, as well as in making good exploitation. A greater balance between an algorithm's exploitation and exploration quality is also achieved by the improved algorithms, further enabling them to produce the optimal solution.
A competitive binary Grey Wolf Optimization (CBGWO) which is based on Mirjalili's Grey Wolf optimizer was offered by the author to enhance BGWO's performance and further classify EMG signals [27], [28], [29]. Other wrapper function selections have been proposed to discover the best features, including binary particle swarm optimization (BPSO) [30] and binary differential evolution (BDE) [31]. To initialize the BGWO population, the author introduced the Genetic Algorithms (GA) in [32]. Aside from that, Faris et al. [33] present an overview of GWO research.

III. METHODOLOGY
In this research, we have considered a fruit image database to classify fruits. The process of our approach is shown in Figure 4. At the beginning of our proposed model, we applied Densnet121 for extracting features. Due to the high dimensionality of the feature set, a two-stage feature selection technique is used to pick the most important features. In this stage, the resultant feature set is passed through the Adaptive Particle Grey Wolf Optimization method for the selection, which is based on previous works [34] and [35]. Then, this optimized feature subset is trained by various machine learning models to classify the type of fruits and compare it to other algorithms without applying the FS method in order to determine which models achieve the best performance.
In this paper, some machine learning models, namely SVM, KNN, MLP, RF, and DT are implemented for evaluating the performance of the system using the feature subset selection method, The detailed working approach for feature extraction using Densnet121 and the two-stage feature selection technique for determining the most effective feature set is shown in the next subsection.

A. DATASET
The fruits dataset is one of the largest datasets up to now. It contains 131 types of fruits with a total of 90,380 pictures [2]. Each fruit was placed in a low-speed rotating system (3rpm) and was recorded for 20 seconds. The white background was the colour of a sheet of paper so that the owner VOLUME 10, 2022 of this dataset could extract the fruit out of the background. In this paper, we divided the dataset into train and test sets. The entire dataset was used as input for the feature extraction. In this step, two pre-trained models which are DenseNet121 were applied. For the DenseNet121, we extract the features from the 4 th dense block. This output contains 1024 features, and each has a dimension of 7×7 as shown in Figure 6. Next, all of these features are the input of the features selection method APGWO. The selected features are then taken to the input of machine learning models for the fruit classification.

2) GREY WOLF OPTIMIZATION (GWO)
Individual-to-group communication is referred to as swarm intelligence. The use of herd intelligence has a wide range of applications in research and industry. Herd intelligence research can aid in the management of complicated systems. In order to solve the problem, numerous metaheuristic methods have been developed [22], [23]. In this section, GWO was used to reduce the size of the feature subset. GWO simulates the way how the wolf searches for food and avoids enemies in order to survive. Mirjalili et al. [29], were the first to introduce GWO. Alpha denotes that the leader makes the decision about where to sleep, when to hunt, and when to wake up. The beta is the second level in the pack. These wolves in herds were led by alphas, although they also commanded another wolf. Omega is the lowest rank in the pack. Those wolves are weak and rely on the other wolves. Although delta wolf relies on alpha wolf and beta wolf, they are more responsible than omega ones. These wolves are in charge of monitoring territorial boundaries and issuing warnings inside in the danger case, protecting, ensuring the safety of the herd, and caring for the weak and ill wolves. The following diagram depicts how an instance in GWO updates its position based on alpha, beta, and delta. The optimum answer is considered alpha, beta, and delta, correspondingly, for developing the model. GWO's encircling prey step is shown below: where t denotes the current iteration, A and C denote coefficient vectors, X denotes a position of a grey wolf vector, X p implies the prey's position vector. In equations (8) and (9), the coefficient is shown: where a are decreased in the range of [2,0], r 1 , r 2 are random vectors in the range of [0, 1]. These are the steps of hunting. The final position of the wolf X (t + 1) is indicated as follows: The grey wolf's position has described in the following: As previously stated, when the prey stops moving, the wolves will attack to finish the hunting. Note that a decrease, A also decreases. A is a value that is random between -2a and 2a, with a decreasing a from 2 to 0 for each iteration. The wolf attacks the prey |A| < 1. For monitoring the exploitation/ exploration trade-off, parameter a has been updated, the parameter a is adjusted between 2 and 0 in equation (12): where t denotes the number of iterations and Maxiter denotes the total number of iterations. The subset of features in the solution representation is revealed as the first step in resolving the challenge of GWO's feature subset selection. The location of the solution can be a ''1'' or a ''0.'' If the value is 0, the feature is not selected; however, if the value is 1, the feature is chosen. As described by Emary et al. [28], the GWO technique needs to be updated for binary feature selection. The following equation is a description of the BGWO2 model:

3) ADAPTIVE PARTICLE GREY WOLF OPTIMIZATION (APGWO)
Similarly, Eberhart and Kennedy present a theory of Particle Swam Optimization [24] regarding the herd's prey-hunting strategy in an environment where each animal in the pack is aware of its location in relation to the food and which position is closest to it. To find the nearest way to food, each member of the herd has to follow their leaders who are closest to the prey. The authors present a PSO method that can be used to adapt to this circumstance and solve optimization problems. The two factors that characterize each component of the PSO are the element's current position -x -and velocity -v. At the same time, the fitness function calculates a fitness value for each part. At the time of departure, the position of each element is stated at random. Each aspect is affected by two parameters of position: the best one is pBest and the other that is held in the whole flock is gBest. The PSO's elements will navigate the issue space by following the features that are currently the best. After each step, the velocity and position of each component are determined by the equations below: The values of c 1 and c 2 are usually specified as constants in PSO to balance the exploration phases, most likely to c 1 = c 2 = 1 or c 1 = c 2 = 2. In each iteration, a formula is used to change the acceleration coefficients. Equations (16) and (17) contain the new coefficients: where: f (gBest) : The global best fitness of the swarm k : The coefficients and the fitness t : Iteration The values of 1.2 and 0.5 have also been discovered through empirical studies. The following is an example of the inertia formula:

Algorithm 1
The pseudocode of APGWO [36] Initialize the particle population Initialize parameters Define t = 0 while (t < Max iteration) for each particle with position x p calculate the fitness value f(xp) if fitness f(xp) is better than fitness (pbestp)then pbestp←-x p endif if best f(pbestp) is better than gbest then gbest * -pbestp endif end for update w by using (26) for each particle with position x p update cl, c2 by (24). (25) calculate velocity of each particle by (27) update position of each particle by (28) end for if rand (0,1) < prob run GWO Xp = position of the best wolf endif t =t+l end while return gbest The sigmoid function is shown in equation (19): The equation (20) shows the improvement of the particle's position: where ij parameter has a value in the range of 0 to 1.
In the PSO main loop, a few GWO iterations reflect the chance of mutation that will result in a hybrid variety. The likelihood of a mutation in our case study is set to 0.1. The inner loop is only triggered a few times because this value is small, which ensures that the quality of the solution is unaffected.
The pseudo-code of APGWO is shown in the algorithm below: For the APGWO solution, it is an array that has a size of 1×n (n represents for all features) and contains binary numbers; 1 for the selected feature and 0 for the rejected ones. The following criteria have been established for the wrapper algorithms: 50 iterations for the PSO loop, 20 search agents for the nested GWO loop, and 20 search agents for the PSO main loop. According to [37], initial weight and final weight are denoted by w_max and w_min, respectively. The value of 0.9 is set as w_max and 0.2 for the w_min. The method aims to reduce the fitness function that is as follows: where E t is the validation set's error rate, α = 0.9, S and L are the number of chosen features, and the total number of features, respectively. Attempt is made to simultaneously optimize this fitness function and reduce the number of selected features to improve validation accuracy. The optimizer concentrates more on increasing validation accuracy when the α value is high. The following figure represents the flow chart of APGWO algorithms:

IV. EXPERIMENTAL RESULTS
This section discusses and compares the result of using feature selection methods Adaptive Particle -Grey Wolf Optimization to optimize several machine learning models. The system's outcomes and performance evaluation are presented in the following subsections. The dataset is separated into 80:20 in this system, where 80% is used to train, and 20% is used to test the accuracy of these algorithms FIGURE 11. The ratio between training and testing set.

A. FEATURE EXTRACTION
Each of the extracted image features is treated as a numerical matrix with the dimension of m × n. Next, the average value of each matrix is calculated by the following formula.
Assume that a matrix S(m × n) has the form The mean of S is given by: The mean value of the feature image represents one feature. Therefore, there are 90380 × 1024, and 90380 × 512 mean feature values for Densenet121 and VGG16, respectively.

B. RESULTS OF USING MODELS WITH ALL FEATURES
Five machine learning models, namely KNN, SVM, DT, RF, and MLP, were applied to train all available features. The following figure shows the accuracy comparison between algorithms. KNN, SVM, and MLP reach the highest performance with 99.99%, as shown in the figure below.

C. RESULTS OF USING MODELS WITH SELECTED FEATURES
After utilizing the APGWO method, 508 from 1024 features of Densenet121 are selected. Thus the feature space is reduced by 50.4%. Figure 13 demonstrates the number of all available features and the selected features after using the feature subset selection method. These selected features become the input of several machine learning models, and 50 epochs have been considered in this step. The following figure shows the proposed approach returned the highest accuracy of KNN, SVM, and MLP with 99.99%. While the classification accuracy of the SVM and DT also achieve high performance with 97.46% and 91.09%, respectively.

D. PERFORMANCE
After using a wrapper feature selection method to reduce the feature space of the fruits dataset, we apply various models to classify the difference between fruits. The following table presents the accuracy performance of models between full available features and selected features.
With full features, there is the highest position of three machine learning models, namely KNN, SVM, and MLP with 99.99%. Meanwhile, it is obvious that the proportion of Decision Tree accounts for the lowest with around 98%. In this system, APGWO is applied to enhance the performance of machine learning models. It can be seen that 508 features out of 1024 of the original are selected for the first time, and the accuracy of KNN and MLP are higher than the other models, which is indicated that there are un-useful features in the fruits dataset. RF and MLP are two models that remain the high percentage with 97.46% and 99.97%, respectively. While DT only achieves 91.04%.
The experimental results obtained and shown in the table below describe machine learning algorithms combined with the APGWO method that outperform other models without applying the feature subset selection method. These experimental results also shows that our model maintains high accuracy.
Moreover, we run our method several times in order to achieve higher experimental results and enhance our proposed model. It is clear that KNN and SVM perform most accurately, reaching the first and second positions, respectively. As a result, we can observe that KNN and RF are two models that perform well. To save time and get the best results, we recommend future studies to employ both models.
The figure below shows the five-time trials of our proposed models. Five trials are performed in this research to validate the stability of machine learning models, with the initial results from the proposed method shown previously corresponding to Trial 1. It can be seen that the accuracy score remains the same across five trials.
The Table 3 describes the training time of all machine learning models with all features and selected with the  feature selection method -APGWO. In addition, the accuracy between our approach and current model are listed in Table 4. According to both two tables, our proposed method reduces the training time; it yielded higher scores in 51.3% less time than the MLP model with full features. It can be seen that other models with APGWO methods also have a training time lower than models with full features. Moreover, our approach is outperformed than other current method. To be more specific, with 131 classes, our model achieves 99.9% compared to Pure-CNN (81 classes), CNN (60 classes), and VGG16 (120 classes) which are 98.8%, 96.3%, and 95% accuracy, respectively.  Along with the achievements already mentioned above, our model also has the following limitations. The Grey Wolf Optimization (GWO) algorithm's disadvantages include a slow rate of convergence and a propensity for local optimums on specific problems. In future work, we will utilize Improved Grey Wolf Optimization (IGWO) to improve our approach. This method will add disturbance to the wolf to active search capabilities, and dynamic weights are applied to the wolf's position to prevent it from losing diversity and settling into a local optimum. This strategy could enhance the accuracy of our model.

V. CONCLUSION
This paper has proposed a method for classifying various types of fruits in order to support in manufacturing. The scientific community has been working to effectively classify various fruits by leveraging the capabilities of ML or DL. In this study, we proposed a model using Densnet121 as feature extractor and a feature selection serves as a mechanism to choose the most relevant features for the classification from the image of the fruits. Adaptive Particle Grey Wolf Optimization (APGWO) is used to enhance a feature subset selection method. With 508 selected features from the original dataset, we have applied various models for training such as K-Nearest Neighbour (KNN), Random Forest (RF), Multilayer Perceptron (MLP), Support Vector Machine (SVM), and Decision Tree (DT). The performance of the proposed methods has been achieved with high accuracy. The training and execution time are significantly reduced when irrelevant input attributes are removed. In future work, we may apply the dimensionality reduction for multiple models before they go through the FA step. In addition, the MLP may be automatically tuned by adjusting the number of hidden layers and hidden nodes, the activation functions, or the parameter of the feature selection method for improving the performance. By combining the proposed method with volume/mass estimation method [39], [40], and [41] the completed system in fruit/vegetable recognition and sort will be developed in the upcoming research.