Cross Validation Voting for Improving CNN Classification in Grocery Products

The development of deep neural networks that has been carried out in recent years allows solving highly complex computer vision classification problems. Often, although the results obtained with these classifiers are high, there are certain sectors that seek greater accuracy from these systems. Increasing the accuracy of neural networks can be achieved through ensemble learning, which combines different classifiers with the aim of selecting a winner based on different criteria about them. These techniques have traditionally shown good results although they involve training models of different nature and can even produce an overfitting with respect to the training data, so datasets must be chosen to correctly evaluate the result. In this paper, a Cross-Validation-Voting (CVV) technique for grocery product classification is presented. This technique improves several single state-of-the-art classifiers without combining different ones and avoids the problems of overfitting with respect to the training set. The single classifiers are trained multiple times against distributed sets to show how the results obtained to date from the classification of a well-known dataset are improved. In this dataset, an extensive test set was previously selected by the authors to show comparable results with other papers in the literature. The technique is valid not only for vision nets and can be used to solve numerous problems with different kinds of neural networks and classifiers.


I. INTRODUCTION
I N recent years, there has been a significant development in the field of deep learning. Advances in hardware have made it possible to train complex models, such as neural networks with hundreds of hidden layers. These models are capable of learning to solve numerous classification problems but may have overfitting problems when the training data set is not sufficiently well structured and sized. The models are so sophisticated that they can achieve very high accuracy against training data but may not respond well to new test data.
In many classical methods, such as Support Vector Machines (SVMs) [1], models are trained against a dataset although no validation data are used to perform an early stopping in a simple way. These methods are usually evaluated against a test set or by cross-validation. A grid search is usually used to find the optimal model parameters, selecting the ones that give the best results in the cross-validation.
Neural networks require the division of data into three sets. A training set allows the model weights to be configured over different epochs. A validation set allows the model to be evaluated every few epochs, detecting the best performer on the validation data and stopping the training. In this way, although the model is trained with training data, we can keep the one that performs best with data not used directly in the adjustment of weights. This allows the model to generalize better to new cases. Finally, the model is evaluated with a test set. This step is necessary because when we stop the training depending on the validation evaluation, we can make the model tend to behave favorably to validation. For an overall evaluation of the model, some authors often use cross-validation to test how it behaves on different choices of dataset elements.
The early stopping approach followed to train neural networks makes the model favorable to the validation data, something known as validation overfitting. Although the evaluation of the model with respect to the test set is realistic, being totally independent unknown data, it may not be the best generalizing model. In this article, we propose to use a technique called Cross-Validation-Voting (CVV) that allows better generalization to unknown cases. To do so, we use the models trained by CVV in an ensemble voting technique. This technique allows each of the models participating in the voting to be trained with a different training and validation set. It allows that, although each selected model is favorable to a different validation set, as these sets are different, the result is a model that generalizes better to test.
To test the CVV method, we have used a well-known dataset of supermarket products. In the Grocery Store Dataset [2], the authors established a prior separation of training and test data, which allows realistic comparisons between different authors. Our technique has shown that, with a single classifier, it is possible to outperform the best results obtained to date. When we ensemble different models using CVV, the results improve even more. In addition to the tests using only neural networks, we also present results using other techniques, such as boosting with trees or SVM classifiers. Another important aspect to note about our technique is that individual classifiers in the CVV model can be trained for fewer epochs than a single classifier, so the training time is not substantially altered.
Although cross-validation and ensemble model techniques are well known and have been widely used, to our knowledge, no other paper has combined them to improve model generalization. More specifically, no similar work has been carried out in the agrifood sector [9]. There are authors who have used both techniques, such as in [3] or [4]. However, in both cases, they have created an ensemble model that has been evaluated using cross-validation. In these works, classifiers of different nature have been used. The main difference with our approach is that we integrate the two techniques to train the model, and the evaluation of the model is carried out with a fully independent test set. In addition, our method works with a single type of classifier. These previously mentioned methods together with other related works have been evaluated and compared with our method, showing how it achieves a remarkable improvement in the results.
The present paper is structured as follows: Section II explores the state-of-art of the technologies considered in this paper. Section III describes the procedure that has been carried out. In Section IV, the different experiments and results obtained with the proposed method are reported. An overall discussion on the obtained results is set out. Finally, Section V notes the advantages and limitations of the presented system and suggests future developments.

II. OVERVIEW OF RELATED WORK
One of the common problems in the neural network training is the lack of generalization capability [5]. This may be due to overfitting with respect to the training data but also to overfitting with respect to the validation data. Neural networks are usually trained in epochs with respect to a set of training data, periodically evaluating the behavior of the network on a validation set and detecting a worsening of certain metrics on the validation. This mechanism, called early-stopping, makes it possible to choose the model that best fits the validation data. The model is then evaluated against a test set. However, as the model chosen is the one that performed best with the validation data, these data indirectly influence the model, so that we may be faced with an overfitting respect to validation. Although the model may obtain good results with test, it is probably not optimal due to such a tendency to a specific validation.
The cross-validation mechanism [6] adds an additional step to the training mechanism. Traditionally used with SVM-type classifiers [1], it divides the training set into k validation slots and trains k models using the training data that do not belong to their respective slots. Then, the models are evaluated with respect to their particular validation slot and the average accuracy of the k models is calculated. In a grid search of the model hyperparameters, e.g. C, gamma or the kernel in SVM, a cross-validation is performed for each combination of the parameters in the grid. Finally, the parameters that best respond to the cross-validation are selected and a new model is trained with all data from the training set.
In the case of neural networks, cross-validation is mainly used as a method of model evaluation. It has been used for the evaluation of very different problems, such as medical applications [7] or the classification of grocery products [8]. It is important to mention that most of the methods used in the grocery product classification problem only use one training and validation set but not three sets (training, validation and test) [9]. The cross-validation evaluation allows selecting certain hyperparameters to improve the model. Most of the literature applying cross-validation is limited to using this method to provide an evaluation of the proposed architectures. However, in a real scenario, the selected model must be usable. When the hyperparameters have been chosen, there are different possibilities to use the model: we can select one of the k models, usually the one that offers the best results; we can also re-train the model with certain validation data randomly selected from the original data; or we can directly train the model with all data and stop at a certain number of epochs estimated during the training of the different k models. This last case is valid considering that an early-stopping mechanism has not been used. Early stopping is not commonly used for cross validation except for the training of the final model. In our article, the cross-validation mechanism allows the model to be trained using all data from the training set but maintaining the generalization by using different validations to choose the optimal parameters. This avoids choosing parameters that overfit the model with respect to a particular validation and respond to different types of data.
On the other hand, the techniques known as ensemble learning [10]- [12] allow training different classifiers with the same dataset and combine them. Thus, for example, we could combine SVM, decision trees or neural networks. There are four types of ensemble methods: voting, bagging, boosting and stacking. In most situations, these techniques have been shown to improve the performance of individual classifiers.
Voting [13] can be carried out in two ways in classification problems. In hard voting, each model produces a vote for a class. As a final prediction, the class voted by the majority of the models is chosen. On the other hand, in soft voting, probabilities are used. If a model is not totally sure of a class but that class is a winner, for example with a probability of 0.6, instead of a vote for that class, its probability is taken into account. This allows the model to value more highly those outcomes of which it is truly certain. This is recommended for an ensemble of well-calibrated classifiers. The voting technique has been widely used in conjunction with Convolutional Neural Networks (CNN) for a variety of problems: signal modulation [14], coronavirus diagnosis [15], human action classification [16], multimodal emotion recognition [17] and even detection of the camera used in forensic imaging [18]. Regarding groceries, a systematic investigation on end-to-end deep recognition of grocery products using voting was presented in [19]. The authors presented the integration of different CNN classifiers using voting, showing the effectiveness of the method.
In bagging [20], several models, similar or different, are trained with a subset of the original data usually chosen randomly. This data may be repeated among the different trained models. Once all the models have been trained, they are all combined using soft or hard voting techniques. Random forest is a classifier based on this method.
Boosting techniques [21] aim at repeatedly training a model by correcting the errors of the previously trained models. To do this, a model is trained with some samples. The model is then re-trained with the same samples but assigned a weight depending on whether it was correct in the previous step or not. At the end of training, the models are combined by weighting them in a certain way. One of the most widely used classical methods has been AdaBoost [22], method also used in grocery product recognition in conjunction with CNN networks [23]. However, over the last few years, new boosting techniques have been developed and applied in this line, such as XGBoost [24], CatBoost [25] or LightGBM [26].
Finally, the stacking problem [27] consists of stacking the output of one or more models on others, which in turn can be stacked on other models. An example of this technique was used in [28] for the grocery classification problem. In [29], the authors conducted a multistage training procedure, in which they first trained with a large class-level dataset with a single view image per category, followed by an auxiliary dataset of multiple views, which allowed the model to be robust to viewpoint changes. Finally, they trained with the objects they wanted to recognize from a single image.
Grocery product recognition offers many applications, such as the control of eating habits [30]. Over the past few years, several datasets have been created for grocery store products, such as the Grocery Store Dataset [2], the MVTec D2S dataset [31], the Retail Product Checkout dataset (RPC) [32] or the Freiburg groceries dataset [33]. Among them, the MVTec D2S, RPC and Freiburg datasets focus on the problem of object detection rather than classification. The Grocery Store Dataset [2] contains image data of grocery items categorized into fine and coarse classes. It consists of 5,125 images of 81 different types of fruits, vegetables, and carton items (e.g., juice, milk or yogurt). All images were taken with a smartphone in different grocery stores. There are 81 fine classes, grouped into 43 coarse categories. As an example, fine classes Royal Gala and Granny Smith belong to the same coarse class Apple. The authors separated the test set so that it is possible to correctly compare different models and architectures. A classification baseline was also provided, where the authors connected the CNN-feature vector before the classification layer to an SVM classifier. Testing different CNNs: AlexNet [34], VGG16 [35] or DenseNet [36], the authors obtained 72.5% test accuracy, directly using a model trained with ImageNet [37], and 85% test accuracy performing fine-tuning. In these winning cases, the authors used a DenseNet-169. They also evaluated a DenseNet-169 without SVM, obtaining 84% test accuracy. In addition to the data, this dataset provides iconic images which represent the product taken in controlled lighting conditions and without the supermarket background. The authors of [19] have obtained the best classification results for this dataset to date, using voting techniques on different CNNs. In the following, we will show how the CVV technique can improve their results with a single classifier.

III. ANALYSIS OF THE SYSTEM
In our training method, we wanted to improve the current training results of a known dataset of grocery products. Traditional bagging methods select random groups of data to train different models. Each selected set of items includes different samples. Bagging methods usually do not take into account validation sets, something that has gained special interest in neural networks, where early-stopping mechanisms allow stopping training before overfitting occurs. In the CVV method, the training dataset is distributed in different training and validation slots so that we use all the data for training the ensemble model. Each individual model uses a part of the general training dataset as training and another part as validation. In this way, we do not lose samples during the training process. Although the method is valid for different types of classifiers, it has been evaluated with different current neural networks, SVM and boosting methods. Figure 1 shows the approach taken to solve this problem. In the CVV approach, the data are previously randomized and k different validation slots are selected. The rest of the data from each slot are used for training.
Let τ be the set that includes all the samples of the complete training dataset. Be T i and V i the training and validation sets corresponding to the slot i. These sets must verify the equations 1 to 6.
Therefore, the elements of a training slot i are those used by all the other slots in validation, as shown in 7.
In the same way, the elements of a validation slot i are the intersection of all other training sets, as shown in 8.
Then, k models of the same nature are trained with each of the training and validation slots. Finally, prediction is carried out using voting techniques on the selected models. Our idea was to distribute the data into different validation slots, as cross-validation does. This allowed us to train the model with different training sets but, more importantly, to allow the model to be able to generalize to different situations. Once the models were trained with each set of validation and training slots, all methods were combined using classical voting techniques. Let k be the number of classifiers, each one associated with its respective validation slot. For an input sample, x, is the vector of output probabilities given by classifier i. This vector is composed of the probabilities, p ij (x), which represent that a sample x belongs to class j according to classifier i. − → c represents the vector of labels, one for each possible class: {c 1 , c 2 , ..., c N }. In (9), soft voting is obtained by accumulating the output probabilities of each class j. w i is a weight associated with each classifier i, in our case 1 k . The argmax function returns the position of the class with the highest cumulative probability.
Hard voting requires prior binarization of the probability, as shown in (10). For this purpose, we set only the class with the highest probability to 1 and the rest to 0.
In (11), the output is obtained by accumulating the binary values of each class j. As in soft voting, w i is 1 k in our case.
The CVV approach can also be used with classifiers of different nature. Figure 2 shows how three different classifiers can be integrated: ResNeXt-101, EfficientNet B7 and Wide ResNet-101. In this case, the data partitioning into k slots is carried out in the same way as in the case of a single classifier. Each classifier of a different nature is trained with the similar slots used with the others, taking advantage of the goodness that each model offers against the same data partitioning. Finally, all classifiers are combined using voting techniques. In terms of modern neural network-based classification techniques, deep learning has brought considerable progress to image classification problems using computer vision. Deep neural networks use many successive convolutional layers that attempt to capture the salient elements of images, from the most general to the most specific. Until a few years ago, most CNN networks used for food classification, such as AlexNet [34] or VGG [35], had problems with gradient fading. The accuracy started to saturate at a certain point and eventually decreased. In addition, the model did not converge because the gradients disappeared. These problems were partially solved by using residual blocks, which connect the input of a block to the output of that block through aggregation. Networks such as Inception [38] or ResNet [39] began to be used, models that are still widely used today. Another problem with deep convolutional networks was the rapidly increasing number of parameters as the number of layers increased. The ResNet architecture included residual bottleneck blocks. This model was a variant of the residual block that used 1x1 convolutions to create a bottleneck. These bottleneck blocks reduce the number of parameters and matrix multiplications without noticeably changing the result. The idea was to make the residual blocks as thin as possible to increase depth and have fewer parameters.
Among the networks used in our experiments we can find ResNeXt-101 [40], EfficientNet-B7 [41] or Wide Residual Networks (WRNs) [42]. The ResNeXt-101 [40] model is based on a ResNet model, but replaces the 3x3 convolutions within the ResNet model with clustered 3x3 convolutions. The ResNeXt bottleneck block divides a single convolution into multiple smaller parallel convolutions. A notable difference from ResNet is that ResNeXt uses aggregation instead of concatenation in the original Inception-ResNet block. This state-of-the-art network, as well as the two presented below, has been considered in our experiments as it generally gives better results than previous CNNs.
EfficientNet-B0 [41] is a novel network that seeks a balance between the number of parameters and accuracy. For this purpose, they defined this model by leveraging a multi-objective neural architecture search that optimized both accuracy and FLOPS, similarly to MNAS-Net [43]. EfficientNet-B7 [41], scaled from EfficientNet-B0, is based on a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient.
Wide Residual Networks (WRNs) [42] consider the problem that each fraction of a percent of improved accuracy costs nearly doubling the number of layers. In addition, the training of very deep residual networks is very slow because they have a problem of diminishing feature reuse. To solve this problem, the authors proposed a novel architecture where they decreased depth and increased width of residual networks.
Another approach we have analyzed consists of using the CVV model together with the features obtained by means of a CNN and different types of classifiers, such as SVM or boosting-based classifiers. For this method, we start from the same previously split dataset and train each of the CNN estimators, such as ResNeXt-101. Then, the features of the model are extracted for each estimator and for each dataset, and another SVM or boosting-type classifier is trained. Fig-ure 3 shows the process. The blue and gray arrows show training and inference, respectively. During inference, the test data is passed through the CNN to obtain its features and subsequently passed through the SVM or boosting model to obtain the prediction of the k estimator. Finally, voting is applied as in the previous cases to take advantage of all classifiers. This technique seeks to see if methods such as SVM or boosting allow to better refining the classification results.
Deep neural networks often have generalization problems if they have not been properly trained. Although ResNeXt-101 is one of the models with the best generalization capacity, we wanted to use a technique that could improve the classification results on a known dataset of grocery products [2]. In addition, we wanted a technique that used a single type of classifier. The authors of [19] had managed to significantly outperform the baseline results established by [2], using different classifiers independently trained on the same training data. We set out to find some technique that would allow us to improve their results without using different classifiers. Based on ResNeXt-101, we analyzed cascade classification in a previous paper [28]. That paper showed that the stacking technique worked well to improve the classifier results. However, we were not able to surpass the results of [19] with that technique. The novel CVV technique generalizes better, achieving an improvement in the results obtained by all the previous methods. It requires a single type of classifier and fewer epochs than other methods.

IV. EXPERIMENTS AND RESULTS DISCUSSION
In a first experiment, we have evaluated how the CVV training technique improves three different current models: ResNeXt-101 [40], EfficientNet B7 [41] and Wide ResNet-101 [42]. In all the defined models, apart from the convolution and pooling layers, a 0.2 dropout layer has been added between the features and the output classes. These models work with a single full connected layer connected between the feature vector, with the dropout, and the output classification vector. The network has been adapted to work with 600x600 input images. After several tests, we found that this size offered slightly better results than with other sizes, both larger and smaller. The interior architecture of the models has not been modified from the original, with the exception of the last connection layer. The cross entropy error function and Adam optimizer, with an initial learning rate of 5 × 10 −5 , have been used during training. As we have used transfer learning of the models previously trained with ImageNet [37], we have used image normalization using the mean and standard deviation calculated per channel on it. This normalization method is usually applied with transfer learning and images with a histogram distribution similar to ImageNet. Beyond the normalization and use of transfer learning on all trained models, we did not use any additional image preprocessing technique.
Tables 1, 2 and 3 show the results of how the models improve by applying CVV training with 5 classifiers ResNeXt-   Table  4 shows the result of soft voting for different number of estimators applied on ResNeXt-101. Above 6 estimators, the results of the joint model start to decrease. Table 5 shows the comparison of the models trained with our technique with other current models working on the same dataset. The authors of the Grocery Store Dataset [2] trained a DenseNet-169 that combined the feature vector with an SVM classifier. They achieved 85% test accuracy with a model they fine-tuned. Similarly, they evaluated a DenseNet-169 classifier network without SVM. In that case, they achieved 84% accuracy.
Regarding the ensemble models, in [3], the authors presented an application of transfer and ensemble learning techniques (ETL) for cervical histopathology image classification. They used ResNet-50, Inception v3 [44], VGG16 and XCeption [45]. We have evaluated this technique and have seen that, although it improves the base classification results (87.8% accuracy), it obtains worse results than other existing methods and the one presented in our article. In [4], the authors presented a fuzzy rank-based ensemble of CNN models for classification of cervical cytology. They used DenseNet-169, Inception v3 and XCeption. This method achieves an accuracy of 88.45%, although also below than ours and other existing methods.
In [28], a ResNeXt-101 [40], a ResNet-152 [39] and a stacking model of two ResNeXt-101 were evaluated. For the ResNeXt-101, the test accuracy was 90.80%, with a precision of 92.50%, recall of 92.10%, balanced accuracy of 92.09%, and F1-Score of 92.30%. The same experiment was carried out with a ResNet-152. The test accuracy was 89.90%, with a precision of 91.60%, recall of 91.10%, balanced accuracy of 91.09%, and F1-Score of 91.30%. During the experiments, oranges and satsumas got very confused. To overcome that problem, a cascade classifier was proposed. The models were trained with early-stopping and 140 epochs. The validation data were split using 30% of the data, using a balanced split. In the stacking model, the result of the first classifier had 81 classes while the second classifier only 2 outputs (oranges and satsumas). That stacking model improved the test accuracy up to 92.0%.
It deserves special attention that in the experiments we have developed with CVV, we are training the models only 10 epochs and also using early-stopping. This is important since a training of 5 estimators takes even less time than a single training of a model for 140 epochs, provided that the early-stopping stops after 50 epochs.
In [19], the authors used a hard voting ensemble approach. They evaluated different cases, with the "C" and "D" ensemble models producing the best results. The en-    The training sessions have been carried out on an i9-10900K server with 128GB RAM and 2 GPU RTX-3090 with 24GB GDDR6X. As a guideline, this server required 60 minutes to train each of the ResNeXt-101 classifiers.
Another set of experiments have been performed to see how the model behaved using a different classification algorithm. Instead of connecting the output features of the convolutional part of the neural network to a classification layer, the features have been extracted to evaluate other models. Using the features provided by ResNeXt-101, the CVV technique has been applied for 5 estimators. Then, the ResNeXt-101 models have been trained and the image features have been extracted for each estimator. Next, several SVM classifiers have been trained, one per estimator. In the case of SVM, hard voting has been performed. Each of the SVM estimators has been trained by means of a gridsearch, and each estimator could have different kernel, or C and gamma values (e.g., for the first estimator: [C: 0.1, gamma: 1, kernel: linear], for the second: [C: 1, gamma: 0.001, kernel: rbf]). The basis of this experiment has been to analyze whether this algorithm could improve the results obtained by classifying directly with the neural network or not, as they did in [2]. However, this experiment has shown that, in our case, the classification has worked better using the whole neural model with 5 estimators.
Following the same procedure, different boosting algorithms have been evaluated on the basis of the features. The training of these algorithms is very computationally expensive if the whole convolutional part is trained in each estimator, so the approach using only the features is more convenient. Among the boosting algorithms, both individually and using CVV, we have evaluated: • AdaBoost [22], the classic boosting method that consists of creating several simple predictors in sequence, so that the second one adjusts the errors of the first classifier, the third one the errors of the second one and so on. Finally, all classifiers are merged into a strong classifier. The classifier used has been a decision tree and the search for parameters has been carried out using a grid search, obtaining the best results with 1, 000 estimators, a depth per tree of 40 and a learning rate of 1.5. • XGBoost [24] is an extreme gradient boosting method inspired in [46], a method that built an additive model in a forward stage-wise fashion, optimizing arbitrary differentiable loss functions. XGBoost adds different features such as clever penalization of trees, Newton boosting, proportional shrinking of leaf nodes, extra randomization parameter, distributed computing and automatic feature selection. After a grid search, the best results have been obtained using trees based on the faster histogram optimized approximate greedy algorithm, with a learning rate of 0.01, a number of estimators equal to 4, 000 and a maximum number of discrete bins to bucket continuous features equals 2. For this method, early-stopping has been used during training. • LightGBM [26] is a method that implements the growing of the tree using leaf-wise, a technique that chooses the leaf it believes will yield the largest decrease in loss. LightGBM implements a highly optimized histogrambased decision tree learning algorithm, improving efficiency and reducing memory consumption. In addition, it uses Gradient-Based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) that allow the algorithm to keep the accuracy while running faster. After a grid search, the best results have been obtained using DART trees [47], 200 estimators, a learning rate of 0.09 and a maximum number of discrete bins equals 3. • CatBoost [25], a gradient boosting method that handles categorical features, uses ordered boosting to overcome overfitting and oblivious or symmetric trees for faster execution. After a grid search, the best results have been obtained with 2,000 iterations, a maximum depth per tree of 5 and a learning rate of 0.1. For this method, early-stopping has also been used during training.
These methods are some of the most up-to-date methods using boosting techniques. Again, this experiment has shown that, in our case, the classification has worked better using the whole neural model with 5 estimators. Table 6 shows the results of this experiment. Both CatBoost and XGBoost have given the best results combined with the CVV technique using the ResNeXt-101 features. SVM has obtained close but slightly inferior results. However, none of these techniques has been able to overcome the final connection layer of classification of the neural network itself. This allows us to affirm that each individual model is trained in an optimal way directly with the neural network. Although other authors, such as in [2], had obtained good results by connecting SVM to the features, it is clear from our experimentation that it is not necessary to extract the features and we simply have to let the model be trained directly using the classification layer itself.
As seen previously, the best model has been obtained by Soft CVV with 5 estimators of each type: ResNeXt-101, EfficientNet B7 and Wide ResNet-101. A Precision-Recall curve with the average precision score, micro-averaged over all classes, is shown in Figure 4. In addition, a Multi-Class Precision-Recall curve, with all classes, is shown in Figure  5. The evaluation of the model shows a good performance, leading to 98% AUC (Area Under the Curve). The confusion matrix of this model is shown in Table 7. To show the results clearly, instead of showing the 81 fine classes, we present the 43 coarse categories into which the results have been grouped. Among the limitations of the presented method are the need to train multiple models and the need for parallel inference in order to operate in real time. However, we have shown how individual models can be trained in a smaller number of epochs than in a normal training. Furthermore, at the real-time execution level it would be sufficient to size the GPU memory to be able to load all models simultaneously. Since the size that the models have during inference is usually VOLUME 4, 2016   quite smaller than during training (approximately 2.2GB for a ResNeXt-101 during inference), it is possible to load several models and infer in parallel without significantly reducing the total runtime with respect to when we use a single classifier. So this method, despite requiring the training of several models, can be used for real-time applications of various kinds, such as industrial applications. Beyond its use in the problem of grocery product classification, in industry and other sectors high accuracy is sought, so this technique allows to overcome the results that individual classifiers offer.

V. CONCLUSIONS
In recent years, there has been significant progress in deep learning methods to solve classification problems of con-siderable complexity, such as the recognition of grocery products by means of computer vision. The training of these models is usually performed by dividing the data into training, validation and test. One of the problems with this technique is that the model is trained with the training data and periodically evaluated with validation until the accuracy of this starts to decrease. Using early-stopping, we can stop training when this happens, ensuring that the model is optimal on data not used directly in training. Subsequently, the model is evaluated against the test data. The problem with this approach is that by using validation to block training we are making the model optimal against that set of validation, but it may not be optimal against other different validation samples. There are cross-validation techniques that allow the model to be evaluated with different validation sets, but they do not allow us to obtain a model that takes into account all of these sets. All these problems lead to a lack of generalization that we find in many models used in various fields, such as industry.
Traditionally used in classical algorithms, bagging, stacking, boosting or voting techniques, allow to integrate multiple models to obtain a more powerful classifier. However, these techniques are mainly designed to benefit from classifiers of different nature to obtain an overall result that improves all of them. Among all these techniques, bagging selects subsets of elements to train the same or different classifiers. Originally, when there were practically no validation sets in use, this method made it possible to make a model that generalized better in the face of unknown situations.
Our paper applies a method, named Cross Validation Voting (CVV), to the grocery product classification problem. It allows benefiting from the advantages of cross validation to be integrated into the training scheme itself. The data are partitioned as in the k-fold validation technique, but each pair of slots is used to train a classifier of the same type. In this way, each classifier stops at the point where it performs best against its corresponding validation. As all validations are different, each model is optimal for its respective validation. Subsequently, by grouping all the models using voting techniques, we manage to improve the results obtained by the model at the individual level in all the current neural networks evaluated. The resulting model significantly improves the results and the generalization capacity of the individual models.
Numerous experiments have been carried out and we have shown how the best results obtained to date on a well-known dataset of grocery products have been improved. In addition, we have shown that, when classifiers of different nature are integrated into the model, the results improve even more.
We are currently applying this method to the improvement of classification systems used in other fields, such as industry, activity recognition and robotics. In industry, for example, the applications require very high precision. Beyond being used for classification models using computer vision, this technique can also be used for any other type of neural network or algorithm that allows a stop to be made on the basis of validation data. It is a method that offers very promising results and makes it possible to advance in the classification methods for products in the agrifood and supermarket sector.

FUNDING
This publication has been partially funded by the project "5R-Cervera Network in robotic technologies for smart manufacturing", contract number CER-20211007, under "Centros Tecnológicos de Excelencia Cervera" programme funded by "The Centre for the Development of Industrial Technology (CDTI)" and by "Instituto para la Competitividad Empresarial de Castilla y León" (Project CCTT3/20/VA/0003) through the line "2020 Proyectos I+D orientados a la excelencia y mejora competitiva de los CCTT", co-financed with FEDER funds.
ROBERTO MEDINA APARICIO is an Industrial Engineer and Doctor from the University of Valladolid. He has worked as a researcher at CAR-TIF since 2005, where he has combined research work and industrial development. He began his professional career in the Robotics and Machine Vision Division and is currently part of the Industrial and Digital Systems Division. He has extensive experience in research projects involving machine vision, sensorization, 3D reconstruction, and distributed computing. He has published 8 scientific articles in scientific journals of impact and in various conferences, as well as has participated in 1 international patent. He has worked on many research projects, both national and European. Currently working on research projects related to computer vision based on deep neural networks, such as I-visart ("New artificial vision methodologies for the visual inspection of highly reflective and textured surfaces") and Agrovis ("Artificial Vision for products/processes in the agrifood sector"). He is also the project manager at CARTIF of CERVERA network in robotic technologies for smart manufacturing (5R). VOLUME 4, 2016