State Recognition of Food Images Using Deep Features

State recognition of food images is a recent topic that is gaining a huge interest in the Computer Vision community. Recently, researchers presented a dataset of food images at different states where unfortunately no information regarding the food category was included. In practical food monitoring applications it is important to be able to recognize a peeled tomato instead of a generic peeled item. To this end, in this paper, we introduce a new dataset containing 20 different food categories taken from fruits and vegetables at 11 different states ranging from solid, sliced to creamy paste. We experiment with most common Convolutional Neural Network (CNN) architectures on three different recognition tasks: food categories, food states, and both food categories and states. Since lack of labeled data is a common situation in practical applications, here we exploits deep features extracted from CNNs combined with Support Vector Machines (SVMs) as an alternative to the End-to-End classification. We also compare deep features with several hand-crafted features. These experiments confirm that deep features outperform hand-crafted features on all the three classification tasks and whatever is the food category or food state considered. Finally, we test the generalization capability of the most performing deep features by using another, publicly available, dataset of food states. This last experiment shows that the features extracted from a CNN trained on our proposed dataset achieve performance quite close to the one achieved by the state of the art method. This confirms that our deep features are robust with respect to data never seen by the CNN.


I. INTRODUCTION
In the last few years, one of the most active topics in the Computer Vision community is the image understanding for object recognition [1], [2]. Within this context, automatic food analysis [3]- [7] is one application scenario that received great attention recently.
Accurate tracking of daily nutrition intake is not only conducive for people to maintain a healthy weight, but also important to treat and control food-related health problems like obesity and diabetes. Conventionally, this has been accomplished by exploring daily recorded manual logs. Nowadays, technology can support the users to keep track of their food consumption in a more user-friendly way allowing for a more comprehensive daily dietary monitoring. Computer vision techniques can help to build systems to automatically locate and recognize diverse foods as well as to estimate the food quantity. For example, one may simply take The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano . a picture of a plate of food using a smartphone and the whole process towards measuring the total calorie in the plate can be achieved by a visual understanding framework [8]- [11]. Cooking videos can be processed in order to extract food items, utensils and cooking procedures to construct an interactive, computer-aided system for learning how to cook healthy recipes [12]- [14].
Automatic food recognition is thus an important task for different applications. However, food recognition is a challenging task due to the intrinsic properties of the food items. For instance, food is a non-rigid object. It is characterized by intrinsic high intra-class variability where the same food can have a very diverse visual appearance in different images due to different preparations, placements in the plate, or acquisition point of view. This can be seen in Figure 1 that shows different images of ''Panette e crocchette''.
Moreover, if we consider a video recipe, during the preparation of the dish, a food item (e.g. a zucchini) assumes different shapes and appearances. For example, if we look at the video recipe of the ''Zucchini cream'' in Figure 2,  we can see that the zucchini and the leek are initially whole and raw, then they are chopped in different styles (i.e. oblong and round for the leek and zucchini respectively), mixed, stir-fried, and finally whisked. During all these processing steps the appearance of the initial food significantly varies. The food is transformed in different states induced by the preparation steps themselves, and these states must be dealt with if we want to correctly identify the food throughout the whole video recipe.
Food state recognition is a topic that has not been extensively studied. The only previous works that tackled this problem are those by Jelodar et al. [15] that first introduced a new food state challenge dataset, and Salekin et al. [16].
According to [15], object states are characteristics into which an object can be transformed by some activity, and it can be described as a form of changes in form, color, or texture. As it can be seen in Figure 2, the texture of the zucchini is greatly changed as it undergoes the processing transformation required by the recipe. Its visual texture changes as it is being sliced and whisked as well as its color and shape are heavily affected by the cooking process. The object is always a zucchini, but the states are very different and each of them should be dealt with.
An ideal food recognition system needs to recognize food independently by its state, but also it needs to identify a food state within the recipe. This is very important for automatic video recipe transcript as well as fine-grained human activity understanding. The recognition of the different states of food is essential if we want also to determine the nutritional values of the food. While the food transition from one state to another, its nutrients may change due to seasoning, cooking or other cooking procedures. Being able to fully describe a food and its states will enable the implementation of intelligent dietary monitoring systems supporting users in controlling their food intake.
In this paper, inspired by the work of Jelodar et al. [15], we want to investigate the following issues: • Can we recognize foods across different states? • Can we recognize a food state independently by its identity?
• How robust are CNN-based features with respect to hand-crafted features?
To answer these questions we created our own dataset of food states that differs from the one in [15] in that it allows us to perform three classification tasks: recognizing a food item across different states, identifying a food state across different foods and recognizing a food item at a given state. This is not possible with the existing dataset which allows only to perform the first task. Our dataset has been carefully curated and is composed of 11,943 images. It contains 20 categories of fruits and vegetables acquired in 11 states: batons, creamy paste, floured, grated, juiced, julienne, peeled, sliced, wedges/quarters, and whole. We used the dataset for the classification experiments evaluating different state-of-the-art hand-crafted features and learned features from recent Convolutional Neural Networks architectures. We evaluated the generalization capability of our best CNN-based features on the food state recognition task using the dataset in [15].
The rest of the paper is organized as follows. In section II we present the most common used visual description for food classification. In section III we present the dataset, the hand-crafted features, and CNNs used in the experiments. Section IV presents results achieved with the end-to-end classification strategy, deep features, hand-crafted features and finally results achieved on the dataset presented in [15]. In section V we comment on the results achieved and present future works.

II. VISUAL DESCRIPTIONS FOR FOOD CLASSIFICATION
A huge variety of features have been proposed in the literature for describing the visual content. They are often divided into Hand-Crafted (HC) features and Learned Features (LF). Hand-crafted descriptors are features extracted using a manually predefined algorithm based on expert knowledge. Learned descriptors are features extracted using Convolutional Neural Networks (CNNs). In the following subsections, we will provide an overview of works related to food classification approaches exploiting hand-crafted features and learned features [17], [18].

A. HAND-CRAFTED FEATURES
Many works in the literature exploit hand-crafted visual features for food recognition and quantity estimation both for desktop and for mobile applications. Since using only a single feature is not enough to describe image contents, most of the approaches in the literature exploit several image descriptors at once in an early fusion or late fusion framework.
For example, [19] proposes a fusion of color and texture features for fruit recognition. Color features are extracted as statistical measures from the H and S color channels of the HSV color space, while texture features are derived from co-occurrence matrices. A voting-based, late decision fusion classifier is considered in [20]. Color statistics, entropy statistics, predominant color statistics, and energy responses of Gabor filter banks [21] as used for global descriptors, while local color, local entropy color, Tamura perceptual features [22], Gabor filters, SIFT descriptor [23], SURF [24], Steerable filters [25], and DAISY descriptor [26] as considered as local descriptors. A late fusion approach enabled to achieve a 7% improvement in recognition with respect to the single classifiers.
In [27], different features are integrated into a Multiple Kernel Learning (MKL) classification approach for single food recognition. The features comprise color histogram, Gabor texture features, Histogram of Oriented Gradient (HOG) [28], SIFT bag-of-features [29]. MKL is also used in [3] for multiple food recognition. The images are processed with a candidate region detection aimed at locating food regions. Each region is described in terms of SIFT and CSIFT bag-of-features ( [30]), HOG, Gabor texture, and color histograms [31]. [32] uses a k-NN classifier on local and global features. In [33] a vocabulary is constructed on textons and the food images are classified using SVM. The same classifier is used in [34] where local binary pattern and relationship between SIFT interest points are used to code the local and spatial information. SVM, Artificial Neural Networks and Random Forest classification methods are used in [35] where 14 different color and texture descriptors are evaluated. The one that provided the best result was the HSV-SIFT descriptor that describes local textures in the color channels.
Ten different features are considered in [36]: color histograms in different color spaces, shape with Pyramid of HOG, and GIST [37], texture with local Binary Patterns [38], Local Phase Quantization [39], Local Configuration Pattern [40], Binary Gabor Pattern [41], and MSR4-Gabor filter bank [42]), and data-driven features (CNN features). All these features are fed to a committee of classifiers built on Extreme Learning Machines whose outputs are combined in the final result.
Food recognition can leverage from the contextual information derived from the place where the food is consumed i.e. the restaurant. In [43] the food images are first geo-localized then several features are extracted and fed to an MKL classifier for recognition. The image descriptors are based on colors such as Color Moments Invariants, and Hue Histograms, and on variants of the SIFT descriptor. Local and global features are tested in [44]. The list of features used is: CEDD [45], Gabor Features, Opponent Gabor Features, LBP, Local Color Contrast, Chromaticity Moments, and Complex Wavelet Transform [46]. Among these features, CEDD achieved the best recognition results.
The arrangement of food ingredients is also a possible cue for food recognition. Given soft labeling of food pixels, in [47] spatial relationships between pixels of different food ingredients are described using pairwise local features. Results showed that, on the evaluated dataset, the approach outperforms other bag-of-features models.
Notwithstanding the large literature in hand-crafted features, these descriptors need to be carefully chosen for the task at hand, or a suitable feature selection procedure must be applied to limit the information redundancy and the curse of dimensionality.

B. LEARNED FEATURES
CNNs are a class of learnable architectures adopted in many domains such as image recognition, image annotation, image retrieval, etc... [48]. CNNs are usually composed of several layers, each involving linear as well as nonlinear operators. The layers' parameters are learned jointly in an end-to-end manner to solve a particular task. A CNN that has been trained for solving a given task can be also adapted to solve a different task. It is common to use a CNN that is pre-trained on a very large dataset and adapt it for new tasks [49].
Several studies have investigated deep neural networks for food recognition as end-to-end classifiers, or feature extractors. One of the first works that used features extracted from VOLUME 8, 2020 CNNs within the context of food recognition was by [50]. The food images are described with the features extracted from the FC7 layer of an AlexNet-style architecture pre-trained on ImageNet and classified with SVM. Reference [51] evaluated different CNN-based techniques for food recognition. These techniques include network pre-trained with the large-scale ImageNet data, fine-tuned network for food classification, and the use of the activation features extracted from the CNN. In [5] the AlexNet network is used as a feature extraction module for the classification of food images acquired in a canteen environment. Experiments with traditional features using k-NN and SVM classifiers showed the superiority of the CNN-based features. Reference [52] used Google's image recognition architecture Inception V3. The network, composed of 54 layers, was designed to tackle the ImageNet's ILSVRC15 and it was fine-tuned for classifying food images. The network greatly surpasses the performances of previous approaches. Another approach based on the Inception architecture is DeepFood [53]. In this case, 1×1 convolutional layers are introduced to reduce the network complexity with some loss in performances. [54] devised the WIde-Slice Residual Network (WISeR) designed to specifically handle structures that can be found in food images. The network outperformed the Inception V3 architecture. CNNs can be also used to tackle different tasks simultaneously. Reference [55] used this ability to build a deep convolutional neural network architecture for simultaneous food ingredients recognition and food categorization. Reference [56] proposed NutriNet, a modified version of the AlexNet architecture which uses fewer parameters compared with the original design. The network was trained on a very large food database of more than 130,000 images.
The Residual Network ResNet-50 [57] is one of the most powerful and performing CNN architecture. The network is exploited in [58] for extracting features to be used for image retrieval in a dataset of 1,200 distinct dishes. The CNN-based features greatly outperform traditional bag-of-SIFT and textons features [33]. In [59] an extensive evaluation of different techniques for food recognition and retrieval is conducted on a dataset of more than 240,000 images of 475 different food dishes (Food-475). Seven different CNN architectures for end-to-end classification are evaluated: AlexNet, Caffe-Reference, GoogleNet, VGGNet-16, VGGNet-19, InceptionV3, and ResNet-50. Among these architectures, the ResNet-50 showed the best recognition accuracy. In the same work, CNN-based features are also evaluated. The features are extracted from a ResNet-50 trained with different food datasets and recognition is performed using a k-NN classifier. The same features are also tested in a retrieval task. Experiments showed that robust features can be obtained from very large and heterogeneous food datasets.
Recognizing a food identity during a dish preparation is quite challenging. Reference [15] was the first work to explicitly introduce and address the food state classification problem. By analyzing cooking procedures, eleven states of the most frequent foods are identified and a new dataset of food states is introduced. They proposed a ResNet based deep model solution to the state identification problem. Since state identification has a strong correlation with the type of food, individual models are fine-tuned for each food in the dataset. This strategy showed significant improvement with respect to a food-independent model. The Inception V3 architecture is used instead in [16] CNNs are exploited also for other food-related tasks such as food localization, segmentation, ingredients recognition, quantity, and calories estimation. Readers interested in these tasks can refer to [60] and [61] for a comprehensive survey of recent techniques.

III. MATERIALS AND METHODS
The aim of this paper is twofold: the collection of a new dataset containing foods in different states, and the evaluation of features and classification methods. Specifically, we are interested in food recognition across different states, food state classification across different foods, and joint food and state recognition. In this Section, we first illustrate the procedure we adopted for the collection of the dataset and then we illustrate the classification pipeline as well as the procedure we adopted for the evaluation of the classification methods.

A. DATASET
The construction of our dataset is inspired by [15]. We identified 11 food states representative of the states that can be found in food recipes. The states are: batons, creamy paste, floured, grated, juiced, julienne, peeled, sliced, wedges/quarters, and whole. Most of these states apply to fruits and vegetables so we focus our attention on these food classes. Among the possible foods, we selected those with at least two states. The final list of fruits and vegetables is: apple, apricot, aubergine, banana, beet, carrot, garlic, lemon, melon, onion, orange, peach, pear, pepper, potato, pumpkin, strawberry, tomato, watermelon, and zucchini.
We searched and downloaded the images using the Google search engine using a Python script. Textual queries with combinations of food and state words (such as ''apple'' and ''diced'') were submitted in several languages (i.e. English, Chinese, French, German and Italian). The downloaded images were manually reviewed to ensure that they were pertinent to our food/state classification. We discarded images depicting food in cans since most of the time the food is covered by a large label. We also discarded images containing different foods or different states. Furthermore, we edited images having a very large background area in comparison to the food area to limit the influence of non-relevant regions during the classification. At the end of the analysis process, each image is filed in a double layer categorization: food identity, and state. In this way, we can perform food classification across states, state classification across different foods, or a paired food and state classification. Starting from an initial set of 180,000 downloaded images, we obtained 11,943 images manually inspected and categorized.  Table 1 shows the organization of the dataset according to the state categorization. We can see that the state ''floured'' is the one containing the fewer images (i.e. 79), while the ''whole'' state is the class containing the most images. We also report the number of foods that are present in each state. It can be seen that not all states contain all foods. Table 2 details the content of the dataset according to the food categorization. In this case, the number of images in each class has less variability than in the previous categorization. The number of images ranges from 300 to about 1,000. None of the 20 foods has all the 11 states. Beet, potato, and zucchini have 10 states, while garlic has only three states. Figure 3 shows some examples of images in our dataset. These images are representative of the food/state categorization. We can notice that the food states have a very large visual diversity. Also within a state, the visual appearance of the foods is influenced by the distance of acquisitions, lights, colors, and textures.
The images in our original dataset have been split into three sets by allocating 70% (8,233 images) for the training, 15% (1,855 images) for validation, and the last 15% (1,855 images) for testing.

B. METHODS
We evaluated several hand-crafted and deep learning based feature extraction methods. The evaluation pipeline includes a feature extraction module and a classification module (one for each task) based on an SVM classifier with a radial basis function (RBF). The validation set of the dataset is used for the choice of the RBF parameters. Learned features are extracted from several CNNs trained using our dataset. For the sake of comparison, we also evaluate the trained CNN architectures for end-to-end classification. In the following subsections we describe the chosen hand-crafted features, and the CNN models.

1) HAND-CRAFTED FEATURES
We considered both color and grey-scale hand-crafted features. The grey-scale image L is defined as follows: L = 0.299R + 0.587G + 0.114B, where R = Red, G = Green and B = Blue. All feature vectors have been l 2 normalized (they have been divided by its l 2 -norm): • 256-dimensional grey-scale histogram (Hist L) [   and ResNet-50 [57] which is part of the ensemble of CNNs that won the contest in 2015. The models were pre-trained on ImageNet so we fine-tuned them for food state recognition by modifying the last layers to match our classification tasks. Fine-tuning has been performed using the SGDM optimizer (Stochastic Gradient Descent with Momentum), a mini-batch of size 10, and a learning rate of 0.0003 for 12 epochs.
The CNN architectures are used as end-to-end classifiers as well as feature extractors. We extract the features from the fully connected layers before the actual classification layer in each model. Specifically, the extracted features have the dimension of 1024 values for the GoogLeNet, 2048 for both the Inception-v3 and ResNet, and of 1280 values for the MobileNet-v2. The features are used to train an SVM classifier (with RBF kernel) in the same way as we have done for hand-crafted features.

IV. RESULTS
We performed different classification experiments using the hand-crafted features and CNN-based features. More in details, we investigated the performance of the features in recognizing a food state regardless of the food identity, the recognition of the food across the different states, and jointly recognizing the food and state. All the experiments have been performed considering the three splits of our dataset as described in Section III-A and averaging the obtained results. We also evaluated the robustness of the best performing CNN-based features for the recognition of states of the dataset in [15]. For brevity, in the following tables, we have indicated the names of networks as G.Net, Inc-v3, M.Net and R.Net for the GoogLeNet, Inception-v3, MobileNet-v2, and ResNet50 respectively.
Detailed results are reported in terms of per-class Accuracy while the overall results are reported in terms of Average Accuracy and F 1 -Score.

A. END-TO-END CNN CLASSIFICATION
In the first experiment, we evaluated the performance of the fine-tuned CNN models in an end-to-end classification scenario. Results for the food classification task are shown in Table 4, while the results for the state classification task are shown in Table 5. As can be seen, the four networks can achieve very good classification results on most of the classes. For the food classification task, the recognition of the ''pumpkin'' seems to be the most difficult with the  best results obtained by the GoogLeNet with 80%. Also the ''pear'' exhibits a general recognition accuracy lower than the other classes. With respect to the state classification task, the ''batons'', and ''flower'' have the lowest accuracy.
If we examine the average accuracy of each network on all the classes in Table 6, we can see that all the models are able to achieve a food classification accuracy above 87%. The Inception-v3 and the ResNet-50 are the best models with an average accuracy of about 92%. It is worth noting that the MobileNet-v2 has only 2% drop in accuracy and  the network is by far the most lightweight of the four. With respect to the fine-tuned network for state recognition, all the networks are able to achieve an accuracy above 92%. In this scenario, the best model is the Inception-v3 with 95.39%, followed by GoogLeNet (94.49%), ResNet-50 (92.90%), and MobileNet-v2 (92.88%). Again it is worth noting the very good performances of the MobileNet-v2. We also evaluate the joint classification of food and state. Table 6 shows that the Inception-v3 network achieves the best result with an accuracy of 90.48%. All the networks are able to recognize both information with an accuracy of at least 85%. Table 7 shows the overall results on the three tasks computed in terms of F 1 -Score. The values do not exhibit a different behaviour than those in Table 6.

B. CNN-BASED FEATURE CLASSIFICATION
The previous experiments proved that the trained networks can effectively classify food and states with high accuracy. However, we were more interested in the features that have been learned by the networks. For this reason, we used the networks to extract the features embedded in the last layers of the networks. Specifically, we extracted the features from the following layers: ''pool5-drop_7x7_s1'', ''avg_pool'', ''global_average_pooling2d_1'', and ''avg_pool'' for the GoogLeNet, Inception-v3, MobileNet-v2, and ResNet-50 respectively. Classification has been performed by training an SVM classifier with a RBF kernel and using the same splits as before.
The detailed results for the food classification task are reported in Table 8, while the results for the state classification task are reported in Table 9. For the end-to-end classification results, we cannot see specific increments or decrements in the accuracy for all the classes. In some cases the CNN-based features exhibit worse performances than the end-to-end counterparts (e.g. for the G.Net the ''apple'' drops from 84.44% to 80.00%), while in other cases the accuracy increases (e.g. for the M.Net, the accuracy increases from 80.00% to 85.28%). If we consider the average results on    all the classes reported in Table 10, we can see that the use of the embedded features coupled with the SVM classifier does not exhibits significant drops in classification accuracy. The drop is in the order of 1 percentage point on average.   This means that the features are indeed robust enough to solve both classification problems. The use of a non-linear classifier allows the models to achieve, in some cases, even slightly better results than the end-to-end counterpart. This can be seen in the case of the joint food and state classification task. In this case, all the features have better results than the end-to-end counterparts with the MobileNet-v2 exhibiting an increase in accuracy of 2.2 percentage points. Table 11 shows the overall results in terms of F 1 -Score. As before, there are no significant differences with respect to the results in Table 10. Table 12 and Table 13, show the per-class accuracy of the ten hand-crafted features described in Section III-B.1. The hand-crafted features are not able to capture enough TABLE 16. Comparison, in terms of Average Accuracy, between the best deep-based feature (Inc-v3) and its concatenation with each hand-crafted feature on the Food, State, and Food-and-State classification tasks. The best results are in bold.

TABLE 17.
Comparison, in terms of F 1 -Score, between the best deep-based feature (Inc-v3) and its concatenation with each hand-crafted feature on the Food, State, and Food-and-State classification tasks. The best results are in bold. information about the image contents to discriminate between the different classes. Concerning the food classification task, we can see that some foods are more easily recognizable with the hand-crafted features than others. For example, the beet, carrot, tomato and zucchini are the classes that have higher recognition accuracy. This could be due to the characteristic color and shape of the food. On the other hand, the peach and pear are the fruits that are more difficult to recognize by the hand crafted features. If we look at the CNN-based features, the most difficult food to recognize is the pumpkin followed by the pear.
Concerning the state classification task, the best results are obtained for the ''whole'' state for both the CNN-based and hand-crafted features. This is not surprising since it corresponds to a traditional image recognition task. The most difficult state to be recognized is the ''floured'' one. This could be related to the fact that this state is not atomic but must be considered in conjunction with another state (e.g. sliced floured zucchini). Surprisingly, the HIST-RGB feature achieves an accuracy of about 31%, while other hand-crafted features do not reach 15% and some are completely unable to recognize this state. Table 14 compares the results of the CNN-based features against the hand-crafted ones in terms of average accuracy. As expected, the CNN-based features achieve the best results among all the features. The best CNN-based features are those extracted from the Inc-v3 network. Among the hand-crafted features, the best overall result is obtained again by the GIST features (i.e 41.24%). This could be due to the fact that this descriptor summarizes texture information at different scales and orientations. Again, similar conclusions can be derived from the performances computed in terms of F 1 -Score as shown in Table 15.

D. COMBINATION OF DEEP-BASED AND HAND-CRAFTED FEATURES
We concatenated the best performing features extracted from the Inception-v3 network (Inc-v3) with each of the hand-crafted feature. The aim was to investigate if the combination of different features can further improve the overall classification performance. Table 16 and Table 17 report the overall performance in terms of average accuracy and F 1 -Score respectively. Concerning the food classification task, the accuracy is above 91% and the F 1 -Score is above 90% for all the combinations. The differences with respect to Inc-v3 are very small. The best combination achieves extra 0.18 points for the average accuracy, and 0.24 points for the F 1 -Score. Concerning the state classification task, the results are more diverse. The best result is achieved by Inc-v3+LBP-RI with 95.29% of average accuracy against 94.96% of the Inc-v3 alone. This is also true for the F 1 -Score (95.29% against 94.76%). However, the overall gain is less than 1 percentage point. In the case of the food-and-state classification task, we notice very small differences. The Inc-v3+LBP-RI is again the best combination with 90.71% against 90.53%, and 90.71% against 90.89% for the average accuracy and F 1 -Score respectively. These results show that there is no significant advantage in combining CNN-based and hand-crafted features for our problem.

E. COMPARISON WITH THE STATE-OF-THE-ART
We also tested how the CNN-based features can recognize the food states on a dataset in the state-of-the-art. Specifically, we evaluated the features extracted by the four networks on the Jelodar dataset [15]. This dataset is the only public dataset that is comparable to ours. The dataset contains some states common to our dataset but also new, unseen, states. For example, the ''mixed'' state is present in the dataset, and this state corresponds to different finely chopped ingredients that are blended. The ''other'' state is an heterogeneous class and comprises all the states that are not already considered. Results of our CNN-based features are reported in Table 18. As expected, the ''other'' state exhibits the worse results among the eleven states. On the overall the accuracy of our CNN-based features on the Jelodar dataset is lower than in the case of our dataset. This is to be expected in a transfer learning problem. If we compare the results obtained with the CNN-based features with those obtained in [15] (see Table 19), we can see that the VOLUME 8, 2020 best classification accuracy is obtained with the ResNet-50 (82.16%) followed by the Inception-v3 (81.25%). In this case, the MobileNet-v2 exhibits the lowest performances with an accuracy of 78.37%. The best combination of CNN-based and hand-crafted features achieves an accuracy of 81.15% which is one percentage point lower than the Inc-v3 only. Again there is no clear gain in combining the features.
As a comparison, we also report the results reported in [15]. In this case, a direct comparison is not possible because the dataset used in the original paper differs from the one used here. The results of the five methods in Table 19 have been obtained on a revised version of the original dataset that has been provided to us by the authors. However, we can see that our results are similar or better than those reported in [15] in the case of a single network, while, if multiple networks are trained on each state class, the results are about 5 percentage points lower.
From the analysis of the results we can deduce that the CNN-based features extracted from the Inception-v3 network are those that are able to achieve the overall best results on all the three tasks (see Table14). The second best features are those extracted from the ResNet-50 network. Good results of the Inception-v3 features notwithstanding, we can see that for some classes, we still have some errors. Figure 4, shows the confusion matrix of these features on the food classification task, while Figure 6 shows some examples of food incorrectly classified. We can see that apples and potatoes are often confused. Apricots are often confused with peaches in some states. Peppers are confused with tomatoes, and this is true in many states, but especially whole, but not the reverse. Garlic and onions, if diced cannot be easily distinguished. Pumpkins are mistaken for carrots in many states and this is true also for the reverse, even if at a lower error rate. Strawberries are often confused with tomatoes especially when they are juiced, or finely diced. Figure 5, shows the confusion matrix of the Inception-v3 features on the state classification task, while Figure 7 shows some examples of state classification errors. ''batons'', ''floured'', ''julienne'' and ''wedges'' are the states where more mistakes are made. For ''batons'' and ''julienne'' the reason is that they are mistaken for each other. This is mainly due to the caliber of the cut, being 6mm or more for batons and 2-3mm for julienne. ''floured'' and ''wedges'' foods are often mistaken for ''sliced'', but not the reverse. These problems could also be attributed to the fact that these classes have fewer images than others.

V. CONCLUSION
In this paper we presented a new dataset of images for food and state recognition. A similar dataset exists in the literature, but it tackles only the problem of state recognition. We started our investigation with several questions: Can we recognize foods across different states? Can we recognize a food state independently by its identity? How robust are end-to-end Convolutional Neural Networks (CNNs)? How robust are CNN-based features with respect to hand-crafted features? Our experiments effectively show that, with the proper network, we can obtain robust features to be used for different food-related classification tasks. These features outperform hand-crafted features by a large margin. Moreover, there is no significant advantages in combining hand-crafted features with learned ones. On the overall, it seems that the state recognition problem is more approachable by the CNN-based features than the food classification one. This could be associated with the visual appearance of the state classes where the texture is more important in the discrimination of the different states. Although the best features are those extracted from the Inception-v3 network, we must acknowledge the very good results of the features extracted from the MobileNet-v2 network. For applications where the computational cost is important, the MobileNet-v2 is a perfect candidate having very good results and efficient implementation. If food classification and state classification are important tasks in a general food recognition application, it is also important to classify food at a particular state. In our experiments, we have shown that, although with slightly minor success for the two base tasks, this can be also achieved with the use of the CNN-based features. Also, if applied to unseen food states, our features are able to achieve comparable or even better results than an ad-hoc network trained end-to-end to those specific states. This demonstrates the generalization capability of the features on new domains. This also demonstrates that CNN-based features are robust with respect to the intra-class visual variability of food images. For a general food recognition system, this is a very important feature.
Good results notwithstanding, we need to further investigate the robustness of machine learning methods to the variability of real world foods in images and videos in terms of illumination, scale, point of view, and cluttered scenes. For example, discerning some type of food across some states (i.e. creamy carrot vs creamy pumpkin, or juiced strawberry vs juiced tomato) is very difficult if we rely only on visual features. An idea could be to consider other type of related features such as nutrients, ingredients, or recipe procedures. Also some food and states can be confused if the images are acquired at different scales. From the acquisition point of view, illumination plays an important role. Different lighting conditions can make it problematic to distinguish different foods [70]. An integration with a carefully designed pre-processing procedure could alleviate this problem as demonstrated in [71]. For all these reasons, as future works, we intend to perform a more systematic investigation on the effect of these issues on the recognition of the foods and states, and design possible solutions. Finally, some of our classes are under-represented and this could be a problem for proper recognition. We are planning to increase the number of images for those under-represented classes. To let other research groups contribute to the food and state recognition problem, we intend to make our dataset publicly available. 1 ACKNOWLEDGMENT (Gianluigi Ciocca, Giovanni Micali, Paolo Napoletano are contributed equally to this work.)