Food State Recognition Using Deep Learning

Automated food detection and recognition methods have been studied to enhance end-user life. However, most existing research focused on food ingredient type recognition, with little work has been done for food ingredient state recognition. Successful recognition of food ingredient state plays a significant role in handling the food ingredient by an intelligent system. In this work, we propose a new novel cascaded multi-head approach based on deep learning to simultaneously recognize the state and type of food ingredients. We trained and evaluated the proposed approach on a benchmark dataset of food ingredient images with nine different food states and 18 food types. We compared the proposed approach with a non-cascaded deep learning approach. The cascaded approach shows improvement in food ingredient state recognition with 87% accuracy compared to 81% using a non-cascaded deep learning method. Our proposed method broadly applies to various tasks where food ingredient state recognition is essential, such as feeding elderly and disabled people and automating food recognition and preparation.


I. INTRODUCTION
According to the U.S. Chamber of Commerce, the jobs requiring in-person attendance and having lower wages, including food service and hospitality, have suffered from labor shortages and had difficulty retaining workers [1]. Developing automated methods for food recognition and classification can help solve worker shortages by replacing human workers in many of the repetitive food preparation tasks in the industry. The computerized techniques for food preparation can be used in various applications, including supporting people with disability [2], especially people with vision impairment who need help recognizing the food type and ingredients. Our world has at least 2.2 billion people who are classified as having vision impairment, based on the World Health Organization (WHO) [3]. Therefore, it is essential to help this group of people in their daily life and their food recognition and classification needs. The current advances in the automation of services are induced by the effective learning approaches of artificial intelligence (AI), deep learning, and the availability of large data [4], [5], [6]. However, there are still challenges to fully automating The associate editor coordinating the review of this manuscript and approving it for publication was Charalambos Poullis . services in interactive environments. Automation of food preparation requires the intelligent system to operate in an interactive kitchen environment while recognizing the food ingredient type and state. An example of the food ingredient type is orange and the state being sliced or whole. The food ingredient state is defined as the character of food which can be transformed by a human or robot chef interventions [7]. The same type of food can appear in different shapes and states depending on the intervention of a human or robot chef. For instance, a state of an orange fruit can be observed based on the texture and the shape of the fruit, which can be whole, peeled, sliced, or liquid (juiced).
Researchers have introduced different approaches for automation of food ingredient recognition [8], [9], [10]. These approaches use deep learning to learn representation from food images. However, these approaches learn to predict the food type or state independently, but the two tasks are related, and knowing the food type will help recognize the state. In this paper, we propose a novel approach to simultaneously learn to predict the food type and state in a cascading manner where the prediction of the food type will be fed into the prediction of the food state as those two are correlated. In our approach, food ingredient state recognition is achieved using the learned deep representations of food type and the input image's deep representations. To the best of our knowledge, our work is the first to address learning food ingredients type and state together in a cascading approach.
The Main Contributions of this work are: • Proposing an approach, based on deep convolutional neural networks, to predict the food type and state in a multitasking and cascading manner.
• We manually labeled part of the dataset (test set) to use for the evaluation.
• Providing a detailed quantitative and qualitative evaluation of the learned models for various settings.

II. LITERATURE REVIEW
Different approaches have been proposed in the literature on food recognition. In [8], the authors proposed a deep learningbased approach for a food recognition system that allows the user to monitor the dietary intake during the day. This system uses a smartphone camera to take a picture of food as input to the trained deep-learning model and then provide the food classification and dietary information. This computeraided food recognition system automates food recognition and dietary assessment to better monitor user dietary intake. A method for regression of food nutrition using deep learning is proposed in [11]. This approach uses Inception-v3 [12], ResNet [13], wide ResNet [14], and VGG16 [15] for learning food images to nutrition regression such as calories using ChinaMartFood-109 dataset [11]. Another method for food recognition using an ensemble of deep neural networks was proposed by Pandey et al. [9]. This approach assessed the use of traditional approaches (machine learning-based approaches) and an ensemble of deep neural network-based techniques for food recognition. The ensemble of neural networks produced the best result using the famous ETH Food-101 dataset [16]. In [17], the authors proposed a deeplearning approach to recognize traditional dishes with high calories. The proposed model was trained using EfficientNet pre-trained on ImageNet and fine-tuned using a dataset of traditional dishes images collected from the web. Then, the learned model was deployed on a smartphone for real-time inference. Food recognition plays a significant role in helping impaired people. An approach for Middle Eastern food recognition was proposed in [18]. This approach used a pretrained MobileNet-v2 and fine-tuning using a dataset of 23 classes [19], then deployed the learned model on phones for real-time inference.
In [20], the authors proposed a fusion of different pre-trained deep neural network-based classifiers for food recognition. This approach is based on fine-tuning several pre-trained deep learning models, then fusing the output predictions using a decision template. The authors have assessed this ensemble approach using two datasets, Food-11 and Food-101 [16], [21]. Salim et al. [22] studied different approaches for food recognition, including machine learning (traditional) and deep neural network-based approaches, where deep learning-based food recognition methods were the most effective compared to traditional approaches. Deep-Food transfer learning approach was proposed in [23] for food type multi-class classification. The proposed approach extracts deep features from a pre-trained ResNet followed by feature selection and classification, where the results revealed improvement of food type multi-class classification using Mealcome (MLC) dataset [24]. A summary of food datasets and benchmark results and an evaluation of existing methods VOLUME 10, 2022 for food recognition were presented in [25]. The authors trained the state-of-the-art method for five trials and achieved the state-of-the-art results on the UEC Food-100 dataset [26] by averaging the predictions of ResNeXt [27] and DenseNet models [28].
An improved VGG16-based approach was proposed in [29]. This approach used asymmetric convolution blocks instead of the original convolution kernel. Moreover, batch normalization was added to the VGG16, and a spatial attention mechanism was applied to improve the results of food type classification. To improve food recognition for vertical trait foods, a method was proposed by Martinel et al. [30] which used deep residual blocks and sliced convolution to learn recognition of vertical traits of food, such as a stack of pancakes. This approach improved the classification results using the Food-101 dataset.
Food recognition is important for automating the visual inspection of food quality and defects. A deep learning approach was proposed to detect defective apples and bananas in [31]. This approach uses multiple state-ofthe-art deep learning architectures to recognize defective apples and bananas using food images. The deep learning architecture used in the work includes: ResNet-50 [13], DenseNet [28],MobileNet-v2 [32], NASNet [33], and Effi-cientNet [34]. The best performance was obtained using EfficientNet. Detecting the freshness of perishable fruits, including bananas, oranges, and apples, was studied in [35]. This approach applied transfer learning using AlexNet [36], VGG16 [15], and ResNet [13] architectures pre-train on ImageNet [37]. The dataset comprises six types of images: fresh banana, fresh orange, fresh apple, rotten banana, rotten orange, and rotten apple from an online dataset [38]. The bestperforming model was obtained using ResNet architecture.
The previous approaches used static datasets, which represent a challenge because of the food appearance and shape variation. To solve this problem, a method that uses online continual learning was proposed for visual food classification [39] using the Food-1K dataset [40]. The approach first applied example selection using a similarity-based clustering approach for knowledge replay, and second, training online continual learning with a batch-based class balancing approach trained in a contrastive learning manner.
In [10], an approach to recognize the food state and type was proposed. This approach takes features extracted from an ImageNet-based pre-trained convolutional neural network (CNN), followed by a support vector machine for classifying food images into 20 food types and 11 food states. The authors experimented with multiple pre-trained CNN including GoogleNet [12], Inception-v3 [41], MobileNet-v2 [32], and ResNet-50 [13]. This approach independently learned food ingredient type and food ingredient state recognition using separate neural networks. The authors also proposed a new dataset of food ingredient images with state and type labels. However, this data was not made publicly available. Another approach for identifying the food ingredient state was proposed by Jelodar et al. [7]. The proposed approach used ImageNet pre-trained ResNet for fine-tuning using a dataset created by the authors. The proposed approach focused on learning food ingredient states only by finetuning pre-trained deep learning models [7]. Although these approaches can be applied to food ingredient state recognition, they suffer from shortcomings related to the learning process where learning food ingredients' states and types are performed independently.

III. DATASET
The dataset we used to learn and evaluate the proposed approach has annotated images of various ingredients in a kitchen. This dataset has 17 different most common cooking ingredients collected from over 250 online cooking videos from the two popular datasets [42], [43]. The cooking ingredients include: chicken/turkey, beef/pork, tomato, onion, bread, pepper, cheese, strawberry, milk, potato, garlic, egg, carrot, butter, mushroom, orange, and cheese [7].
Each food ingredient was labeled with a state where the state describes the status of the ingredient during the cooking process. In the original dataset, there are eleven different food ingredient state classes. However, there are two food ingredient state classes that do not have the object labels associated with each image, which are mixed and other. Therefore, we have eliminated these two food ingredient state classes (mixed and other) from the dataset. Thus, in the revised dataset, there are nine different food ingredient states where each food ingredient image has a state label and type label. An example of the food ingredient type is Orange, and the food ingredient state is sliced. The food state labels include whole, peeled, sliced, floured, grated, julienne, diced, juiced, and creamy paste. Our training set has 5251 images, the validation set has 1132 images, and the test set has 1180 images. The test set had only the state label. Therefore, we have manually labeled the test set for the food type, e.g., Orange or Potato. During training, we applied data augmentation, including rotation of 90 • , 180 • , 270 • , horizontal and vertical flipping, and zooming. Figure. 1 shows the total images for each state and type label in the dataset. Table 1 illustrates the total number of images per food state and type classes on the test set. More information on the dataset collection and the labeling process is provided in [7]. Figure 2 presents examples from the dataset of different food ingredient types and states.

IV. RESEARCH METHODOLOGY
We propose a CNN architecture as shown in Figure 3 to estimate the two probabilities P(f t |img) and P(f s |img, f t ) for a food ingredient image img, where f t represent the food ingredient type, f s is the food ingredient state, and img represent the input image. For a given food ingredient image img, this approach learns two functions: the first is for estimating the probabilities of the food ingredient type using features learned from input images. The second function estimates the probabilities of the food ingredient state, which takes inputs from the learned representation of the food ingredient type in concatenation with the food ingredient image feature vector. The proposed cascaded multi-head neural network integrates features learned for food ingredient type with the imagebased deep features to learn to recognize food ingredient state effectively (cascaded approach). In other words, the proposed approach learns food state recognition with the guidance of food-type learned representations. Furthermore, the proposed cascaded multi-head neural network can simultaneously predict food state and type.
The input for our model is RGB images img of food ingredients. For this approach, we use the DenseNet121 model pre-trained on the ImageNet dataset and fine-tuning on the food dataset described in Section III. The earlier layers of DenseNet121 are set as non-trainable, whereas layers from the convolution layer (conv5) onward are set to trainable.
Two heads on the top of DenseNet121 were added, where the first head is used for predicting the food ingredient type, and the second head is used for predicting the food ingredient state. Each head comprises three fully connected layers. The output of the second dense layer on the food ingredient type head is concatenated with the flattened feature vector of the image representation, and the concatenated feature vector is the input for the second neural network head consisting of fully connected layers for food ingredient state recognition as shown in Figure. 3. This neural network is trained in an endto-end approach for learning food ingredient state and type simultaneously.
For the purpose of studying the impact of learning the food ingredient state and food ingredient type jointly in a cascaded manner, we have trained two models: the first model uses  DenseNet121 with a single head of three fully connected layers for learning to recognize the food ingredient state only. This model is called non-cascaded single head model. The second model has two heads on-top of DenseNet121 for the food ingredient type and food ingredient state outputs, as shown in Figure 3. This model concatenates the learned features from the last convolution layer and the second fully connected layer in the food type head. We called this model cascaded multi-head model, which learns to predict the food ingredient state and type, whereas the former model only learns to predict food ingredient states.

V. IMPLEMENTATION
Model architecture designing and coding was done using TensorFlow and Keras deep learning development libraries [44], [45]. Fine-tuning deep learning was performed on Nvidia GeForce 1080ti GPU architecture for 20 epochs. For the models' fine-tuning optimization, we used the Adam algorithm with a learning rate of 0.001, and the loss function was categorical cross-entropy. To ensure the repeatability of deep neural networks training, we have seeded all the libraries, and we set deterministic configurations using Ten-sorFlow as described in [46].

VI. RESULTS
We have experimented with two classification approaches. First, we experimented with fine-tuning a deep learning model (DenseNet121) in an end-to-end approach where there is only one output head for the food state (i.e., non-cascaded single head approach). The second classification approach is for fine-tuning deep learning model (DenseNet121) in an end-to-end cascaded classification scenario where there are two neural network prediction heads: the first is for food ingredient type and the second is for the food ingredient state (i.e., cascaded multi-head approach). Furthermore, the latter approach uses fused learned food ingredient type representation with the image deep representation vector in a cascaded manner for learning to recognize the food ingredient state. Therefore, learning the food ingredient state is guided by the food ingredient type and image deep representations. In the following two subsections, we provided the results for our trained deep learning models.

A. NON-CASCADED SINGLE HEAD MODEL
This single-head model (i.e., the model trained to learn the food ingredient state only) is a non-cascaded approach where DenseNet121 is fine-tuned for food ingredient state recognition. The results of this approach showed an accuracy of about  81%, a precision of 0.82, a recall of 0.81, and an F1-score of 0.81 for the food ingredient state.

B. CASCADED MULTI-HEAD MODEL
The food ingredient state and type recognition model is a cascaded multi-head model for classifying food ingredient type and food ingredient state simultaneously. This approach uses the food ingredient type feature vector and the input image feature vector to learn to recognize the food ingredient state. The two heads are built on top of DenseNet121, as shown in Figure. 3. This method showed superior results for food ingredient state classification compared to the former approach (i.e., non-cascaded single-head model), where accuracy, precision, recall, and F1-score are 87%, 0.87, 0.87, and 0.87, respectively. The food ingredient type accuracy is 71.35%, precision is 0.72, recall is 0.71, and F1-score is 0.70. The improvement of the food ingredient state results using the proposed model compared to the non-cascaded single head model shows that learning the food ingredient state using the representations learned for the food ingredient type in combination with the image features vector has a monumental results improvement. The results are shown in Table 2.

VII. DISCUSSION
The proposed approach aims to simultaneously learn food ingredient state and type in a cascaded manner. The cascaded multi-head deep neural network uses representations learned for food ingredient type recognition to learn food ingredient state. In other words, the learned deep features for food ingredient type are fused with the image deep representation for learning food ingredient state. Learning food state with fused representations (cascaded approach) shows superior results over learning food ingredient state without feature fusing i.e., non-cascaded single head model.
The food ingredient state recognition model (i.e., noncascaded single head) for learning to recognize food ingredient states using the input image deep representations only showed that some food ingredient states are confused with the other food ingredient states. For instance, some diced labeled images are predicted as sliced, some creamy-paste labeled images cases are predicted as grated, some images labeled as whole food ingredient state is predicted as sliced, and some images labeled as julienne food ingredient state is predicted as sliced. This is because of learning the food ingredient state directly from the image representation without knowing the food ingredient type. The confusion matrix for a single-head neural network to predict the food state is shown in Figure 4a.
Fusing food ingredient type feature vector with image deep representations using the proposed cascaded multi-head deep neural network shown in Figure. 3 shows improvement in classifying the food ingredient state images. For instance, only three images of the test set were predicted as sliced, whereas the true label is whole. However, in the non-cascaded single-head neural network for learning to recognize the state of food ingredients, 22 images of the test set were predicted as sliced, whereas the true label is whole. Furthermore, the number of true positives for diced, creamy paste, floured, juiced, julienne, peeled, and whole increased using the cascaded multi-head deep neural network compared to using non-cascaded single head deep neural network. The confusion matrix for the food ingredient state recognition using a cascaded multi-head deep neural network approach is shown in Figure. 4b, and Figure. 4c for the food ingredient type prediction. The food ingredient type head of the cascaded multi-head approach shown in Figure 3 showed prediction errors for some foods images that appear similar. For instance, milk and lemon_orange juice labeled images are similar; therefore 15 images of liquid milk were predicted as lemon_orange juice, as shown in Figure. 4c. Furthermore, the number of milk images in the dataset is low. Thus, the lower number of images labeled as milk could be the cause for the misclassification of milk images.
In Figure 5, two images are provided along with probabilities distribution for food ingredient state and type using the cascaded multi-head deep neural network. The ground truth label for each image is shown using the orange color. Figure 6 shows food state and type prediction results using the cascaded multi-head model for the same food type (orange fruit) but with different food states. In Figure 7, we provided visualization of the learned deep representations using the proposed cascaded multi-head deep neural network for both food state and type. This visualization was done by extracting deep features from the last convolutional layer of the fine-tuned DenseNet121, then applying the Grad-Cam internal representation visualization approach [47]. As observed from the deep representation visualization, our deep learningbased approach is focusing on the target objects presented in the images. Moreover, when there is more than one object of the same type in the image, the deep learning model focuses more on the closest object to the camera, as shown in Figure 7 top left image.
Convolutional Neural Network (CNN)-based food ingredient type and state recognition is an efficient approach for optimal results compared to handcrafted based approaches. Previous work showed that food discrimination could be done only using the food ingredient state images [10]. However, we found some challenges to food ingredient state recognition because of the similarity between food ingredients in terms of shape and texture, especially after manipulations. Furthermore, some states of food ingredients appear similar in images. For instance, julienne and grated food ingredient appearance are similar. Because of this apparent similarity, we noticed some cases with julienne food ingredients and predicted them as grated food ingredients.
Our work did not include deep feature extraction (transfer learning) or handcrafted feature extraction. Instead, this work focuses on learning the food ingredient state and type by fine-tuning an off-shelf neural network (DenseNet121) in an end-to-end manner. Some learned representations of the early layers of DenseNet121 were kept unmodified, where learning (fine-tuning) was done for the last layers of DenseNet121.
Although this project focuses on recognizing the state of food ingredients from a dataset collected from food preparation videos, this approach can apply to assist robots in other tasks, such as feeding elderly or handicapped people where an intelligent system needs to recognize the type and state of food ingredient for successful achievement of a certain task such as food preparation. Moreover, this research contributes to improving human-intelligent system interaction for better automation of services such as automation of feeding elderly persons and food ingredient grasping. The limitation and challenges of the proposed approach are related to the dataset. The dataset was collected by the authors of [7], where some state classes do not have the corresponding object label for food ingredient images. These classes are mixed and other state categories. Therefore, we had to remove the food state classes where dual labels for food state and type are not provided. Furthermore, the dataset we used suffers from data imbalance for some food type classes and poses a challenge for our proposed approach. Therefore, our future work focuses on solving the data imbalance issue for learning the food state and the type of food ingredient. Although the number of images in each food state and type class was low, we overcame this issue using a data augmentation approach.
Our future work includes addressing some shortcomings of the proposed approach, including improving the dataset by balancing the number of instances per class to improve the results. Furthermore, we are planning to use the image segmentation of each food ingredient along with the raw food ingredient images for learning deep representation using deep learning.

VIII. CONCLUSION
Learning food ingredient states is an important task during food manipulation by an intelligent system. We propose an approach for learning food ingredient type and state jointly using a cascaded multi-head deep neural network. The learned feature vector for food ingredient type using deep learning is fused with image deep representation. The fused feature vector is used as input to the food ingredient state fully connected layers for the classification of food images. This approach showed superior results over the non-cascaded single-head neural network approach that learns to predict only the food ingredient state. SAEED S. ALAHMARI (Member, IEEE) received the Ph.D. degree in computer science from the University of South Florida, in 2020. He is an Assistant Professor of computer science with Najran University, Najran, Saudi Arabia. He has authored and coauthored many journals and conference papers. His research interests include learning from noisy and limited labeled data, machine learning, deep learning, medical image understanding, computer vision, and deep learning repeatability and explainability.
TAWFIQ SALEM (Member, IEEE) received the Ph.D. degree in computer science from the University of Kentucky, in 2019. He is a Visiting Assistant Professor with the Department of Computer and Information Technology, Purdue University, USA. His research interests include computer vision, remote sensing, medical imaging, and machine learning. VOLUME 10, 2022