A Review of Image-Based Food Recognition and Volume Estimation Artificial Intelligence Systems

The daily healthy diet and balanced intake of essential nutrients play an important role in modern lifestyle. The estimation of a meal's nutrient content is an integral component of significant diseases, such as diabetes, obesity and cardiovascular disease. Lately, there has been an increasing interest towards the development and utilization of smartphone applications with the aim of promoting healthy behaviours. The semi – automatic or automatic, precise and in real-time estimation of the nutrients of daily consumed meals is approached in relevant literature as a computer vision problem using food images which are taken via a user's smartphone. Herein, we present the state-of-the-art on automatic food recognition and food volume estimation methods starting from their basis, i.e., the food image databases. First, by methodically organizing the extracted information from the reviewed studies, this review study enables the comprehensive fair assessment of the methods and techniques applied for segmenting food images, classifying their food content and computing the food volume, associating their results with the characteristics of the used datasets. Second, by unbiasedly reporting the strengths and limitations of these methods and proposing pragmatic solutions to the latter, this review can inspire future directions in the field of dietary assessment systems.

A Review of Image-Based Food Recognition and Volume Estimation Artificial Intelligence Systems Fotios S. Konstantakopoulos , Eleni I. Georga , Member, IEEE, and Dimitrios I. Fotiadis , Fellow, IEEE (Methodological Review) Abstract-The daily healthy diet and balanced intake of essential nutrients play an important role in modern lifestyle.The estimation of a meal's nutrient content is an integral component of significant diseases, such as diabetes, obesity and cardiovascular disease.Lately, there has been an increasing interest towards the development and utilization of smartphone applications with the aim of promoting healthy behaviours.The semi -automatic or automatic, precise and in real-time estimation of the nutrients of daily consumed meals is approached in relevant literature as a computer vision problem using food images which are taken via a user's smartphone.Herein, we present the state-of-the-art on automatic food recognition and food volume estimation methods starting from their basis, i.e., the food image databases.First, by methodically organizing the extracted information from the reviewed studies, this review study enables the comprehensive fair assessment of the methods and techniques applied for segmenting food images, classifying their food content and computing the food volume, associating their results with the characteristics of the used datasets.Second, by unbiasedly reporting the strengths and limitations of these methods and proposing pragmatic solutions to the latter, this review can inspire future directions in the field of dietary assessment systems.Index Terms-Dietary assessment system, food databases, food segmentation, food recognition, food classification, food volume estimation, nutrient information, computer vision, machine learning, deep learning, artificial intelligence.

I. INTRODUCTION
T HE global incidence of chronic diet-related diseases, such as obesity, diabetes, and cardiovascular diseases, shows an ever -increasing trend, which tends to take on epidemic proportions.The number of obese people has nearly tripled since 1975.In 2016, more than 1.9 billion adults were overweight, out of which over 650 million were obese.Moreover, in 2019, 38 million children under the age of five were overweight or obese [1].Diabetes is considered as a major cause for blindness, kidney failure, heart attacks, stroke, and lower limb amputation.The World Health Organization (WHO) estimated that 1.5 million deaths were directly caused by diabetes and that diabetes was the seventh leading cause of death in 2019 [2].According to the International Diabetes Federation, 463 million people (adults 20-79 years) suffer from diabetes worldwide nowadays [3].As far as cardiovascular diseases (CVDs) are concerned, they are a group of disorders of the heart and blood vessels that include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions.CVDs are the number one cause of death globally while, in 2016, 17.9 million people died from CVDs representing 31% of all global deaths [4].The above-mentioned diseases are inextricably linked.Healthy diet has been shown to be the common denominator that can either positively or negatively affect the aforementioned diseases.A healthy lifestyle, which includes a balanced diet, maintaining a healthy weight and regular exercise can significantly reduce the percentage of individuals suffering from these diseases.
Daily diet monitoring by experts is definitely the most appropriate way to achieve a healthy and balanced diet, which includes daily recording of the type and the estimated amount of food consumed [5].However, since daily diet monitoring by specialists is almost impossible, patients are advised to record their daily eating habits themselves.Although these methods are widely used, their accuracy remains questioned, especially for children and adolescents who lack motivation and the required skills [6], with the average error in estimating the amount of food consumed being more than 20% [7].Even well-trained individuals with diabetes have difficulty in calculating, with a relative accuracy, the amount of carbohydrates of their meal [8].The rapid increase in the use of smartphones and their advanced computing capabilities during the last decade, have led to the development of smartphone applications [9] that can detect food, recognize its type and calculate its nutritional value, by estimating its quantity, via the analysis of food images [10].In a typical scenario, the user is asked to take one or more photos or even videotape their meal, and then, the application computes the corresponding nutritional information.Nowadays, the advances in the field of computer vision and Artificial Intelligence (AI) provide users with the possibility to monitor their health every day through appropriate applications [11].Recent studies have shown that AI-based applications are more popular among users, compared to traditional dietary recording methods, for recording the nutritional composition of food [12].AI-based methods can be divided into semi-automatic which require user participation, and automatic that do not require any human participation.These applications do not aim to replace dieticians, on the contrary their goal is to provide them with an additional tool in the monitoring patients' diet.The performance and accuracy of these applications depend to a large extend on various factors, such as the food image databases used for training of the system and extraction of nutritional composition, the food segmentation techniques, the food recognition methods and the volume estimation techniques.
The quality and the quantity of images of a food database mainly affects the performance of the food recognition step [13].Food classification, which consists of food segmentation and food recognition steps is next.Food segmentation is the process of partitioning a food image into multiple segments (sets of pixels) [14].Food recognition comprises the identification of the foods which are present in the food image through the application of machine and deep learning techniques [15], [16].The final step is the volume estimation for each food item which is present in the food image.This step depends directly on the previous steps of segmentation and recognition.Volume calculation of each identified segment, in combination with a food nutritional database, is used for the extraction of the nutritional composition [17].A typical procedure of an automated vision-based dietary assessment system is shown in Fig. 1.
In this article, we present a review of the literature over the past 10 years (2012 -2021) in the field of food images segmentation, food classification, food volume estimation and food macronutrient content estimation based on smartphone-captured food images, assessing, in parallel, the main characteristics of the employed food image databases.The in-depth analysis of the methods used in each of the above components of a dietary assessment system comprises the main distinguishing characteristic of this review in comparison with existing reviews in the specified research topic [18], [19], [20], [21], [22].This analysis led to the categorization of the employed methods as: (i) semi-automatic and automatic food image segmentation methods, (ii) traditional machine learning (ML) -based and deep learning-based methods for food image classification, and (iii) 3D reconstruction, pre-build shape templates, perspective transformation, depth camera and deep learning methods for food volume estimation (Table I).The algorithms and techniques pertaining to each of these categories are identified per investigated study, and their performance, strengths and limitations are presented and contrasted.Importantly, we suggest pragmatic solutions to deal with the identified limitations starting from the construction of relevant datasets to the computation of the food nutrient value.This manuscript is hereunder organized in six sections, with Sections II-V presenting the review of the methods and techniques used in each of the components of a dietary assessment system, and Sections VI and VII being devoted to the discussion of the outcomes and conclusions derived by this review study.

II. FOOD IMAGE DATABASES
The process of collecting food images, which can be used in the food classification model, is crucial and it directly affects the performance of the classification models.A comprehensive collection of food images is the key to a classifier's performance.Large food image databases, such as Food-101 [23], UEC-Food100 [14], VIREO Food-172 [24], and UEC-Food256 [25],  are benchmark food databases and are typically used to evaluate machine learning models.Existing databases are distinguished by the different characteristics they have, such as cuisine type, the number of images, the number of food classes, the food categories, the way of acquisition, the task of use (classification or segmentation task) as well as by how many different food items are included in each photo.For instance, Diabetes [26] has 11 classes with a total of 5420 pictures out of which 3800 images are downloaded from the web and 1620 are captured in a controlled environment.A few food databases have been created by compiling images of existing food databases.For instance, the database Food524DB [27] were created from existing publicly available food image databases: Food-101, UEC-Food256 and VIREO Food-172.Moreover, there are several food image databases that have collected food images from specific types of cuisines.For example, Chen [28] and ChineseFoodNet [29] represent the Chinese cuisine, FFoCat [30] and MedGR-Food [31] refer to Mediterranean food, Indian food database [32] contains images with local food dishes, while [33], [34], [35] present databases with images of fruits and vegetables.FLD-469 [36] refers to Japanese food, while FoodX-251 [37], Menu-Match [38], UPMC Food-101 [39], NutriNet [40] and UNICT-FD889 [41] consist of a mix of eastern and western food images.Moreover, a critical feature of the food image database is whether it is used for classification [42], [43], [44], [45] or segmentation tasks [46], [47], [48], [49], [50], [51].For example, Food201-Segmented [52] contains segmented images from Food-101 dataset for the USA cuisine.Also, an important element for the classifier is the way the pictures were acquired, namely whether they were taken in a controlled environment (in terms of lighting conditions and the food's image background) or in a free environment.In addition, with the increasing use of deep learning methods for image classification, the food image databases must contain a large number of images per class to support training of a deep learning model.Furthermore, the diversity of the images contained in a class leads to a more advanced model, which can classify food even if it has been cooked in a similar way.Fig. 2 presents sample images from four food image databases.
The techniques used in the later stages of food imagebased analysis nutrition systems, emphasize the need to create databases that contain a large number of images for each food class.It may be easier nowadays to collect the images for a large food image database, due to the tendency to capture food images using smartphones and to the existence of many images in social networks.Although, there is a plethora of food image databases, we note that there are no food image databases related to healthy diet patterns.In addition, there exist a few annotated databases, mainly referring to the Japanese cuisine, which could be used in the segmentation and classification tasks (Fig. 3).Fig. 4 illustrates the size (number of images) of existing food image databases for different types of cuisine annotated by the associated method of constructions.We observe that the majority of databases belong to generic and Asian cuisine, while a large number of them are either collected from the web or created using other databases.Finally, it is worth mentioning that there is no benchmark food image database for general classification purposes.As food has no borders and we live in  multicultural societies, it is needed to create a large food image database, that will include different types of cuisines, to allow the development of systems and applications that will be able to detect and calculate the amount of as many foods as possible.Therefore, the creation of an annotated food image dataset that would take into account the type of cuisine could include foods with the same name but from different regions.For example, it is possible for an annotated food image database to contain the same food name and characterize it additionally by its cuisine or its region.Therefore, the creation of an annotated food image dataset that would take into account the type of cuisine could include foods with the same name but from different regions.For example, it is possible for an annotated food image database to contain the same food name and characterize it additionally by its cuisine or its region.Table II summarizes the most representative food image databases, and their most significant features.

III. FOOD IMAGE SEGMENTATION
Segmentation is the initial step required to identify food and refers to the process of localization and extracting regions that have different colour and texture features.The purpose of food image segmentation is to localize a food item or the food items (if there is more than one) present in an image, and to separate them from the background or other food items [24].When the image contains more than one food, food segmentation is considered a necessary step in dietary assessment systems.It is a challenging task to segment foods that overlap each other, or foods that have an indeterminate shape, or foods that do not have strong colour or texture features in contrast with the other food items in a plate.In addition, the lighting conditions, under which an image is taken, can affect the segmentation step by creating shadows and reflections [17].Although segmentation is a difficult process, the accuracy of segmentation directly affects the effectiveness of the subsequent steps, such as the classification and volume estimation.The main metrics for assessing food image segmentation are the Intersection over Union -IoU: where Y true is the ground truth of the food image and Y pred is the prediction mask; the meanIoU for multiclass segmentation: where N is the number of food classes; and the pixel accuracy: where True Positive (TP) represents a pixel that is correctly predicted to belong to the given class, True Negative (TN) represents a pixel that is correctly identified as not belonging to the given class, False Positive (FP) represents a pixel that is wrongly predicted to belong to the given class and False Negative (FN) represents a pixel that is wrongly identified as not belonging to the given class.
Several methods have been proposed to address issues in food image segmentation.An initial classification of methods is: (i) semi-automatic food segmentation, (ii) automatic ML with handcrafted feature extraction, and (iii) automatic ML with deep learning feature extraction.
In several studies, the use of semi-automatic techniques for food segmentation is preferred, where the user is asked to select regions of interest in the image, the foreground and the background (Fig. 5).The results of semi-automatic techniques are highly accurate, distinguishing details of each food item in the image, as the user knows the exact boundaries of food items contained in the image/tray [53], [54], [55], [56].Hassannejad et al. [57], used a customized interactive graph cut algorithm.Initially, the user imposes a number of hard constraints to segmentation, by marking some pixels.Then they use the Gaussian mixture model and K-Means to generate image clusters and initialize the graph.Finally, an iterative graph cut algorithm is used to  segment the food image.The users who were familiar with the application achieved up to 93% accuracy (images with less than 5% of false segmented pixels), while the users who were not familiar achieved 88% accuracy.
In automatic food segmentation methods with handcrafted feature extraction, the user only needs to capture the image.Then, existing image processing techniques are employed to solve the segmentation problem by making assumptions about the shape, colour and number of food items in the plate.These approaches use algorithms and techniques to extract texture, shape and colour features, such as the J measure-based segmentation (JSEG), the Normalize cuts (NCut) [58], or region merging and growing [59].For example, Anthimopoulos et al. [60] suggested the use of a five-step food segmentation algorithm based on colour information: CIELAB conversion, pyramidal mean-shift filtering, region growing, region merging and plate detection/background subtraction.The proposed method achieves an 88.5% segmentation accuracy.
In recent years, deep learning approaches [61], [62], [63], [64] and Convolutional Neural Networks (CNNs) [65] in some cases have shown state of the art performance in computer vision tasks, allowing the use of automated food image segmentation methods.In these approaches the segmentation models consist of two main parts: (i) the first part, acts as an encoder by extracting a large number of features from the image, while (ii) the second part act as decoder and is responsible for image segmentation (Fig. 6).Several popular CNNs models, such as ResNet50 [66], [67] and InceptionV3 [68] are used as the backbone network in the encoder, while well-known architectures, such as Fully Convolutional Network (FCN) [69] and DeepLab [70], are used as a decoder.Shimoda and Yanai [71], presented a method to make consistency between a food segmentation model and a plate segmentation model.More specifically, they used Class Activation Mapping (CAM), which is one of the basic visualization techniques of CNNs.A food category classifier can highlight food regions containing no plate regions, while a food/non-food category classifier can highlight food regions including plate regions.They demonstrated that they boosted the accuracy of weakly-supervised food segmentation.In a recent study, Wu et al. [49] proposed a novel fully automatic semantic segmentation method consisting of a recipe learning module and an image segmentation module.They used a Long short-term memory (LSTM) network as the encoder and the vision transformer architecture as the decoder and they achieved 0.439 mIoU in the FoodSeg103 database.In a new study, Nguyen and Ngo [72] presented an instance segmentation model for multiclass segmentation, using the terrace representation for food items.They employed the panoptic quality metric, a combination of IoU and pixel accuracy metrics, which achieved a score 0.693.Although the segmentation step is not necessary in several dietary assessment systems, we observe that the studies using the semi-automated segmentation method result in better performance.However, this leads to a delay in calculating the nutritional composition, as it requires interaction with user of the system.In automated food segmentation, the use of deep learning techniques has resulted in better performance compared to handcrafted techniques.Instance segmentation is a technique that has been used on a small scale (Fig. 7) in food image segmentation and could further improve the segmentation performance of dietary assessment systems.Moreover, it can be used to segment multiple foods in an image, allowing the development of more realistic applications, as each dish tends to have more than one food items.This presupposes the use of annotated food image databases, as it is a requisite to build segmentation models based on deep learning techniques.In recent studies, the food image segmentation step is omitted and in some others the performance is not reported.In other studies, although the performance of the methods used to segment food images is high and improves the classification accuracy, there are still open issues related to cases where mixed or overlapping foods exist.In these cases, the use of state-of-the-art segmentation techniques, such as semantic and instance segmentation, can be used to improve performance and increase accuracy in the classification step.In Table III, the main segmentation techniques are summarized.

IV. FOOD IMAGE CLASSIFICATION
Food image classification is a complex process that may be affected by many factors.For instance, the way food is cooked or if other food items, like sauce, covering the main food are present.Provided that the results of classification highly affect the effectiveness of next steps (the food volume estimation step and the food nutritional composition step), researchers have developed various techniques and methods to improve classification accuracy.The training of the classifier is affected by the number and quality of images used in the training phase, so the food database plays a crucial role in this process.Moreover, the techniques used to extract the features of the images, through which the images are recognized, greatly affect the accuracy of the classifier.The most basic metrics used for classification models are top-1 and top-5 accuracy.Top-1 accuracy is the accuracy where true class matches with the most probable classes (4) Top-5 accuracy is the accuracy where true class matches with any one of the 5 most probable classes predicted by the model.Other known metrics for classification task are: The task of food image recognition can be divided into two categories: traditional machine learning approach with handcrafted features and deep learning approach using convolutional neural networks (Fig. 8).

A. Traditional Machine Learning Approaches
Approaches that fall into this category are differentiated based on the technique chosen to extract the image features and, on the classifier selected for their classification.Feature extraction is the process in which the most representative features of an image are extracted, creating the corresponding feature vector.There are several feature extraction algorithms, such as speededup robust features (SURF), scale invariant feature transform (SIFT), local binary patterns (LBP) [73], Gabor filter [74] and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV TRADITIONAL CLASSIFICATION APPROACHES
histogram of oriented gradients (HOG).In numerous approaches the feature extraction is performed by a combination of the above algorithms, improving the classification accuracy.exported features then, feed classifier for training the prediction model, based on machine such as support vector machine (SVM), bag of features (BoF), forests (RF), k-nearest neighbours (kNN) [75] and multiple kernel learning (MKL).For example, Bossard et al. [23] introduced a method to mine discriminative parts using RF.To improve effectiveness of mining and classification, they consider patches that are adjusted with image superpixels.For each superpixel, they extracted Dense SURF and L * a * b colour features.Then, they train a multi-class SVM for final classification, with an average accuracy 50.8% in Food-101 image dataset.In another study, Kawano and Yanai [76] proposed a food recognition system that can identify 256 food categories using the food image database UEC-Food256.They applied RootHoG and colour features and coded them into a Fisher Vector to train one-vs-all linear classifier, with top-1 accuracy 50.1% and 74.4% top-5 accuracy.Pouladzadech et al. [74], classified 30 food classes using a cloud-based SVM classifier, achieving 94.5% accuracy.They used a combination of features, including colour, texture, size and shape, while most prevailing methods use only colour and shape features.Table IV summarizes traditional food classification approaches and their main characteristics.

B. Deep Learning Approaches
The CNN is a class of deep neural networks (DNNs); it constitutes the state-of-the-art method in image recognition.They are most used to analyse visual imagery and are frequently working behind the scenes (hidden layers) in image classification.A CNN convolves learned features with input data and uses 2D convolutional layers.This means that this type of network is ideal for processing 2D images.Compared to other image classification algorithms, CNNs actually use very little pre-processing.A CNN works by extracting features from images.This eliminates the need for manual feature extraction.The features are not trained but they are learned while the network is trained on a set of images.This makes deep learning models extremely accurate for computer vision tasks.CNNs learn feature detection through tens or hundreds of hidden layers.Each layer increases the complexity of the learned features.
Several studies use pre-trained CNN models [77], [78], [79], [80], [81], [82] to classify food images, such as Inception V3 [83], [84] and EfficientNet [85], [86].Moreover, fine-tuning [87], transfer learning [88] and data augmentation techniques are applied to improve the accuracy of classification models.Definitely, the last years, deep learning is the state of the art for food image classification [89].Hassannejad et al. [90], evaluated a fine-tuned version of Inception V3 model, increasing the accuracy and decreasing the computational cost.In particular, they achieved 81.5%, 76.2% and 88.3% top-1 accuracy, on UEC-Food100, UEC-Food256 and Food-101 databases, respectively.In addition, they achieved 97.3%, 92.6% and 96.9% top-5 accuracy on UEC-Food100, UEC-Food256 and Food-101 databases, respectively.In another study, they have built a DNN model consisting of two stages: The first stage is a residual network, encoding generic visual depictions of food images, while the second stage is a slice network with a slice convolutional layer capturing the vertical food features.The extracted features are linked and fed to the fully connected layers that give out the classification prediction.Tan and Le [91], proposed a new CNN scaling architecture, the EfficientNet.They scaled up the depth, width and resolution of the network, outperforming the stateof-the-art deep learning studies.EfficientNet-B7 achieves 93% accuracy in the Food-101 dataset.In several deep learning-based studies for food recognition, it is observed that the evaluation of the models is performed in the databases of food images: UEC-Food100 [92], UEC-Food256 [93], Food-101 [94] and VIREO-172 [95].
Fig. 9 shows the box plots of top-1 accuracy achieved by deep learning approaches for existing food image databases.We observe the top-1 accuracy features a high interquartile range for the UEC-Food256 and Food-101 databases; this is an indication of the complexity characterising multi-class problems.On the other hand, a higher and less spread top-1 accuracy obtained for databases with a small number of classes or focused on specific tasks.Fig. 10 presents the percentage usage of existing food image databases as development datasets in food recognition, where databases with a large number of classes being used more often.In addition, a considerable amount of studies (18%) do not refer any information about the used databases, diminishing their replicability potential.We observe that the Food-101 is the database with the highest percentage, while newer databases have used very little.Table V presents the main characteristics of deep learning approaches applied in food image classification.We can observe that the accuracy of conventional classification models can be improved by combining feature extraction algorithms.Moreover, the combination of different classifiers seems to work better than using only one classifier.In addition, we notice that the traditional approaches are used on small food datasets where deep learning techniques cannot be applied, and it is obvious that deep learning techniques for food image recognition outperform the traditional ones [19].Although CNNs were firstly used to extract features that feed a classifier, in recent years only deep learning models have been used to classify food images.Furthermore, we note that there is a tendency to use deeper learning networks to train food image classification models (for example, the EfficientNet B-7 consists of 813 layers).However, the need of computing power seems to limit the possibilities of such an approach.In the future, with the ever-increasing computing power to train deep learning models (e.g., deep learning cloud servers) and to build deeper networks, combined with training in larger datasets, their performance can be further improved.

V. FOOD VOLUME ESTIMATION
The last step in food nutritional composition systems comprises the estimation of foods quantity and the analysis of their nutritional composition, such as carbohydrates, proteins, fat and total calories.Accurate estimation of the amount of food, assumes that the previous stages of the segmentation and recognition of the food have been accomplished correctly.Then, using appropriate approaches, such as 3D reconstruction, pre-build shape templates, perspective transformation, depth camera and deep learning techniques, the volume of food is estimated.This is a demanding process which in most cases requires a specific number of photos and a specific way of taking them, a controlled environment and in many cases dedicated cameras for capturing food images.In fact, calculating the nutritional composition of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V DEEP LEARNING CLASSIFICATION APPROACHES
a food is a challenging task, even for nutritionists.This is why in many nutritional estimation systems; it is considered appropriate to have a reference object to determine the depth of the image.The metrics which are used to evaluate the volume of food are: the mean absolute error (MAE): the mean absolute percentage error (MAPE): and the root mean square error (RMSE): where V real is the real volume of food, V est is the estimated volume and n is the total number of foods.Having estimated the amount of food, using local food composition databases, its nutritional composition can be calculated.Several studies require taking two or more images of the food for its 3D reconstruction [96], [97].The first step in these studies is the feature points extraction, using appropriate feature extraction algorithms, among others SIFT and SURF.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Then, the relative camera pose is estimated between the captured images.Furthermore, reference objects, with known dimensions, are used to estimate the scale of the image, for instance a reference card.Consequently, dense stereo matching is utilized for 3D food reconstruction, projecting the image coordinate system to the world coordinate.The next step is to estimate the volume of the food by removing the background from the image and keeping only the food in it.Finally, the nutritional composition of the food is analysed using the relevant nutrient database, such as the USDA Food and Nutrient Database for Dietary Assessment (FNDDS) [98].Dehais et al. [99], estimated the volume of multi-food meals by capturing two images, with the food placed inside an elliptical plate and a reference card placed next to it.The proposed system comprised of three stages.The first stage is extrinsic calibration (computation of camera rotation and translation matrices) which is performed in three steps: salient point matching, relative pose extraction and scale extraction.The second stage is dense reconstruction, which also consists of three steps: rectification of the images, stereo matching and point cloud generation (Fig. 11).Volume estimation is the final stage, which consists of the following steps: food surface extraction, dish surface extraction and volume calculation.The system was evaluated on 77 food dishes of known volume, and achieved MAPE from 8.2 -9.8% in two different datasets.It is worth mentioning that the researchers in order to extract the relative pose, modified the classical Random sample consensus (RANSAC) algorithm by including local optimization and an adaptive threshold estimation method.3D food reconstruction is a methodology that can be used in a food of any shape and in capturing food images in a non-controlled environment.However, the need to capture at least two images, as well as to extract the features using image processing algorithms, such as SIFT or SURF, makes the methodology sensitive to the acquisition of images and make the process significantly slower, affecting food volume estimation accuracy.
Some studies suggest the use of specific geometrical shapes or templates (for example spherical and cylindrical objects) to reconstruct the food image from the 2D space into the 3D space from a single image [100], [101], [102].Moreover, they utilize a fiducial marker (a checkboard pattern or a reference card) to obtain the camera parameters and provide a reference for the object scale and pose of each food item.The requirement for predefined geometrical shapes or templates for the 3D reconstruction of food, renders these methods extremely difficult to use in systems for daily dietary monitoring, because of the different and irregular shapes that food items present.For instance, in [103], the dimensions of the reference object used by the user must be pre-registered, to be able to calculate the real size of the food region.They assume that the food portion height is correlated with the food size, and they estimate calories of food items directly from the food size.For this purpose, they utilize quadratic curve estimation of food calories based on their 2D size.The quadratic curve of each food is calculated based on data annotated with real food calories.This approach gives good results in foods that have a regular shape, such as lasagna and cheesecake.Otherwise, the calculation of the amount of food is inaccurate and must be used in conjunction with methodologies for volume estimation of food having irregular shape.For food items that have irregular 3D shapes, researchers suggest using area-based volume estimation methods from a single image [104], [105].The pinhole camera model provides a perspective transformation from the 3D plane to the 2D plane [106].Perspective transformation is a linear projection where 3D objects are projected on a picture plane.This causes distant objects to appear smaller the nearest ones and also means that lines which are parallel appear to intersect in the projected image.In order to accurately determine the food region, the 2D image should be rectified, so that the projective distortion may be removed.In this case, the existence of a reference object in the 2D image is a prerequisite [107].In, [108] they have proposed a system which requires the user's thumb placed beside the dish when capturing the picture.Then the system, which already knows the dimensions of user's thumb, can calculate the food area of each food item, and multiplies the total area of food (TA) by the depth (d) of the image to estimate its volume.The advantage of perspective transformation methodology is that it can handle irregular food shapes based on a single image.Its disadvantages are that it requires a special capture of food images and that the distance cannot be computed accurately.
In order to obtain the depth of the food image, the use of special devices and sensors is suggested in some studies.In [109], new generation smartphone cameras (Time of Flight Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.(ToF) sensor or depth-sensing camera) were utilized to estimate depth and distance, where a pair of rear cameras can create the depth map in real time.The use of an additional depth camera to calculate the depth makes this approach less popular.However, with the development of technology which captures 3D images using smartphones, the depth camera methodology is expected to dominate in the next years.At the moment, the high cost of these smartphones prohibits the use of such technology.
In recent years with the ever-increasing use of deep learning networks in computer vision problems [110], they have been used in food volume estimation problems.Moreover, the everincreasing computing power has allowed the use of Generative Adversarial Networks (GANs) to estimate the amount of food [111], providing a new dimension in the solution of this problem.In [112], a CNN is employed to deduce the depth from RGB food images to be used in Bread Units (BU) regression.This is why they have created a large-scale dataset of around 9 K different RGB-D images of 60 western dishes taken using a Microsoft Kinect v2 sensor.They have proved that depth maps from RGB images can replace RGB-D input data at high importance for the BU regression task.In another study [113], GANs are utilized to estimate food energy distribution.For the GANs training, they have created a food image dataset, which consists of 1875 paired images, based on ground truth food labels and segmentation masks for each food image including energy information correlated with the food image.The average energy estimation error is 10.89%.In Fig. 12 we can observe a quasi-even use of different food volume estimation approaches, except for depth-camera -based ones, with deep learning and perspective transformation covering each 25% of the studies.Table VI summarizes the main food volume estimation approaches, along with the techniques used to estimate the amount of food and their performance.

VI. DISCUSSION
The 21st century is characterized as the century of data explosion.With the AI and the Internet of Things (IoT) becoming omnipresent technologies, we now have a huge amount of data being created.Since the enormous volume of image data we receive is not structured, we rely on advanced techniques, such as machine learning for efficient image analysis.Food image database, food image segmentation, food classification and food volume estimation are parts of image analysis and can be used to dietary assessment systems as part of mobile health (mHealth) applications, capturing images through a smartphone.This is what today is used and it is easy to use by most of the people and of all ages to capture photos and more specifically food images, that will offer the possibility of continuous recording of health data in real time.The use of mobile devices and cloud technology to monitor health data and sharing it with physicians, can lead to faster and less misdiagnosis of diseases, such as diabetes and CVDs.In vision-based dietary assessment systems, all stages are important towards building a reliable integrated system for food nutrition analysis.Although the dietary assessment systems have been researched for many years, several challenges remain to be explored.
The way food images are captured plays an important role in the individual steps of these systems.For both the creation of the databases and their input in the food analysis systems, the way the images are taken affects the performance of segmentation, classification, and volume estimation.In the database creation, similar foods must be captured in a way that emphasizes their different features.To input food images in the dietary assessment system, many applications require capturing images from specific shooting angles [99] and with specific objects placed next to them [108].These prerequisites make it difficult to use these applications and prevent users from employing them, which renders it imperative to create simpler systems.
In food image databases, the use of deep learning techniques for food recognition tends to create databases with the largest possible number of images for each food class.However, the existing databases are limited to the number of food classes, depending on the dietary habits of the database constructor.Thus, there is a necessity to create a generic food image database which covers as many food categories as possible and represents the types of food from all cuisines.The collection of food images and the creation of food image databases is an easier task nowadays, due to the habit of capturing and posting images on social media.However, creating a database that will additionally include the ingredients of the food or its weight, is still a demanding task.Furthermore, creating an annotated database of food images using their weight in addition to the type of food, will help build better and more accurate models for the next steps of nutritional analysis systems.Also, one possible way to increase the number of images per food class is to use GANS models.Finally, it is worth mentioning that the acquisition of databases remains difficult, and the creation of a unified food image database cannot be achieved.
In several recent studies, the step of food image segmentation is omitted and in some others the performance of this step is not reported.In other studies, although the performance of the methods used to segment food images is high and improves the classification accuracy, there are still open issues related to cases where there are mixed foods.There are also open issues in cases where lighting conditions can create shadows or reflections in the  image or blurring the food items contained in the image.In these cases, the use of state-of-the-art segmentation techniques, such as semantic and instance segmentation, be used to improve the performance of this step and improve the efficiency to the classification step.
Studies have shown that deep learning techniques perform better than traditional food image classification techniques and that is the reason why they are considered the state-of-the-art methods for food image classification.To classify food images, as mentioned above, databases with a large number of food images are required.This requirement becomes even bigger for deep learning techniques, where the number of images in the database affects the performance of the food image classification system.In addition, blurred images, inadequate lighting conditions when capturing them and the different ways of cooking the same food, can lead to misidentification of the food.The use of deeper classification models and the application of transfer learning, fine tuning, and data augmentation techniques, could improve the accuracy of deep learning classification models.The use of pre-trained DNNs in existing food image databases could lead to the construction of models with better accuracy and even lower loss.
Volume and nutrient estimation are the most challenging task in automated vision-based dietary assessment systems.The controlled environment for capturing food images, taking multiple photos, the inability to estimate the volume of food with weak texture features, for instance yogurt, and the creation of databases according to the techniques used in each study, render the estimation of the amount of food through images the most demanding stage for nutrient analysis systems.In addition, the need to use a reference object or the use of a depth camera to calculate the scale and quantity of food, limits their possibility for extensive use.Moreover, food estimation techniques based on geometric patterns allow volume estimation to be calculated in only few foods which have a specific shape.Finally, although the recent use of deep learning techniques in food volume estimation was a very promising approach, studies have shown that they do not outperform the existing techniques.In the 3D reconstruction approach, CNNs could be used instead of image processing algorithms to extract the features, significantly increasing the number of matched features and improving the reconstruction of food 3D point cloud.One possible approach that would solve many problems regarding the way images are captured, the number of images required and the depth sensors needed would be to build a machine learning model on an annotated food image database with regard to the weight of the food items.
Considering the continuous technological development and the techniques of recording data, the use of alternative ways to enter data and information related to the food consumed (for example via speech or text), could help optimize the performance of nutritional analysis systems.In particular, combining traditional food recognition and quantity estimation techniques with voice text input and processing techniques could further improve the performance of nutritional assessment systems.In addition, using advanced deep learning techniques and algorithms, such as reinforcement learning, it is possible to build dietary assessment systems based on personalized nutrition, providing dynamic dietary recommendations by monitoring the user's environment and aiming to optimize a reward function.
Table VII provides a comparative assessment of existing review studies including our work with respect to the elements of dietary assessment systems that are reviewed and assessed therein.Considering the level of information (quality, quantity, and granularity) provided by the existing reviews, herein, we aimed at improving the completeness of the information by reviewing all the elements of such a system (Sections II-V) and unbiasedly capturing all the different classes of methods/techniques/algorithms that have been proposed over the last 10 years in the specified research topic.In this direction, the above discussion of both the strengths and limitations of the existing approaches alongside the identification of solutions to their shortcomings aimed at strengthening future research works.

VII. CONCLUSION
This review study assessed and contrasted the methods constituting the intelligence logic of a dietary assessment system aiming at providing to the reader the potentialities of the existing approaches.First, we highlighted the need for annotated food image databases including meals from multiple cuisines and with adequate size per class in view of their use as training/test sets in image segmentation or image classification tasks.Second, we stressed the potential of instance and semantic image segmentation approaches to augment the performance of food classification models orchestrated under the same pipeline.Third, we verified, as it was expected, the superiority of deep learning architectures in classifying the content of food images over conventional machine learning algorithms, and the tendency of increasing the number of hidden layers towards increasing the accuracy of predictions.Finally, further annotation of food images (e.g., with respect to their weight) could complement the current functionality of food volume estimation approaches.

Fig. 3 .
Fig. 3. Type of cuisine distribution according to the number of classes and how they are used.

Fig. 4 .
Fig. 4. Size of existing databases for different types of cuisine annotated by the means of food image collection.

Fig. 5 .
Fig. 5. Example of food image segmentation using the GrabCut algorithm.The blue rectangle represents the region of interest, the white lines represent the foreground and the black lines represents the background.

Fig. 7 .
Fig. 7.The counts of segmentation approaches in dietary assessment systems.

Fig. 8 .
Fig. 8.A deep learning classification model of food images.

Fig. 9 .
Fig. 9. Boxplot distribution of top-1 accuracy of deep learning-based food recognition algorithms for different food image databases.

Fig. 10 .
Fig. 10.Percentage use of food image databases in food recognitionrelated studies.

TABLE I MAIN
TECHNIQUES, METHODS AND PERFORMANCE METRICS FOR EACH STEP IN DIETARY ASSESSMENT SYSTEM Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.