Food Volume Estimation by Integrating 3D Image Projection and Manual Wire Mesh Transformations

2D images can be used to capture food intake data in nutrition studies. Estimates of food volume from these images are required for nutrient analysis. Although 3D image capture is possible, it is not commonplace. Additionally, nutrition studies often require multiple food images taken by non-expert users, typically collected using mobile phones, due to their convenience. Current 2D image to 3D volume approaches are restricted by the need for prescribed camera placement, image metadata analysis and/or significant computational resources. A new method is presented combining 2D image capture and automated 3D scene projection with manual placement and resizing of wire mesh objects. 2D images, with a reference object, are taken on low specification mobile phones. 3D scene projection is calculated by twinning a cuboid in 3D space to the reference object in the 2D image. A manually selected 3D wire mesh object is then positioned over the target food item and manually transformed to improve accuracy. The virtual wire mesh object is then projected into the 3D scene and the volume of the target food item calculated. The whole process is computationally light, and runs in real-time as an app on a standard Apple iPad. Based on a user study with 60 participants, experimental evaluations of volume estimates over regular shape and ground truth food items demonstrate that this approach provides acceptable accuracy. We demonstrate that the accuracy of estimates can be increased by combining multiple independent estimates.


I. INTRODUCTION
P HOTOGRAPHS, typically as 2D digital images, are a common way to collect information on dietary intake [6], [9]. For studies in natural settings, participants often take photographs before, during or after meals. These photographs represent food consumption and are then analysed to classify eating behaviour by, for example, food types and nutritional intake [1].
However, most nutrient analysis of food requires the weight of food item, in addition to food item identification, in order to determine relevant nutritional metrics [14]. Weight is determined by food density and volume. Food density is often determined by food identification and application of standard food density tables which are only available for some foods [15]. However, volume calculations from photographs, as 2D images, is non-trivial without 3D depth information [6]. 3D imaging from common image sources, i.e. mobile phones with multiple front facing cameras to obtain stereo images, is not yet common place or affordable [17]. Thus, there is ongoing research into developing methods to generate accurate 3D volume estimates from 2D images [10].
To support food volume estimation, two main approaches have been applied; with [14], [17] and without [18] reference objects in images. Reference objects can be explicitly placed before image capture, for example a card [17], piece of local currency or a standard sized object like a Rubik's cube [14], or extracted from identified image elements, for example plates or serving vessels of known dimensions [2]. Reference objects provide 3D cues to determine image depth, which is needed to generate 3D volume estimates [6]. Alternatively approaches without reference objects attempt to (i) identify image features, for example parallel elements, that can be used to determine an image vanishing point or (ii) use properties of the camera placement and environment knowledge [18], [22], i.e. the camera becomes its own reference object.
Two challenges with approaches using reference objects is that they can be computationally expensive, for example involving use of neural networks [14], [17] or requiring multiple food images [6]. Another significant issue, for approaches with and without reference objects, is the casual nature of the 2D image capture. In practical use, cameras are not typically on tripods where orientation is known or positions accurately captured. Recent work [22] proposed a hybrid approach with mobile phones grounded on an image plane, i.e. a table, and camera orientation data presented as machine readable data in the image capture. However, this approach restricts camera placement and in many environments where food intake data would be collected, i.e. in rural locations in the developing world, there may be no table or stable surface to ground the camera during eating occasions. Also strict camera placement requirements might be inconvenient for the user or the user might just forget to have strict placement for every image capture. Thus there is no way of being 100% sure that an image was taken on an appropriate surface.
In the approach presented here, a new method is presented where only 2D images with no camera or image meta-data is needed for the 3D food volume estimates. Through a combination of automated and manual activities, the volume of food items in 2D images can be generated in a process that is computationally light and runs in real-time. This offers considerable flexibility in its application with real world image collection where precise camera view information cannot be collected, for example when photographs are taken with low specification cameras/mobile phones or by untrained users under various lighting and photograph framing conditions. This paper has three main contributions: 1) A new semi-automated method combining 2D image capture and automated 3D scene projection with manual placement and resizing of wire mesh objects to generate volume estimates of food items. 2) An evaluation of the method on representative food item images, where individual volume estimates are collected via a user study (n=60). 3) A demonstration of how combining pair and triple food volume estimates can improve the accuracy of the volume estimates depending on the regularity of the food item shape.

II. BACKGROUND
Portion size estimation has a large and growing body of research. Here, we overview a selection of particularly relevant related works. Recent reviews in this area can be found in [1], [5], [11], [12], [15], [21] Approaches for single view image-based food portion estimation typically involve the identification of reference objects in the scene. Common approaches include (i) the inclusion of the reference object [6], [17], for example a fiducial, at image generation [7], (ii) use of prior knowledge of objects in the scene, for example known bowl or plate sizes [2], [7], or (iii) knowledge of the capture device, e.g. the digital camera, and environment context [18], [22].
Fang et al. [7] use a fiducial marker to estimate the scale and pose of objects in a scene and also provide estimates of camera parameters. Their approach is based on the use of pre-determined geometric models, namely cylinder and prism models, to determine height and radius estimates. Prior knowledge of the container shape is used as geometric contextual information. For example, the prism method utilises the area of the plate in the image. They used their previously determined criteria of 15% error or less [13] and found that out of 19 food types, only 3 (lettuce, French dressing and ketchup) where outside the 15% error range.
Beltran et al. [2] consider the use of configurable wire mesh overlays on 2D images to estimate food portion sizes. Eleven wire frames are provided including cuboid, cylinder, sphere, wedge, ellipse, half spheres (bowl and dome), half ellipse, section of sphere (cap), tunnel and irregular shape. After placement, the user can customize each wire frame to food portions using virtual pressure points to change the wire frame's shape. In order to generate reference points, the diameter and depth of standard plates, bowls and glasses, as present in the 2D images, were measured and provided to the portion calculating algorithm.
Beltran et al. [2] also considered rater reliability of portion estimates across 150 food images with results from two dietitians and three engineers. They found that although dietitians had high inter-rater reliability for volumes served (r=0.771), this decreased with smaller portions, i.e. for portions left after serving/eating (r=0.629). This was also the case for the engineers but overall the engineers had better scores, although they note this is likely due to familiarisation with the approach. Individual food types were not defined but serving containers were limited, noted as small bowl (n=48), large bowl (n=42) and plate (n=56).
Puri et al. [18] present a system that given a set of three images and a verbal description of food items performs object recognition and 3D reconstruction to estimate food volumes. 3D models of the food items are constructed via a pairwise classification framework. Camera pose estimation treats the camera as the reference object. However, the image analysis requires the use of a remote processing site with server-based computer vision processing. Thus, not practical for real-time use or where there is limited internet connectivity. Table 1 provides a summary of approaches to food volume estimation, comparing key aspects of each approach. Our proposed method is included for comparison.

III. PROPOSED METHOD
The work described here is part of a larger system which aims to provide a platform for the collection of dietary intake data, including images of food, the analysis of dietary intake data and the interpretation of dietary intake data. A simplified view of the Voice-Image-Sensor technologies for Individual Dietary Assessment (VISIDA) system is shown in Fig. 1. The basic process is that users collect images and voice recordings of food intake data for themselves and others in their household (as required), before and after food consumption, via a smartphone app. This data is uploaded into a content management system (CMS), seen as Step 2 of Fig. 1, where individual food items are identified and matched to food composition databases (i.e. the food items are "tagged"). These tagged images are then queued for food volume estimation and supporting this estimation is the approach reported in this paper. The estimated food item volume is returned to the CMS and contributes to the ongoing analysis of dietary intake data. Here, we present a novel approach to the 3D volume estimation component of this process. An overview of our approach can be seen in Fig. 2. In Fig. 2, the corners of a reference card on a captured 2D photograph are automatically identified (the blue dots in image 1). A 3D cuboid (the blue rectangle in image 2) is twinned to the reference object and the projection depth angle into the 3D scene determined (the blue arrow in image 2). A user selected wire mesh (in this case, a sphere in image 3) is then placed on a target food item. When the wire mesh object is projected into the 3D scene (image 4) it needs to be rescaled to cover the target food item (image 5). With the 3D wire mesh at the correct 3D depth and scale, the volume of the food item it covers can be generated based on the 3D wire mesh shape.
Two examples of the volume estimation tool in use can be seen in Fig. 3 and a breakdown of the manual and automated elements of the approach is presented in Fig. 4. A food image is provided to the tool, running on an Apple iPad, from the CMS where the food item of interest has been tagged. The CMS also provides initial estimations of the fiducial corner tags with the food image. The identification of the fiducial card's corners are automatically determined using the EMGU (v3.4.1) C# adaptation for OpenCV (v4.2.0). As image quality and lighting conditions can impact the automatic corner detections, a manual review of the corner matching is needed (Step 2a in Fig. 4). If the corners are a poor match (as indicated by the tool as an error distance from the known size of the fiducial object), the user can adjust the corner placement manually (Step 2b). As the dimensions of the real fiducial object are known, an equivalent 3D virtual version of the fiducial object is placed at the center of real fiducial object, on the 2D image, and automatically rotated around X, Y, and Z axis to optimise the corners of the virtual fiducial corners to the real fiducial corners (see Algorithm 1 1 ). The corner optimisation attempts to find a best-fit match in one dimension at a time, i.e. starting by rotating in the X axis. The rotation algorithm looks ahead in unit rotation increments to determine the best solution at the current angle and will roll back its current rotation if the solution is not improved. This is repeated across the other dimensions, Y and Z. This whole process is then repeated, with the X, Y and Z axis, until no better solutions are found and the mapping has stabilised. Thus, the optimised position of the virtual fiducial is a mapping to the real fiducial card position and enables 3D image depth projection in latter steps.
The user then picks a wire mesh object (sphere, dome, bowl or box) that best matches the food item (step 4) and then resizes and reshapes (X, Y, Z transformations) manually to best match the shape of the food object (step 5). Using the orientation of the virtual fiducial object, the wire mesh object is projected, as a 3D object, into the food image scene (step 6 and see Algorithm 2). This often results in a change of scale of the wire mesh object as it moved into the 3D scene. Thus  the user must resize it at this new 3D depth, so that it matches the food item (repeating steps 5-7). When this is complete the wire mesh object is at the projected depth of the food item and its volume can be used to estimate the food item it has covered (step 8 and see Algorithm 3).
To determine the accuracy of the approach, a user study was conducted. The aim of this study was to evaluate the error rate of the new technique by comparing user trials to ground truth measures of non-food and real food items. Also the estimates provided by the users were used to determine any benefits of having multiple independent estimates, for example by allowing multiple tool users to asynchronously estimate the volumes of the same food items.

A. PARTICIPANTS
Participants were recruited at the University of Newcastle, Australia until 60 volunteers (22 female) completed a series of volume estimation tasks. The participants (n=60) had a mean age of 25.8 years (SD=7.1). All participants either owned or regularly used a touch-based device (for example iPhone, iPad or Samsung Galaxy Phone/Tablet). The study was approved by the University of Newcastle Human Research Ethics Committee (HREC #H-2018-0271).
Algorithm 1 Twinning 3D cuboid to reference object Input: (1) 2D screen positions of the fiducial card's four corner tags (2) Euler angles of 3D cuboid (3) Increment value, set to 0.01 Output: 3D orientation of 3D cuboid twinned to the orientation of the 2D fiducial card in the image Initialize: (1) Boolean variables (currentOrientation[.x, .y, .z]) for increments across x, y and z Euler angles to f alse.
(2) Distance variables (currentDistance, previousDistance) between fiducial corners and 2D screen projection of 3D cuboid. Our approach has been implemented into an iOS app and deployed on a 9.7' iPad. The iPad touch screen provided the interface for the manual movement of on-screen objects, for example the wire mesh objects. Standard two-finger pitch and zoom gestures were used to size and scale the wire mesh objects. The app also had a number of on-screen buttons to support the different phases of the approach (see Fig. 3). The user can also double tap the screen to hide/show the on-screen buttons as needed.
non-food objects and real food items in 2D images. During each training trial, after a first attempt, the participants were provided with verbal feedback on the correct volume of the target object and allowed a second try. This allowed the participants to practise with the iPad tool and gain experience with using it on 2D images. The training objects were presented as progressively more complex objects, namely styrofoam sphere, styrofoam bowl, styrofoam dome and two plastic food containers, with the latter two trials requiring both re-sizing and deformation of the virtual object (see Fig. 5). Styrofoam objects were used as their ground truth volumes were available.
After the training trials, 16 test trials were conducted with no feedback given to the participants (see Fig. 6). Participants could self-select the wire mesh shape to use from a choice of sphere, dome, bowl and cube where each default wire mesh was both resizable and deformable in all three dimensions (X, Y and Z). The ground truth volumes of the target objects was calculated by direct measurement, i.e. for the styrofoam and plastic box items, or via water displacement [4], [22] with a measuring jug measured as mL, for the food items.

D. PROCEDURE
Each session, of training and testing, had an average session time of 16 minutes 43 seconds (SD 21 minutes 2 seconds), and were held individually. Firstly, the research team collected signed consent forms. Secondly, the participants were given a short tutorial on the use of the new volume estimation technique as implemented on the iPad. The participants then completed the 26 volume estimation trials. Each trial had a target object highlighted with a red or white dot (white dots were used when the target food item was colored red or orange). Only one volume estimate was completed during each trial. Some images had multiple food objects and these were duplicated across the trials with the red or white dot indicating which object was the focus of the current trial.
Finally, participants completed a demographic question-naire on gender, age, university degree or discipline area they were studying and use of touch-based devices. All participants received a $5 coffee voucher in appreciation of their participation time. Participants on eligible university courses were also given course credit as part of an assessment on research awareness activities.

E. ANALYSIS
In evaluating the approach, we were interested in two primary measures, namely the volume estimation deviation from the target object's ground truth (GT) volume and the reliability of the volume judgements. The deviation was calculated by generating the absolute value (ABS) of the percentage deviation for each volume estimate as an absolute percentage deviation (APD): Food volume estimation judgements are typically not done only once. Given the wide variation of human-based estimates [8], [23], it is common to obtain multiple estimates of food portion sizes and implicitly the volume of food items. As the current approach is a mixed method with automated and manual components, there is likely to be variation in the manual contributions. The current implementation also fits well into an overall system where crowd sourcing [23] could be used to aid scalability. Thus food items for volume estimation could be distributed to multiple app users and the results averaged in an attempt to smooth out any errors from manual components.
With a large number of food items to obtain estimates of, getting two, three or more estimates impacts further analysis of the food items, i.e. determining density and nutritional value. Thus there are two sub-questions. Firstly are there improvements in the volume estimates if more than one independent estimation is obtained (i.e. via crowd sourcing) and secondly, are there particular food types or shapes where averaging multiple independent estimates make these improvements are more prominent. Determining both these issues will highlight any return on investment on requesting multiple estimates and waiting time for multiple estimates to return.
In our study, we collected 60 estimates across 26 food items. Combinations of estimates were calculated by the number of ways to choose a sample of r unordered outcomes from a set of n possibilities: We have explored the relations between single estimates (C(60, 1) = 60), all pairings (C(60, 2) = 1770) and all triples (C(60, 3) = 32, 220) for all 26 food item estimates. Code to generate the combinations and calculate each averaged volume estimate was written in MATLAB (version 9.7, The MathWorks Inc.).

A. TRAINING: NON-FOOD ITEMS
The results of the training estimates are shown in Figs 7 and 8. When comparing the single estimates (n=60), it can been seen that the percentage error has been reduced in the second training trial for all volume estimates. This is to be expected as the participants were given feedback on the true volume after their first attempt. However, this is indicative that the participants are competent in the use of the iPad   app for generating accuracy volume estimates as the second estimates are improved. This is particularly evident in the box objects, which required deformation of the virtual wire mesh box, where there is consistent improvement between the first and second estimates across both examples. This shows that the training was sufficient before the testing trials.
In terms of combining the estimates, both the single to paired estimates and paired to triple estimates show increased improvements and with smaller error bars. This improvement in accuracy was evident across all the testing trials. Fig. 9 shows the mean absolute percentage error from ground truth volume for the non-food objects under the testing conditions. Across this object set with single estimates, the worst error was 20.4% (Sphere) and the best was 10.7% (Bowl). For the multiple estimates the mean improvement was minimal, but for each increasing amount of estimates, the standard deviation and error range was reduced. Fig. 10 shows the mean absolute percentage error from ground truth volume for food items where there was only one food item per image, under the testing conditions. Across this object set with single estimates, the worst error was 56.1% (Raw chicken on plate) and best was 34.2% (Soup), i.e. the more regular shaped food item gained more accurate estimates.

C. TESTING: SINGLE FOOD ITEMS
For the multiple estimates the mean improvement was trending towards improved, i.e. smaller, means with the best results in the triple estimate averages. Also the standard deviation and error range was reduced for each food item.   Fig. 11 shows the mean absolute percentage error from ground truth volume for multiple food objects on one plate. Across this object set with single estimates, the worst error was 32.6% (Carrots) and best was 23.7% (Cabbage). The multiple estimates had a reducing trend with increasing estimates. Also, the standard deviation and error range was reduced for each food item. Fig. 12 shows the mean absolute percentage error from ground truth volume for multiple food objects served discretely. Across this object set with single estimates, the worst error was 55.7% (Rice on plate) and the best was 20.3% (Apple), which was expected given the more regular shape of the apple. The multiple estimates had a reducing trend with increasing number of estimates for the irregular food items, namely the "Rice on plate" and "Chicken stir fry on plate", but less so for the "Apple" food item. The standard deviation and error range was reduced for each food item. VOLUME 1, 2022 FIGURE 12. Results of testing trials with multiple food objects served discretely. Each food item was considered individually.

VI. DISCUSSION
The results of the current study present a method where the average error rates were in the 10-20% range for regularshaped objects and the 10-60% range for irregular food items. Although it is difficult to directly compare this with other studies given different food item images and computational approaches that have been used previously, the results here are promising, particularly when multiple volume estimates from independent users are averaged. Also, our approach requires no prior knowledge of estimation shapes. It provides a number of common but representative proxy 3D objects, and only requires limited preparation in image capture. Further, use of the tool was surprisingly good by the participants who had no formal training with the tool, and who were from a range of discipline backgrounds. Many image based studies in a previous systematic review [15] identified that dietitians, who are highly trained in dietary assessment and portion size estimation, are not typically used to capture portion images. Thus the results of the current study provide promising insight for potentially broader use [10] and potential for realistic real-world usage.
In terms of the potential for crowd sourcing multiple estimates to improve the reliability of the manual component of the approach, the user study provides evidence that both the median and error rates can be reduced with cross averaged pairings and triples. The generation of these combined estimate averages is trivial and the only real overhead is the generation of multiple estimates. However, with an asynchronous approach, these estimates can be collected concurrently and in the context of our larger system, distributed from a CMS to individual users on iPads. One question would be the return on investment, in time, in gaining the multiple estimates. From this study, we have seen that irregular food items gain more benefit from multiple independent estimates.
One contributing factor to this it that irregular food items may have variable gaps, for example between two pieces of carrot, and this adds to estimation errors 2 . The multiple independent estimates can smooth out this gap error as the wire mesh placements under or overestimate the true volume.
For more regular shaped food items, paired estimates are sufficient. For example, in Fig. 12, the "Rice on plate" and "Chicken stir fry on plate" have improved accuracy and reducing error bars from paired to triple estimates. However, for the "Apple" item it would be sufficient to only have paired estimates with the sphere wire mesh shape. A similar pattern can be seen in the "Soup" food item in Fig. 10 where there is a good match between the regular shaped food item, i.e. soup in a bowl, and the bowl wire mesh shape.
An important aspect of the present work is that it was designed as a practical and computationally light approach that runs on a standard iPad in real-time. Based on a large user base (n=60), it exhibits high ecological fidelity [16]. The study by Fang et al. [7] did not focus on processing times and the required level of computation, which limits comparability to the present study. Beltran et al's [2] use of engineers who helped to develop the procedure influenced the results with Beltran et al. noting the dietitians had not mastered the manipulation of the wire mesh to closely conform to the outer boundary of the food image whereas the engineers who helped create the wire imaging system did.
The current model would also be appropriate to be used with other active capture systems in image based dietary assessment. Active capture methods involve the user in collecting images for example taking the image and placing the fiducial marker in the image frame, this allowed for more rapid image processing. However, it is acknowledged that this portion size estimation tool is only one aspect to dietary assessment process and for accurate real world nutrient estimation would rely on accurate food identification and up to date nutrient database information [15].
Given the results found in the current study, this approach shows the ability of the current system to be used by users with minimal training. The training used in this study comprised of only a small number of trials and transitioned well from fixed known volume objects to real foods. Interestingly some of the best estimates were found for apples and this might likely reflect the training that was conducted using spheres as these shapes have a high resemblance. Given this finding, adding additional training objects that resemble real foods might be good practise.
There was not much variation in the error estimation across the vegetables that were tested. This was expected as the vegetables were all chopped and assembled in a similar way. Raw chicken on the plate was also associated with larger errors in estimation compared to other estimations of real food. Mixed dishes or items served on shared plate present a complexity in dietary assessment as there are additional factors to consider in addition to image depth, such as the size of the overall serving vessel. This issue was highlighted in a recent review where dietary assessment from shared plates was considered [3].

VII. CONCLUSION
In the analysis of dietary intake, images of food items play a critical part of data collection. Determining the volume of these food items is necessary in order for accurate nutritional analysis. This paper has described a new semi-automated method combining 2D image capture and automated 3D scene projection with manual placement and resizing of wire mesh objects to generate volume estimates of food items. Through a combination of automated and manual activities, the volumes of food items in 2D images can be generated in a process that is computationally light and runs in real-time. Also we have considered the use of multiple food volume estimates and showed that accuracy can be increased by combining multiple independent estimates.
However, the approach presented is not without limitations. There can be errors in both the automated and manual elements and, as a pipeline approach, any inaccuracies accumulate and increase the final error [20]. Thus, reducing any error is desirable. Also with increasing availability of wearable sensors (see [19]) there are opportunities for hybrid approaches combining direct (based on physical properties of food) and indirect (based on food intake activity) measures to reduce overall error. Future work will focus on improving the accuracy of the automated aspects of the approach, for example the accurate identification of corner markers that is critical to accurate 3D projection, and determining the impact of increased training [12] for improvements in the manual aspects, i.e. the fitting and sizing of the wire mesh objects. SHAMUS P. SMITH received B.Sc., B.Sc. (Honours) and Ph.D degrees in computer science from Massey University, New Zealand, in 1992, 1993 and 1999. He is an Associate Professor in Computing and IT at The University of Newcastle and has research interests in virtual reality, humancomputer interaction and technology enhanced learning. He has published over 100 research articles. He is a Senior Member of the IEEE.
MARC T. P. ADAM received his undergraduate degree in computer science from the University of Applied Sciences Würzburg, Germany, and his Ph.D. degree in information systems from the Karlsruhe Institute of Technology, Germany. He is an Associate Professor in Computing and IT at The University of Newcastle. His research interests focus on the interplay of user cognition and affect in human-computer interaction. He is a founding member of the Society for NeuroIS.
GRACE MANNING is an Accredited Practicing Dietitian, Graphic Designer and Research Assistant for the Priority Research Centre for Physical Activity and Nutrition, School of Health Sciences, University of Newcastle. Her areas of interest are communication and marketing and environmental sustainability. Her work spans a range of nutrition areas including aged care, nutrition in pregnancy, cooking, disability, nutrition education and technology.
TRACY BURROWS is a Professor in nutrition and dietetics at The University of Newcastle, and also a Researcher at the Hunter Medical Research Institute. She is currently a National Health and Medical Research Fellow. Her research areas focus on dietary assessment, eating behaviours, in addition to the management of overweight, obesity, and addictive eating.
CLARE COLLINS is Laureate Professor in nutrition and dietetics and Director of Research in the School of Health Sciences at The University of Newcastle. Her research focus is on personalised food and nutrition eHealth programs, tools and evaluating impact on eating patterns and dietrelated health across key life stages and chronic disease conditions. MEGAN E. ROLLO received the BAppSci, BHlthSci(Nutr&Diet), and Ph.D. degrees from the Queensland University of Technology, Australia. She is currently a Research Fellow in nutrition and dietetics with the School of Health Sciences and Priority Research Centre for Physical Activity and Nutrition, The University of Newcastle, Australia. She has research interests in technology-assisted dietary assessment and personalized behavioral nutrition interventions.