A Systematic Review of Sensor-Based Methodologies for Food Portion Size Estimation

Food portion size estimation (FPSE) is critical in dietary assessment and energy intake estimation. Traditional methods such as visual estimation are now replaced by faster, more accurate sensor-based methods. This article presents a comprehensive review of the use of sensor methodologies for portion size estimation. The review was conducted using the PRISMA guidelines and full texts of 67 scientific articles were reviewed. The contributions of this article are three-fold: i) A taxonomy for sensor-based (SB) FPSE methods was identified, classifying the sensors (as wearable, portable and stationary) and the methodology (as direct and indirect). ii) A novel comprehensive review of the state-of-the-art SB-FPSE methods was conducted and 5 sensor modalities (Acoustic, Strain, Imaging, Weighing, and Motion sensors) were identified. iii) The accuracy of portion size estimation and the applicability to free-living conditions of these SB-FPSE methods were assessed. This article concludes with a discussion of challenges and future trends of SB-FPSE.


I. INTRODUCTION
D IETARY intake assessment has been one of the profound areas of scientific research, particularly in assessing the impact of diet on the human body and an individual's state of well-being [1], [2]. The World Health Organization (WHO) underlines the importance of a healthy diet and contends that unhealthy eating habits, along with the lack of physical exercise, manifests into global health risks [3]. Establishing a desirable diet is dependent on each individual. Determining dietary guidelines is not only pertinent on an individual basis but also has global implications [4]. It is essential to identify the dietary restrictions that should be followed to lead a healthy life and make additions to the diet to have a balanced energy intake.
Energy density and portion size have been identified as properties of foods that can modulate energy intake [5]. Energy density refers to the amount of energy per unit mass of food (kcal/g). Portion sizes can be expressed as mass (g), volume (ml), household measures (e.g. tablespoons), hand measures (e.g. a fist), or as measures relative to the size of a reference object (e.g. "tennis ball" [6]). A large portion size increases energy intake in individuals [5], [7], [8]. This is known as the 'Portion Size Effect' [9]. Dietary interventions such as weight control rely on an individual's ability to choose appropriate portion sizes and that person's awareness about the amount of food being consumed [10]. Therefore, it is critical to develop accurate methods for food portion size estimation (FPSE).
FPSE methods can be broadly classified into traditional-FPSE (T-FPSE) and SB-FPSE. In T-FPSE, portion sizes of the food consumed are estimated by directly measuring the quantity of food (mass/volume) using household measures or by visual approximation. Household measures such as calibrated measuring cups/jars or spoons can be used to measure the quantity of food directly. This can be quite cumbersome and it is not possible to directly measure the food quantity for foods consumed outside of the household. Visual guides are frequently used to estimate portion sizes. Hands are often used as a visual guide by healthcare professionals and the common public to estimate portion size as described in Fig. 1. For example, a fist, a thumb fingertip, and an index fingertip are used to estimate one cup, one tablespoon, and one teaspoon, respectively. The accuracy of such estimates is very low. Visual estimates are used in two popular reporting methods: 24-hour recall [11]- [14] and food diaries [15].
There are numerous drawbacks in T-FPSE and related reporting methods. It was seen that individuals have difficulty assessing food portion sizes [6], [11], [12] and are confused as measurement units/terminologies are used inconsistently [13]. The accuracy of T-FPSE varies in reporting methods that involve conceptualization and memory [14], [15]. Traditional methods suffer from a flat-slope phenomenon, in which large portions tend to be underestimated and small ones overestimated [14], [16], [17]. Amorphous foods (e.g. mashed potatoes) and those eaten in small portions are reported less accurately than other types of foods [10], [16], [18]. Also, T-FPSE based reporting methods are cumbersome, and prone to errors due to memory, particularly in recall methods. The intake behavior may also be affected in these methods. The academic community, since, has extensively explored the use of technological aids such as sensors for FPSE.
Sensor-based technology is gaining popularity in dietary assessment. Sensors have been used to address a variety of problems, such as automatic food intake detection (utilizing piezoelectric sensors [19], capacitive sensors [20], accelerometers [21], and others), food type recognition, energy intake estimation [22], and food portion size estimation. Several existing reviews covered various uses of sensors in food intake [23], [24], but presently there are no reviews focused on the sensor-based portion size estimation.
The present review is intended to provide a systematic, comprehensive evaluation of state-of-the-art sensor-based methods for FPSE. To cover the full range of approaches in this survey, research publications were thoroughly studied, and a total of 121 papers (without duplication) were found related to the topics. Following the application of inclusion and exclusion criteria, 67 papers were selected for full-text review.
The contributions of this review are three-fold: i) A taxonomy for SB-FPSE methods was identified ii) A novel comprehensive review of the state-of-the-art SB-FPSE methods was conducted iii) The accuracy of portion size estimation and the applicability to free-living conditions of these SB-FPSE methods were assessed. Flowchart of the inclusion and exclusion process for the systematic review examining the evidence for SB-FPSE.
The paper is organized as follows. First, the methodology of the systematic review is presented in Section II along with the specification of research questions (RQs). The review findings, which include taxonomy, are discussed in Section III. Section III also presents a detailed description of the articles included in the review categorized by the type of sensor modality. Sections IV, V, and VI are discussion, open problems/future directions, and conclusion, respectively.

II. REVIEW METHODOLOGY
Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) [25] was used as a guide for this review. The authors independently screened the titles and abstracts of the publications retrieved through the database search and then carried out a full-text review of all relevant studies. This methodology used the following processes:

A. Identifying Research Questions
Three research questions were chosen to guide this systematic review: RQ1) What are the available state-of-the-art SB-FPSE methodologies?
The answer to this question helps in identifying the sensor modalities that have been used for portion size estimation.
RQ2) What methods are employed for portion size estimation from sensor data and how accurate are these methods?
The answer to this question enables the authors to assess and compare the validity of various sensor-based methods.
RQ3) Which sensor modalities are more suitable for use in the free-living conditions?
The answer to this question helps in determining the applicability of the sensor-based methods to studies outside of the lab and identifying the research gaps in current methodologies.

B. Databases
Exhaustive electronic searches for relevant literature were performed across five repositories: PUBMED, Science Direct, SCOPUS, ACM Digital library, and IEEE Explore from inception through May 21 st , 2020.

C. Search Strategy
To cover all existing approaches for portion size estimation, the following keywords were considered: 'sensor', 'technology', 'device', 'food portion size', 'food mass estimation', and 'food volume'.
The query string used for the search was: "Sensor" OR "Device" OR "Technology") AND ("Food Portion Size" OR " Food Volume" OR "Food Mass Estimation") The search results were restricted to the English language. References from the selected primary full-text articles were further analyzed for relevant publications. The selection was further narrowed by applying the eligibility criteria described in Table I.
The manual bibliographic search identified articles through sources outside the mentioned databases such as reference lists of articles. Cited references or citing references of articles included through the database search were examined, and suitable ones were added to the survey.
Articles that fulfilled the inclusion criteria were considered in this review and those that fulfilled the exclusion criteria were filtered out.

D. Results
Initial electronic database searches resulted in a total of 2368 publications. A manual bibliographic search also identified 6 additional publications that were qualified to be included. A total of 2215 articles were set aside for screening based on abstracts and titles, of these, 121 articles were selected for full-text review. Following the exclusion process, 67 articles were included in the systematic review based on the eligibility criteria. Fig. 2 illustrates the methodology and results of the review process.

III. REVIEW FINDINGS A. Taxonomy
First, the taxonomy for SB-FPSE was identified to address RQ1. A potential classification can be done based on the type of instrumented devices: Wearable, Portable, and Stationary devices. Wearable devices, as the name implies, are integrated into wearable objects or directly with the body [26]. Portable devices are those that may be carried around, such as a smartphone. Stationary sensors are those sensors that are embedded into the environment and stationary objects. Fig. 3 describes a taxonomy that was established from the findings of this review.
Another classification scheme is based on the measurand: Direct and Indirect methods of FPSE. Direct methods estimate portion size by measuring the properties of food such as food mass or food volume directly. Indirect methods are those that use behavioral and physiological manifestations of eating such as hand gestures, bites, chews, and swallows to approximate the ingested amount.

B. Sensor Modalities
The review identified 5 sensor modalities that were primarily used for FPSE. They were: Acoustic, Strain, Imaging, Weighing, and Motion sensors. One immediate observation is that the subset of sensors used for FPSE is much narrower than the sensors used, for example, for the detection of food intake. Fig. 4 provides categories of the sensors included in this review.
Acoustic, strain, and motion sensors were involved in the indirect measurement of portion size and were based on food intake activities such as chewing, swallowing, hand gestures, and head movement. All the sensors used in indirect FPSE were found to be wearable sensors.
Weighing and imaging sensors were used for more direct methods of FPSE that were based on the physical properties of the food item such as mass and volume. Imaging sensors were either wearable, portable, or stationary while weighing sensors were either stationary or portable. Each modality is described in detail below.

C. Acoustic Sensors
Acoustic wearable sensors have been previously used to monitor food intake [27]. The authors suggested two possible locations for an acoustic wearable sensor (Fig. 5), each associated with a physiological activity: chewing and swallowing. A few recent studies have employed these sensors in FPSE. All these studies used methods that can be grouped under indirect measurement of portion size.
One study [28] presented an approach to predict the weight of individual bites taken using an ear-pad chewing acoustic sensor. Chewing sounds were recorded using a miniature microphone (Knowles, TM-24546) embedded in a custom earpad device. A pattern recognition algorithm was developed to identify chewing cycles and to detect the food type. The bite weights were predicted based on this information.
The study included 8 participants and 504 habitual bites. The acoustic sensor system was used to predict bite weight for  three types of foods (potato chips, lettuce, and apple) using linear regression models with prediction errors in bite weights from 19.4% to 31%. The authors claimed that the bite weight prediction using acoustic chewing recordings is a feasible approach for solid foods and should be further investigated.
Researchers have claimed that the information extracted from the temporal sequence of chews and swallows can be used to estimate the mass of food consumed [29]. The counts of chews and swallows along with other derived metrics were used to build prediction models for detection of food intake, differentiation between liquids and solids, and estimation of the mass of ingested food. The proposed prediction models were able to predict mass of ingested food with an accuracy of 91% for a limited number of solids and an accuracy of 83% for liquids. The insight obtained from that preliminary study on mass intake estimation was used to design a new study [30]. The study individually calibrated parameters that were computed for each subject based on the total number of chews and swallows observed in the training meals. The average mass per chew of solid foods, the average mass per swallow of solid foods, and the average mass per swallow of liquids were computed. These were used to create models to predict the mass of consumed solids and liquids. Counts of chews and swallows were used to estimate the mass of individual food items with the mean absolute percentage error of 32.2% ± 24.8%.  [27]; (Right) Same device fastened to the neck using a neoprene collar [30].
The authors in [31] proposed the use of multi-modal wearable sensors, such as an in-ear acoustic sensor plus head and wrist motion sensors. The authors stated that knowing the food type in combination with sensor data was critical for estimating the food amount consumed. A customized earbud and a pocket audio recorder were used as internal and external microphones. The external microphone was used to remove external noise while the subtle eating signal was captured mainly on the internal microphone. After noise cancellation, each intake window was split into frames. The frame length was chosen to capture a full chew without including multiple chews in a single frame. For each frame, the following features were computed: energy, spectral flux, zero-crossing rate, and 11 Mel-frequency cepstral coefficients (MFCCs). Then, the mean and standard deviation of the frame features were stored as a feature vector for the whole intake window. A random forest regression with 40 trees was used that included all features (audio, motion, and annotation). Data was collected with people wearing acoustic and motion sensors, with ground truth annotated from video and continuous scale data. The error in weight estimation was 35.4%.

D. Strain Sensors
Wearable strain sensors have been used for indirect FPSE by sensing chewing. The authors in [30] used a strain sensor along with an acoustic sensor to demonstrate the suitability of estimating portion size using individually-calibrated models based on Counts of Chews and Swallows (CCS models). The study was conducted under laboratory settings. Chew counts and swallow counts were estimated using sensor signals and video recordings. CCS models were compared to diet diaries and it was found that CCS models presented the lowest reporting bias and a lower error as compared to diet diaries. They achieved a mean absolute percentage error of 32.2% ± 24.8% while estimating portion sizes.
The same research group conducted another study using just the strain sensor [32]. In both studies, a piezoelectric strain sensor (LDT0-028K, Meas-Spec Inc.) was attached to the skin immediately below the earlobe using a medical adhesive (Hollister 7730). The strain sensor (Fig. 7) allowed the monitoring of jaw motion during chewing. Subject-independent models were derived from bite, chew, and swallow features obtained from either video observation or information extracted from the sensor. 30 participants each consumed 4 different meals  in a laboratory setting while being videotaped. Research staff kept a record of mass of each bite by recording the food weight, before and after each bite, using a commercially available scale with 1 g precision. With multiple regression analysis, a forward selection procedure was used to choose the best feature set. The best estimate of meal mass had an absolute percentage error of 25.2% ± 18.9% (mean ± standard deviation), and estimation errors of −17.7 ± 226.9 g (mean ± standard deviation).

E. Imaging Sensors
Imaging sensor-based methods for portion size can be grouped under direct FPSE and may fall under wearable, portable or stationary categories. Also varying are types of imaging devices, number of viewpoints, use of fiducial markers, passive/active capture of images (whether the image is taken by the user or automatically), and others. We have summarized the studies included under imaging sensors in Tables II, III, and IV (see Appendix A) which include portable, stationary, and wearable imaging sensors, respectively.
The types of devices used for image capture include video cameras, handicams, smartphones, DSLRs, and custom-made cameras. The type of camera (imaging device) and the image resolution are very important in food photography [33], [34]. A clear food image provides a better input source to both humans and computers alike.
The number of viewpoints may be classified as either single-view or multi-view. The review identified that there were essentially two categories in imaging namely single-view color/color-depth imaging and multi-view color imaging. Color imaging involves recording color information of the scene of interest using imaging photosensors. Color filters are used along with the photosensors to filter the light by wavelength range, such that the separate filtered intensities include information about the color of light. The recorded information is then used to reproduce the original colors by mixing various proportions of the captured color channels (example: Red-Green-Blue (RGB)). Depth sensing captures the depth information as an additional stream of information.
The fiducial markers may be used as a dimensional and color reference. Fiducial markers are used as a point of reference or measure and are placed in the field of view of the imaging system. Volume estimation using imaging needs a dimensional reference whose volume or dimensions are known. This dimensional reference may include: custom made backgrounds [35], circular references (plates, bowls, or food containers) [36]; colored checkerboard [37]; business cards [38], [39]; the international food unit (IFU) [40], [41]; laser mesh or pattern [42], [43]. The type of visual cue used determines the complexity of the setup. Some methods require the user to carry around the reference (checkerboards, blocks, or cards).
The passive or active mode of capture defines the need for a user to take action during each eating episode, however small. "Passive" methods use wearable sensors to trigger the camera [44]- [47], not requiring any action by the user. "Active" methods require the user to capture the food images during every eating episode, thus making the result dependent on the quality of self-report.
The number of viewpoints was determined as the most important criterion, and the review results were grouped either as single-view or multi-view methods.
1) Single-View Methods: The single-view color/color+depth approach is defined as any method where images of food are captured before, during or after the eating episode by a single image sensor and at a one-time instance. For example, a user may use a smartphone to take an image of the serving size before a meal, and an image of the leftovers after the meal.
Since there is only a single viewpoint, there is a need for a dimensional reference such as a fiducial marker or any other visual cues in the image scene. The review identified that the use of such references was more prevalent in these methods. Several studies [48]- [50] have used fiducial markers that need to be carried around by users which often induce user burden. Researchers also evaluated other dimensional references that are naturally found in the image scene and do not require additional fiducial markers. One example is the use of cutlery or chopsticks as a dimensional reference [51], [52]. One study that involved portable imaging sensors used such a reference object, particularly participants' thumbs as a reference for FPS estimates [53]. The authors used the known size of the thumb and estimated the portion size by comparing the food and thumb in the images. Other popular reference objects are food containers such as plates or bowls. The method reported in [54] used food shape models using plates and bowls as references; achieving error rates of 3.7%, 19.65%, and 15.36% for a wafer box, chips, and cheese, respectively, while measuring volume. This method assumes the presence of a plate that is easy to identify, however, many foods are eaten without plates. Authors in [55] used templates of food shapes. In this approach, food images are segmented and assigned a food name/code. A pre-defined system chooses a food template shape for each segmented food item. The authors reported an error of 11% for beverages and an 8% error for solids (bread slices) in volume estimation. A similar study [56] calculated the food location in the 3-dimensional coordinate system using the plate as a scale reference. After segmentation and further image processing, a shape was picked from a pre-constructed shape library. The authors reported an average error rate of 3.9% in food volume.
As opposed to the use of physical dimensional references, some researchers explored the use of virtual 3D wireframes/meshes. A virtual 3D wireframe is projected into the food image in a proportional relationship. Using this relationship, the known volume of the wireframe is utilized to compute the food volume. This method was robust in estimating the FPS under several different experimental conditions and gave identical results for different cameras [57]. The study involving a portable sensor [57] reported a relative error in volume estimation around 7% for regularly shaped food and around 19% for irregularly shaped food. The study involved 2 regularly shaped objects and 19 food replicas. Another research group that used a portable sensor reported an average error rate in volume of 10% in 224 trials, consisting of a cuboid and 7 NASCO food replicas [58]. Virtual meshes/wireframes have also been used in studies involving wearable sensors. One research group used a wearable sensor and reported an average error rate of −2.8 ± 20 % in the volume estimation of 50 samples of Western/Asian foods [59]. The authors in [60], [61] discussed the validity of the wireframe procedure using a wearable camera. However, wireframes can be erroneous for irregular shaped food and might be tedious due to necessary manual inputs. These studies involving wearable cameras performed well during portion size estimation which hints at the possibility of accurate FPS estimates using wireframes for the data captured in free-living. A different type of virtual input was proposed and evaluated by researchers using a portable imaging sensor [62]. The authors introduce an International Food Unit (IFU). The approach relies on a virtual cube to relate the food size with the cube size. They achieved error rates of less than 20% for large volume foods and around 40% for small volume foods while measuring volume.
Another popular approach in single-view methods is the use of geometric models. The authors in [38] reported that FPSE using geometric models such as cylinders, prisms, and spheres produced better results in terms of accuracy of estimation when compared to depth images captured by structured light scanners. Irregular shaped food items may be a challenge to the approach. In another study involving geometric models [51], the authors reported an average error rate of 6% in energy estimation. The accuracy of FPSE was further refined in a following study that made use of food co-occurrence patterns. Food portion co-occurrence pattern is a type of contextual information where patterns from food images are used to study the user's eating behavior [39]. Both of these studies [38], [51] employed portable imaging sensors.
Recent developments in sensor technology and methodology have enabled researchers to evaluate latest sensors such as off-the-shelf 3D scanners [63]- [66] and latest methodology such as deep learning [67]- [71] in FPSE. Another study used short-range depth cameras to determine FPS [72]. The authors achieved a low error rate of 7.5% in volume estimation and suggested that the use of these off-the-shelf 3D scanners could be a possible method for accurate FPSE. However, these scanners can be expensive and could be difficult to use routinely. A relatively recent method in single-view FPSE is the use of neural-nets. Im2Calories proposed in [73] makes use of a Convolutional Neural Network (CNN) [74]. This approach using a portable sensor was accurate in EI estimates and worked well for combinations of food items.
Stationary imaging sensors are fixed to a system that is not intended to move. The system is usually located in a laboratory. Single-view methods using stationary imaging sensors have been employed previously and the accuracy in portion size of these systems were found to be higher compared to other sensor types. One study estimated the volume of food using the intrinsic properties of the camera such as resolution and focal length [75]. The authors achieved an error rate of 13.3% in volume estimation. Another group used cubic spline interpolation [76] and achieved an error rate of 0.625% in volume estimation of six synthetic objects. A stationary depth sensor was used in another study [77], to achieve an error rate of 3% in volume estimation.
2) Multi-View Methods: A multi-view method is any method where images of food are captured before, during or after the eating episode by either a single image sensor at multiple view angles or by multiple imaging sensors.
The review identified that multi-view methods have been less explored compared to single-view methods but are gaining popularity. The advantage of having more than one viewpoint enables the sensor system to mimic the human eye (binocular vision-based depth perception). The need for a dimensional reference is frequently eliminated as there are at least two different perspectives of the same image scene.
One of the popular multi-view approaches is the stereoimaging-based 3D reconstruction [78]- [81]. A study involving a portable sensor [78] explored the stereo reconstruction approach by considering real food datasets with more than one food item. The authors claimed that the stereo approach works well for FPSE but might have issues in the case of textureless food items and non-uniform lighting conditions. The average error in volume was 5.75 (± 3.75) %.
Binocular vision-based methods have also been explored using wearable cameras. The authors in [82] used a wearable dual-lens camera system on 16 types of food objects with error rates of 3.01% in food dimensions and 4.48% in food volume. The study of [83] used a wearable stereo camera and achieved an error rate of 2.4% in volume estimation. Both approaches [82], [83] worked well for irregular shaped food items. However, two-view stereo wearable devices were not tested in free-living.
Some studies have explored the idea of using dimensional references, like in single-view methods. The reference object along with the food is photographed in two different perspectives, for example a top-view and a side-view. The dimensions (length, width, height) of the food item are calculated first, using the reference object, and then the volume is calculated. The references may include thumbs, hands or random objects [84]- [88].
Some multi-view methodologies use more than two viewpoints. These methods have been used less frequently in the FPSE. Increasing the number of views results in increased complexity of the sensor system. Shape from silhouettes is a popular 3D reconstruction approach [89], [90] that estimates a 3D shape estimate of an object using silhouette images of the object. This concept has been applied in FPSE to generate 3D volumes of food items [91], [92] using a portable sensor. The camera intrinsic and extrinsic parameters are determined for every image and each estimation process needs several images taken from different points of view. However, compared to other methods, it appears to be robust to segmentation noise. The average error in volume estimation was 9.5-10%. Furthermore, this approach does not require any prior shape information from food identification, and it may work for arbitrarily shaped food objects. This method might be cumbersome and more suited to laboratory settings. The applicability to free-living conditions can be questionable.
Another study [93] discussed a technique that estimates the volume directly from three orthogonal images without explicit 3D reconstruction of the object's shape. The proposed approach makes no assumptions about the shape of the object, except that it is convex, axially symmetric, and axially aligned. They achieved an average FPSE error rate of 11.9% for 8 real foods consisting of vegetables, fruits, and 8 artificial geometric 3D objects.
A recent study involving a portable and a wearable imaging sensor [46] explored the idea of simultaneous localization and mapping (SLAM). Sparse maps from SLAM was generated and further processed using a multiple convex hull algorithm to acquire a mesh object. They achieved an overall error of 17% in volume estimation. This approach doesn't require any prior information on the food shape. Another approach, similar to SLAM, is the structure from motion algorithm (SFM) [94]. This approach produced accurate results for convex food items using a portable sensor. The algorithm first segmented the images into food items and extracted the volume by applying the SFM algorithm on 6 separate frames. The accuracy of volume estimation (ml) was 92% for real single food items. Another study [67] proposed a multi-view structured light 3D reconstruction technique. A food model was synthesized using multiple depth maps obtained from multiple image reconstructions.
However, this review identified only one study that employed wearable motion sensors for indirect estimation of portion size. The study of [31] used a multi-sensor system (also discussed under acoustic sensors). Authors utilized features derived from acoustic sensors (zero-crossing rate, energy, spectral flux, etc.), motion sensors (zero-crossing rates, temporal shape features, etc.), and food type to estimate the mass of food intake. The mean absolute percentage error in food weight estimation was 35.4%.

G. Weighing Sensors
Weighing is a direct method of measuring the food quantity where the mass of the food item is measured. Weighing sensors consist of load cells that are embedded in devices such as a weigh-scale. Recently, these sensors have been used in FPSE using the motivation that people generally eat on a solid surface where these sensors can be positioned. There are two types of weighing sensors: stationary and portable weighing sensors. We have summarized the studies included under weighing sensors in Table V (see Appendix A).
Stationary weighing sensors are generally part of systems embedded on the eating surface such as a table. One such system is a smart table cloth (Fig. 9 (c)) equipped with a fine-grained pressure textile matrix and a weight-sensitive tablet that allowed the spotting and recognition of food intake related actions (such as cutting, scooping, stirring, etc.), the identification of the plate/container on which the action is executed, and the tracking of weight changes in the containers. In the study [97], the authors were able to determine how much food was consumed overall, reporting an error ratio of 16.62% (calculated as error root mean square to signal span) while measuring the food amount consumed.
Another research group employed a stationary weighing sensor [98] and a new algorithm that can detect and measure the weight of individual bites consumed during unrestricted eating. They used an instrumented table (Fig. 9 (b)) with 4 trays placed over food scales. The algorithm works by identifying time periods when the scale weight is stable and then analyzing the surrounding weight changes. The series of preceding and succeeding weight changes were compared against patterns of single food bites, food mass bites, and drink bites to determine if a scale interaction is due to a bite or some other activity. The study involved 271 participants and included a total of ∼25000 bites with ground truth labels annotated from video. The weight changes in the scale were used to determine bite portions.
Some studies employed weighing sensors in portable systems. One such study [99] involved a portable smart plate consisting of three load cells. The plate had 3 compartments, each with an embedded load cell ( Fig. 9 (d)). The authors here stated that a bite was typically characterized by a sharp increase in measured force originating from food scooping and developed a bite detection algorithm using a random forest decision tree classifier. The authors reported an average error (mean ± standard deviation) of 8 ± 8 % in portion size weight estimation in lab conditions. They extended their work in [100] by evaluating a newer bite detection algorithm on multiple measurements with varying food types. The extension also included more realistic eating conditions and a larger dataset. On average, the algorithm was able to estimate the total portion size with an error of 29 g, with an average meal size of 318 g.
Another study [101] presented a similar portable device (the Mandometer, Fig. 9 (a)) for automatically processing continuous in-meal weight measurements to detect in-meal eating indicators, such as total food intake. The algorithm was evaluated on a dataset of 113 meals, using an algorithm that automated the extraction of meal-related indicators. The algorithm calculated the Cumulative Food Intake (CFI) curve of a meal based only on continuous weight measurements from the Mandometer. The authors report an error of 24 g for total meal weight.

IV. DISCUSSION
This review was intended to provide a systematic evaluation of existing SB-FPSE methods. The review surveyed 67 fulltext research articles, identifying 3 specific sensor types namely wearable, portable, stationary sensors and 2 types of sensor-based methodologies namely direct and indirect FPSE. Also, 5 sensor modalities were primarily used for FPSE (Acoustic, Strain, Imaging, Motion and Weighing sensors).
Among the three sensor types, most studies in this review made use of portable sensors. These studies employed either portable-imaging or portable-weighing sensors. The ubiquity of cameras in portable devices such as mobile phones has been the main motivation for researchers to use imaging extensively. Mandometer [101] and the smart plate [99], [100] are two portable-weighing sensors.
Stationary sensors are part of a fixed system that is not intended to move. There were two types of stationary sensors, namely, stationary-imaging or stationary-weighing sensors. Stationary-imaging sensors were mostly employed in systems that were fixed in a laboratory. Weighing sensors that are embedded in a fixed eating surface such as a table or tablecloth fall under this category.
The third category of sensors is wearable sensors. Wearableimaging sensors provide an egocentric point of view and are mostly passive devices. Other wearable sensors, namely wearable acoustical, motion, or strain sensors, mainly sense physiological activities such as chewing, swallowing, and hand/wrist movements. These studies that sense physiological activities provide a different perspective considering metrics related to food intake, unlike the ones that compute the FPS by measuring properties of food (volume, mass).
The review identified that imaging sensors were the most popular of the 5 sensor modalities considered, mostly due to the proliferation of portable imaging on the smartphone platform. The other sensor modalities are often overlooked by the researchers, although they may present compelling alternatives either in stand-alone or hybrid configurations.
For example, a key difference in the sensor operation is whether some form of self-report is required for sensor operation. The self-report may include capture of images by the user of the sensor, placement of dimensional reference in the images, staging of the foods in the image, and others. Reliance on self-report frequently leads to underreporting of food intake [115]. Wearable sensors, on the other hand, may be completely passive, only requiring compliance with wear, but no active self-report by the user. Thus, this review attempts to cover as many sensor modalities as possible, attempting to bring the attention of the research community to less frequently used sensors.
Imaging sensors were broadly classified as single-view and multi-view. The portion size estimation from single-view images relies on dimensional references. Two major concerns pertaining to these methods are assessing irregular shaped food and estimating accurate portion sizes when multiple food items are involved. Another limitation is the use of fiducial markers that are placed in the image scene. Although, these types of references aid in the FPSE, it might not be feasible for the user to place markers in every single eating episode, for example, a snack. If the user fails to place the reference in the scene or take the image, then the whole process becomes void. Some researchers have explored the use of other references such as thumbs, fingers, and circular references (plates/cups/ bowls) to eliminate these fiducial markers.
Some other popular methods involve geometric models, shape templates and, virtual wireframes/meshes. They make use of pre-defined shape templates or prior knowledge of the shape of the food item. They fail in the case of irregular shaped food items. Multi-view approaches are the more recent developments in the image-based FPSE milieu and provide more insight into the three-dimensional structure of the food items compared to the single-view approaches. There is no need for fitting shape models nor any prior information such as pre-defined food shapes. Dense-point clouds of the food items can be obtained using multiple perspectives and these methods work well with mixed meals. However, some of the methods still require the use of fiducial markers for calibration and may have issues with non-rigid irregular shaped foods. As the number of views increases, the complexity of the system grows and tends to become cumbersome for the participants.
This review also investigated the viability of using sensorbased approaches for FPSE in free-living, where participants are not limited to the constraints of the lab and can consume any foods at any time. RQ3 was formulated to see whether the methods were tested and validated in the field and represent real-world performance, which may be quite different from that observed in a lab [116]. The overall findings in the review suggest a few important aspects for consideration. These include how the studies reported the measure of accuracy, under what conditions the methods evaluated the performance, what food items were considered, and the reference method that was used. It was observed that the included studies estimate mass/volume and were compared with popular reference ground-truth measures such as water displacement, seed displacement, food weights, and 3D scans.
The first consideration is the accuracy of the SB-FPSE methods. Fig. 10 summarizes the methods that have been reviewed in this article. The review identified that the direct methods for FPSE are more accurate than indirect methods. There is a larger scope for improvement under indirect methods. However, the indirect methods are mostly used with   wearable sensors and thus, are more friendly to the studies conducted in free-living. It should also be noted that in many studies the accuracy was reported on a few food items, which do not reflect the real-life variability of foods. This creates a paradox: imaging sensors report the highest accuracy rates, followed by weighing sensors with second-best accuracy rates, even though the weighing sensors represent the most direct and standard way of portion size measurement. The difference is easily explained by the conditions (lab vs. free-living) and the size of the studies, both in terms of the number of participants and the number of food items.
The second consideration is the applicability of the reviewed sensors to free-living conditions. The review identified that wearable sensors that indirectly measure portion size (Acoustic, Strain, and Motion) were tested in the laboratory.
Further studies are needed to investigate the use of wearable sensors in free-living. The majority of free-living studies to date were based on portable sensors, mostly due to the significant prevalence of smartphones, although some studies used portable-weighing sensors (e.g. [101]).
The review identified several articles that employed hybrid solutions involving multiple sensors. The authors in [31] emphasized the need for devices with multiple sensing modalities. The authors claimed that a combination of modalities led to significant improvements in accuracy. The study proposed a hybrid solution for FPSE using motion and acoustic sensors. An acoustic sensor was used to classify the type of foods and a motion sensor was used to track the number of bites. The authors also hypothesized that acoustic sensors could distinguish between different food textures, for example crisp  TABLE III  STATIONARY-IMAGING SENSORS   TABLE IV  WEARABLE-IMAGING SENSORS and tacky, and motion sensors could help discriminate between soft foods, such as an ice cream and a milkshake, based on head or wrist position. A similar hybrid solution using features from an acoustic sensor and a strain sensor was discussed in [29]. However, in this case the estimation was prone to errors due to differences in physical properties of different foods.

V. OPEN PROBLEMS AND FUTURE DIRECTIONS
Overall, the sensor-based methods of FPSE made significant progress over the past few years. However, despite promising accuracy, they are not widely adopted in clinical and research practice. There are still open problems and questions to be addressed.
The stationary sensors are limited to the place of installation and thus, are not suitable as a measurement tool for free-living individuals, where the food may be consumed at various places throughout the day. Portable sensors better lend themselves to the idea of free-living monitoring, but these sensors predominantly rely on self-report (active participation of the user) in the form of taking images or placing items on the scale. The limitations of self-report and resulting underreporting of intake are well documented in research literature (although this was not a part of this review). Wearable sensors offer the best alternative in terms of user acceptance and provision of truly "passive" monitoring that does not rely on self-report, only compliance with sensor wear. However, these sensors do not yet provide sufficient accuracy and more work is needed to develop methods of portion size estimation from wearable sensor data.
Another significant finding of the review is the very limited scope of sensors that were tested in the applications of portion size estimation. The majority of the reported methods rely on various forms of imaging, with very few other sensor modalities ever tested. The field is ready for innovation in sensor technology, utilizing novel technologies and sensors. For example, time-of-flight cameras may present a compelling alternative to traditional color cameras relying on dimensional references.
The review revealed that the field of FPSE is dominated by imaging sensors with other sensor modalities being underrepresented. A potential reason for such an imbalance is the proliferation of portable imaging devices in smartphones and other consumer electronics. However, the field of FPSE in general is still immature, with a significant number of open problems and directions to grow.
As an indirect outcome of this review, we noticed that a significantly higher number of research publications target the detection of food intake and recognition of food being eaten. A possible explanation is that in the chain of processing to determine energy and nutrient intake, the portion size measurement is important in the later stages, thus FPSE is not considered as the first research priority. We would like to emphasize the extreme importance of the accuracy of the portion size estimation for accurate measurement of energy and nutrient intake and hope that this review will generate a renewed interest in the field.
Although imaging solutions are prevalent, they still have many open problems, such as frequent use of self-report for image capture, need for manual placement of the dimensional the image, need for the staging of the foods in the image, and other issues that may affect the process of image capture. Image processing is even more challenging, with errors in FPSE originating from complex food shapes that often need to be reconstructed from a 2D projection to a 3D representation, food occlusions on plates where one food could completely cover another, and inability to separate ingredients in the mixed dishes, such as soups.
The most significant open problem is in the applicability of any given sensor-based solution to everyday use. We would argue that no existing solution lends itself to everyday use. Weighing and imaging sensors require a significant user burden and may lead to underreporting of the intake. Wearable sensors may require fewer user actions, just cooperation with the wear regiment, but these sensors need to address the issues of accuracy, social acceptance, and data privacy before being widely adopted.
Hybrid solutions that fuse information from several sensor sources may be a promising future direction. However, sensor fusion should not come at the cost of users having to wear multiple sensors resulting in increased sensor burden. Rather, a multi-sensor sensor system may be packaged in socially acceptable accessories such as wrist watches [95], eyeglasses [118], and other wear items. Overall, the future goal should be to achieve an accurate, objective portion size measurement applicable to the wide variety of foods and beverages consumed by people of different demographic and sociocultural characteristics under the great variety of conditions, environments, and ways to consume food.

VI. CONCLUSION
A comprehensive review of the state-of-the-art SB-FPSE methods was conducted and 5 sensor modalities (Acoustic, Strain, Imaging, Weighing, and Motion sensors) were identified. This article contributed a taxonomy for SB-FPSE methods. The review found that the present-day research is now focusing on improving accuracy, testing outside of restricted laboratory conditions, including mixed meals with more challenging models such as irregular shaped food and non-rigid food items. If these existing challenges can be addressed, SB-FPSE can be exploited to be used in free-living with minimal human intervention in the estimation process.
Indirect methods using wearable technologies can be robust to food shape and size since they are derived from the physiological indicators such as chewing, swallowing, hand gestures, or head movements. The accuracy of FPSE in these methods is lower than with the direct methods. If indirect methods are more extensively explored and the accuracy is improved, they can well be the future of SB-FPSE.