Estimation of Plate and Bowl Dimensions for Food Portion Size Assessment in a Wearable Sensor System

Automatic food portion size estimation (FPSE) with minimal user burden is a challenging task. Most of the existing FPSE methods use fiducial markers and/or virtual models as dimensional references. An alternative approach is to estimate the dimensions of the eating containers prior to estimating the portion size. In this article, we propose a wearable sensor system (the automatic ingestion monitor integrated with a ranging sensor) and a related method for the estimation of dimensions of plates and bowls. The contributions of this study are: 1) the model eliminates the need for fiducial markers; 2) the camera system [automatic ingestion monitor version 2 (AIM-2)] is not restricted in terms of positioning relative to the food item; 3) our model accounts for radial lens distortion caused due to lens aberrations; 4) a ranging sensor directly gives the distance between the sensor and the eating surface; 5) the model is not restricted to circular plates; and 6) the proposed system implements a passive method that can be used for assessment of container dimensions with minimum user interaction. The error rates (mean ± std. dev) for dimension estimation were 2.01% ± 4.10% for plate widths/diameters, 2.75% ± 38.11% for bowl heights, and 4.58% ± 6.78% for bowl diameters.


I. Introduction
Reliable and accurate portion size estimation is challenging but essential for dietary assessment.Image-based dietary assessment has been one of the fastest growing areas of research in this milieu.Image-based assessment can be split into manual assessment and automatic assessment.Manual assessment can be done using digital food records [2] or by image-based 24-h recall/self-reporting that involve food atlases [3], [4], [5], [6], [7].The image-based food records involve capturing meal images that are reviewed later by the participant or by a professional (nutritionist/clinician/researcher) to estimate the portion size.Digital food images are a useful tool for the quantification of food items and in portion size estimation [8], [9], [10].Images of food leftovers are also captured in some studies, which improved the portion size accuracy [11].Recall or self-reporting methods use food atlases.Food atlases are reference guides that are taken to present various portions representative of the range of portion sizes usually consumed.Either during or after data collection, participants are asked to report the food quantity consumed by selecting a particular image, a fraction/multiple of an image, or a combination of images [12].
The abovementioned manual methods are cumbersome, subject to memory (and therefore prone to error), and are not accurate compared to the much recent automatic assessment methods.A previous review [13] identified some of the existing image-based food portion size estimation (FPSE) methods that are automatic.It was seen that food portion size can be estimated automatically using food images captured during the meal [14], [15], [16], [17], [18].However, automatic FPSE from food images is a challenging task since the twodimensional (2-D) image lacks the three-dimensional (3-D) real-world information.There is a lack of reference to measure/gauge the size/volume of the food items.To tackle this problem, the dimensional reference is obtained by using a visual cue that must be present in a food picture.A few methods used virtual objects or objects that already exist in a typical food image to aid in FPSE.Some of the popular approaches included geometric models [19], VR-based referencing [20], circular object referencing [17], [21], and thumb-based referencing [22].Shang et al. [23] used a structured light-based 3-D reconstruction approach to estimate food volume.Jia et al. [17] used the "plate method" for FPSE where the circular plates present in the image are used to determine location, orientation, and volume of the food items.The study, however, only considers circular plates.
A fiducial marker of known dimensions placed in the images can also be used as a point of reference [17], [22], [24], [25].The type of reference used determines the complexity of setup.Some methods require the users to carry around the reference (checkerboards, blocks, and cards) and some require special dining setups, which increases user burden.
Another classification in image-based FPSE can be done based on the mode of image capture.Food image capture can either be active or passive.Active methods rely on the participant to capture the food image by a camera (such as a smartphone camera), typically, before and after an eating episode.The images are then processed using computer vision models to segment foods, recognize foods, estimate portion size/volume, and compute energy content [26], [27], [28].Active methods provide detailed information such as meal timing, location, and duration of the eating episodes.However, these methods require the active participation of the users, which can be a burden.Some of the active methods that predict portion sizes require fiducial markers in the food image to assist manual review/ computer algorithms [26], [29].The placement of these markers combined with the active nature of image capture increases the user burden considerably.
One study presented a new active method for food volume estimation without using a fiducial marker.The method utilizes a special picture-taking strategy on a smartphone [1].A mathematical model that uses the height and orientation of the smartphone was used to determine the real-world coordinates of the plane of the eating surface in the capture image.Bucher et al. [30] presented and tested a new virtual reality method for food volume estimation using the International Food Unit.This method, however, requires the user to place the smartphone on the eating surface while image capture and also needs additional user interaction in using the virtual International Food Unit.
Food images can also be acquired by a "passive" method using wearable devices that capture images continuously (both food and nonfood) without the active participation of the user [31], [32].The passive methods minimize the burden of active capture using a wearable camera.However, FPSE methods that require fiducial markers cannot be easily integrated with the passive image capture since the user is not actively taking images and do not know when to place these markers.
The automatic ingestion monitor [33] is a wearable sensor system [automatic ingestion monitor version 2 (AIM-2)] that is mounted on a user's eyeglass.The sensor consists of a combination of sensors for accurate detection of food intake and triggering of a wearable

A. Equipment
In this study, a novel wearable sensor system (AIM-2 with a ToF ranging sensor) was used [33].AIM-2 consists of a sensor module, which houses a miniature 5-Mpixel camera with 120° wide-angle gaze-aligned lens, a low-power 3-D accelerometer (ADXL362 from Analog Devices, Norwood, MA, USA), and a ToF ranging sensor (VL53L0X from STMicroelectronics).The sensor system is enclosed in a custom-designed 3-D printed enclosure.The ToF sensor is aligned with the camera axis.
The camera continuously captured images at a rate of one image per 10-s interval continuously throughout the day.Data from the accelerometer and ToF sensor were sampled at 128 Hz.All collected sensor signals and captured images were stored on an SD card and processed off-line in MATLAB (MathWorks Inc., Natick, MA, USA) for algorithm development and validation.The AIM enclosure is such that the camera and the ToF sensor are at an angle of 21° with respect to the accelerometer axis, as shown in Fig. 1.We will be using this offset (+21°) while calculating the pitch of the camera.The raw sensor data from the accelerometer were preprocessed before extracting the pitch angle.A high-pass filter with a cutoff frequency of 0.1 Hz was applied to remove the dc component from the accelerometer signal.
The sensor pitch was calculated as in [33] and the device pitch is obtained by adding the offset (21°) to the sensor pitch.The distance readings are more straightforward, the raw values depicting the actual distances.Next, using the timestamp of an image, the respective pitch and distance readings were extracted.Fig. 2 shows the ToF distance readings and pitch plotted as a function of time for a sample meal.

B. Geometric Model
The objective is to project the points in an image in the real-world coordinates.In this study, our primary concern is to measure the dimensions of the plate and bowls.
Refer to Fig. 3. Let P be a point in the world, C w be a world coordinate system, and (X Y Z) t be the coordinates of P in C w .Define the camera coordinate system, C c , to have its W-axis parallel with the optical axis of the camera lens, its U-axis parallel with the u-axis of C i (image coordinate plane), and origin located at the perspective center.Let (U V W) t be the coordinates of P in C c .The coordinates (U V W) t are related to (X Y Z) t by a rigid body coordinate transformation where R is a 3 × 3 rotation matrix and T is a 3 × 1 translation vector.R is dependent on three angles of rotation, namely, pitch (ω), roll (Φ), and yaw (ψ).The three angles for the AIM device are shown in Fig. 1.
The principal point is the intersection of the imaging plane with the optical axis.Let f c be the focal length of the lens of the imaging system.Define the 2-D image coordinate system, C i , to be in the image plane with its origin located at the principal point, u-axis in the fast scan direction (horizontal rows of pixels on the sensor), and v-axis in the slow scan (vertical rows of pixels on the sensor) direction of the camera sensor.Fast scan indicates the pixel direction in which the sensor scans at a higher rate.Let p be the projection of P onto the image plane and let u, v t be the coordinates of p in C i .The focal length (Table .I) of u, v t is given by Next, radial lens distortion is incorporated into the model in the following way.Let (uv) t be the actual observed image point after being subject to lens distortion.Then, (u v) t is related to u, v t by where K c is a coefficient, which controls the amount of radial distortion.
Finally, it is necessary to model the image sampling performed by the camera sensor [charged coupled device (CCD)].A camera sensor consists of a 2-D array of photosensors.Each photosensor converts incoming light into a digital signal by means of an analog-todigital converter.To obtain color information, one "sensor pixel" is divided into a grid of photosensors, and different color filters are placed in front of these multiple photosensors.Each of these photosensors receives light through only one of the three filters: blue, red, and green.Combining these measurements gives one color triple: (red intensity, green intensity, and blue intensity).This is known as the Bayer filter.Therefore, the digital image coordinates are not the same as the pixel coordinates.
Let C p be the pixel coordinate system associated with the digital image.The pixel coordinates are related to the image coordinates by ( We are interested in the inverse function of (5) for the purposes of dimension estimation, i.e., With the known AIM sensor orientation, namely, the pitch angle of the sensor, provided by the inertial measurement unit, a right angle between the surface of the lens and the optical axis, and the projection relationship in (5), it can be shown that the inverse of function f in (6) exists for the tabletop, i.e., Note that Z = 0 in (7) represents the plane equation of the tabletop.
Also, we assume that the roll (Φ) and yaw (ψ) of the sensor are zero.Therefore, the rotation matrix R is given by From ( 4), the world coordinates of the tabletop are related to the pixel coordinates by To calculate the translational and rotation matrices, we make use of the sensor pitch (ω) and distance readings from the AIM.The distance is obtained by the ToF sensor, and the sensor pitch is obtained by the accelerometer on the AIM device, as shown in Fig. 4. The camera on the AIM has an offset of 21°ℎ where d tof is the distance from the ToF sensor, which is the distance between the AIM and the eating surface.As in [1], we obtain the following equation: From (9), we obtain Finally, we obtain the equation where T = [0; −h; 0].Equation (13) gives us the plane of the eating surface (Z = 0).
The study mainly focuses on obtaining the dimensions of two types of vessels, namely, plates and bowls.We assume plates to be flat and part of the plane Z = 0.The heights/depths of the plates are assumed to be negligible and approximated to zero.We measure the dimensions of the plate on this plane.
However, in the case of bowls, first, the height of the bowl is measured along the y-axis, as shown in Fig. 5.The height is just a projection on the y-axis and the true height is calculated as in (14).Here, the assumption is that the bowl sides are flat and not curved H = tan ω × H′ . ( Once the height of the bowl (H) is calculated, the equation of the plane Z = H is obtained instead of Z = 0. Fig. 5 shows the changes in the parameters for obtaining the adjusted plane equation where h is calculated as in (10).
We then measure the dimensions of the mouth of the bowls similar to the dimensions of the plates.We assume bowl of the mouth to be a part of the plane Z = H.The radius of the mouth of the bowl is then measured on this plane along the xand y-axes.

C. Data
The AIM sensor system was mounted on a test bench to collect data.The test bench consisted of a tripod and a protractor for angle measurement (see Fig. 6).The AIM device was placed on a tripod in front of a table at three pitch angles (40°, 55°, and 70°) with respect to the parallel to the ground, at three different heights from the table surface (20, 35, and 50 cm).The angles were measured using the protractor fixed to the side of the sensor aligned with the camera (as shown in Fig. 6).The protractor was also calibrated to test for errors in the pitch angle measurement.The calibration was done in increments of 10° from 0° to 70°.The error in measurement was (mean ± std.dev) −2.43°±1.36°.The roll and yaw of the cameras were approximately set to be 0 for experimentation.Also, the roll and yaw for the AIM are assumed to be 0 when a person is eating.
Nine sets of data collected at a combination of three heights and three pitch angles were used for testing (see Fig. 7).A set of eight objects, three circular plates (diameter: small 18 cm, medium 22 cm, and large 26 cm), two square plates (side: small 18 cm and medium 23 cm), and three circular bowls, were used as objects of interest.
As a final step, four research assistants used the proposed methodology to estimate the bowl/container sizes of 3 (two circular bowls and one hollow rectangular box) shown in Fig. 8.The AIM device was worn by a user and a minimum of three images were taken for each case without any restrictions on the position/tilt of the head.
The images and the sensor signals captured by the AIM at each setup were extracted and used as input to the model.The ground-truth dimensions were measured using a tape measure.

III. Results
Fig. 9 represents a sample result of the lens corrections after (3).
Using the world coordinates of the plane Z = 0 and the projected image on the plane (see Fig. 10), the dimensions of plates were measured (see Fig. 11).Any object belonging to this plane can be measured using this projection.
Table II presents the results for the dimension estimation of plates using the proposed model.The error percentage in the dimension estimation of plates was (mean ± std.dev) 2.01% ± 4.10%.
In the case of bowls, the heights of the bowls are estimated, as shown in Fig. 12. Once the height is estimated, ( 14) and ( 15) are made use of to estimate the bowl width measured at the top of the bowl (at Z = H).
Table III presents the results for the estimation of heights of bowls.The error percentage in the height estimation of bowls was (mean ± std.dev) 2.75% ± 38.11%.Table IV presents the results for the estimation of diameters of bowls.The error percentage in the diameter estimation of bowls was (mean ± std.dev) 4.58% ± 6.78%.Tables V and VI present the results from the real scenarios that were used for validation.The error percentage in the diameter/length and height estimation was, (mean ± std.dev) −7.89% ± 4.71% and 4.70% ± 11.56%, respectively.

IV. Discussion
This study proposes a passive and automatic method for estimation of plate and bowl dimensions that involve the AIM-2 device integrated with a ToF sensor.The motivation is to use these dimensions for FPSE as in the "plate method" suggested in [17].A geometric camera model is used to obtain real-world coordinates of the surface on which the objects of interest are present.In [1], a similar model is proposed, however, that method requires the use of a smartphone with the active participation of the user.Also, the smartphone is needed to be placed on the eating surface at a specific position.We propose a method that does not have this requirement.We make use of a ToF ranging sensor, which can directly measure the distance between the camera and the table.The method also accounts for any lens aberrations that can cause distortions such as barrel distortion in the captured images.
A major contribution of this work is the elimination of fiducial markers that have been extensively used in previous methods for FPSE.The direct measurement from the range sensor will provide the necessary dimension reference in 2-D-to-3-D model conversion.
The method makes several assumptions prior to estimation: the camera axis and the range sensor axis are parallel to each other, the roll and yaw angles of the sensor are 0, the eating surface = 0, and the walls of the bowls are flat.
The proposed method was evaluated on a test bench using a calibrated protractor for positioning.Three heights and three angles were considered for testing the proposed model based on the natural behavior of participants in previous AIM-based studies.The tilt angles and the distances between the camera and the eating surface were in the selected range of pitch angles and heights.
For the measurement of bowl heights, the inner walls of the bowls were used.The rationale for using the inner walls of the bowls is that the AIM is a passive device that captures continuous images from including the start and end of meal images, and that way the images will include an empty bowl at the end of the meal.Even if the bowl is not empty, we can measure the difference between the start and end of the meal and eventually calculate the difference in the food level.This is a major advantage of having a passive camera since there are enough images covering the entire meal.It was noticed that the error rates were higher for steep angles.
The results of estimation of dimensions for plates were acceptable with good error rates.It was noticed that the dimensions were overestimated for steep angles (70°).The estimations were most accurate for 55° compared to the other two orientations.This is a promising trend since the corresponding AIM pitch angles normally occur when a person is bending forward to grab a bite of the food in front.In addition, since the AIM captures images continuously every 10 s, there will be multiple images captured at several angles due to the forward bending of the user.The angle that typically had the lowest error rates could then be picked from the range of angles available to estimate the dimensions of the objects in the scene.This reference can then also be used for images from different orientations.
We also noticed that the error rates were lower for heights of 20 and 35 cm compared to 50 cm for the same pitch angles in the case of plate diameter estimations.This could be because the plates are more central in the images as the camera is closer, reducing the field of view (the area covered by the camera).However, for the height and diameter estimations of bowls, a height 35 cm was more accurate compared to the 20-and 50-cm cases for narrow pitch angles.The 35 cm height might be ideal for the methodology used here since the walls of the bowls are clearer to the user to mark.The best results were obtained for the heights of 20 and 35 cm at 55° pitch angles for plates and bowls, respectively.
One limitation of the study is that the bowl walls are assumed to be flat and not curved.This could be a source of error in dimension measurement and portion size estimation.The method also assumes that the plates are part of the plane Z = 0.The method does not account for the thickness of the plates or the curvature of plates.However, unlike other studies which use plates as a reference, this method is not restricted to circular plates or bowls.Any shape of plates or bowls can be included.
Finally, the proposed method was validated by wearing the AIM and collecting data for three cases and four research assistants estimated the diameters/length and heights for the same.The results suggested that except for a couple of outliers (RA3: diameter and RA2: height for white box), the estimates were reasonably accurate.Also, it should be noted that one of the cases was a hollow rectangular box.This indicates that the method could be employed for similar shaped bowls and possibly for a larger variety of bowl shapes.However, in some situations where the walls of the bowls are not flat, our assumption of the walls being flat might induce errors in estimating the height of the container accurately.
Future work could include estimating food portion sizes from the dimensions of the bowls and plates.Another possible work is to use this method to estimate the dimensions of regular-shaped foods followed by food volume.Also, the proposed method was only tested on a test bench that was stationary.Since the AIM device is primarily designed to be mounted on the eyeglass, it is necessary to test the proposed method by mounting the sensor system on a human.

V. Conclusion
In this article, we propose a wearable sensor system-based (the automatic ingestion monitor integrated with a ToF ranging sensor) method for the estimation of dimensions of plates and bowls.The contributions of this study are: 1) the model eliminates the need for fiducial markers; 2) the camera system (AIM-2) is not restricted in terms of positioning, unlike in [29] where the smartphone is required to be placed on the eating surface; 3) our model accounts for radial lens distortion in caused due to lens aberrations; 4) a distance (ToF) sensor directly gives the distance between the sensor and the eating surface; 5) the model is not restricted to circular plates; and 6) a passive method that can be used either for automatic or manual assessment of container dimensions with minimum user interaction.The error rates (mean ± std.dev) for dimension estimation were 2.01% ± 4.10% for plate widths/ diameters, 2.75% ± 38.11% for bowl heights, and 4.58% ± 6.78% for bowl diameters.Rotational axes of the AIM sensor.Also depicted is the camera offset of 21° with respect to the axis of the eyeglass/AIM device.ToF distance and pitch angle as a function of time (y-axis: hh:mm:ss) for a sample meal where the AIM-2 was used.Model of the AIM imaging system.Projection of camera coordinates on to the world coordinates.Also, depict the calculation of pitch angle of the camera.Left: test bench with the AIM attached to the tripod.Right: protractor used to measure the pitch angle of the sensor system.Test dataset: plates and bowls of varying sizes.Validation images in a real-case scenario where AIM-2 was worn by a user on the eyeglass.Sample measurements of plate dimensions (all units are in mm).Sample measurements of bowl dimensions.
x and s c y are scale factors (pixel/mm), c cx and c c y are the pixel coordinates of the principal point, and K c is the distortion coefficient (pixel/mm)

Fig. 5 .
Fig. 5. Projection H' of height H on the y-axis.Calculating h' for obtaining the plane equation for Z = H.

Fig. 10 .
Fig. 10.Projection of the image on the plane equation Z = 0 (all units are in mm).

TABLE I
Parameters of the AIM Camera

TABLE VI
Error in Height Estimation in Containers IEEE Sens J. Author manuscript; available in PMC 2023 October 05.