The CORSMAL benchmark for the prediction of the properties of containers

The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difﬁcult. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, speciﬁc tasks and performance measures. We conduct an in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classiﬁers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and estimating the ﬁlling mass with audio-visual multi-stage approaches reach up to 65% weighted average capacity and mass scores. These results show that there is still room for improvement on the design of new methods. These new methods can be ranked and compared on the individual leaderboards provided by our open framework.


I. INTRODUCTION
P EOPLE interact daily with household containers, such as cups, drinking glasses, mugs, bottles, and food boxes.Methods to estimate the physical properties (e.g., weight and shape) of these containers could support human-robot cooperation [1]- [5], video annotation and captioning.Meth-ods should generalize to unknown container instances and operate with only limited prior knowledge, such as generic categories of containers and contents [1], [6], [7].However, the material, texture, transparency, and shape vary considerably across containers and may change with the content.Furthermore, the content may not be visible due to the opaqueness of the container or because of hand occlusions.For these reasons, predicting the physical properties of containers is a challenging task.The combination of sensing modalities, namely RGB images, depth, and audio, may help to overcome challenges such as noisy scenarios, already filled containers with absence of sound, occlusions, or transparent objects whose depth data may be highly inaccurate [8].
The contributions of this paper include: • A novel framework for the comparison of methods that estimate the physical properties of containers and their content, when a person manipulates the container (see Fig. 1); • The definition of three tasks, such as the classification of the content amount, the classification of the content type, and the estimation of the container capacity, and related performances measures, including the indirect filling mass estimation based on the three tasks, for the framework; • The design of 12 audio-only baselines and one visiononly baseline for the tasks of classifying the content level and the content type based on related approaches from the literature; • A formal review, a comparative analysis, and an indepth discussion of methods that used the framework to address this problem; • The results of an international benchmarking challenge 1 .The multi-modal, multi-sensor system used to record a person manipulating a container and its content.The system includes two third-person view cameras (at the two sides of the robot), a first-person view camera mounted on the robot, a first-person view from the body-worn camera on the person and a 8-microphone circular array (placed next to the robot arm).
The paper is organized as follows.Section II discusses related works.Section III presents the benchmarking framework, including a multi-modal dataset, tasks for the estimation of the container and content properties, and corresponding performance measures.Section IV reviews the methods that used the framework for the tasks of filling type and level classification.Section V reviews the methods that used the framework for the task of container capacity estimation.Section VI discusses and compares the results of the methods under analysis.Section VII concludes the paper and discusses future research direction.

II. RELATED WORK
In this section, we discuss the object properties that are commonly estimated in the literature.We then review methods that recognize the content type, estimate the amount of content in a container, or estimate the container capacity, based on their approaches and input modalities.
Most of the works in the literature focus on object recognition, object shape and size reconstruction in 3D, as well as pose estimation of a variety of objects using visual data and objects standing on a surface [9]- [16].Object properties, such as transparency, are often tackled independently with ad-hoc designed approaches for 3D shape reconstruction, object localization in 3D, or 6D pose estimation [8], [17]- [19].Recognizing different high-level properties, such as the type and amount of multiple filling materials, the capacity of the container, and the overall weight of the object (i.e., the container with its content) is not yet well-investigated.
Recognizing the content type within a container is addressed only for general food recognition using visual information [20]- [22].Audio modality is commonly used for the recognition of general environmental sounds using the combination of traditional features and machine learning classifiers -e.g., k-Nearest Neighbour kNN [23], Support Vector Machine (SVM) [24], and Random Forest (RF) [25] -, or deep learning approaches -e.g., convolutional neural networks (CNNs) [26].Examples of traditional acoustic features are spectrograms, zero-crossing rate (ZCR), Melfrequency Cepstrum Coefficients (MFCCs), chromogram, Mel-scaled spectrogram, spectral contrast, and tonal centroid features (tonnetz) [27]- [30].However, there are no unimodal or audio-visual approaches that recognize the content type during the manipulation of different containers held by person and together with other physical properties.
For content level estimation, some methods regress or classify the property using CNNs and a single image [7], [37], or use temporal information from sequences of RGB or RGB-D data to track the change in the amount during a mechanical action [38]- [40].Other methods use the sound signals generated by the contact of the content with a container during a manipulation [41]- [44].For example, the level of unknown liquids within containers standing on a surface is regressed or classified by using approaches such as Kalman Filter and recurrent neural networks with edge features or spectrograms [39], [40], [44].For the estimation of the capacity of a container, one work trained a CNN using an RGB image of one or more containers standing on a surface [37].However, all of these approaches are often designed and evaluated on scenarios with only standing containers, and with limited variability in the data.
Unlike previous works, in the next sections we present an open framework for the estimation of multiple physical properties of containers and contents as they are manipulated by a person.We also discuss methods that used this framework based on the modalities used as input, the features extracted, and the type of approach (regression, classification, or geometry-based) [31]- [35] (see Table 1).

III. BENCHMARKING FRAMEWORK A. CONTAINERS, FILLINGS, SCENARIOS
The dataset includes audio-visual-inertial recordings of people manipulating a range of containers that vary in shape, size, material, transparency, and deformability, and a set of contents under different scenarios with increasing level of difficulty due to the type of occlusions.CORSMAL Containers Manipulation [45] is a dataset consisting of 1,140 audio-visual recordings with 12 human subjects manipulating 15 containers, split into 5 cups, 5 drinking glasses, and 5 food boxes.These containers are made of different materials, such as plastic, glass, and cardboard.Each container can be empty or filled with water, rice or pasta at two different levels of fullness: 50% and 90% with respect to the capacity of the container.The combination of containers and contents results in a total of 95 configurations acquired for three scenarios with an increasing level of difficulty caused by occlusions or subject motions.
In the first scenario, the subject sits in front of the robot, while a container is on a table.The subject either pours the content into the empty container, while avoiding touching the container, or shakes an already filled food box.Afterwards, the subject initiates the handover of the container to the robot.In the second scenario, the subject sits in front of the robot, while holding a container before starting the manipulation.In the third scenario, a container is held by the subject while standing to the side of the robot, potentially visible only on the third-person camera view.After the manipulation, the subject takes a few steps and initiates the handover of the container in front of the robot.Each scenario is recorded with two different backgrounds and under two different lighting conditions.The first background condition involves a plain tabletop with the subject wearing a texture-less t-shirt, while the second background condition involves the table covered with a graphics-printed tablecloth and the subject wearing a patterned shirt.The first lighting condition is based on artificial illumination as provided by lights mounted on the ceiling of the room.The second lighting condition uses two controlled artificial lights placed at the sides of the robot and illuminating the area where the manipulation is happening.Each subject executed the 95 configurations for each scenario and for each background/illumination condition2 .

B. SENSOR DATA AND ANNOTATION
The dataset was acquired with 4 multi-sensor devices, Intel RealSense D435i, and an 8-element circular microphone array.Each D435i device has 3 cameras and provides spatially aligned RGB, narrow-baseline stereo infrared, and depth images at 30 Hz with 1280x720 pixels resolution.One D435i is mounted on a robot arm that does not move during the acquisition and provides a more realistic view of the operating area from the robot perspective.Another D435i is chest mount by the person to provide a first-person view, while the remaining two devices are placed at the sides of the robot arm as third-person views that look at the operating area.The microphone array is placed on a table and consists of 8 Boya The annotation of the data includes the capacity of the container, the content type, the content level, the mass of the container, the mass of the content, the maximum width and height (and depth for boxes) of each object.Fig. 2 shows the total object mass across containers and their contents.
The dataset is split into training set (684 recordings of 9 containers), public test set (228 recordings of 3 containers), and private test set (228 recordings of 3 containers).The containers for each set are evenly distributed among the three categories.The annotations of the container capacity, content type and level, and the masses of the container and content are provided publicly only for the training set.

C. TASKS AND PERFORMANCE SCORES
We define three tasks for the framework, namely the classification of the amount of content (Task 1), the classification of the content type (Task 2), and the estimation of the capacity of the container (Task 3).We refer to the amount of content as filling level and to the type of content as filling type.
In Task 1, a container is either empty or filled with an unknown content at 50% or 90% of its capacity.There are three classes: empty, half-full, full.For each configuration j, the goal is to classify the filling level (λ j ).In Task 2, containers are either empty or filled with an unknown content.There are four filling type classes: none, pasta, rice, water.For each configuration j, the goal is to classify the type of filling, if any (τ j ).For these two tasks, we compute precision, recall, and F1-score for each class k across all the configurations belonging to class k, J k .Precision is the number of true positives over the total number of true positives and false positives for each class k (P k ).Recall is the number of true positives over the total number of true positives and false negatives for each class k (R k ).F1-score is the harmonic mean of precision and recall for each class k and defined as We then compute the weighted average F1-score, F1 , across the K classes, where J = K k=1 J k is the total number of configuration.Note that K = 3 for filling level classification, whereas K = 4 for filling type classification.
In Task 3, containers vary in shape and size.For each configuration j, the goal is to estimate the capacity of the container (γ j ∈ R >0 , in milliliters).For capacity estimation, we compute the relative absolute error between the estimated capacity, γj , and the annotated capacity, γ j , for each configuration, j, We then compute the average capacity score, C, as where the value of the indicator function 1 ∈ {0, 1} is 0 only when the capacity (mass) of the container in configuration j is not estimated.The weight of the object, ω ∈ R >0 (in Newtons), is the sum of the mass of the (empty) container, m c ∈ R >0 (in grams), and the mass of the (unknown) filling, m f ∈ R >0 (in grams), multiplied by the gravitational earth acceleration, g = 9.81 m/s −2 , ω = (m c + m f )g. (5) While we do not require the mass of the empty container to be estimated, we expect methods to estimate the capacity of the container and to determine the type and amount of filling to estimate the mass of the filling.For each configuration j, we then compute the filling mass as where D(•) selects a pre-computed density based on the classified filling type.The density of pasta and rice is computed from the annotation of the filling mass, capacity of the container, and filling level for each container.Density of water is 1 g/mL.For filling mass estimation, we compute the relative absolute error between the estimated, mj f , and the annotated filling mass, m j f , for each configuration, j, unless the annotated mass is zero (empty filling level), Similarly to the average capacity score, we compute the average filling mass score, M .
Note that we will present the scores as percentages when discussing the results in the comparative analysis.

D. BASELINES
CORSMAL provides along with the framework 12 audioonly baselines and one video-only baseline for the tasks of filling level and filling type classification.
The audio-only baselines3 jointly classify filling type and level using traditional acoustic features, such as ZCR, MFCCs, tonnetz, or spectrograms, combined with either of three machine learning classifiers (kNN, SVM, RF).Note that for MFCCs, the 1 st to 13 th coefficients are used, whereas the 0 th coefficient is discarded.Three baselines use as input the mean and standard deviation of the MFCCs and ZCR features across multiple audio frames [46].Three other baselines extract a feature vector consisting of 193 coefficients from the mean and standard deviation of the MFCCs, chromogram, Mel-scaled spectrogram, spectral contrast, and tonnetz across multiple audio frames [27]- [30].For simplicity, we refer to this set of acoustic features as AF193 in the rest of the paper.Three other baselines use spectrograms, which are cropped, resized and reshaped into a vector of dimension 9,216, as input to the classifiers [35].To remove redundant information, three additional baselines perform dimensionality reduction with Principal Component Analysis (PCA) on the reshaped spectrograms, retaining only the first 128 components.
The vision-only baseline uses two CNNs to perform an independent classification of filling level and filling type from a single RGB image.We re-trained ResNet-18 architectures [47] using a subset of frames4 selected within the video recordings of the training set of the CORSMAL Containers Manipulation and cropped to a rectangular area around the container [7].On the test sets, the baseline is applied to each camera view independently: an image crop is extracted from the last frame using Mask R-CNN [9] and the segmentation mask with the most confident class between cup and wine glass is selected.The output classes of the two CNNs include an additional class, opaque, to handle cases where containers are not transparent and vision alone fails to determine the content type and level [7], [37].

IV. FILLING LEVEL AND TYPE CLASSIFICATION
Six methods used the framework to address the tasks of filling level classification (Task 1) and filling type classification (Task 2) either independently, e.g., when only one of the two properties is necessary for the target application, or jointly, e.g., when both properties are necessary for accurately estimating the total object weight.For simplicity, we refer to the 6 methods as M1, M2 [31], M3 [32], M4 [33], M5 [34] and M6 [35] for the rest of the paper.
For filling type classification, audio is preferred as input modality and methods used either only CNNs, CNN with RNN, or CNN followed by majority voting as classification approaches [31], [33], [34].For filling level classification, some methods used visual data in combination with audio data [32], [34].Hand-crafted and/or learned acoustic features are used by the methods.Traditional acoustic features, such as MFCCs, spectral characteristics, ZCR, chroma vector and deviation, are computed from short-term windows.Longterm features can be obtained by summarizing the short-term features from longer windows of the input audio signal and by including additional statistics, such as mean and standard deviation.Learned features are extracted by CNNs from multi-channel or mono-channel audio signals that are postprocessed into spectrograms or log-Mel spectrograms [33], [34].To handle audio signals of different duration, long audio signals can be truncated to a pre-defined duration and zeropadding is added to shorter signals [31], [33].
The fully connected neural network of M1 has 5 layers and uses STFT features as input.The network is trained with the Adam optimizer [48] and dropout [49] on the last hidden layer to reduce overfitting.
The filling type classifier of M2 uses 40 normalized and concatenated MFCCs features that are extracted with 20 ms windows at 22 kHz, with a maximum duration of 30 s [31].The CNN has 2 convolutional layers and 1 fully connected layer (86,876 trainable parameters).
M4 [33] used all the 8 audio signals from the microphone array to compute log Mel-scaled spectrograms with STFT and 64 filter banks for filling type and filling level clas-sification.A sliding window over the cropped spectrogram with 75% overlap forms overlapping audio frames consisting of 3D tensors, where the third dimension is given by the 8 audio channels.Each window is provided as input to a CNN consisting of 5 blocks, each with 2 convolutional and 1 batch normalization layers followed by a max-pooling layer.The CNN is complemented by 3 fully connected layers for the filling type classification of each audio frame and followed by the majority voting.The CNN has a total of 13 layers with 4,472,580 trainable parameters.The same extracted features are also used as input to the three stacked Long Short-Term Memory (LSTM) [50] units for the filling level classification.The three stacked LSTMs are trained with a set of 100 audio frames and contain 256 hidden states, resulting in 2,366,211 trainable parameters.
The multi-layer perceptrons (MLPs) of M3 [32] are trained for either filling level or filling type classification, and specifically only for each object category (cup, drinking glass, food box).Each MLP has 3 layers with 3,096 nodes in the first hidden layer and 512 in the last hidden layer.The total number of trainable parameters is 20,762,288.The MLPs takes as input a spectrogram computed from a multi-channel sound signal re-sampled at 16,600 Hz and converted into mono-channel by averaging the samples across channels.Only the last 32,000 samples are retained and converted into a spectrogram via Discrete Fourier Transform.To select which MLP to use at inference time, regions of interest (ROIs) are detected in all frames of the image sequences of all four views in the CORSMAL Containers Manipulation dataset by using YOLOv4 [51] pre-trained on MS COCO [52].The category (cup, drinking glass, food box) is determined by a majority voting of randomly sampled frames (65% of all frames).
Both traditional and learned acoustic features are used by M5 [34] for filling type classification, whereas visual features are extracted in addition to the acoustic features for filling level classification.Multiple classifiers, each associated with each feature, are used to output the class probabilities.Then, the probabilities are averaged across the classifiers to determine the final class.For the acoustic features, the multi-channel input audio signal is converted into a monochannel by averaging the samples across channels.MFCCs, energy, spectral characteristics, and their statistics (mean and standard deviation) are computed from 50 ms windows of the input signal as short-term traditional features.The features are concatenated in a 136-dimensional vector used as input to a RF classifier.The number of trees of the RF classifier is automatically set during training by selecting the value between (10, 25, 50, 100, 200, 500) that achieves the highest accuracy in validation.For the learned features, the mono-channel signal is re-sampled at 16 kHz and converted into log-Mel spectrograms from 960 ms windows of the resampled signal.Each spectrogram is provided as input to a VGG-based model [53] that is pre-trained on a large dataset (e.g., AudioSet [54]) and computes a 128-dimensional feature vector.The learned features are then provided as input to a GRU model [55] that has 5 layers and a hidden layer of size 512 to handle the intrinsic temporal relations of the signals.
The model has a total of 7,291,395 trainable parameters.Visual features are extracted from the image sequences of all camera views by using R(2+1)D [56], a spatio-temporal CNN that is based on residual connections [47] and 18 (2+1)D convolutional layers that approximate 3D convolution by a 2D convolution (spatial) followed by a 1D convolution (temporal).R(2+1)D is pre-trained for action recognition on Kinetics 400 [57], takes as input a fixed window of 16 RGB frames of 112×112 pixel resolution, and outputs a 512-dimensional feature vector.Long temporal relations between the features of each window are estimated by using a RNN with a GRU model that has 3 layers and a hidden dimension of size 512 (4,729,347 trainable parameters).The GRU models from each camera view are jointly trained and their logits are summed together before applying the final softmax to obtain the class probabilities from the visual input.For filling type classification, the probabilities resulting from the last hidden state of the GRU network and those resulting from the RF are averaged.For filling level classification, the probabilities resulting from the RF classifier and the GRU models for both the audio and visual features are averaged together to compute the final class.The RF classifier and all the GRU models are trained independently for filling type classification and filling level classification by using 3-fold validation strategy.
Jointly estimating the filling type and level can avoid infeasible cases, such as an empty water or half-full none.Different traditional classifiers and existing CNNs that use spectrograms as input have been analyzed and compared in Donaher et al.'s work [35], especially when different containers are manipulated by a person with different content types, such as both liquids and granular materials.
Because of the different container types and corresponding manipulation, the authors of M6 [35] decomposed the problem into two steps, namely action recognition and content classification and devised three independent CNNs.The first CNN (action classifier) identifies the manipulation performed by the human, i.e., shaking or pouring, and the other two CNNs are task-specific and determine the filling type and level.The CNN for action recognition (pouring, shaking, unknown) has 4 convolutional, 2 max-pooling, and 3 fully connected layers; the CNN for the specific action of pouring has 6 convolutional, 3 max-pooling, and 3 fully connected layers; and the CNN for the specific action of shaking has 4 convolutional, 2 max-pooling, and 2 fully connected layers.The choice of which task-specific network should be used is conditioned by the decision of the first CNN.When the action classifier does not distinguish between pouring or shaking, the approach associates the unknown case to the class empty.
Regression approaches use CNNs [31] or distribution fitting via Gaussian processes [32].The CNN architecture of M2 has 4 convolutional layers, each followed by batch normalization [58], and 3 fully connected layers (532,175 trainable parameters) [31].The CNN takes as input a ROI and its normalized relative size, and then regresses the capacity of the container limited to 4,000 mL, accordingly to the range of capacities in the dataset.The ROI is computed from the contour features of a depth image selected from the frame with the most visible pixels of the frontal, fixed view and assuming a maximum depth of 700 mm.M4 [32] used Gaussian processes to regress the container capacity, depending on the container category.To model multiple multi-variate Gaussian functions for each container category, the container type is recognized by detecting multiple ROIs in all frames of all image sequences as done for filling type and level classification.
Geometric-based approaches approximate the container to a primitive shape in 3D, such as cuboid or cylinder [8], [33], [34].The shape is represented as a point cloud obtained directly from RGB-D data or computed via energy-based minimization to fit the points to the real shape of the object as observed in the RGB images of a wide-baseline stereo camera and constrained by the object masks [8], [34].The capacity is then computed as a by-product, e.g., by finding the minimum and maximum values for each coordinate in 3D [33] or using volume formulas specific for the primitive shape [34].The approximated primitives can lead to inaccurate capacities: a cuboid representation could result in an overestimated capacity and hence re-scaling would be necessary [33]; a cylinder representation may not generalize to different shapes than rotationally symmetric objects.To handle occlusions caused by the human hand manipulating a container, M5 [33] selects the RGB-D frame with a single silhouette having the largest number of pixels and postprocesses the point cloud to deal with inaccuracies in the segmentation.Capacity estimations computed at different frames of the image sequences in the stereo views are then averaged, assuming that the container is fully visible.

VI. EXPERIMENT RESULTS AND DISCUSSION
We compare and analyze the performance of the 6 methods and the 13 baselines on the public test set, the private test set, and their combination on the CORSMAL Containers Manipulation dataset [45].

A. IMPLEMENTATION DETAILS
The CNN of M2 for filling type classification is trained with the SGD optimizer, a fixed learning rate of 0.00025 and momentum of 0.9, and a batch size of 16.M4 sets the frame length to 25 ms, the hop-length to 10 ms, and the number of samples for the Fast Fourier Transform to 512 for computing the STFT.During training, M4 crops audio signals based on manual annotations of the starting and ending of the manipulation.The network for filling level classification of M4 is trained by using cross-entropy loss and the Adam optimizer [48] with a learning rate of 0.00001 and a minibatch size of 32 for 200 epochs.

B. FILLING LEVEL CLASSIFICATION
Table 2 compares the performance of all baselines and methods except M2.M4, M5 and M6 achieve the highest accuracy with 80.84, 79.65, and 78.65 F1 on the combined test set, respectively.This performance is almost twice higher than M1 and M3 and shows that using only audio as input modality is sufficient to achieve an accuracy higher than 75 F1 .M5 uses both audio and visual data, but the similar performance to M4 and M6 suggests that audio features are dominant in determining the classification decision.M6 is the best performing in the private test set (81.46 F1 ), whereas M4 is the best performing in the public test set (82.63 F1 ).Interestingly, both methods selected a fixed portion of the audio signal, transformed into a spectrogram, where the manipulation of the container by the human subject was more likely to occur (see Fig. 3).However, the three CNNs of M6 use the full trimmed spectrograms as input, whereas the CNN+LSTM of M4 uses portions of the log-Mel spectrogram, which are obtained with a temporal sliding window.Both are shown to be valid methods assuming that the whole audio signal is available and the manipulation is completed.
The confusion matrices in Fig. 4 show that M4 and M6 do not confuse the class empty, whereas M5 mis-classifies some  empty configurations as half-full.Not surprisingly, most of the confusions occur between the classes half-full and full for all methods.M4 and M5 are more accurate than M6 in recognizing the class half-full, but M6 is more accurate in recognizing the class full.M3 mis-classifies the true class empty as half-full for 40% of the times and as full for 33% of the times, and the class full is confused with half-full for 75% of the times.M3 recognizes the container categories cup, drinking glass and food box with 92%, 73%, and 88% accuracy, respectively, in the training set.Errors in the category recognition may lead to wrong classifications by the selected category-specific MLP-based classifier, which is also trained with limited and selected data.The CNN of M1 made erroneous predictions across all classes, except for empty that was never predicted as half-full but only confused with full.
The vision-only baseline (using the first camera view, on the left side of the robot arm) confused 81% of the times the class empty with half-full in addition to mis-classification between half-full and full, making the performance of the baseline only 10 F1 points higher than a random classifier (37.62 F1 ).

C. FILLING TYPE CLASSIFICATION
Confusion matrices of filling type classification for all methods across all the containers of the public and private testing splits of the CORSMAL Containers Manipulation dataset [45].Note that each cell is normalized by the total number of true labels for each class (colorbar).KEY -E: empty; P: pasta; R: rice; W: water.PCA to select the first 128 components) to any of the three classifiers, namely kNN, SVM, or RF, is the worst choice.On the combined test set, the lowest performance is obtained by Spectrogram + PCA + SVM with 24.20 F1 , whereas the highest performance is obtained by Spectrogram + kNN with 64.55 F1 .Classic audio features, such as MFCCs and ZCR, are more discriminative and sufficient to achieve performance higher than 78 F1 for the three classifiers.Simply using ZCR and MFCCs with RF can achieve 91.31 F1 , which is close to the performance of the three top methods (M5, M6, M4) that are using CNNs and LSTMs.On the contrary, the performance decreases when using a larger set of features, such as tonal centroid, spectral contrast, chromogram, Melscaled spectrogram, and MFCCs.Fig. 5 shows the confusion matrices of the methods.M4 made a few mis-classifications for the class rice with none and pasta, and for the class water with none.M6 confused 4% pasta with rice, 4% rice with pasta, 7% pasta with water, and 2% water with none.The confusion between water and none could be expected due to the low volume of the sound produced by the water, whereas the confusion of water with rice might be caused by the glass material of the container and background noise.The largest confusion for M5 is given by the erroneous prediction of rice with pasta (13%).As for filling level classification, M1 and M3 have large misclassifications across different classes, with M3 that could not predict water for any audio input.

D. CAPACITY ESTIMATION
We compare the results of M2, M3, M4, and M5, in terms of the average capacity score.We also report the results of a pseudo-random generator (Random) that samples the predictions from a uniform distribution in the interval [50,4000] based on the Mersenne Twister algorithm [59].We then analyze and discuss the statistics of the absolute error in predicting the container capacity for each testing container as well as for each filling type and level.
Table 4 shows that M2 achieves the best score with 66.92 C, 67.67 C, and 67.30C for the public test set, private test set, and the combined test set, respectively, when using only depth data from the fixed frontal view.All methods achieve a performance score that is twice higher than the random solution (24.58 C for the combined test set): M4 has the lowest score (54.79 C), whereas M5 and M3 obtain 60.57C and 62.57C, respectively.Fig. 6 shows the statistics (median, 25 th and 75 th quartiles, and the lower and upper whiskers 5 ) of the relative absolute errors for each container in the test sets of the dataset.M2 has the lowest median error for all containers, except for the private containers C14 and C15.The variation of the error across configurations is either smaller than the variation of the other methods or lower than the median value of the other methods.M5 is more consistent in estimating the same container shape and capacity for most of the configurations related to containers C12 and C15.M5 also have the largest variations for C10 and C14; M3 for C12 and C15; and M4 for C11.Interestingly, M3 have a median error lower than M4 and M5 for C13 and achieve the lowest  [34] 60.56 60.58 60.57M3 [32] 63.00 62.14 62.57 M2 [31] 66.92 67.67 67.30Comparison of statistics of the absolute error in estimating the container capacity for each testing container between M2 [31], M3 [32], M4 [33], and M5 [34].Statistics of the box plot includes the median (red line), the 25 th and 75 th quartile, and the lower and upper whiskers.Note that outliers in the data are not shown.Note also the different scale for the y-axis.KEY -CX: container (C) index (X), where X is in the range [10,15].median error with a small variation across configurations for C14.However, we can observe that in general the relative absolute error across containers is around or higher than 0.5.
In addition to the comparison across containers, Fig. 7 shows the relative absolute errors grouped by filling type and level for each method.Most of the errors are in the interval [0.3,0.8], and the methods have similar amount of variations between the 25 th and 75 th quartiles, but differences are in the median error and the upper whisker error (excluding outliers).M2 achieves the lowest median error (always lower than half of the real container capacity) and smaller variations (25 th -75 th quartiles), whereas M3 have similar results for rice full.M4 has the largest errors for empty, pasta half-full, pasta full, rice half-full, and rice full.M5 has the largest errors for water half-full and water full.

E. ANALYSIS PER SCENARIO AND CONTAINER
Table 5 analyzes and compares the performance scores of the methods grouped by scenario and containers for all the three tasks.For filling level classification on the testing containers, the F1 of M4, M5, and M6 increases from scenario 1 to scenario 3, showing how audio information is robust despite the increasing difficulty due to the in-hand manipulation (scenario 2 and 3) and larger distance (scenario 3).However, the performance of M6 decreases by almost 2 pp from scenario 1 (78.52 F1 ) to scenario 2 (76.92F1 ).The performance of M1 is affected by the in-hand manipulation and distance, decreasing from 52.90 F1 in scenario 1 to 45.46 F1 in scenario 3. M3 achieves the highest accuracy for scenario 2 (51.34 F1 ), increasing by 11.51 pp compared to scenario 1 (39.83F1 ), but decreasing to 35.92 F1 in scenario 3 (likely caused by the errors in recognizing the container category).For filling type classification, the performance of M4, M5, and M6 is higher than 90 F1 across the scenarios, but the trend is the opposite of filling level classification.M5 and M6 decrease in F1 from scenario 1 to scenario 3, whereas M4 achieves the highest accuracy in scenario 2 (98.07 F1 ).M3 and M1 show the same behavior for filling level and type classification with a large decrease in scenario 3 by 15.31 pp and 22.16 pp compared to scenario 1, respectively.For capacity estimation, M3 and M4 are less affected by the variations across the scenarios, whereas M2 is the best performing in scenario 1 (68.81C) and scenario 2 (73.70 C) but decreases by 9.42 pp in scenario 3 compared to scenario 1. M2 is based only on the frontal depth view, where the subject is not visible for most of the time.This challenges the method to detect the object in the pre-defined depth range.M5 is affected by the increasing challenges across scenarios, decreasing from 66.51 C in scenario 1 to 55.68 C in scenario 3.This shows the limitations of the underline approach [8] that was designed for objects free of occlusions and standing upright on a surface.
The performance across containers varies between the methods.Testing containers 12 and 15 are the most challenging for M3, M4, M5, M6, when classifying the filling level, whereas M1 achieves its best performance on both containers.M4 and M5 have the largest decrease with the score in the interval [40,50]  Comparison of the absolute error in estimating the container capacity between M2 [31], M3 [32], M4 [33], and M5 [34] for the different combinations of filling type and level in the combined public and private test set of the CORSMAL Containers Manipulation dataset.Statistics of the box plot includes the median (red line), the 25 th and 75 th percentiles, and the minimum and maximum error.. M4 performs worse on the private testing containers than the public testing containers, with the lowest scores on the boxes (containers 12 and 15).M5 also performs worse for the drinking glass and cups in the private test set than the public test set.Surprisingly, the best score of M5 is on the box container 15 (78.82C) despite the modeled shape is a 3D cylinder.

F. FILLING MASS ESTIMATION
We discuss the overall performance of the methods based on their results on estimating the filling mass.Methods that estimated either of the physical properties in our framework (e.g., M1, M2, and M6) are complemented by the random estimation of the missing physical properties to compute the filling mass 6 .Table 6 shows that methods addressing only filling type and level classification achieve a lower score than a random guess for each task.Given the multiplicative formula of the filling mass estimation (see Eq. 6), even a few errors in these classification tasks can lead to a low score in the filling mass estimation, especially when combined with the random estimation of the container capacity.However, improving the capacity estimation is an important aspect to achieve more accurate results (and higher score) for the filling mass estimation (see M2). M3, M4, and M5 addressed all three tasks and achieved 53.47 M , 62.16 M , and 65.06 M , respectively.Overall, methods perform better on the public test set than the private test set, except for M2 and M5 that achieve similar performance in the two test sets.We can observe that the more accurate predictions in the container capacity help M3 to obtain 53.47 M despite the classification errors for filling level and type.The high classification accuracy on filling level and type, combined with a similar score for the capacity estimations with respect to M3, makes M4 and M5 the best performing in filling mass estimation.The similar scores for container capacity and filling mass estimation shows how important it is to accurately predict the capacity in order to correctly estimate the filling mass.

VII. CONCLUSION
We presented the open CORSMAL framework to benchmark methods for estimating the physical properties of different containers while they are manipulated by a person with different content types.The framework includes a dataset, a set of tasks and performance measures, and several baselines that use either audio or visual input.The framework supports the contactless estimation of the weight of the container, including its content (if any), despite variations in the physical properties across containers and occlusions caused by the hand manipulation.
We performed an in-depth comparative analysis of the baselines and state-of-the-art methods that used the framework.The analysis showed that using only audio as input is sufficient to achieve a weighted average F1-score above 80% for filling type and level classification, but the high performance could be limited to the sensor types and setup of the CORSMAL Container Manipulation dataset.Methods that use audio alone are robust to changes in the container type, size, and shape, as well as pose during the manipulation.Moreover, filling type and level estimation can benefit from each other to avoid unfeasible solutions [35].Container capacity is the most challenging physical property to estimate with all methods affected by large errors and a maximum score of 65%.Performance on this task also affects the successive estimation of the filling mass.The design of a method that can generalize across the different containers and scenarios, especially for container capacity estimation and partially for filling level classification, is still challenging.
Future directions involve the exploration of fusion and learning methods with both acoustic and visual modalities to support the contactless estimation of the physical properties of containers and their content.The CORSMAL framework is open for further submissions and support the research in this upcoming area 7 .
FIGURE 1.The multi-modal, multi-sensor system used to record a person manipulating a container and its content.The system includes two third-person view cameras (at the two sides of the robot), a first-person view camera mounted on the robot, a first-person view from the body-worn camera on the person and a 8-microphone circular array (placed next to the robot arm).

FIGURE 2 .
FIGURE 2. The mass of objects (container and content) in the training set of the CORSMAL Containers Manipulation dataset.The class empty corresponds to the mass of the container, which is known.Legend: Empty, P5, P9, R5, W5, R9, W9,

FIGURE 3 .FIGURE 4 .
FIGURE 3. Illustrative comparison of M6[35] (left) and M4[33] (right) for filling type (τ j ) and level classification (λ j ).The two methods take as input only an audio signal that is converted into a spectrogram representation.During training, the initial and final part of the audio signal (gray areas) are removed based on the manual annotations and to focus only on the action.Note that M4[33] (right) computes features from overlapping audio frames (shadow gray areas on the spectrogram).KEY -CNN: convolutional neural network, FC: fully connected layer, LSTM: Long-Short Term Memory, MFCC: Mel Frequency Cepstral Coefficients.
FIGURE 6.Comparison of statistics of the absolute error in estimating the container capacity for each testing container between M2[31], M3[32], M4[33], and M5[34].Statistics of the box plot includes the median (red line), the 25 th and 75 th quartile, and the lower and upper whiskers.Note that outliers in the data are not shown.Note also the different scale for the y-axis.KEY -CX: container (C) index (X), where X is in the range[10,15].
FIGURE 7.Comparison of the absolute error in estimating the container capacity between M2[31], M3[32], M4[33], and M5[34] for the different combinations of filling type and level in the combined public and private test set of the CORSMAL Containers Manipulation dataset.Statistics of the box plot includes the median (red line), the 25 th and 75 th percentiles, and the minimum and maximum error..

TABLE 1 .
Methods that used the CORSMAL framework for filling level, filling type, and container capacity estimation.Methods are evaluated on the CORSMAL Container Manipulation dataset.

TABLE 2 .
Filling level classification results (Task 1).Baselines and state-of-the-art methods (MX with X ranges from 1 to 6) are ranked by their score in the combined test set.

Table 3
89 F1 , respectively, but about 20 and 10 percentage points (pp) lower than M4, respectively.The table also shows that the performance of the baselines varies from random results to almost the same performance as the best performing M4.Using the spectrogram as an input feature (either after reshaping the spectrogram into a vector or after applyingVOLUME 10, 2022

TABLE 3 .
Filling type classification results (Task 2).Baselines and state-of-the-art methods (MX with X ranges from 1 to 6) are ranked by their score in the combined test set.

TABLE 4 .
Container capacity estimation results (Task 3).Methods ranked by the average capacity score on the combined test set.

TABLE 6 .
Comparison of the filling mass estimation results.Methods are ranked by their score on the combined test sets of the CORSMAL Containers Manipulation dataset.Note that scores are weighed by the number of tasks addressed by the methods.