Single-stage intake gesture detection using CTC loss and extended prefix beam search

Accurate detection of individual intake gestures is a key step towards automatic dietary monitoring. Both inertial sensor data of wrist movements and video data depicting the upper body have been used for this purpose. The most advanced approaches to date use a two-stage approach, in which (i) frame-level intake probabilities are learned from the sensor data using a deep neural network, and then (ii) sparse intake events are detected by finding the maxima of the frame-level probabilities. In this study, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake detections. This is achieved by weakly supervised training using Connectionist Temporal Classification (CTC) loss, and decoding using a novel extended prefix beam search decoding algorithm. Benefits of this approach include (i) end-to-end training for detections, (ii) consistency with the fuzzy nature of intake gestures, and (iii) avoidance of hard-coded rules. Across two separate datasets, we quantify these benefits by showing relative $F_1$ score improvements between 2.0% and 6.2% over the two-stage approach for intake detection and eating vs. drinking recognition tasks, for both video and inertial sensors.


I. INTRODUCTION
A CCURATE information on dietary intake forms the basis of assessing a person's diet and delivering dietary interventions. To date, such information is typically sourced through memory recall or manual input, for example via dietitians [1] or smartphone apps used to log meals. Such methods are known to require substantial time and manual effort, and are subject to human error [2]. Hence, recent research has investigated how dietary monitoring can be partially automated using sensor data and machine learning [3].
Detection of individual intake gestures in particular is a key step towards automatic dietary monitoring. Wrist-worn inertial sensors provide an unobtrusive way to recognize these gestures. Early work on the Clemson dataset, established in 2012, used threshold values for detection from inertial data [4]. More recent developments include the use of machine learning to learn features automatically [5] and learning from video, which has become more practical with emerging spherical camera technology [6] [7]. Research on the OREBA dataset showed that frontal video data can exhibit even higher accuracies in detecting eating gestures than inertial data [8].
The two-stage approach introduced by Kyritsis et al. [9] is currently the most advanced approach benchmarked on publicly available datasets for both inertial [9] and video data The [10] Two-stage (ours) Single-stage (ours) Fig. 1. F 1 scores for our two-stage and single-stage models in comparison with the current state of the art (SOTA). Our single-stage models see relative improvements of 10.2% and 2.6% over the SOTA for inertial [10] and video-based intake detection [6] on the OREBA dataset, and relative improvements between 2.0% and 6.2% over comparable two-stage models for intake detection and eating vs. drinking detection tasks across the OREBA and Clemson datasets. [6]. It first estimates frame-level intake probabilities using deep learning, which are then searched for maxima to detect intake events. Drawbacks of this approach include the explicit nature of the constraint imposed in the second stage, and the loss function not being directly aligned with the detection task.
In this paper, we propose a single-stage approach which directly decodes the probabilities learned from sensor data into sparse intake event detections. This approach is compatible with data from any sensor, including inertial and video. We achieve this by weakly supervised training [11] of the underlying deep neural network with Connectionist Temporal Classification (CTC) loss, and decoding the probabilities using a novel extended prefix beam search algorithm. Compared to the approaches currently established in the literature, our study makes four key contributions: 1) Single-stage approach. This is the first study that applies a single-stage approach allowing for end-to-end training with a loss function that directly addresses the intake gesture detection task. We avoid the constraint associated with the second stage of two-stage models [9] [6] (i.e., the two second gap between intake events). 2) Simplified labels. The proposed approach requires information about occurrence and order of intake gestures, but not their exact timing. Hence, it is particularly suitable for intake gestures, whose start and end times are fuzzy in nature and highly time-consuming to determine. 3) Improved performance. Our single-stage models outperform two-stage models on the OREBA and Clemson datasets, including the current state of the art (SOTA) [6] arXiv  [4] (left) searches the angular velocity for values that breach the thresholds T 1 and T 2 . The two-stage approach [9] (center) independently estimates frame-level probabilities, which are then searched for maxima on the video level (generalized to two gesture classes here). The proposed single-stage approach (right) directly decodes the estimated probability distribution p(c|xt) using extended prefix beam search, after which token sequences in the most probable alignmentÂ are collapsed to yield the result.
[10] and two-stage versions of our models, see Fig. 1. 4) Intake gesture recognition. This is the first study simultaneously detecting and recognizing intake gestures as either eating or drinking from inertial and video data. Distinguishing between eating and drinking is an important step toward more fine-grained analysis of dietary intake. The remainder of the paper is organized as follows: In Section II, we discuss the related literature on CTC and intake gesture detection. Our proposed method is introduced in Section III, including a complete pseudo-code listing of our proposed decoding algorithm. We present and analyse the evaluation of our proposed model and the SOTA on two datasets in Section IV. Finally, we conclude in Section V.

II. RELATED RESEARCH
A. Intake gesture detection Intake gesture detection involves the detection of the timestamps at which a person moved their hands to ingest food or drink during an eating occasion. It is one of the three elements of automatic dietary monitoring, which also encompasses recognition of the consumed type of food, and estimation of the consumed quantity of food. Sensors that carry a signal appropriate for the detection of intake gestures include inertial sensors mounted to the wrist [12] and video recordings [6]. Note that information on eating events can also be derived from chewing and swallowing monitored using audio [13] [14], electromyography [15] [16], and piezoelectric sensors [17]. There are also other recent video-based approaches based on skeletal and mouth [18] as well as food, hand and face [7] features extracted using deep learning. For inertial data, there is recent work on in-the-wild monitoring [19]. In the following, we focus on two main approaches for inertial and video data that have been benchmarked on publicly available datasets: 1) Thresholding approach: In 2012, Dong et al. [4] noticed that intake gestures are strongly correlated with the angular velocity around the axis parallel to the wrist (wrist roll). They devised an easily interpretable thresholding approach which requires the angular velocity to first surpass a positive threshold (e.g., rolling wrist one way to pick up food), and then a negative threshold (e.g., rolling wrist the other way to pass food to the mouth). Refer to Fig. 2 (left) for an illustration. The approach selects these thresholds and two further parameters for minimum time amounts during and after a detection based on an exhaustive search of the parameter space. Note that this approach is not generalizable to multiple gesture classes.
2) Two-stage approach: Kyritsis et al. [9] proposed a twostage approach for detecting intake gestures from accelerometer and gyroscope data. Rouast and Adam [6] later adopted this approach for video data. In this approach, the first stage produces frame-level estimates for the probability of intake versus non-intake. These estimates are provided iteratively by a neural network trained on a sliding two-second context. The second stage identifies the sparse video-level intake gesture timings by operating a thresholded maximum search on the frame-level estimates, constrained by a minimum distance of two seconds between detections. Fig. 2 (center) illustrates this approach generalized to two intake gesture classes.
While this approach is also relatively easy to interpret and works well in practice [19], it has a few restrictions. Firstly, the second stage introduces the explicit constraint of a predefined gap between subsequent intake gestures. This constraint implies that any consecutive events occurring within two seconds of each other lead to false negatives. Secondly, the loss function during neural network training is geared towards optimizing the frame-level predictions, not the video-level detections. In the present work, we address these restrictions by introducing a new single-stage training and decoding approach using CTC -see Fig. 2 (right).

B. Connectionist temporal classification
In 2006, Graves et al. [20] proposed connectionist temporal classification (CTC) to allow direct use of unsegmented input data in sequence learning tasks with recurrent neural networks (RNNs). By interpreting network output as a probability distribution over all possible token sequences, they derived CTC loss, which can be used to train the network via backpropagation [21]. Hence, what sets CTC apart from previous approaches is the ability to label entire sequences, as opposed to producing labels independently in a frame-by-frame fashion.
While the original application of CTC was phoneme recognition [20], researchers have applied it in various sequence learning tasks such as end-to-end speech recognition [22], handwriting recognition [23], and lipreading [24]. In the most closely related prior research to the present work, Huang et al. [11] extended the CTC framework to enable weakly supervised learning of actions from video, simplifying the required labelling process. To the best of our knowledge, CTC has neither been applied for temporal localization of actions from sensor data nor intake gesture detection.
III. PROPOSED METHOD Our proposed approach interprets the problem of intake gesture detection as a sequence labelling problem using CTC. This allows us to operate within a single-stage approach, meaning that both probability estimation and intake gesture detection are operationalized for a single time window of data, as exemplified in Fig. 3: • The probability distribution p(c|x t ) over all possible token sequences is estimated using a neural network trained with CTC loss. • We decode p(c|x t ) to determine an alignment A using extended prefix beam search. We then derive the gesture timings by collapsing event token sequences within A. The proposed extended prefix beam search is a complex algorithm. To lay the necessary groundwork, we start by introducing the concept of alignments and derive the CTC loss function. Then, we continue by describing greedy decoding and prefix beam search as alternative decoding algorithms which provide the motivation for our extension. We then finally introduce the proposed extended prefix beam search.

A. Alignment between sensor data and labels
In many pattern recognition tasks involving the mapping of input sequences X to corresponding output sequences Y , we encounter problems relating to the alignment between the elements of X and Y . Often, real-world sensor data cannot naturally be aligned with fixed-size tokens: In handwriting recognition, for example, some written letters in X are spatially wider than others, unlike the fixed-size tokens in Y [23]. We face the same problem in intake gesture recognition, where gesture events can have various durations.
To account for the dynamic size of events in the input, we create an alignment A by using the token in question multiple times [25], such as in the example in Fig. 3. The blank token is additionally introduced to allow separation of multiple instances of the same event class, A = [E, E, , E, E, D, D, D] in the example. We derive the token sequence Y from an alignment A by first collapsing repeated tokens and then removing the blank token. Hence, the token sequence for the example is Y = [E, E, D], which correctly reflects the ground truth label. Note that a collapsed output token sequence Y can have many possible corresponding alignments A. Greedy decoding 3. An example with (1) dataset represented by data and label with corresponding alignment A L and collapsed token sequence Y L , (2) the single stage approach for intake gesture detection with estimated probabilities p(c|xt), and alignments as well as collapsed token sequences produced by Greedy decoding, prefix beam search as well as extended prefix beam search.
Note that finding the alignment A E produced by extended prefix beam search is the key element missing for simple prefix beam search.

B. CTC loss for probability distribution estimation
Suppose we have an input sequence X of length T , the corresponding output token sequence Y , and possible tokens Σ. Our network is designed to express a probability estimate p(c|x t ) for each token c in Σ given the sensor input x t at time t. Fig 3 continues the previous example to show what the network output p(c|x t ) might look like. The objective of CTC loss is to minimize the negative log-likelihood of p(Y |X), which is the probability that the network predicts Y when presented with X [20]. This probability can be efficiently computed using dynamic programming, adding the probabilities of the alignments A X,Y that produce Y [21].
Using CTC loss for intake gesture detection allows our networks to be trained in a weakly supervised fashion with the less restrictive collapsed labels. This implies that our networks will learn to make predictions differently than when trained with cross-entropy loss, as we explore further in Section IV-E. It also implies that examples are required to regularly contain multiple intake gestures for the network to learn properly (e.g., two eating and one drinking gesture in Fig. 3).

C. Greedy decoding
During inference, we decode the probabilities p(c|x t ) into a sequence of tokens Y . This can be interpreted as choosing an alignment A, which is then collapsed to Y . A fast and simple solution is Greedy decoding, which chooses the alignment by selecting the maximum probability token at each time step t [25] as in Equation 2.
However, this method is not guaranteed to produce the most probable Y , since it does not take into account that each Y can have many possible alignments [25]. In the example of

D. Prefix beam search
Traversing all possible alignments turns out to be infeasible due to their large number [25]. The prefix beam search algorithm [20] uses dynamic programming to search for a token sequenceŶ that maximises p(Ŷ |X). It presents a tradeoff between computation and solution quality, which can be adjusted through the beam width k, determining how many possible solutions are remembered. Prefix beam search with a beam width of 1 is equivalent to greedy decoding. However, it is important to note that prefix beam search does not remember specific alignments. Hence, it is not possible to temporally localize intake events (see missing A B in Fig. 3).
The algorithm determines beams in terms of prefixes (candidates for the output token sequenceŶ up to time t), which are stored in a list Y . Each prefix is associated with two probabilities, the first of ending in a blank, p b ( |x 1:t ), and the second of not ending in a blank, p nb ( |x 1:t ). For each time step t, the algorithm updates the probabilities for every prefix in Y for the different cases of (i) adding a repeated token and (ii) adding a blank, and adds possible new prefixes. Due to the algorithm design, branches with equal prefixes are dynamically merged. The algorithm then keeps the k best updated prefixes.

E. Extended prefix beam search
Standard prefix beam search finds a token sequenceŶ , without retaining information about the alignments A X,Ŷ . In order to be able to infer the timing of the decoded events in a way consistent with CTC loss, we would like to findÂ. This is the most probable alignment that could have producedŶ , as expressed by Equation 3.
Instead of running a separate algorithm based onŶ , we search forÂ simultaneously as part of prefix beam search, which already includes most of the necessary computation. We add two additional lists for each beam , A b ( ) and A nb ( ), which store alignment candidates that resolve to as well as their corresponding probabilities. Every time a probability is updated in prefix beam search, we add new alignment candidates and associated probabilities to the appropriate lists. This includes (i) adding a repeated token, (ii) adding a blank token, and (iii) adding a token that extends the prefix. The algorithm design implies that if two beams with identical prefixes are merged, alignment candidates are also merged dynamically. At the end of each time step t, we resolve the alignment candidates for each in Y by choosing the highest probability for each A b ( ) and A nb ( ). Finally, for each of the k best token sequences in Y , the best alignment candidateÂ is chosen as the more probable one out of A b ( ) and A nb ( ).
We created a Python implementation 2 of the version listed in Algorithm 1. Note that this version is not created with efficiency in mind. For our experiments, we implemented a more efficient implementation 3 as a C++ TensorFlow kernel.

F. Network architectures
Although they are trained with different loss functions, both the single-stage and two-stage approaches each rely on an underlying deep neural network which estimates probabilities. Here, that is for an 8 second window of sensor data from the OREBA and Clemson datasets. We choose adapted versions of the ResNet architecture [26]. Our video network is a CNN-LSTM with a ResNet-50 backbone adjusted for our video resolution. For inertial data, we use a CNN-LSTM with a ResNet-10 backbone using 1D convolutions. Table I reports the parameters and output sizes for all layers.
Algorithm 1: Extended prefix beam search algorithm (loosely based on [27]): The algorithm stores current prefixes in Y . Probabilities are stored and updated in terms of prefixes ending in blank p b ( |x t ) and non-blank p nb ( |x t ), facilitating dynamic merging of beams with identical prefixes. The empty set is used to initialize Y and associated with probability 1 for blank, and 0 for non-blank. A b ( ) and A nb ( ) store the current candidates for alignments (ending in blank and non-blank) pertaining to prefix , along with their probabilities. They are likewise initialized for the empty prefix. The algorithm then loops over the time steps, updating the prefixes and associated alignments. Each current candidate is re-entered into the new prefixes Y , adjusting the probabilities for repeated tokens and added blanks. The corresponding alignment candidates and their probabilities are added to the new alignment candidates A nb ( ) and A b ( ). Furthermore, for each non-blank token in Σ, a new prefix is created by concatenation, the probability is updated, and corresponding alignment candidates are added. At the end of each time step, we set Y to the k most probable prefixes in Y and resolve the alignment candidates for each of those prefixes as the most probable ones. Finally, for each of the k best token sequences in Y , the best alignment candidate is chosen as the more probable one out of A b ( ) and A nb ( ).
Data: Probability distributions p(c|x t ) for tokens c ∈ Σ in sensor data x t from t = 1, . . . , T .
Result: k best decoded sequences of tokens Y and best corresponding alignments A.

IV. EXPERIMENTS AND ANALYSIS
In the experiments, we compare the proposed single-stage approach to the thresholding approach [4] and the two-stage approach [9] [10]. We consider two datasets of annotated intake gestures: The OREBA dataset [6] and the Clemson Cafeteria dataset [28]. To the best of our knowledge, these are the largest publicly available datasets for intake gesture detection. For both datasets, we attempt detection of generic intake events, as well as simultaneous detection and recognition of eating and drinking gestures. For OREBA, we run separate experiments for inertial and video data. Across our experiments, we use time windows of 8 seconds, which ensures that examples regularly contain multiple intake events. All code used for the experiments is available at https://github.com/prouast/ctc-intake-detection.
A. Approaches 1) Thresholding approach: We implemented the thresholding approach with four parameters as described by Dong et al. [4] and Shen et al. [28], which only relies on angular velocity (wrist roll). For each dataset, we used the training set to estimate the parameters T 1 , T 2 , T 3 , and T 4 .
2) Two-stage approach: SOTA results on OREBA [6] [10] are based on 2 second time windows. However, a 2 second time window is not sufficient for the single-stage approach. Hence, to still facilitate a fair comparison between single-stage and two-stage, we train our own two-stage models based on 8 second time windows and the same architecture as our singlestage models. These models are trained with cross-entropy loss. Video-level detections are reported according to the Stage 2 maximum search algorithm outlined in [9]. To facilitate multi-class comparison, we also extend the Stage 2 search by applying the same threshold to both intake gesture classes.
3) Single-stage approach: Our single-stage models are trained using CTC loss [20]. One caveat of the single-stage approach is that it requires a longer time window than Stage 1 of the two-stage approach. This is to ensure that multiple gestures regularly appear in the training examples, providing a signal for learning of temporal relations. We found that choosing a time window of 8 seconds is just sufficient for this purpose. 4 For inference, the probabilities estimated for each temporal segment are decoded into an alignment using the Extended prefix beam search with beam width 10, and then collapsed to yield event detections. On the video level, we first aggregate detections from the individual alignments of sliding windows using frame-wise majority voting before collapse.
B. Training and evaluation metrics 1) Training: All networks are trained using the Adam optimizer on the respective training set with batch size 128 for inertial and 16 for video, and an exponentially decreasing learning rate starting at 1e-3. We also use minibatch loss scaling analogous to [6]. Hyperparameter and model selection is based on the validation set unless stated otherwise. 2) Evaluation: For comparison we use the F 1 measure, applying an extended version of the evaluation scheme proposed by Kyritsis et al. [9] (see Fig. 4). The scheme uses the ground truth to translate sparse detections into measurable metrics for a given label category. As Rouast and Adam [6] report, one correct detection per ground truth event counts as a true positive (TP ), while further detections within the same ground truth event are false positives of type 1 (FP 1 ). Detections outside ground truth events are false positives of type 2 (FP 2 ) and non-detected ground truth events count as false negatives (FN ). We extended the original scheme to support the multi-class case, where detections for a wrong class are false positives of type 3. Based on the aggregate counts, precision ( TP TP+FP1+FP2+FP3 ), recall ( TP TP+FN ), and the F 1 score (2 * Precision * Recall Precision+Recall ) can be calculated.
C. Datasets 1) OREBA: The OREBA dataset [8] includes both inertial and video data. Specifically, we are using the scenario OREBA-DIS with data for 100 participants (69 male, 31 female) and 4790 annotated intake gestures. Data are split into training, validation, and test sets of 61, 20, and 19 participants according to the split suggested by the dataset authors [8]. For our inertial models, we use the processed 5 data from accelerometer and gyroscope readings for both wrists at 64 Hz. The video data comes at a frame rate of 24 fps and spatial resolution of 140x140 pixels. We downsample the video to 2 fps and and use data augmentation analogous to [6], which includes spatial cropping to 128x128 pixels. The choice of 2 fps presents a trade-off as limited GPU memory does not allow us to run experiments based on more than 16 frames at a time. For this dataset, 8 seconds correspond to 16 frames of video at 2 fps and 512 frames of inertial data at 64 Hz.
2) Clemson: The Clemson dataset [28] consists of 488 annotated eating sessions across 264 participants (127 male, 137 female). This results in a combined number of 20644 intake gestures (referred to as bites in the original paper). Sensor data for accelerometer and gyroscope is available for the dominant hand at 15 Hz. We split the sessions into training, validation, and test sets (302, 93 and 93 sessions respectively) such that each participant appears in only one of the three. Details are  [6]; test set results as reported in [8]. These models use time windows of 2 seconds, while our models require time windows of 8 seconds due to the nature of the single-stage approach.
available in Section S2 of the Supplementary Material. For this dataset, 8 seconds correspond to 120 samples. Before feeding the sensor data into our models, we apply the same preprocessing as for OREBA. Table II, and extended results with detailed metric counts are available in Section S1 of the Supplementary Material. 1) Detecting intake gestures: Here, the goal is to detect only one generic intake event class. The results displayed in the center column of Table II reveal that the singlestage approach generally yields higher performance than the thresholding and two-stage approaches (average improvement of 6.4% over SOTA and 2.9% over two-stage versions of our own models).

Results are listed in
For OREBA, the relative improvement over the SOTA equals 10.2% and 2.6% for the inertial and video modalities, respectively. In the same vein, we measure improvements of 3.2% and 2.0% over the two-stage versions of our own models. We can make an observation regarding the difference between inertial and video results on OREBA: For inertial, our twostage model with 8 second time window leads to a significant improvement over the two-stage SOTA -accounting for ca. 66% of the improvement recorded for our single-stage model over the two-stage SOTA. For video on the other hand, the same figure is only ca. 23%. A plausible explanation for this observation is that a larger time window does not make up for missing detail due to the reduced frame rate.
Besides the thresholding approach [4] [28], we are not aware of any SOTA deep learning models on the Clemson dataset. The results demonstrate that both the two-stage and singlestage approach outperform thresholding by a large margin. This not surprising since thresholding exclusively relies on one channel of gyroscope data and the deep learning models have many more parameters. Comparing our own models, we find that the single-stage approach leads to a relative improvement of 3.5%, which is in a similar ballpark to the results for OREBA. It is worth noting that the F 1 scores are generally lower for the Clemson than for the OREBA, indicating that it is more challenging for intake gesture detection. However, this may be related to the lower sampling rate in Clemson and the fact that data for both wrists is available for OREBA, while only the dominant wrist is included in Clemson.
2) Simultaneous detection of intake events and recognition of eating vs. drinking: This task consists of detection and simultaneous recognition of intake events as either eating or drinking. As there is no current SOTA for this more fine-grained classification, we solely compare results for the separately trained two-stage and single-stage versions of our own models. In the right hand side columns of Table II, we report separate F 1 scores for eating and drinking individually, as well as both together.
We can make three main observations: Firstly, the singlestage approach again outperforms the two-stage approach for both datasets and modalities, however the result is more pronounced for inertial data with an average relative improvement of 5.9%. Secondly, the increased difficulty of this task compared to the generic detection task is noticeable in the difference between the F 1 and F E∧D 1 scores, a decrease of 3.0% for OREBA and 4.1% for Clemson. Thirdly, there is no clear indication whether eating or drinking is easier to detect. While the average across both datasets and modalities hints at eating being easier, this does not hold true for all combinations.
Additionally, it is interesting to note that there are generally very few misclassifications between eating and drinking. As indicated by Table III, the frequency of false positives of type 2 is higher than the frequency of false positives of type 3 by almost two orders of magnitude.

E. Effect of training with CTC loss or cross-entropy loss
During our introduction of CTC loss in Section III-B, we mentioned that weakly supervised training with CTC causes our networks learn a different approach of detecting events than cross-entropy loss. We can think of cross-entropy loss as causing the network to predict whether a frame occurs anytime during the gesture that is being detected. The analogous way of thinking about CTC loss is to predict which frames are the most distinctive about the gesture that is being detected. This causes the signature for predictions by our single-stage models to look more like probability spikes, while the twostage models produce sequences of high probability values.
We illustrate this characteristic difference between the single-stage and two-stage approaches in Fig. 5 using an  We observe that the predictions by the two-stage models indeed mimick the ground truth, while the single-stage models produce probability spikes where events are detected. Furthermore, these probability spikes line up temporally with the patterns that are most distinct about the gestures for the human eye. That is, the single-stage video model spikes at exactly the frames where the participant begins ingesting the food and drink. For the inertial data it is more difficult to interpret, but the times where the spikes occur are also associated with the most pronounced changes in the inertial signal.
When averaging the results across all datasets and tasks as reported in Table III, it becomes clear that training with CTC loss accounts for the majority of the improvement of singlestage models over two-stage models. The effect of training with CTC loss manifests itself in a higher true positive rate and an associated lower false negative rate. Furthermore, there is a significant drop in false positives of type 1, which were previously conjectured to be a restriction of the two-stage approach [6]. In particular, the single-stage approach avoids the hardcoded 2 second gap in Stage 2 of the two-stage approach and is thus less likely to lead to false positives of type 1 for gestures with a long duration. 1

F. Difference between Greedy decoding and Extended beam search decoding
In theory, the results produced by the proposed extended prefix beam search decoding better reflect the network's intended output than greedy decoding, since they are computed in the same way as CTC loss works internally. However, the reality of our scenario is characterized by few classes and relatively low uncertainty. This is also indicated by the low rate of false positives of type 3 in Table III and also the high prediction confidences in Fig 5. Hence, it turns out that the effect of extended prefix beam search decoding is not very noticeable -a relative improvement of only 0.20% over greedy decoding as indicated by Table III. This improvement is characterized by a higher true positive rate and an associated lower false negative rate, but also a higher rate of false positives of type 2.
Further to this point, recall that extended prefix beam search decoding with a beam width of 1 is equivalent to greedy decoding. While previously reported results are based on beam width 10, experiments with other beam widths show that values over 2 do not lead to further improvements. As illustrated in Fig. 6, extended prefix beam search decoding with beam widths greater than 1 mainly had benefits for the OREBA models.

V. CONCLUSION
In this paper, we introduced a single-stage approach to detect and simultaneously recognize intake gestures. This is achieved by weakly supervised training of a deep neural network with CTC loss and decoding using a novel extended prefix beam search decoding algorithm. Using CTC loss instead of cross-entropy loss allows us to interpret intake gesture detection as a sequence labelling problem, where the network labels an entire sequence as opposed to doing this independently in a frame-by-frame fashion. Additionally, to the best of our knowledge, we are the first to attempt simultaneous detection of intake gestures and distinction between eating and drinking using deep learning. We demonstrate improvements over the established two-stage approach [9] [6] using two datasets. These improvements apply to both generic intake gesture detection and eating vs. drinking recognition tasks, and also to both video and inertial sensor data.
The proposed extended prefix beam search decoding algorithm is the second novel element in this context besides CTC loss. This algorithm allows us to decode the probability estimate provided by the deep neural network in a way that is consistent with the computation of CTC loss. However, despite the theoretical benefits of this algorithm, our results show that training with CTC loss accounts for the lion's share of the improvements we see over the two-stage approach. This could be explained by to the low number of classes for the datasets and tasks considered here. Greedy decoding can hence be seen as a fast baseline alternative. It remains to be seen in future work whether extended prefix beam search decoding is more useful when working with a larger number of classes and higher associated uncertainty.
Limitations of the single-stage approach include a requirement for a larger time window during training than the twostage approach. This is required to assure that multiple intake gestures are regularly presented during training, as a basis for learning of the temporal interplay between intake gestures. It follows that the single-stage approach also has a requirement for more GPU memory, since more activations and gradients have to be stored during training. In our work, this mainly had an impact for the video model, which has a large memory footprint to begin with. This work has several implications for future research. We have shown a feasible way of detecting intake gestures while simultaneously classifying them into eating and drinking. Given larger video datasets with more different food types and associated labels, it should be possible to perform more finegrained classification of different foods. The necessity of large datasets has been pointed out [30] and detailed food classes are in fact available for the Clemson dataset, but tentative experiments indicated that inertial sensor data may not be sufficiently expressive to yield satisfactory results for food classification. Another implication directly has to do with the practical task of labelling future datasets. When working with CTC loss, events do not need to be painstakingly labelled with a start and end timestamp. Instead, it is sufficient to mark the apex of the gesture -similar to how the single-stage approach makes detections -which has the potential to significantly reduce the labelling workload and reduce ambiguity around determining the exact start and end times of intake gestures.