EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset

Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants’ egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5 percent of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79 percent relative to a single perspective. Both applications exploit EgoCom's synchronous multi-perspective data to augment performance of embodied AI tasks.


INTRODUCTION: THE NEED FOR EGOCENTRICITY
C ONSIDER a conversational turn-taking system to predict who will be speaking in five seconds or whether it's a good time to start, stop, or continue talking. This system is useful as an aid for persons dealing with autism [1]; teaching a personal assistant when to help [2]; autonomous multi-agent collaboration [3]; or attention measurement in affective computing [4]. The complexity of conversation makes predicting turn-taking a challenging task. Here we use a simple baseline to show that multi-perspective egocentric data has compelling benefits.
Alternatively, consider a global transcription system for multi-person transcription and speaker identification. With a single audio source, one must solve the challenging cocktail party problem [5] and with multiple audio input sources, one must solve an often misspecified matrix factorization [6]. The inclusion of visual features has aided in simplifying object classification and source separation [7], [8], [9], but global transcription requires simultaneous speaker disambiguation and multi-channel speech recognition [6]. Here we show how synchronous multi-perspective egocentric data enables a simple solution to improve a state-of-the-art speech-to-text system by 79 percent, when compared to asynchronous, single-perspective transcription.
We introduce the Egocentric Communications (EgoCom 1 ) dataset, the first multi-perspective egocentric dataset comprised of natural human conversations, and establish baseline performances for both turn-taking and global transcription. The primary contribution of EgoCom is the unique nature of the data. EgoCom is a multi-modal, synchronous multi-perspective, egocentric communications dataset comprising 38 unique 20-30 minute natural conversations. Each conversation has three participants, with at least two wearing video recording glasses. Egocentric video is captured from the perspective of the eyes and embodied stereo audio is captured from the perspective of the ears. Transcriptions are provided via human annotators. For each conversation, start and end times of each participant's data is synchronized.
Beyond turn-taking and transcription, EgoCom is timely and relevant as an egocentric communication benchmark dataset. The ubiquity of hand-held smart devices and head-worn recording devices [10] has proliferated egocentric video, yet the usefulness of egocentric capture remains largely unrealized by the artificial intelligence community. While recent datasets like EPIC-KITCHENS [11] and GTEA Gaze+ [12] have significantly advanced this goal, there is no public egocentric dataset addressing two key elements of embodied intelligence: natural language and multi-perspective interaction. EgoCom serves this purpose. We elect the term communications, as opposed to conversations, because the multi-modal nature of the dataset includes both verbal language and non-verbal cues [13].
EgoCom captures language across three modalities: verbal, vocal, and visual [13]. EgoCom captures verbal cues through human-transcribed annotations, vocal cues through egocentric audio data, and visual cues through gestures, body language, and gaze. EgoCom amplifies egocentric audio relative to quieter surrounding audio thereby simplifying tasks like speaker identification [14]: a simple solution is to use the maximum magnitude of aligned audio. Similarly, EgoCom enhances visual cues through the egocentric perspective by enabling spatial AI techniques like head-pose estimation combined with traditional computer vision techniques like body pose estimation [15].
Our goal is not to discover state-of-the-art algorithms for turn-taking prediction or global transcription, but is instead to demonstrate how the synchronized multi-perspective, multi-modal nature of the EgoCom dataset simplifies solutions to otherwise challenging tasks, while establishing baseline scores for these tasks in the embodied AI context. Our contributions can be summarized as follows: 1) created the first multi-modal, synchronized multiperspective egocentric communications dataset. 2) established baseline accuracy for embodied turn-taking prediction in human conversations.

3) established baseline global transcription accuracy
with multi-perspective egocentric data.

RELATED WORK
Common benchmark datasets in computer vision [16], [17], [18] have been essential in catapulting advances in machine learning, but contain data from a third-party perspective, losing contextual egocentric information such as head pose, or imperceptible sounds like the quiet breath one takes before speaking. Instead, EgoCom is multi-disciplinary, combining synchronized multi-perspective, multi-modal communications data and egocentricity [19] with elements of conversational AI [20], natural language, audio, computer vision, and spatial AI [21]. There are a number of related video-based datasets. Action classification datasets include Kinetics, a video dataset for human action classification [22], ActivityNet, a video dataset for action classification and temporal localization [23], and AVA, a dataset of spatio-temporally localized atomic visual actions (AVA) [24]. Multi-modal AI datasets include AVA-ActiveSpeaker, an audio-visual dataset for speaker detection [25], VGG lip reading dataset, an audio-visual dataset for speech recognition and separation [26], Mosi, a multimodal corpus of sentiment intensity [27], [28], and OpenFace, a multi-modal face recognition [29]. The two major advantages of EgoCom are egocentricity and the inclusion of multiple participant's synchronized audio and video, which as we show, simplifies multispeaker applications.
There are several related prior works that study social interactions in egocentric vision. Fathi et al. [30] present a first-person visual dataset and detection and recognition tasks. Rehg et al. [31] analyze children's social and communicative behaviors based on video and audio data. Yonetani et al. [32] collect a human interaction dataset and presents action and reaction recognition tasks. Joo et al. [33] present a task and a 3D motion dataset to understand human social interactions. Li et al. [34] introduce a dual relation modeling framework for egocentric human interactions using vision signals. EgoCom differs from [30] and [32] in that it captures multi-perspective multi-modal signals from interpersonal communications, which can be exploited beyond vision tasks. EgoCom provides a new dataset to benchmark existing models, e.g., [34], as well as future extensions that leverage multi-perspective multi-modal content captured in a natural social settings.
EgoCom combines multi-modal AI [35], [36] with egocentricity. Multi-modal data can be useful for tasks like multiparty speech recognition and predicting turn-taking by combining granularity of verbal and non-verbal cues [13], [37]. For example, [38] is related to predicting turn-taking, but does not take advantage of multiple egocentric perspectives afforded by EgoCom. Numerous egocentric datasets exist [19], [39], [40], [41], but the main advantage of EgoCom over these datasets is the conversational content. Whereas these previous datasets were action-oriented, EgoCom is communications oriented in an effort to link conversational AI, audio, and natural language tasks with egocentric computer vision. This feature makes EgoCom natural for looking to listen tasks [7], [8], [9].
Like Owens and Efros [42], where multi-modal data is shown to be useful as a source of self-supervision, we demonstrate how multi-observer data can also be employed to generate training labels for the prediction of turn-taking without human supervision. Turn-taking is also related to keyword-spotting [2], [43]: words like "Okay" or "Well" may indicate finishing or starting speech. Unlike these efforts, we do not solve this task directly, but indirectly while predicting turn-taking.

EGOCOM DATASET
Egocentric Communications (EgoCom) is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously and synchronized across participants' egocentric perspectives. For each conversation, the dataset provides embodied stereo audio, egocentric video, time-stamped word-level transcriptions, and speaker labels. EgoCom is comprised of 28 unique English conversations across 34 diverse speakers. Every conversation has three participants with at least two participants wearing a recording device. Low-cost headworn Gogloo glasses were used to record stereo audio near the ears and 1080p video between the eyes (see Fig. 8 in Appendix A, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TPAMI.2020.3025105, for an example of the device). Three synchronized egocentric video frames from each participant's perspective in a conversation are shown in Fig. 1. The color of each image matches the perspectivearrow in the other images.
Topics of Conversation. Every conversation includes a host who directs topics. To enable future research, conversations adhere to topics: playing and teaching how to play card games; playing word-guessing games; pontificating thought experiments; discussing interests (e.g., favorite food); describing objects in the environment; question-answering, teaching, and learning about how things work; and interacting with mirrored reflections with egocentric video. Although topics are constrained, conversations are reasonably natural. Throughout the dataset, an estimated 7,200 unique words are spoken, with the most common word being "I" and an estimated 3,000 unique words only spoken once.
The EgoCom dataset is split into a train set (78 percent), test set (16 percent), and validation set (6 percent) by total duration (see Fig. 2). These sets were generated randomly while enforcing similar distribution across gender and dialect. The term non-native is used to qualitatively express a non-American, non-British English accent.
Dataset Content Coverage. EgoCom encompasses a breadth of typical conversational elements including variation in (1) spatial geometry variations such as position and movement while speaking, (2) relative speaker geometries including variation in all six degrees of freedom (x, y, z, yaw, pitch, tilt), (3) environment variations including background fan noise and music varying in genre (classical, latin, country) and loudness, and (4) speaker demographics. Varying accents, dialects, and cultures are represented. All conversations are recorded in the same high-ceiling studio apartment. Fig. 3 quantifies distributional statistics across demographics and background noise. Observe that the largest bar depicts the host, who participated in every conversation. The train, test, and validation sets comprise 160,000 spoken words and 14 hours of unique conversational data, or 38.5 hours of video, audio, and text data from all perspectives in each conversation.

Research Areas Enabled
The EgoCom dataset enables new research opportunities through the combination of embodied visual, audio, and text modalities from multiple simultaneous aligned perspectives in natural conversation. EgoCom is intended to enable new research directions in the following: Question Answering. About 20 percent of the conversational content encompasses a question-based word-guessing game where participants must guess a word placed on their forehead based on answers to binary yes/no questions that they ask. This is relevant for AI systems built on knowledge graphs of objects and properties. Throughout the dataset we ask questions about objects and their relationships like: 1. "What's that called?" (the answer names the object) 2. What color is the <object>? (object has previously been named)   3. What shape is the <object>? (the question names the object) 4. Name the <object> <relative to (e.g., above)> <object>? Conversational AI. EgoCom is a natural dataset for predicting turn-taking, lip-reading to predict speech from video, semantic analysis and linguistic tasks, automatic speech recognition, natural language understanding, and enhancing predictions through visual cues.
Audio. EgoCom contains multiple aligned perspectives making the dataset helpful for multi-modal multi-source separation tasks as well as audio-only source separation. The multi-channel, multi-perspective audio enables beamforming audio analysis, speaker localization, and pose estimation applications. Additionally, EgoCom works well for self-supervised learning with audio because each person's audio captures the same conversation, the difference being egocentric audio is louder, providing strong cues for speaker identification and source separation.
Human Learning. EgoCom contains contents of human learning, such as participants teaching card games to one another or learning about properties of objects in the room. It is useful for meta-understanding when a learner understands, providing sources for AI agents to understand and simulate human learning processes.

Contribution of EgoCom to Egocentric Datasets
EgoCom is the only multi-modal, synchronized multi-perspective egocentric conversational dataset as of the time of dataset publication. EPICKITCHENS [11] (50 hours) and EGTEA Gaze + [12] (28 hours) are two of the largest egocentric datasets, but are single-perspective. EgoCom comprises 38.5 hours of video and stereo audio, 240,000 ground-truth utterances, and speaker labels. EgoCom is unique and larger than any other egocentric dataset published with these properties [44].

Multi-Capture Synchronization
Our solutions for turn-taking prediction and global transcription in the next sections hinge on the fact that videos are synchronized: within a conversation, all video/audio starts and stops at the same moments in time. Ideally, a synchronized global clock would be logged with the sensor data on each device to support this. Unfortunately, no commercial and unobtrusive wearable capture device supporting the required sensors exist with support for such a global clock, so we use a simple method to infer alignment from the captured data streams.
To synchronize all perspectives in each conversation, videos are aligned based on their audio (see Algorithm 1). Beforehand, audio is truncated to the minimum length of any perspective in each conversation. Volume is equalized within each signal using Gaussian smoothing, implemented by dividing each signal by itself convolved with a Gaussian kernel of width of 0.1 seconds. After these preprocessing steps, alignment is performed using cross-correlation in Fourier space on all combinations of left and right channels of audio, detailed in Algorithm 1.
Speaker Labels. Speaker labels are obtained by aligning the raw audio for each participant in a given conversation (see Algorithm 1). Audio magnitudes are computed by summing together the absolute values of both channels. One dimensional max-pooling with Gaussian smoothing is then used to find the speaker with maximum magnitude for every one second of audio, e.g., the label used in our experiments at z seconds in the future is the max amplitude signal averaged from z seconds to z þ 1 seconds. If no speaker exceeds a threshold (10th-percentile of all magnitudes), a zero label is used to represent no one is speaking. Our labeling procedure assumes, sometimes incorrectly, that only one person is speaking at any given one second window.

TURN-TAKING: PREDICTING INTO THE FUTURE OF CONVERSATIONS
In this section, we study the application of predicting turntaking in conversation and demonstrate the advantages of synchronous multi-perspective data afforded by EgoCom. We first study priors (e.g., the transition probabilities between speakers), then likelihood and posterior models to predict future speaker labels from past data and labels. Posterior inference is formulated by including the current speaker label, at time t ¼ 0, as an additional feature while training to predict a future speaker label. In this formulation, the priors are useful baselines, e.g., if the prior probability that someone's speaking state will not change in t seconds is 0.64, then a trivial model that predicts the current label, will tend towards 64 percent accuracy. This posterior has more information and should perform at least as well as likelihood estimation, so why study both? As discussed in Section 3.3, in contrast to other datasets, EgoCom 's multi-perspective data provides reliable current speaker labels for training but these would also potentially be available for a distributed run-time inference system. This makes posterior estimation at inference time possible. For this reason, we are interested in studying the value of this extra multi-perspective information in the prediction of turn-taking. We approach turn-taking prediction with four tasks: Binary Prediction: 1) Task 1: given any one person's features, will that person be speaking in t seconds? 2) Task 2: given only the conversation host's features, will the host be speaking in t seconds? 3) Task 3: given a concatenation of all participant's features, will the host be speaking in t seconds?
Multiclass Prediction: 4) Task 4: given a concatenation of all participant's features, who will be speaking in t seconds? Our goal is to answer these questions as a real-time prediction task using multi-modal multi-perspective embodied communication data. These tasks are constructed to disambiguate latent factors, such as the influence of the host in conversations (Task 1 versus Task 2), the value of EgoCom's unique synchronous multi-perspective data (Task 2 versus Task 3), and how predicting change in speaker label (binary Task 1) compares to the more difficult task of predicting who will be speaking ( Task 4).
Toward this goal, we favor an approach with fast inference/prediction time by first pre-computing feature representations for visual, audio, and text data using models that are already pre-trained on related datasets. These features are computed for histories of 4, 5, 10, and 30 seconds at 1 second increments. In a second step, we train a simple MLP classifier using the pre-computed features. We chose this approach over an end-to-end recurrent model for inferencetime speed and training stability [45].
We conclude this section with an ablation study and a comparative human performance evaluation.

Prior Probabilities on Speaker Labels
The current speaker label is an indicator of who will be speaking in the future. Before training models to predict speaker labels from data, we estimate the prior probabilities of speaker labels directly by counting the labels of the training set. Define s t to be a binary random variable, such that s t ¼ True if any embodied person is speaking t seconds in the future and False otherwise. Define h t similarly for the conversation host and define m t 2 f0; 1; 2; 3g as the multiclass label of the person speaking at time t (m t ¼ 0 if no one is speaking at time t). The prior baselines pðs t ¼ s 0 Þ, pðh t ¼ h 0 Þ and pðm t ¼ m 0 Þ measure the probability speaker state is the same now (t ¼ 0) as it is t seconds in the future.
These priors (see Table 1) provide relevant information for Tasks 1 -4. We observed during the human evaluation experiment (see Section 4.4) that when labeling egocentric video, humans often predict who will be speaking in the future, not based on visual and audio cues, but by who is currently speaking. This qualitative feedback motivated further inspection of the predictive signal of these priors.
Beyond same-label priors, conditional priors form transition diagrams that illuminate social dynamics in EgoCom (see Figs. 4 and 5). For example, Fig. 4a suggests that participants are more likely to speak in moments of silence than the host and that this dynamic changes further in the future. Importantly, the train set prior distributions (Fig. 4) closely match the test set (Fig. 5) -evidence that including prior information (current speaking label) at training time should improve posterior inference on the test set.

Feature Representations for Downstream Learning
Here, we explain how video (visual), audio, and text embedding representations are created for every video in EgoCom and used as input for training. We extract visual, audio, and text features from an overlapping sliding window at every 1 second. For each modality, separate embeddings that represent the past 4, 5, 10, and 30 seconds of data are created, yielding 12 feature embeddings for every second of video in EgoCom, resulting in 555,000 features in total (138,750 features for each of the four histories). Video Embeddings. Prior to computing video embeddings, videos are compressed to 480p MP4 (see Appendix A, available in the online supplemental material). Video frames are sampled at 32 frames per clip for each past window. Video input frames are re-scaled to 171 x 128 and cropped to 112 x 112 patches. We extract the 2048-dimensional visual features from the last average pooling layer of R(2+1)D-101 model [46], pre-trained on Kinetics-400 [22] for human action classification. Visual feature extraction was performed on 8 NVIDIA V100 GPUs, requiring 3 days to compute the 550,000 features.
Audio Embeddings. The audio features used for this task are generated from a speaker identification model trained on the Voxceleb [47] , the probability that the speaker label does not change in t seconds. Fig. 4. Probability of turn-taking between host and any participants in the EgoCom train dataset. Ã Participants includes all (usually two) participants, e.g., 72 percent is the probability anyparticipant will be speaking in 1s given anyparticipant is currently speaking.
greater than a threshold with the anchor. The input to this model are segments of audio represented as 64-dimensional log Mel-filterbank energies. We compute the energies for each 25 ms frame with a 10 ms frame shift and concatenate the resulting 64 dimensional vectors together as input to the model. On a 24-core CPU machine, it takes roughly 8 hours to compute the 555,000 features. Text Embeddings. We generate text embeddings on the human-annotated transcripts using FastText's Crawl 300 dimensional sub-word embeddings [48], pre-processed with tokenization and removal of white space and punctuation. Sentence vectors are created by normalizing and summing each word vector. As with the other features, text embeddings are created for every 4, 5, 10, and 30 second histories. On a typical CPU, computation time is negligible.

Predicting Turn-Taking in EgoCom
Using pre-computed multi-modal feature representations, we predict the speaking label t seconds in the future for each of Tasks 1-4 using a simple MLP classifier, both without (i.e., likelihood) and with (i.e., posterior) inclusion of the current speaking label (i.e., prior) as input during training. Details about the MLP architecture, training procedure and settings are described in Appendix Section B, available in the online supplemental material. The EgoCom validation set is used for early stopping and hyper-parameter tuning: the EgoCom test set is never accessed during training. We consider variations across: past window of input, future horizon to predict, feature modality, and prediction task, along with an ablation study to study the effect of model and test set choice. Top-1 accuracy on the EgoCom test set for each variation is reported in Tables 2, 3, 4, and 5 for each task. All results are seeded for reproducibility. Table 2 reports the results for Task 1. The top-left entry in Table 2 is read as: The test accuracy predicting whether a given person will be speaking in 1 second given the past 4 seconds of that speaker's text features is 68.2 percent. "A speaker's features" means a subset of the video capture between their eyes, the audio captured near their ears, and transcripts. The input modality with the max value for each (past, future, prior) triad is in bold. Table 4 (Task 3) and Table 5 (Task 4) report test accuracy for models trained using all three participant's features concatenated together; for both tasks, we only consider conversations where all three-participants are wearing a recording device for a static input size. This filtering explains the deviation in baseline accuracies at the end of the captions in Tables 3 and 4.
Observed Trends. Referencing Tables 2, 3, 4, and 5, we observe a number of trends consistent across all tasks. First, test accuracy tends to decrease significantly when the MLP is trained with features averaged over a larger past/history, indicating that in the case of three-person human conversation, the dynamics of turn-taking rely mostly on the last few seconds of interaction. Second, the inclusion of visual   Columns comprise how much past data is included in the feature input and how far in the future we predict. Rows comprise the modality of input used and whether the prior (current speaker) label is included as a feature. Max score for each (past, future, prior) triad is in bold. Random Perf. is 50 percent. Always predicting 0 (not speaking) yields 65 percent accuracy.
features during training decreases accuracy, likely because turn-taking depends more on the speech content than the egocentric view of the speaker and the high-dimensional visual features increased the complexity of the learning manifold during training. Using only video features to predict turn-taking (a speech-oriented task) results in poor performance that breaks these trends in some settings (see Table 3, video, past of 4 s). Finally, accuracy drops off significantly the further you predict in the future.
Comparison of Prior, Likelihood, and Posterior. The top likelihood and posterior test accuracies from Tables 2, 3, 4, and 5 are shown in Table 7 along with their corresponding priors from Table 1. The results indicate the strength of the prior, indicating the value of the aligned multi-perspective data used to compute the prior (current speaker label) at inference time. The prior baseline outperforms the posterior in some cases, however, for longer future horizons, the posterior outperforms the prior. This indicates that while the likelihood may under-perform the prior, the MLP model learns turntaking from the data, not just the prior. As expected, the posterior outperforms the likelihood in most cases.
Unlike Tables 2, 3, and 4, in some settings, accuracies in Table 5 dip below that of a naive model that predicts label 1 (the host). For fair comparison, Tasks 1-4 use the same model and training settings (Section B, available in the online supplemental material) across inputs, features, and output. In complex settings, e.g., multiclass prediction) with large past window and future horizon, without further hyper-parameter tuning, poor local minima may be found.
Role of Multiple Synchronized Perspectives. Surprisingly, concatenating participant features to predict the host's speaking state decreased overall accuracy (cf. Table 3 versus  Table 4). Two likely causes are: (A) the host is less influenced by participants, than vice versa, such that the added data actually adds noise, or (B) the 8000-dimensional concatenated feature input is too complex for the simple MLP model -the same model is used to fairly compare results across tasks.
We observe strong evidence to support cause (B). When the MLP is trained only on audio and text features, without the 6144-dimensional video embedding from the concatenated three perspectives, accuracy increases from Task 2 to Task 3   (see Fig. 6). In more challenging settings (larger past window and future), we observe increased accuracy and stability (across future horizon) when all synchronous participants' features are used. These results suggest a need for further exploration of the effects of synchronous multi-perspective multi-modal data in conversational AI. Albation Study. We conduct an ablation study (see Table 6) to validate our findings throughout this section, reproducing the results in Tables 2, 3, 4, and 5 with a past window of 4 s. We replicate these experiments with the scikit-learn [49] implementations of Random Forest and Gaussian Naive Bayes classifiers, with default settings, reporting top-1 accuracy for both the Ego-Com test set as well as 5-fold cross-validation to study how the choice of EgoCom test set may bias results.  The study varies model used for training and the test set, across input modality and how far in the future to predict who will be speaking. As shown in Table 6, there is no significant difference between cross validation accuracy and test set accuracy: both exhibit a (1) decrease in performance further in the future and/or with increased past window of feature representation, (2) for the same classifier, results are within 3 percent at least 90 percent of the time, and (3) training with audio features exhibits highest accuracies. These are trends are similarly observed by the MLP benchmarks. The MLP and random forest classifiers perform similarly in Table 6, likely because the feature embedding inputs were pre-computed using neural architectures, such that our classification task is like fine-tuning the output layer of a neural network, and a random forest layer is highly expressive in comparison with MLP forward and softmax layers.
Towards Live Prediction. Our approach uses a simple MLP classifier trained with audio, video, and text embeddings from pre-trained models to allow for real-time turn-taking prediction. For example, using a pre-trained model with a 5 second past window, inference time takes less than 1 second for both all Tasks 1-4. The results in Tables 2, 3, 4, and 5 suggest a real-time assistance system is plausible.

Human Performance on Turn-Taking
Human accuracy for Task 1 is reported in Table 8 and compared with machine accuracy (Table 2) in Fig. 7. Three human raters were independently presented with 5 seconds of audio, video, or video+audio and asked to predict if the embodied speaker will be speaking or not, in 1, 5, and 10 seconds in the future. The task was performed by each rater every 10th second for every perspective in each conversation in the test set. Across all configurations, 18,732 human predictions were recorded. To avoid redundancy, only one of the three modalities (audio, video, audio+video) was labeled for each of the three perspectives in each conversation.
Inter-rater reliability is measured using Cohen's Kappa for every pair of raters for each video. To control for label quality, in each video, we require a Cohen's Kappa > 0.3 with another rater's labels (44 percent removed). Cohen's Kappa for each (modality, future) setting is reported in Table 8. Fig. 7 compares human and machine performance on Task 1. In all cases, the MLP posterior model is within 5 percent of human performance. Humans perform notably worse when presented video without audio, likely because predicting speaking without being able to hear, using only gestures of visible peers, is remarkably challenging. For a 10 s future horizon, the MLP always outputs "not speaking", yielding a baseline accuracy of 65 percent (see dashed line in Fig. 7).

MULTI-SPEAKER SPEECH RECOGNITION
Here we demonstrate how the unique nature of embodied data may simplify the task of global transcription: computing a time-stamped, multi-speaker identification and transcription. To obtain ground truth transcriptions, a third-party human annotation service transcribed the entire EgoCom dataset.
As an asynchronous baseline, we use Google Cloud's speech-to-text service to transcribe each person's audio in a given conversation and compute mean accuracy with ground truth. Transcription accuracy is computed as 1 À WER, where WER is the word error rate defined by the Wagner-Fischer edit-distance algorithm [50]. Because we use a pre-trained speech-to-text service, we do not need a train and test set and instead compute accuracy on the entire EgoCom dataset. Accuracy is computed per conversation, and overall accuracy is computed as a weighted mean, weighted by the number of words in each conversation.
Asynchronous Baseline: 30.7 Percent Accuracy. We use Google's single-source speech-to-text [51], [52] service to transcribe the audio source for every video in EgoCom. This service provides time-stamped word-level transcriptions with a confidence for every transcribed word. For each conversation, we compute 1 À WER for each source and take the average. The weighted average accuracy across all conversations, weighted by the number of words in each conversation, is 30.7 percent. Low accuracy occurs because the speech-to-text system only has access to a single audio source. Qualitatively, all three speakers can be heard in each audio stream, however, the egocentric audio is significantly louder which may "trick" the system into filtering out non-egocentric audio as noise.
As an alternative baseline, the loudness issue could be avoided by adding the signals prior to transcription, however, such a baseline is not asynchronous because it requires aligned multi-perspective data at inference time. We study   Table 8.
how the unique nature of synchronous, multi-perspective EgoCom data can simplify tasks like global transcription. Synchronous Multi-Perspective Data: 54.8 Percent Accuracy. We use the same Google Cloud transcriptions from the baseline accuracy experiment, but combine the outputs using the maximum confidence for each word, exploiting that EgoCom sources are egocentric and aligned. Our approach is three steps: (1) label all transcriptions for source i as being spoken by speaker i, (2) join all transcriptions in a table indexed by time, sorting the start-time of each transcribed word, and (3) starting from row zero, if two or more rows from different speakers have the same word transcription, within 0.1 seconds of each other, then remove all rows except the one with max confidence. The output is a time-stamped global transcription with speaker ids. Using this approach, we achieve an overall accuracy of 54.8 percent, a 79 percent improvement over the baseline. The improvement results from egocentric synchronous audio: the source worn by the speaker typically yields the prediction with highest confidence score, disambiguating the source for each spoken word in order to obtain a speaker-identified global transcription.
Synchronous Speaker Identification Accuracy is 76.8 Percent. We compute speaker identification accuracy by considering every time both the ground truth and the global transcription speaker labels (from above) both provide a speaker label for a given 1 second time window. In total there are 534,500 labels. Speaker id accuracy is computed as the number of same labels divided by the total number of co-occurrences for each conversation, with the overall accuracy of 76.8 percent as a weighted sum. Note there is no notion of speaker identification for the baseline approach and thus no comparison for speaker identification.
Tables 9 and 10 report accuracies for each approach across demographics and background noise. The results indicate that the advantages of synchronous multi-perspective data for global transcription are unaffected by demographics (no performance decrease) or background noise (similar decrease in performance as the baseline).

CONCLUSION
These findings demonstrate how synchronized multiperspective egocentric data can simplify baseline solutions for two example applications, and more generally, the interdisciplinary nature of the EgoCom dataset. The turn-taking application demonstrates unique new applications enabled by EgoCom and the global transcription application illuminates how aligned egocentric capture may simplify practical problems. Egocentric communications motivates the need for further study of applications in embodied AI and egocentric data across conversational analysis, computer vision, audio, and machine perception.  Steven Lovegrove received the MEng and PhD degrees from Imperial College London, London, United Kingdom. His research interests include calibration, modeling, representation, and reconstruction of 3-D, virtual, and augmented scenes.
Richard Newcombe received the PhD degree in computer vision from Imperial College London, London, U.K. He leads the Surreal Vision Team, Facebook Reality Labs and is director of research science for Facebook Reality Labs. His research interests include real-time SLAM systems and world scale localization and mapping, environment and human understanding, and machine perception systems.