Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language Acquisition

This paper presents a self-supervised method for detecting the active speaker in a multi-person spoken interaction scenario. We argue that this capability is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. Our methods are able to detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Our methods do not rely on external annotations, thus complying with cognitive development. Instead, they use information from the auditory modality to support learning in the visual domain. The methods have been extensively evaluated on a large multi-person face-to-face interaction dataset. The results reach an accuracy of 80% on a multi-speaker setting. We believe this system represents an essential component of any artificial cognitive system or robotic platform engaging in social interaction.

Abstract-This paper presents a self-supervised method for detecting the active speaker in a multi-person spoken interaction scenario. We argue that this capability is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. Our methods are able to detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Our methods do not rely on external annotations, thus complying with cognitive development. Instead, they use information from the auditory modality to support learning in the visual domain. The methods have been extensively evaluated on a large multi-person face-toface interaction dataset. The results reach an accuracy of 80% on a multi-speaker setting. We believe this system represents an essential component of any artificial cognitive system or robotic platform engaging in social interaction.
Index Terms-active speaker detection, language acquisition through development, transfer learning, cognitive systems and development

I. INTRODUCTION
T HE ability to acquire and use language in a similar manner as humans may provide artificial cognitive systems with a unique communication capability and the means for referencing to objects, events and relationships. In turn, an artificial cognitive system with this capability will be able to engage in natural and effective interactions with humans. Furthermore, developing such systems can help us further understand the underlying processes in language acquisition during the initial stages of the human life. As mentioned in ten Bosch et al. [1], modeling language acquisition is very complex and should integrate different aspects of signal processing, statistical learning, visual processing, pattern discovery, and memory access and organization.
According to many studies (e.g. Steels and Kaplan [2]) there are two alternatives to human language acquisition -individualistic learning and social learning. In the case of individualistic learning, the infant exploits the statistical regularities in the multi-modal sensory inputs to discover linguistic units such as phonemes and words and word-referent mappings. In the case of social learning, the infant can determine the intentions of others by exploiting different social cues. Therefore, in social learning, the participants in the interaction with the infant play a crucial role by constraining the interaction and providing feedback.
From a social learning perspective, the main prerequisite for language acquisition is the ability to engage in social interactions. For an artificial cognitive system to address this challenge, it must at least 1) be aware of the people in the environment, 2) detect their status: speaking/not speaking, and 3) infer possible objects the active speaker is focusing attention on.
In this study we address the problem of detecting the active speaker in a multi-person language learning scenario. We propose and evaluate three self-supervised methods to tackle this task based on the temporal synchronization of two sensory modalities -visual and auditory. We demonstrate these methods with a system that learns to construct visual representations of speaking/not speaking faces by exploiting the information from the temporally aligned auditory modality. As a result, the system can detect and keep track of the active speaker in real-time using only visual information.
In order to impose as little constraints as possible on the social interaction, we have two requirements for the methods. The first is that any particular method must operate in realtime (possibly with a short lag), which in practice means that the method should not require any future information. The second requirement is that the methods should make as few assumptions as possible about the environment in which the artificial cognitive system interacts. Therefore, the methods should not assume noise-free environment, known number of participants in the interaction, or known spatial configuration. The proposed methods address the requirements for engagement in social interactions outlined above, by detecting the people in the environment and detecting their status -speaking/not speaking. This information in turn, is a prerequisite to hypothesizing the possible objects a speaking person is focusing his/her attention on, which has been shown to play an important role in infants' language acquisition (cf. Section II-A).
The rest of the manuscript is organized as follows. First we examine previous research that forms the context for the current study in Section II, then we describe the proposed methods in Section III. The experiments we conducted are described in Section IV and the results of these experiments are presented in Section V. Discussion on the data and the used evaluation metric, together with the assumptions made can be found in Section VI. We conclude the paper in Section VII.

II. RELATED WORK
This section is divided in two parts. First we introduce research on language acquisition which supports our motivation to build an active speaker detector for a language learning artificial cognitive system. In the second part of the section we turn our focus on research related to the problem of identifying the active speaker through visual and auditory perceptual inputs.

A. Language Acquisition
The literature on language acquisition offers several theories of how infants learn their first words. One of the main problems which researchers face in this field is the referential ambiguity as discussed for example in Clerkin et al. [3], Pereira et al. [4] and Yurovsky et al. [5]. Referential ambiguity stems from the idea that infants must acquire language by linking heard words with perceived visual scenes, in order to form word-referent mappings. In everyday life however, these visual scenes are highly cluttered which results in many possible referents for any heard word, within any learning event (Quine et al. [6] and Bloom [7]). Similarly, many computational models of language acquisition are rooted in finding statistical associations between verbal descriptions and the visual scene (e.g. [3], [8]- [10]), or in more interactive robotic manipulation experiments (e.g. Salvi et al. [11]). However, nearly all of them assume a clutter-free visual scene, where objects are observed in isolation on a simplified background (often white table).
Different theories offer alternative mechanisms through which infants reduce the uncertainty present in the learning environment. One such mechanism is statistical aggregation of word-referent co-occurrences across learning events. The problem of referential ambiguity within a single learning event has been addressed by Smith et al. [12], [13] suggesting that infants can keep track of co-occurring words and potential referents across learning events and use this aggregated information to statistically determine the most likely wordreferent mapping. However, the authors argued that this type of statistical learning may be beyond the abilities of infants when considering highly cluttered visual scenes. In order to study the visual scene clutter from the infants' perspective, Pereira et al. [4] and Yurovsky et al. [5] performed experiments in which the infants were equipped with a head-mounted eyetracker. The conclusion was that some learning events are not ambiguous because there was only one dominant object when considering the infants' point of view. As a consequence, the researchers argued that the input to language learning must be understood from the infants' perspective, and only regularities that make contact with the infants' sensory system can affect their language learning. Although not related to language acquisition, an attempt at modeling the saliency of multi-modal stimuli from the learner (robot) perspective was proposed in Ruesch et al. [14]. This was a bottom up approach exclusively based on the statistical properties of the sensory inputs.
Another mechanism to cope with the uncertainty in the learning environment might be related to social cues to the caregivers' intent, as mentioned in the above studies. Although a word is heard in the context of many objects, infants may not treat the objects as equally likely referents. Instead, infants can use social cues to rule out contenders to the named object. Yu and Smith [15] used eye-tracking to record gaze data from both caregivers and infants and found that when the caregiver visually attended to the object to which infants' attention was directed, infants extended their duration of visual attention to that object, thus increasing the probability for successful wordreferent mapping.
Infants do not learn only from interactions they are directly involved in, but also observe and attend to interactions between their caregivers. Handl et al. [16] and Meng et al. [17] performed studies to examine how the body orientation can influence the infants' gaze shifts. These studies were inspired by large body of research on gaze following which suggests that infants' use others' gaze to guide their own attention, that infants pay attention to conversations, and that joint attention has an effect on early learning. The main conclusion was that static body orientation alone can function as a cue for infants' observations and guides their attention. Barton and Tomasello [18] also reasoned that multi-person context is important in language acquisition. In their triadic experiments, joint attention was an important factor facilitating infants' participation in conversations: infants were more likely to take a turn when they shared a joint attentional focus with the speaker. Yu and Ballard [9] also proposed that speakers' eye movements and head movements among others, can reveal their referential intentions in verbal utterances, which could play a significant role in an automatic language acquisition system.
The above studies do not consider how infants might know which caregiver is actively speaking and therefore requires attention. We believe that this is an important prerequisite to modeling automatic language acquisition. The focus of the study described in this manuscript is, therefore, to investigate different methods for inferring the active speaker. We are interested in methods that are plausible from a developmental cognitive system perspective. One of the main implications is that the methods should not require manual annotations.  2. Overview of the approaches to active speaker detection considered in this study. In the top row are the perceptual inputs and the way they are automatically modified before being passed to the memoryless (second row), transfer learning (third row) and dynamic (forth row) methods.

B. Active Speaker
Identifying the active speaker is important for many applications. In each area, different constraints are imposed to the methods. Generally, there are three different approaches: audio-only, audio-visual, and approaches that use other forms of inputs for detection.
Audio-only active speaker detection is the process of finding segments in the input audio signal associated with different speakers. This type of detection is known as speaker diarization. Speaker diarization has been studied extensively. Miro et al. [19] offer a comprehensive review of recent research in this field.
Audio-visual speaker detection combines information from both the audio and the video signals. The application of audiovisual synchronization to speaker detection in broadcast videos was explored by Nock et al. [20]. Unsupervised audio-visual detection of the speaker in meetings was proposed in Friedland et al. [21]. Zhang et al. [22] presented a boosting-based multi-modal speaker detection algorithm applied to distributed meetings, to give three examples. Mutual correlations to associate an audio source with regions in the video signal was demonstrated by Fisher et al. [23], and Slaney and Covell [24] showed that audio-visual correlation can be used to find the temporal synchronization between audio signal and a speaking face. In more recent studies, researchers have employed deep architectures (Convolutional Neural Networks and Recurrent Neural Networks) to build active speaker detectors from audio-visual input. A multi-modal Long Short-Term Memory (LSTM) model that learns shared weights between modalities was proposed in Ren et al. [25]. The model was applied to speaker naming in TV shows. Hu et al. [26] proposed a Convolutional Neural Network (CNN) that learns the fusion function of face and audio information.
Other approaches for speaker detection include a general pattern recognition framework used by Besson and Kunt [27] applied to detection of the speaker in audio-visual sequences. Visual activity (the amount of movement) and focus of visual attention were used as features by Hung and Ba [28] to determine the current speaker on real meetings. Stefanov et al. [29] used action units (AU) as input features to Hidden Markov Models (HMM) to determine the active speaker in multi-party interactions and Vajaria et al. [30] demonstrated that information for body movements can improve recognition performance.
Most of the approaches cited in this section are either evaluated on small amounts of data, or have not been proved to be usable in real-time settings. Furthermore, they usually require manual annotations and the spatial configuration of the interaction and the relative position of the input sensors is known. The goal is usually an offline video/audio analysis task, such as semantic indexing and retrieval of TV broadcasts or meetings, or video/audio summarization. We believe that the challenge of real-time detection of the active speaker in dynamic and cluttered environments remains. For automatic language acquisition, we want to infer the possible objects the active speaker is focusing attention on. In this context, assumptions such as known sensor arrangement or participants' position and number in the environment are unrealistic, and should be avoided. Therefore, in this study we present methods which have several desirable characteristics for such types of scenarios, 1) they work in real-time, 2) they do not assume specific spatial configuration (sensors and participants), 3) the number of possible (simultaneously) active participants is free to change during the interaction, and 4) no externally produced labels are required, but rather the acoustic inputs are used as reference to the visually based learning.

III. METHODS
The goal of the methods described in this section is to detect in real-time the status (speaking/not speaking) of all visible faces in a multi-person language learning scenario. The methods proposed in this study use only visual information (the RGB color data). An illustration of the desired output of a method can be seen in Figure 1.
We use a self-supervised learning approach to construct an active speaker detector: the machine learning methods are supervised, but the labels are obtained automatically from the auditory modality to learn models in the visual modality. An overview of the approaches considered in the current study is given in Figure 2. The first row in the figure illustrates the perceptual inputs that are automatically extracted from the raw audio and video streams. The visual input consists of cropped RGB images of each face extracted with Viola and Jones's face detection system [31]. The labels extracted from the acoustic input correspond to the voice activity information [32]. The ability to extract both these pieces of information is given as a starting point in the system and is motivated in Section VI.
All the methods make use of a feature extractor based on Convolutional Neural Networks (CNNs) and a following classifier. We test two kinds of classifiers: memory less (Perceptrons) and dynamic (Long Short Term Memory, LSTM). We also consider two ways of training our models: in transfer learning we employ a pre-trained feature extractor and we only train the classifier specifically to our data and task; in the task specific learning we train the feature extractor and the classifier simultaneously on our data.
Each model outputs a posterior probability distribution over the two possible outcomes (speaking/not speaking). Since the goal is a binary classification, the detection of the active speaker happens when the corresponding probability exceeds 0.5. The evaluation of each method is performed by computing the accuracy of the predictions on frame-by-frame basis (see Section IV).

A. Task Specific Learning
An illustration of the task specific learning method is reported in the second row of Figure 2. We train a Convolutional Neural Network (CNN) feature extractor in combination with a Perceptron classifier with the goal of classifying each input image either as speaking or not speaking. During the training phase both images and labels are used by a gradientbased optimization procedure (i.e. Kingma and Ba [33]) to adjust the weights of a CNN/Perceptron model. During the prediction phase, only images are used by the trained detectors to generate labels. The CNN/Perceptron model works on a frame-by-frame basis and has no memory of past frames.

B. Transfer Learning
An illustration of this method can be seen in the third row of Figure 2. Similarly to the previous method, we use a CNN/Perceptron model. This time, however, the CNN model is pre-trained on an object recognition task (VGG16 [34]). To adapt this model to the active speaker detection task, we remove the object classification layer and use the truncated VGG16 model as feature extractor. We then train a perceptron layer to map those features to the speaker activity information. As for the custom CNN above, this method has no memory of past frames. Because the VGG16 model was originally trained in a supervised way to classify objects, this raises the question on how suitable this model is in the context of developmental language acquisition. Support to the use of this model comes from the literature on visual perception that demonstrates the ability of infants to recognize objects very early in their development.

C. Dynamic Models
The dynamic method is illustrated by the forth row of Figure 2. This method is based on the feature extractors described above, but introduces a model of the time evolution of the perceptual signals. During the training phase the images are fed into a custom (CNN) or pre-trained (VGG16) feature extractor, which constructs an n-dimensional feature vector for each image. Then the features and the labels are used by a gradient-based optimization procedure (i.e. Kingma and Ba [33]) to adjust the weights of a Long Short-Term Memory (LSTM) model (Hochreiter and Schmidhuber [35]). During the prediction phase, images are converted into features with a CNN or the VGG16 model which features are then used by the trained detectors (LSTM models) to generate labels.

IV. EXPERIMENTS
This section is divided in two parts. First we describe the data we used to build and evaluate the active speaker detectors. Then we describe the general setup of the conducted experiments.

A. Data
The methods presented in Section III are implemented and evaluated using a multi-modal multi-party interaction dataset described in Stefanov and Beskow [36]. The main purpose of the dataset is to explore patterns in the focus of visual attention of humans under the following three different conditions: two humans involved in task-based interaction with a robot (we will refer to this condition as Game 1); the same two humans involved in task-based interaction where the robot is replaced by a third human (Game 2), and a free three-party human interaction (Game 3). The dataset contains two parts: 6 sessions with duration of approximately 30 minutes each, and 9 sessions, each of which is with duration of approximately 40 minutes. The dataset is rich in modalities and recorded data streams. It includes the streams generated from 3 Kinect v2 devices (color, depth, infrared, body and face data), 3 high quality audio streams generated from close-talking microphones, 3 high resolution video streams generated from GoPro cameras, touch-events stream for the task-based interactions generated from an interactive surface and the system state stream generated by the robot involved in the first condition. The second part of the dataset also includes the data streams generated from 3 Tobii Pro Glasses 2 eye trackers worn by the participants. All interactions are in English and all data streams are spatially and temporally synchronized and aligned. All interactions occur around a round interactive surface and the participants are seated. Figure 3 illustrates the spatial configuration of the setup in the dataset. There are 24 unique participants.
In this study we only use the audio generated by the closetalking microphones and the video generated by the Kinect devices.

B. Experimental Setup
We report results based on random sampling of the data for each subject, where ∼80% of the data is used for training, ∼15% is used for testing, and ∼5% is used for validation. We evaluated the methods on all the available test data (Game 1-3). Results are also reported for a subset of the data (Game 3) that has different characteristics from the other two conditions. In Game 1 and 2, the participants were often focusing on an interactive table, thus not showing their faces properly to the cameras. By comparing the results on all games to the ones on Game 3 we wish to verify if the orientation of the faces with respect to the cameras has a strong impact on the performance of our methods.
The video input is generated by the Kinect device directed at the subject under consideration and the audio input is generated by the close-talking microphone. The dynamic models are trained and evaluated with 15, 30, 150, and 300 frame (500ms, 1s, 5s, and 10s) long segments without overlaps.
The CNN models comprise three convolutional layers of width 32, 32, and 64 with receptive fields of 3×3 and rectifier activations, interleaved by max pooling layers with window size of 2 × 2. The output of the last max pooling layer is fed to a densely connected layer of size 64 with rectifier activation functions and finally to a perceptron layer with logistic sigmoid activations. The LSTM models include one long short-term memory layer of size 128 with hyperbolic tangent activations, followed by a densely connected and a perceptron layer similarly to the CNN model.
For training the models we used Adam optimizer with default parameters (α = 0.001, β 1 = 0.9, β 2 = 0.999, and = 10 −8 ) and binary crossentropy loss function. Each memory less model (CNN and Perceptron) is trained for 50 epochs and each dynamic model (LSTM) is trained for 100 epochs. The model corresponding to the best validation performance is selected for the final results. The methods were implemented in Keras (Chollet et al. [37]) with Tensor-Flow [38] backend. During the prediction phase only the RGB color data generated from the automatic face detection module is used as input. As described above, each of the considered methods outputs a posterior probability distribution over the two possible outcomes -speaking/not speaking. Therefore, when evaluating the models' performance, we use 0.5 as a threshold for assigning a class to each frame-level prediction. All results are reported in terms of frame-by-frame weighted accuracy which is calculated with, wacc = 100 × tp tp+fn + tn where tp, fp, tn, and fn are the number of true positives, false positives, true negatives and false negatives, respectively. As a consequence, regardless of the actual class distribution in the test set (which is in general different for each subject), the baseline chance performance using this metric is always 50%. Although this metric allows us to more easily compare results between different subjects and methods, it is a very conservative measure of performance. For example, if in 100 frames, 98 belong to the active class and 2 to the inactive class, a method that classifies all frames as active will have wacc = 100 2 × 98 98+0 + 0 2+0 = 50%. We come back to this point when we discuss the results.
We conducted three main experiments: speaker dependent, multi-speaker dependent, and speaker independent.
In the speaker dependent experiment we build a model for each speaker and test it on independent data from the same speaker. The data distribution for this experiment is summarized in Table I.  In the multi-speaker dependent experiment we build a model for a group of subjects and test it on the same group. This experiment tests the scalability of the proposed methods to more than one subject. We divided the subjects in two groups, we call them set A and set B for brevity. The division is based on random assignment of each subject in the dataset to one of the sets, resulting in 12 subjects in set A and the other 12 subjects in set B. The data is then partitioned into train, test and validation sets per subject, following the general approach outlined above. The data distribution for this experiment is summarized in Table II. In the speaker independent experiment we used the same groups and models as in the multi-speaker dependent experiment. In this experiment however, the model built for each group of subjects is tested on data from the other group. This experiment tests the transferability of the proposed methods to unseen subjects.

V. RESULTS
This section presents the numerical results obtained from the three experiments.

A. Speaker Dependent
The summarized mean results and standard deviations obtained in the speaker dependent experiment are provided in Table III. The results show that regardless of the method used in this experiment the mean weighted accuracy is always higher for the experiments where the data is drawn from Game 3 compared to the experiments where the data is drawn from Game 1-3. This indicates that the orientation of the faces with respect to the cameras has an impact on the performance of the methods. The highest mean result in this experiment for Game 1-3 is 73.2% for the LSTM 15 model and in Game 3 the highest is 75.94% for the LSTM 150 model. The complete results (weighted accuracy rates per subject and mean weighted accuracies and standard deviations over all subjects and methods) are provided as appendix at the end of the text in Table VI. We illustrate the complete results in Figure 4. From the figure we can see that the weighted accuracy varies significantly between subjects. This inter subject variability is further confirmed by the high standard deviations. We should mention that the variability between subjects is higher than the difference we obtain with different methods per subject. We should also note that, although most results improve when considering only Game 3, there are subjects for which Game 3 results are worse than Game 1-3 results.

B. Multi-Speaker Dependent
The goal of the multi-speaker dependent experiment is to test the scalability of the proposed methods to more than one speaker. We have summarized the results of this experiment in Table IV. As observed in the previous experiment the mean performance of all methods is better when the data is drawn from Game 3 compared to Game 1-3. The highest mean result in this experiment for Game 1-3 is 74.5% for the LSTM 300 model and in Game 3 the highest is 80.23% for the LSTM 30 model.

C. Speaker Independent
The speaker independent experiment tests the transferability of the proposed methods to unseen subjects. We have summarized the results of this experiment in Table V. The highest mean result in this experiment for Game 1-3 is 56.3% for the LSTM 300 model and in Game 3 the highest is 60.25% for the LSTM 300 model.

VI. DISCUSSION
Firstly, we should mention that the proposed methods estimate the probability of speaking independently for each face. This has the advantage of being able to detect several speakers that are active at the same time, but for many applications it might be sufficient to select the active speaker among the detected faces. Doing this would allow us to combine the single predictions into a joint probability thus increasing the performance.
In order to interpret the results presented in Section V we need to make a number of considerations about the evaluation method and the data. We will also consider the advantages and limitations of the metric used and detail the assumptions made in the methods and the main contributions of the study.

A. Challenges in the Data
Section IV-A outlines the main characteristics of the dataset used in this study. Each interaction in the dataset is divided into three conditions, with the first and the second condition being related to a collaborative game scenario in which the subjects play a game on a touch surface. During this two conditions the subjects interact mainly with the touch surface and discuss with their partner (and sometimes with the interaction moderator) how to solve the given task. Therefore, the subjects' overall gaze direction (head orientation) is towards the touch surface. This raises some very challenging visual conditions for extracting the information from the face. We show three examples in Figure 5. This observation motivated a separate experiment using only the data from the third condition of each interaction (referred to as Game 3 in the text) which yielded better overall results although the amount of data was significantly less compared to the full dataset (referred to as Game 1-3 in the text). As discussed previously, we should also note that the improvement of the results in the Game 3 condition is not shared by all subjects. This would suggest that the random sampling into train, test and validation sets per subject also has an influence on the performance of the methods. Therefore, future work might include incorporating a k-fold cross validation for the subject dependent evaluation. Another source of difficulty in the estimation of speaker activity from the face, is the fact that subjects are wearing recording hardware (close-talking microphones and eye-trackers) that could have interfered with the visual representation of the faces in the proposed methods.

B. Metric
We discuss here several characteristics of the metric used in the study. Evaluating the proposed methods on a frame-byframe basis gives a detailed measure of performance. However, one might argue that frame-level (33ms) accuracy is not necessary for artificial cognitive systems employing the proposed methods in the context of automatic language acquisition. Evaluating the methods on a fixed-length sliding time window (e.g. 200ms) might be sufficient for this application.
Furthermore, the definition of the weighted accuracy amplifies short mistakes, as exemplified in Section IV-B. Consider a case of continuous talking, where the speaker takes short pauses to recollect a memory or structure the argument. A perfect VAD module will detect silence of certain length (at least 200ms) in the acoustic signal and label the corresponding video frames as not speaking. However, from the interaction point of view the speaker might still be active, resulting also in visual activity. A method that misses these short pauses would be strongly penalized by our metric, achieving as low as 50% accuracy when all other frames are classified correctly.
A similar situation occurs when a person is listening and gives short acoustic feedbacks which are missed by the methods.
The advantage of the weighted accuracy metric, however, is that it enables us to seamlessly compare the performance between subjects and methods. This because the different underlying class distributions due to each particular data set are accounted for by the metric and the resulting baseline is 50% for all considered experimental configurations.

C. Assumptions
We made the following assumptions in the proposed methods: • the system is able to detect faces, • the system is able to detect speech (VAD) for a single speaker, • there are situations in which the system only interacts with one speaker, and can therefore use the VAD to train the visual active speaker detector. In order to motivate the plausibility of these assumptions in the context of a computational model of language acquisition, we consider research in developmental psychology. According to studies reported in LaBarbera et al. [39] and Young-Browne et al. [40] infants can discriminate between several facial expressions which suggests that they are capable of detecting human faces. The assumption that the system can detect speech seems to be supported by research on recognition of mother's voice in infants (e.g. Mills and Melhuish [41]). The final assumption is reasonable considering that infants interact with small number of speakers in their first months, and in many cases only one parent is available as caregiver at any specific time.

D. Contributions
The methods proposed in this study are similar to our previous work on active speaker detection in multi-party humanrobot interactions (Stefanov et al. [42]). We will summarize the main differences in this section. The first difference is the use of a better performing pre-trained CNN model for visual feature extraction (VGG16, [34]) compared to the previously used AlexNet, [43]. We also significantly extended the set of experiments we performed to evaluate and compare the proposed methods. In the current study we evaluated the effect of using dynamical models by comparing the performance of LSTM models similar to the ones evaluated in [42], to memory less perceptron classifiers. Furthermore, we compared the performance of the transfer learning methods described above, with models that are built specifically for the current application and trained exclusively on our data. Finally we ran experiments for multi-speaker and speaker independent settings.
One of our findings is that, given that we optimize the classifier to our task (Perceptron or LSTM), it is not necessary to optimize the feature extractor (the custom CNNs perform similarly to the pre-trained VGG16). This suggests that a pre-trained feature extractor such as VGG16 works well independently of the speaker and can be used to extend our results even beyond the subjects in the present data set. Also, the results of the multi-speaker dependent experiment proves that the proposed methods can scale beyond a single subject without significant decrease in performance. Combining this observation with the observation for the applicability of transfer learning discussed above suggests that a mixture of the proposed methods can be indeed an useful component of a real life artificial cognitive system.
Finally, the results of the speaker independent experiment are not as good as the results of the previous two experiments. We should mention, however, that, from a cognitive system's perspective, this might be a unnecessarily challenging condition. We can in fact expect infants to be familiar with a number of caregivers, thus justifying a condition more similar to the multi-speaker dependent test.
VII. CONCLUSIONS In this study we proposed and evaluated three methods for automatic detection of the active speaker based solely on visual input. The proposed methods could assist an artificial cognitive system to engage in social interactions which has been shown to be beneficial for language acquisition.
We tried to reduce the assumptions about the language learning environment to a minimum. Therefore, the proposed methods allow different speakers to speak simultaneously as well as to be all silent; the methods do not assume a specific number of speakers, and the probability of speaking is estimated independently for each speakers, thus allowing the number of speakers to change during the social interaction.
We evaluated the proposed methods on a large dataset including some challenging visual conditions. For example, around half of the time the subjects in the dataset interact with a touch surface and they look down while talking, making the feature extraction after face detection more difficult. The methods perform well on a speaker dependent and multispeaker dependent fashion, reaching accuracy of over 80% (baseline 50%) on a weighted frame-based evaluation metric. The combined results obtained from the transfer learning and multi-speaker learning experiments are promising and suggest that the proposed methods can generalize to unseen perceptual inputs by incorporating a model adaptation step for each new face.
We should acknowledge the general difficulty of the problem addressed in this study. Humans generally produce many facial configurations when not-speaking that might be highly overlapping to the configurations associated with speaking.
The methods proposed in this study are a prerequisite for socially-aware language acquisition and they can be seen as mechanisms for constraining the visual input thus providing higher quality and more appropriate data for a statistical learning of word-referent mappings. Therefore, the main purpose of the methods is to help bringing an artificial cognitive system one step closer to resolving the referential ambiguity in cluttered and dynamic visual scenes.

APPENDIX COMPLETE SPEAKER DEPENDENT RESULTS
The complete speaker dependent results are in Table VI.