Skip to Main Content
Infants learning about their environment are confronted with many stimuli of different modalities. Therefore, a crucial problem is how to discover which stimuli are related, for instance, in learning words. In making these multimodal ldquobindings,rdquo infants depend on social interaction with a caregiver to guide their attention towards relevant stimuli. The caregiver might, for example, visually highlight an object by shaking it while vocalizing the object's name. These cues are known to help structuring the continuous stream of stimuli. To detect and exploit them, we propose a model of bottom-up attention by multimodal signal-level synchrony. We focus on the guidance of visual attention from audio-visual synchrony informed by recent adult-infant interaction studies. Consequently, we demonstrate that our model is receptive to parental cues during child-directed tutoring. The findings discussed in this paper are consistent with recent results from developmental psychology but for the first time are obtained employing an objective, computational model. The presence of ldquomultimodal mothereserdquo is verified directly on the audio-visual signal. Lastly, we hypothesize how our computational model facilitates tutoring interaction and discuss its application in interactive learning scenarios, enabling social robots to benefit from adult-like tutoring.