Skip to Main Content
Speech is a natural interface for human communication. However building human-computer interfaces in unconstrained intelligent spaces still remains a challenging task. Incorporating video information is shown to improve the performance of many audio applications. Similarly, information from the microphones is useful in computer vision tasks. One of the first steps in enabling natural human-computer interaction is person tracking. In this paper, we present a new approach to person tracking using both audio and visual information. We develop a multilevel framework to combine the audio and visual cues to track multiple persons in a meeting room equipped with cameras and microphone arrays. We discuss in detail the multilevel iterative decoding based audio-visual person tracker (MID-AVT). Extensive experimental evaluation of the MID-AVT and comparison to other audio-visual tracking techniques is also presented. The dataset consists of real meeting recordings with sensor configurations similar to those used in the CLEAR 2006 and CLEAR 2007 evaluation workshops. The overall accuracy of the tracker was 75%. The MID-AVT framework performed slightly better than particle filter-based tracker when accurate camera and microphone calibration was available. However, the MID-AVT is also shown to be robust to sensor calibration errors while the particle filtering framework fails. In addition to the audio-visual person tracking results, we also track the active speaker at every instance of time and the results are presented.