Skip to Main Content
This paper proposes a new method for performing joint audio-video talker localization that explores the reliability of the individual localization estimates such as audio, motion detection, and skin-color detection. The reliability information is estimated from the audio and video data separately. The proposed method then uses this reliability information in conjunction with a simple summing voter to dynamically discriminate erroneous outputs from the localizers while performing fusion on the localization results. Based on the voter output, a majority rule is then used to make the final decision of the active talker's current location. The results show that adding the reliability information during fusion improves localization performance when compared to audio only, motion detection only, skin-color detection only, and joint audio-video using straight summing fusion localization methods. The computational complexity of the proposed method is comparable to the existing ones.