Skip to Main Content
This paper proposes a method to estimate a sound source position by fusing the auditory and visual information with Bayesian network in human-robot interaction. We firstly integrate multi-channel audio signals and a depth image about the environment to generate a likelihood map for sound source localization. However, this integration, denoted by "MICs", does not always lead to locate a sound source correctly. For correcting the failure in localization, we integrate the likelihood values generated from "MICs" and the skin-color distribution in an image according to the result of classifying audio signal into speech/non-speech categories. The audio classifier is based on the support vector machine(SVM) and the skin-color distribution is modeled with GMM. With the evidences given by MICs, SVMs and GMM, we infer whether pixels in images correspond to sound source or not according to the trained Bayesian network. Finally, experimental results are presented to show the effectiveness of the proposed method.