Skip to Main Content
The detection of the acoustic events (AEs) that are naturally produced in a meeting room may help to describe the human and social activity that takes place in it. Even if the number of considered events is not large, that detection becomes a difficult task in scenarios where the AEs are produced rather spontaneously and they often overlap in time. In this work, we aim to improve the detection of AEs by two different ways: first, we select the most discriminative spectro-temporal audio features by using a hill-climbing wrapper method; second, we add new features coming from video signals as well as from an acoustic source localization system. A new metric is also proposed to conduct feature selection. Besides confirming the interest of using video and source localization information, the results obtained from audiovisual data collected in our multimodal room show that an improved accuracy can be obtained using an acoustic detection system based on a selected subset of features instead of the whole set of features.