Skip to Main Content
We propose a method for user-driven recognition of events in audio streams, aiming to assist journalists towards easily annotate unedited audiovisual content. Nonlocal information provided by the user, as for example that the sound of applause exists within the video, is used for adapting the audio event classifiers so as to detect the exact position of these events in the video. Towards this end, each audio class is modeled using a Support Vector Machine (SVM) and the final automatic decision is taken on a mid-term audio basis, using an alternative of the One Vs All architecture. A weighting function is generated based on the user input and it is applied on the soft-output decision of the respective SVMs, thus adapting the final decision to the user's provided knowledge. To evaluate our method, we have used a large dataset of real news videos, provided by the German international broadcaster (DW - Deutsche Welle) and the Portugese broadcaster (Lusa - Agłncia de Notcias de Portuga) where five audio classes, often met in the particular dataset, are defined. Results show that the above process leads to significant raise of the audio tracking performance.