Skip to Main Content
Provisioning of mobile video services is rather challenging since in mobile environments, bandwidth and processing resources are limited. Audiovisual content is present in most multimedia services, however, the user expectation of perceived audiovisual quality differs for speech and non-speech contents. The majority of recently proposed metrics for audiovisual quality estimation assumes only one continuous medium, either audio or video. In order to accurately predict the audiovisual quality of a multi-media system it is necessary to apply a metric that takes simultaneously into account audio as well as video quality. When assessing a multi-modal system, one cannot model it only as a simple combination of mono-modal models, because the pure combination of audio and video models does not give a robust perceived-quality performance metric. We show the importance taking into account the cross-modal interaction between audio and video modes also code mutual compensation effect. In this contribution we report on measuring the cross-modal interaction and propose a content adaptive audiovisual metric for video sequences that distinguishes between speech and non-speech audio. Furthermore, the proposed method allows for a reference-free audiovisual quality estimation, which reduces computational complexity and extends applicability.