Abstract:
In many applications, particularly in media production and content localization, it is crucial to detect and evaluate varying degrees of audio-visual synchronization, suc...Show MoreMetadata
Abstract:
In many applications, particularly in media production and content localization, it is crucial to detect and evaluate varying degrees of audio-visual synchronization, such as selecting high-quality dubbed audio over poorly synchronized tracks. Traditional contrastively pre-trained LipSync models are designed to distinguish perfectly synced audio from unsynced audio. However, these models fall short when it comes to detecting partial synchronization, such as in dubbed audio, because their training objective is focused on pulling synced lip-motion and audio closer together while pushing everything else apart. This approach limits their ability to accurately gauge varying levels of sync, leading to challenges in scenarios that require a more nuanced understanding of synchronization quality. To address this limitation, we propose a novel deep metric learning approach, the Ranking Supervised Multi-Similarity (RSMS) loss formulation, which introduces a ranking prior as a supervision signal. Our method integrates hard-sample mining to enforce this ranking, allowing the model to better differentiate between partial-syncs and completely unsynced audios. Furthermore, we demonstrate the effectiveness of using “Dubbed Audio” as a train-time example of partial-syncs, leading to improved performance in lip-sync models.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: