Abstract:
Audio-Visual Active Speaker Detection(ASD) is the task of identifying, at any given moment, who is actively speaking in a multi-person scene by using audio and visual cue...Show MoreMetadata
Abstract:
Audio-Visual Active Speaker Detection(ASD) is the task of identifying, at any given moment, who is actively speaking in a multi-person scene by using audio and visual cues. Current main stream ASD methods separately encode audio and facial features, then adopt post-feature fusion approach where the acoustic features fused with the facial features from the same frame in the manner of vector concatenating or simple projecting. Such solution faces the challenges when there are more than one faces in the frame or overlapping speeches occur since there are lack of information alignment between active speech and the face of taking person. Based on this observation, in this study, we adopt a new solution to establish the relationships between the audio and face information using a heterogeneous graph explicitly. Specifically, we propose AFs-Net, which is able to capture both the relationships between the audio and each candidate’s face, and also the interactions between the faces of the candidates themselves within the same intra-frame. As the result, the graph with attention is trained to learn the importance (attention coefficient) between adjacent nodes. Additionally, we impose consistency constraints that bring speech features closer to speaker characteristics, while aligning non-speech features with non-speaker characteristics, further enhancing the audio-faces alignment. Our frame-level modeling approach supports both streaming applications and real-time operation. Experiments show that our method achieves state-of-the-art (SOTA) performance across multiple datasets.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: