Abstract:
Target speaker voice activity detection (TS-VAD) has recently gained increasing attention due to its wide range of applications, e.g., speaker diarization and extraction....Show MoreMetadata
Abstract:
Target speaker voice activity detection (TS-VAD) has recently gained increasing attention due to its wide range of applications, e.g., speaker diarization and extraction. TS-VAD is usually studied under conversational speech scenarios, wherein the speech of individual speakers is partially or entirely non-overlapping. This potentially restricts the application of TS-VAD systems to less challenging acoustic environments. In this work, we study TS-VAD for fully overlapped speech mixtures. We conduct an ablation study with Personal VAD 2.0 as the baseline to gain a deeper understanding of the choice of TS-VAD components and their effect on the detection performance. Our experiments on WSJ0-2Mix and Libri2Mix datasets show that existing TS-VAD architectures generalize to multitalker environments involving full speaker overlap. Furthermore, we found that TS-VAD performance is sensitive to the target conditioning and its fusion method with the voice activity detection network. We found multiple configurations of target conditioning and fusion methods that outperform the baseline in single- and multi-talker settings.
Published in: Speech Communication; 15th ITG Conference
Date of Conference: 20-22 September 2023
Date Added to IEEE Xplore: 18 December 2023
Print ISBN:978-3-8007-6164-7