Abstract:
Speech separation as a fundamental task in signal processing can be used in many types of intelligent robots, and audio-visual (AV) speech separation has been proven to b...Show MoreMetadata
Abstract:
Speech separation as a fundamental task in signal processing can be used in many types of intelligent robots, and audio-visual (AV) speech separation has been proven to be superior to audio-only speech separation. In current AV speech separation methods, visual information plays a pivotal role not only during network training but also during testing. However, due to various factors in real environments, sensors do not always possible to obtain high-quality visual signals. In this paper, we propose an effective two-stage AV speech separation model that introduces a new approach of visual feature embedding, where visual information is used to optimize the separation network during training, but no visual input is required during testing. Different from the current methods which fuse visual features and audio features together as the input of the separation network, in this model, visual features are embedded into AV matching block to calculate the cross-modal consistency loss, which is used as part of the loss function for network optimization. A novel tuples loss function with a learnable dynamic margin is proposed for better AV matching, and two margin change strategies are given. The proposed two-stage AV speech separation method is evaluated on the widely used GRID and VoxCeleb2 datasets. Experimental results show that the performance outperforms current AV speech separation methods.
Published in: IEEE Journal of Selected Topics in Signal Processing ( Volume: 18, Issue: 3, April 2024)