Abstract:
Automatically synthesizing sounds for different visual contents poses a challenge and there is a strong need to facilitate the direct creation of realistic sounds. Differ...Show MoreMetadata
Abstract:
Automatically synthesizing sounds for different visual contents poses a challenge and there is a strong need to facilitate the direct creation of realistic sounds. Different from previous works, in this paper, we propose a novel deep learning based approach, which formulates sound simulation as a regression problem. This allows us to circumvent the complexity of the acoustic theory by a novel, general-purpose neural sound synthesis (V2RA) network. Moreover, the end-to-end architecture of V2RA ensures full training without any extra inputs, which thereby greatly improves the scalability and reusability over previous works. In contrast to conventional visual-to-audio generation methods, the V2RA problem is established and solved by generative adversarial networks (GANs). Furthermore, our network architecture can directly predict synchronized raw audio signals (unlike most existing approaches that handle the audio through spectrograms) and generate sound in real time. To evaluate the performance of the neural network generator, we specifically introduce two quantitative scores. Various experiments demonstrate that our V2RA network can produce compelling sound results, which thus provides a viable solution for applications such as sound design and dubbing.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 32, Issue: 3, March 2022)