Abstract:
Performing sound source separation and visual object segmentation jointly in naturally occurring videos is a notoriously difficult task, especially in the absence of anno...Show MoreMetadata
Abstract:
Performing sound source separation and visual object segmentation jointly in naturally occurring videos is a notoriously difficult task, especially in the absence of annotated data. In this study, we leverage the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner. Human beings interact with the physical world through a few sensory systems such as vision, auditory, movement, etc. The usefulness of the interplay of such systems lies in the concept of degeneracy [1]. It tells us that the cross-modal signals can educate each other without the presence of an external supervisor. In this work, we efficiently exploit this fact that learning from one modality inherently helps to find patterns in others by introducing a novel audio-visual fusion technique. Also, to the best of our knowledge, we are the first to address the partially occluded sound source segmentation task. Our study shows that the proposed model significantly outperforms existing state-of-the-art methods in both visual and audio source separation tasks.
Date of Conference: 19-22 September 2021
Date Added to IEEE Xplore: 23 August 2021
ISBN Information:
ISSN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Physical World ,
- Segmentation Task ,
- Sound Source ,
- Self-supervised Manner ,
- Convolutional Layers ,
- Feature Maps ,
- Visual Cues ,
- Visual Features ,
- Semantic Information ,
- Attention Map ,
- Short-time Fourier Transform ,
- Similar Sources ,
- Visual Content ,
- Sound Localization ,
- Sound Analysis ,
- Curriculum Learning ,
- Feature Pyramid Network ,
- Video Features ,
- Ambient Noise Levels ,
- Visual Context ,
- Signal-to-interference Ratio ,
- Acoustic Source ,
- Audio Information ,
- Deformable Convolution ,
- ResNet Backbone ,
- Attention Block ,
- Spatial Block ,
- Transformer Encoder ,
- Feature Representation
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Physical World ,
- Segmentation Task ,
- Sound Source ,
- Self-supervised Manner ,
- Convolutional Layers ,
- Feature Maps ,
- Visual Cues ,
- Visual Features ,
- Semantic Information ,
- Attention Map ,
- Short-time Fourier Transform ,
- Similar Sources ,
- Visual Content ,
- Sound Localization ,
- Sound Analysis ,
- Curriculum Learning ,
- Feature Pyramid Network ,
- Video Features ,
- Ambient Noise Levels ,
- Visual Context ,
- Signal-to-interference Ratio ,
- Acoustic Source ,
- Audio Information ,
- Deformable Convolution ,
- ResNet Backbone ,
- Attention Block ,
- Spatial Block ,
- Transformer Encoder ,
- Feature Representation
- Author Keywords