Audioclip: Extending Clip to Image, Text and Audio | IEEE Conference Publication | IEEE Xplore

Audioclip: Extending Clip to Image, Text and Audio


Abstract:

The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approach...Show More

Abstract:

The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.We present AudioCLIP – an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audio-model into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP’s zero-shot capabilities.AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15 % on ESC-50 and 90.07 % on UrbanSound8K. Further, it sets new baselines in the zero-shot ESC-task on the same datasets (69.40 % and 68.78 %, respectively).We also asses the influence of different training setups on the final performance of the proposed model. For the sake of reproducibility, our code is published.
Date of Conference: 23-27 May 2022
Date Added to IEEE Xplore: 27 April 2022
ISBN Information:

ISSN Information:

Conference Location: Singapore, Singapore

Contact IEEE to Subscribe

References

References is not available for this document.