Bootstrapping Language-Audio Pre-training for Music Captioning | IEEE Conference Publication | IEEE Xplore

Abstract:

We introduce BLAP, a model capable of generating high-quality captions for music. BLAP leverages a fine-tuned CLAP audio encoder and a pre-trained Flan-T5 large language ...Show More

Abstract:

We introduce BLAP, a model capable of generating high-quality captions for music. BLAP leverages a fine-tuned CLAP audio encoder and a pre-trained Flan-T5 large language model. To achieve effective cross-modal alignment between music and language, BLAP utilizes a Querying Transformer, allowing us to obtain state-of-the-art performance using 6x less data compared to previous models. This is a critical consideration given the scarcity of descriptive music data and the subjective nature of music interpretation. We provide qualitative examples demonstrating BLAP’s ability to produce realistic captions for music, and perform a quantitative evaluation on three datasets. BLAP achieves a relative improvement on FENSE compared to previous models of 3.5%, 6.5%, and 7.5% on the MusicCaps, Song Describer, and YouTube8m-MTC datasets, respectively. The codebase is available at https://github.com/ETH-DISCO/blap.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

References

References is not available for this document.