We introduce BS-BGM500, a dataset of 500 short videos with shot boundary and rhythm-aligned annotations, and BS-BGM, a diffusion-based model that integrates visual, textu...
Abstract:
Short videos have emerged as a powerful medium for self-expression and background music (BGM) plays a crucial role in enhancing audience immersion. Existing video-to-audi...Show MoreMetadata
Abstract:
Short videos have emerged as a powerful medium for self-expression and background music (BGM) plays a crucial role in enhancing audience immersion. Existing video-to-audio generation methods struggle to achieve precise alignment with beat-synced dynamic rhythms tailored to video content. To address this challenge, we introduce BS-BGM500, a curated dataset comprising 500 short videos with meticulous synchronization between shot boundaries and audio rhythm changes. This dataset includes comprehensive annotations such as shot boundaries, captions, and rhythm information. Additionally, we propose BS-BGM, a diffusion-based model designed for generating BGMs. By integrating visual and textual features while leveraging shot boundary information as a Boundary Rhythm Bias (BRB), the model achieves dynamic rhythm transitions and ensures seamless alignment with video content. Extensive evaluations on BS-BGM500 and the widely-used BGM909 dataset demonstrate that our method significantly outperforms previous approaches in terms of audio quality, emotional alignment, and beat-synced consistency. This work represents a substantial advancement in automated BGM generation for short videos, bridging the gap between video dynamics and music generation.
We introduce BS-BGM500, a dataset of 500 short videos with shot boundary and rhythm-aligned annotations, and BS-BGM, a diffusion-based model that integrates visual, textu...
Published in: IEEE Access ( Volume: 13)