A Multimodal Frame Sampling Algorithm for Semantic Hyperlapses with Musical Alignment | IEEE Conference Publication | IEEE Xplore

Scheduled Maintenance: On Tuesday, May 20, IEEE Xplore will undergo scheduled maintenance from 1:00-5:00 PM ET (6:00-10:00 PM UTC). During this time, there may be intermittent impact on performance. We apologize for any inconvenience.

A Multimodal Frame Sampling Algorithm for Semantic Hyperlapses with Musical Alignment


Abstract:

Producing visually engaging and semantically meaningful hyperlapses presents unique challenges, particularly when integrating an audio track to enhance the watching exper...Show More

Abstract:

Producing visually engaging and semantically meaningful hyperlapses presents unique challenges, particularly when integrating an audio track to enhance the watching experience. This paper introduces a novel multimodal algorithm to create hyperlapses that optimize semantic content retention, visual stability, and the alignment of playback speed to the liveliness of an accompanying song. We use object detection to estimate the semantic importance of each frame and analyze the song's perceptual loudness to determine its liveliness. Then, we align the most important segments of the video—where the hyperlapse slows down—with the quieter parts of the song, signaling a shift in attention from the music to the video. Our experiments show that our approach outperforms existing methods in semantic retention and loudness-speed correlation, while maintaining comparable performance in camera stability and temporal continuity.
Date of Conference: 30 September 2024 - 03 October 2024
Date Added to IEEE Xplore: 18 October 2024
ISBN Information:

ISSN Information:

Conference Location: Manaus, Brazil

I. Introduction

Over the past two decades, recording daily activities has been made accessible with the advent of smartphones, wear-able devices, and personal action cameras, such as GoPro ™. Sharing photos and videos through social media services has also become commonplace, leading to an ever-growing accu-mulation of visual data competing for our attention. Hands-free recordings of daily activities often contain repetitive or irrelevant content because the wearer is focused on the activity itself rather than managing the camera, which can make the video unpleasant to watch. Egocentric video summarization aims to infer the intent of the wearer, reduce irrelevant content, and produce a summary that is pleasant to watch [1]. In particular, dynamic fast-forward methods assign semantic importance scores to the video according to domain-specific criteria, such as route guidance [2] or presence of people [3], which are used to lower the playback speed during important segments or raise it in unimportant segments, producing a representative summary video that has no gaps between scenes.

Contact IEEE to Subscribe

References

References is not available for this document.