Abstract:
Recent years have witnessed the growing research interests in video highlight detection. Existing studies mainly focus on detecting highlights in user-generated videos wi...Show MoreMetadata
Abstract:
Recent years have witnessed the growing research interests in video highlight detection. Existing studies mainly focus on detecting highlights in user-generated videos with simple topics based on visual content. However, relying solely on visual features limits the ability of conventional methods to capture highlights for videos with more complicated semantics, like movies. Therefore, we propose to mine the emotional information in video sounds to enhance highlight detection. Specifically, we design a novel emotion-enhanced framework with multi-stage fusion to detect highlights for complex videos. Along this line, we first extract multi-grained features from the audio waves. Then, the tailored-designed intra-modal fusion is applied on audio features to obtain emotional representation. Furthermore, the cross-modal fusion is developed to generate comprehensive representation of clip by merging audio emotional representations and visual features. This representation can be leveraged for predicting highlight probability. Finally, extensive experiments on real-world datasets demonstrate the effectiveness of our method.
Date of Conference: 05-09 July 2021
Date Added to IEEE Xplore: 09 June 2021
ISBN Information: