Abstract:
As an intriguing interaction method serving for video playback, live video commenting (LVC) shows its value in aggregating user-generated content to deliver real-time fee...Show MoreMetadata
Abstract:
As an intriguing interaction method serving for video playback, live video commenting (LVC) shows its value in aggregating user-generated content to deliver real-time feedback, empathetic reactions, as well as communication among audiences. These comments typically exhibit sentiment polarities, reflecting user opinions in different aspects, making sentiment analysis of LVC crucial for understanding what is expressed. However, different from formal texts, live comments are short while diversified in style, making them hard to be processed by conventional approaches. Especially considering that LVC is highly relevant to video content, sentiment analysis for LVC requires both textual and video information. Although current methods utilize multimodal data, they do not fully discriminate and leverage the relevant content. In this paper, we propose a diffused fusion network of cross-modal representations for LVC sentiment analysis, which has a hierarchical diffused design for filtering contextual information and fusing cross-modal representations. Moreover, our network follows the multi-task learning scheme to help the model reinforce the representation of the target comments and improve the effectiveness of sentiment analysis. Extensive experiments suggest the effectiveness of the proposed approach as well as each main module in this work.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: