Abstract:
This paper introduces a multi-modal automatic video segmentation strategy by incorporating the audio tran-scripts along with the OCR output from video frames. Initially, ...Show MoreMetadata
Abstract:
This paper introduces a multi-modal automatic video segmentation strategy by incorporating the audio tran-scripts along with the OCR output from video frames. Initially, the audio is segmented into smaller chunks based on the silence duration. Each chunk is subsequently transcribed using Whisper ASR. We also extract the textual content from the video frames using Tesseract OCR. The audio transcript and the OCR output are then embedded using sentence transformer. The resultant embeddings are then clustered using a hierarchical agglomerative clustering approach. To extract the relevant subtopic in each cluster, KeyBERT model is employed. The proposed architecture was tested on the publicly available LPM dataset and NMI, IOU, MOF and Fl score were used for evaluation. It was observed that the proposed method fared relatively better for long duration videos with average MOF, IOU and Fl scores of 0.78, 0.72 and 0.54 respectively.
Date of Conference: 12-14 July 2024
Date Added to IEEE Xplore: 04 October 2024
ISBN Information: