Loading [MathJax]/extensions/MathMenu.js
Multi-Modal Automatic Video Segmentation with Sentence Transformer Embeddings and KeyBERT-Based Subtopic Extraction | IEEE Conference Publication | IEEE Xplore

Multi-Modal Automatic Video Segmentation with Sentence Transformer Embeddings and KeyBERT-Based Subtopic Extraction


Abstract:

This paper introduces a multi-modal automatic video segmentation strategy by incorporating the audio tran-scripts along with the OCR output from video frames. Initially, ...Show More

Abstract:

This paper introduces a multi-modal automatic video segmentation strategy by incorporating the audio tran-scripts along with the OCR output from video frames. Initially, the audio is segmented into smaller chunks based on the silence duration. Each chunk is subsequently transcribed using Whisper ASR. We also extract the textual content from the video frames using Tesseract OCR. The audio transcript and the OCR output are then embedded using sentence transformer. The resultant embeddings are then clustered using a hierarchical agglomerative clustering approach. To extract the relevant subtopic in each cluster, KeyBERT model is employed. The proposed architecture was tested on the publicly available LPM dataset and NMI, IOU, MOF and Fl score were used for evaluation. It was observed that the proposed method fared relatively better for long duration videos with average MOF, IOU and Fl scores of 0.78, 0.72 and 0.54 respectively.
Date of Conference: 12-14 July 2024
Date Added to IEEE Xplore: 04 October 2024
ISBN Information:
Conference Location: RAIPUR, India

Contact IEEE to Subscribe

References

References is not available for this document.