By Topic

Learning a Contextual Multi-Thread Model for Movie/TV Scene Segmentation

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Cailiang Liu ; Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China ; Dong Wang ; Jun Zhu ; Bo Zhang

Compared with general videos, movies and TV shows attract a significantly larger portion of people across time and contain very rich and interesting narrative patterns of shots and scenes. In this paper, we aim to recover the inherent structure of scenes and shots in such video narratives. The obtained structure could be useful for subsequent video analysis tasks such as tracking objects across cuts, action retrieval, as well as enriching user browsing and video editing interfaces. Recent research on this problem has mainly focused on combining multiple cues such as scripts, subtitles, sound, or human faces. However, considering that visual information is sufficient for human to identify scene boundaries and some cues are not always available, we are motivated to design a purely visual approach. Observing that dialog patterns occur frequently in a movie/TV show to form a scene, we propose a probabilistic framework to imitate the authoring process. The multi-thread shot model and contextual visual dynamics are embedded into a unified framework to capture the video hierarchy. We devise an efficient algorithm to jointly learn the parameters of the unified model. Experiments on two large datasets containing six movies and 24 episodes of Lost, a popular TV show with complex plot structures, are conducted. Comparative results show that, leveraging only visual cues, our method could successfully recover complicated shot threads and outperform several approaches. Moreover, our method is fast and advantageous for large-scale computation.

Published in:

Multimedia, IEEE Transactions on  (Volume:15 ,  Issue: 4 )