Skip to Main Content
Compared with general videos, movies and TV shows attract a significantly larger portion of people across time and contain very rich and interesting narrative patterns of shots and scenes. In this paper, we aim to recover the inherent structure of scenes and shots in such video narratives. The obtained structure could be useful for subsequent video analysis tasks such as tracking objects across cuts, action retrieval, as well as enriching user browsing and video editing interfaces. Recent research on this problem has mainly focused on combining multiple cues such as scripts, subtitles, sound, or human faces. However, considering that visual information is sufficient for human to identify scene boundaries and some cues are not always available, we are motivated to design a purely visual approach. Observing that dialog patterns occur frequently in a movie/TV show to form a scene, we propose a probabilistic framework to imitate the authoring process. The multi-thread shot model and contextual visual dynamics are embedded into a unified framework to capture the video hierarchy. We devise an efficient algorithm to jointly learn the parameters of the unified model. Experiments on two large datasets containing six movies and 24 episodes of Lost, a popular TV show with complex plot structures, are conducted. Comparative results show that, leveraging only visual cues, our method could successfully recover complicated shot threads and outperform several approaches. Moreover, our method is fast and advantageous for large-scale computation.