Skip to Main Content
Unsupervised learning of semantic activities from video collected over time is an important problem for visual surveillance and video scene understanding. Our goal is to cluster tracks into semantically interpretable activity models that are independent of scene locations; most previous work in video scene understanding is focused on learning location-specific normalcy models. Location-independent models can be used to detect instances of the same activity anywhere in the scene, or even across multiple scenes. Our insight for this unsupervised activity learning problem is to incorporate scene context to characterize the behavior of every track. By scene context, we mean local scene structures, such as building entrances, parking spots and roads, that moving objects frequently interact with. Each track is attributed with large number of potentially useful features that capture the relationships and interactions with a set of existing scene context elements. Once feature vectors are obtained, tracks are grouped in this feature space using state-of-the-art clustering techniques, without considering scene location. Experiments are conducted on webcam video of a complex scene, with many interacting objects and very noisy tracks resulting from low frame rates and poor image quality. Our results demonstrate that location-independent and semantically interpretable groupings can be successfully obtained using unsupervised clustering methods, and that the models are superior to standard location-dependent clustering.