Skip to Main Content
Motion characterization plays a critical role in video indexing. An effective way of characterizing camera motion facilitates the video representation, indexing and retrieval tasks. This paper describes a novel nonparametric motion representation to achieve an effective and robust recognition of parts of the video in which camera is static, or panning, or tilting, or zooming, etc. This representation employs the mean shift filtering and the vector histograms to produce a compact description of a motion field. The basic idea is to perform spatio-temporal mode-seeking in the motion feature space and use the histograms-based spatial distributions of dominant motion modes to represent a motion field. Unlike most existing approaches, which focus on the estimation of a parametric motion model from a dense optical flow field (OFF) or a block matching-based motion vector field (MVF), the proposed method combines the motion representation and machine learning techniques (e.g., support vector machines) to perform camera motion analysis from the classification point of view. The main motivation lies in the impossibility of uniformly securing a proper parametric assumption in a wide range of video scenarios. The diverse camera shot sizes and frequent occurrences of bad OFF/MVF necessitates a learning mechanism, which can not only capture the domain-independent parametric constraints, but also acquire the domain-dependent knowledge to tolerate the influence of bad OFF/MVF. In order to improve performance, we can use this learning-based method to train enhanced classifiers aiming at a certain context (i.e., shot size, neighbor OFF/MVFs, and video genre). Other visual cues (e.g., dominant color) can also be incorporated for further motion analysis. Our main aim is to use a generic feature space analysis method to explore a flexible OFF/MVF representation in a nonparametric technique, which could be fed into a learning framework to robustly capture the global motion by incorporating the context information. Results on videos with various types of content (23 191 MVFs culled from MPEG-7 dataset, and 20 000 MVFs culled from broadcast tennis, soccer, and basketball videos) are reported to validate the proposed approach.