Skip to Main Content
We present in this paper an integrated solution to rapidly recognizing dynamic objects in surveillance videos by exploring various contextual information. This solution consists of three components. The first one is a multi-view object representation. It contains a set of deformable object templates, each of which comprises an ensemble of active features for an object category in a specific view/pose. The template can be efficiently learned via a small set of roughly aligned positive samples without negative samples. The second component is a unified spatio-temporal context model, which integrates two types of contextual information in a Bayesian way. One is the spatial context, including main surface property (constraints on object type and density) and camera geometric parameters (constraints on object size at a specific location). The other is the temporal context, containing the pixel-level and instance-level consistency models, used to generate the foreground probability map and local object trajectory prediction. We also combine the above spatial and temporal contextual information to estimate the object pose in scene and use it as a strong prior for inference. The third component is a robust sampling-based inference procedure. Taking the spatio-temporal contextual knowledge as the prior model and deformable template matching as the likelihood model, we formulate the problem of object category recognition as a maximum-a-posteriori problem. The probabilistic inference can be achieved by a simple Markov chain Mento Carlo sampler, owing to the informative spatio-temporal context model which is able to greatly reduce the computation complexity and the category ambiguities. The system performance and benefit gain from the spatio-temporal contextual information are quantitatively evaluated on several challenging datasets and the comparison results clearly demonstrate that our proposed algorithm outperforms other state-of-the-art algorithms.