Skip to Main Content
While wide-area video surveillance is an important application, it is often not practical, from a technical and social perspective, to have video cameras that completely cover the entire region of interest. For obtaining good surveillance results in a sparse camera networks requires that they be complemented by additional sensors with different modalities, their intelligent assignment in a dynamic environment, and scene understanding using these multimodal inputs. In this paper, we propose a probabilistic scheme for opportunistically deploying cameras to the most interesting parts of a scene dynamically given data from a set of video and audio sensors. The audio data is continuously processed to identify interesting events, e.g., entry/exit of people, merging or splitting of groups, and so on. This is used to indicate the time instants to turn on the cameras. Thereafter, analysis of the video determines how long the cameras stay on and whether their pan/tilt/zoom parameters change. Events are tracked continuously by combining the audio and video data. Correspondences between the audio and video sensor observations are obtained through a learned homography between the image plane and ground plane. The method leads to efficient usage of the camera resources by focusing on the most important parts of the scene, saves power, bandwidth and cost, and reduces concerns of privacy. We show detailed experimental results on real data collected in multimodal networks.