Skip to Main Content
More and more cameras are being installed everyday for safety, security and intelligence gathering purposes, making the volume of storage videos increase all the time. It is therefore important to manage this resource to be able to cast a structured (hierarchical) view into the activities of long video files to catalogue only interesting or relevant domain events. This paper aims to address this issue by proposing a novel and efficient computational approach to ascertaining semantic segmentation of scene activities exhibited in monocular or multi-view surveillance videos. The key to achieve this is to derive the so-called dasiapacepsila descriptor, reflecting the change in underlying scene activities of a surveillance site, based on detecting key scene frames and modeling its temporal distribution. The former is performed by extracting 2D or 3D appearance-based subspace embedding features, followed by a time-constrained agglomerative data clustering. The latter models the density of such key frames distribution in the time domain, and then applies a visual curve segmentation algorithm to identify scene segments of different activities. The approach is especially suited for crowd scene segmentation, and it has been evaluated with real-world surveillance videos of both underground platforms and a busy industrial park entrance in rush hours with promising results.