Skip to Main Content
In this paper we address the problem of localisation and recognition of human activities in unsegmented image sequences. The main contribution of the proposed method is the use of an implicit representation of the spatiotemporal shape of the activity which relies on the spatiotemporal localization of characteristic, sparse, dasiavisual wordspsila and dasiavisual verbspsila. Evidence for the spatiotemporal localization of the activity are accumulated in a probabilistic spatiotemporal voting scheme. The local nature of our voting framework allows us to recover multiple activities that take place in the same scene, as well as activities in the presence of clutter and occlusions. We construct class-specific codebooks using the descriptors in the training set, where we take the spatial co-occurrences of pairs of codewords into account. The positions of the codeword pairs with respect to the object centre, as well as the frame in the training set in which they occur are subsequently stored in order to create a spatiotemporal model of codeword co-occurrences. During the testing phase, we use mean shift mode estimation in order to spatially segment the subject that performs the activities in every frame, and the Radon transform in order to extract the most probable hypotheses concerning the temporal segmentation of the activities within the continuous stream.