Skip to Main Content
This paper presents a novel action matching method based on a hierarchical codebook of local spatio-temporal video volumes (STVs). Given a single example of an activity as a query video, the proposed method finds similar videos to the query in a video dataset. It is based on the bag of video words (BOV) representation and does not require prior knowledge about actions, background subtraction, motion estimation or tracking. It is also robust to spatial and temporal scale changes, as well as some deformations. The hierarchical algorithm yields a compact subset of salient code words of STVs for the query video, and then the likelihood of similarity between the query video and all STVs in the target video is measured using a probabilistic inference mechanism. This hierarchy is achieved by initially constructing a codebook of STVs, while considering the uncertainty in the codebook construction, which is always ignored in current versions of the BOV approach. At the second level of the hierarchy, a large contextual region containing many STVs (Ensemble of STVs) is considered in order to construct a probabilistic model of STVs and their spatio-temporal compositions. At the third level of the hierarchy a codebook is formed for the ensembles of STVs based on their contextual similarities. The latter are the proposed labels (code words) for the actions being exhibited in the video. Finally, at the highest level of the hierarchy, the salient labels for the actions are selected by analyzing the high level code words assigned to each image pixel as a function of time. The algorithm was applied to three available video datasets for action recognition with different complexities (KTH, Weizmann, and MSR II) and the results were superior to other approaches, especially in the cases of a single training example and cross-dataset action recognition.