Skip to Main Content
Realistic human action recognition has been emerging as a challenging research topic due to the difficulties of representing different human actions in diverse realistic scenes. In the bag-of-features model, human actions are generally represented with the distribution of local features derived from the keypoints of action videos. Various local features have been proposed to characterize those key points. However, the important structural information among the key points has not been well investigated yet. In this paper, we propose to characterize such structure information with shape context. Therefore, each keypoint is characterized with both its local visual attributes and its global structural context contributed by other keypoints. The bag-of-features model is utilized for representing each human action and SVM is employed to perform human action recognition. Experimental results on the challenging YouTube dataset and HOHA-2 dataset demonstrate that our proposed approach accounting for structural information is more effective in representing realistic human actions. In addition, we also investigate the impact of choosing different local features such as SIFT, HOG, and HOF descriptors in human action representation. It is observed that dense keypoints can better exploit the advantages of our proposed approach.