Skip to Main Content
Bag-of-Visual-Features (BoVF) representations have achieved a great success when used for object recognition, mainly because of their robustness to several kinds of variations and occlusion. Recently, a number of BoVF approaches has been proposed also for recognition of human actions from videos. One important issue that arises when using BoVF for videos is how to take dynamic information into account, and most proposals rely on 3D extensions of 2D visual descriptors for this. However, we envision alternative approaches based on 2D descriptors applied to the spatio-temporal video planes, instead of to the traditionally explored by previous work. Thus, in this paper, we address the following question: what is the cost-effectiveness of a BoVF approach built from such 2D descriptors when compared to one based on the state-of-the-art 3D Spatio-Temporal Interest Points (STIPs) descriptor? We evaluate the recognition rate and time complexity of alternative 2D descriptors applied to different sets of spatio-temporal planes, and the state-of-the-art STIPs. Experimental results show that, with proper settings, 2D descriptors can yield the same recognition results as those provided by STIP, but at a significantly higher time complexity.