Action recognition in unconstrained amateur videos | IEEE Conference Publication | IEEE Xplore

Action recognition in unconstrained amateur videos


Abstract:

In this paper, we propose a systematic framework for action recognition in unconstrained amateur videos. Inspired by the success of local features used in object and pose...Show More

Abstract:

In this paper, we propose a systematic framework for action recognition in unconstrained amateur videos. Inspired by the success of local features used in object and pose recognition, we extract local static features from the sampled frames to capture local pose shape and appearance. In addition, we extract spatiotemporal features (ST features), which have been successfully used in action recognition, to capture the local motions. In the action recognition phase, we use the Pyramid Match Kernel based on weighted similarities of multi-resolution histograms to match two videos within the same feature types. In order to handle complementary but heterogeneous features, i.e., static and motion features, we chose a multi-kernel classifier for feature fusion. To reduce the noise introduced by the background clutter, our system also tries to automatically find the rough region of interest/action. Preliminary tests on the KTH action dataset, UCF sports dataset, and a YouTube action dataset have shown promising results.
Date of Conference: 19-24 April 2009
Date Added to IEEE Xplore: 26 May 2009
ISBN Information:

ISSN Information:

Conference Location: Taipei, Taiwan

1. INTRODUCTION

With the explosive proliferation of digital video in people's daily life and on the Internet, action recognition is receiving increasing attention due to its wide range of applications such as video indexing and retrieval, activity monitoring in surveillance scenarios, and human-computer interaction. Most of the earlier research work focused on holistic video representations such as spatiotemporal volume [3] or trajectories of body joints [4], [5]. In order to obtain reliable features, these approaches often make certain strong assumptions about the video. For instance, the systems in [4], [5] require reliable human body joint tracking, and [3] needs to perform background subtraction to create a 3D shape volume. Although the holistic methods have obtained high recognition accuracy on simple video sequences taken in carefully-controlled environments with largely uncluttered background, and without camera motion, these strong assumptions limit their application to a more complicated dataset than the commonly used “clean” KTH dataset [8]. In real practice, it is simply not feasible to annotate a large video dataset to obtain body joints, or perform reliable background subtraction on a dataset that often contains significant camera motion.

Contact IEEE to Subscribe

References

References is not available for this document.