1. INTRODUCTION
With the explosive proliferation of digital video in people's daily life and on the Internet, action recognition is receiving increasing attention due to its wide range of applications such as video indexing and retrieval, activity monitoring in surveillance scenarios, and human-computer interaction. Most of the earlier research work focused on holistic video representations such as spatiotemporal volume [3] or trajectories of body joints [4], [5]. In order to obtain reliable features, these approaches often make certain strong assumptions about the video. For instance, the systems in [4], [5] require reliable human body joint tracking, and [3] needs to perform background subtraction to create a 3D shape volume. Although the holistic methods have obtained high recognition accuracy on simple video sequences taken in carefully-controlled environments with largely uncluttered background, and without camera motion, these strong assumptions limit their application to a more complicated dataset than the commonly used “clean” KTH dataset [8]. In real practice, it is simply not feasible to annotate a large video dataset to obtain body joints, or perform reliable background subtraction on a dataset that often contains significant camera motion.