Skip to Main Content
We propose a general framework for fast and accurate recognition of actions in video using empirical covariance matrices of features. A dense set of spatio-temporal feature vectors are computed from video to provide a localized description of the action, and subsequently aggregated in an empirical covariance matrix to compactly represent the action. Two supervised learning methods for action recognition are developed using feature covariance matrices. Common to both methods is the transformation of the classification problem in the closed convex cone of covariance matrices into an equivalent problem in the vector space of symmetric matrices via the matrix logarithm. The first method applies nearest-neighbor classification using a suitable Riemannian metric for covariance matrices. The second method approximates the logarithm of a query covariance matrix by a sparse linear combination of the logarithms of training covariance matrices. The action label is then determined from the sparse coefficients. Both methods achieve state-of-the-art classification performance on several datasets, and are robust to action variability, viewpoint changes, and low object resolution. The proposed framework is conceptually simple and has low storage and computational requirements making it attractive for real-time implementation.