This paper addresses the problem of human action recognition by introducing a sparse representation of image sequences as a collection of spatiotemporal events that are localized at points that are salient both in space and time. We detect the spatiotemporal salient points by measuring the variations in the information content of pixel neighborhoods not only in space but also in time. We derive a suitable distance measure between the representations, which is based on the Chamfer distance, and we optimize this measure with respect to a number of temporal and scaling parameters. In this way we achieve invariance against scaling, while at the same time, we eliminate the temporal differences between the representations. We use Relevance Vector Machines (RVM) in order to address the classification problem. We propose new kernels for use by the RVM, which are specifically tailored to the proposed spatiotemporal salient point representation. The basis of these kernels is the optimized Chamfer distance of the previous step. We present results on real image sequences from a small database depicting people performing 19 aerobic exercises.