I. Introduction
In recent years, vision-related tasks are becoming increasingly important for the scientific community. This is because they can be considered main prerequisites for multimedia content analysis [1]–[3], including event interpretation and behaviour understanding [4], [5]. In [6], for example, a system for automatic detection of crimes (e.g., chain and purse snatching) is reported. To detect variations of the attributes e.g., walking and running) composing these complex actions, the authors propose a Gaussian Mixture Model (GMM) with a large number of mixtures known as Universal Attribute Models (UAMs). In [7], instead, an automated video surveillance system to detect anomalous human behaviours is presented. To obtain reliable results, the authors use a Convolutional AutoEncoder (CAE), which identifies the bi-dimensional structures within the image sequences. Another system designed for monitoring purposes is described in [8], where the authors adopt a Bag-of-Visual-Words (BoVW) model to distinguish different classes of actions. Finally, in [9], [10], are even reported two action recognition methods focused on the evaluation of impaired patients that perform rehabilitative exercises, namely, serious games. In these works, each exercise is treated as a long action, whose evaluation provides the recovery rate of lost skills.