A new approach for Video Action Recognition: CSP-based filtering for video to image transformation

In this paper we report on the design of a pipeline involving Common Spatial Patterns (CSP), a signal processing approach commonly used in the field of electroencephalography (EEG), matrix representation of features and image classification to categorize videos taken by a humanoid robot. The ultimate goal is to endow the robot with action recognition capabilities for a more natural social interaction. Summarizing, we apply the CSP algorithm to a set of signals obtained for each video by extracting skeleton joints of the person performing the action. From the transformed signals a summary image is obtained for each video, and these images are then classified using two different approaches; global visual descriptors and convolutional neural networks. The presented approach has been tested on two data sets that represent two scenarios with common characteristics. The first one is a data set with 46 individuals performing 6 different actions. In order to create the group of signals of each video, OpenPose has been used to extract the skeleton joints of the person performing the actions. The second data set is an Argentinian Sign Language data set (LSA64) from which the signs performed using just the right hand have been used. In this case the joint signals have been obtained using MediaPipe. The results obtained with the presented method have been compared with a Long Short-Term Memory (LSTM) method, achieving promising results.


I. INTRODUCTION
V Ideo action recognition is a task which involves recognizing the action that is being performed in a sequence of observations. It is mainly used in computer vision, since the visual features provide basic information about what is happening in the image sequence, and has many reallife applications, such as visual surveillance, rehabilitation, human-computer interaction or entertainment.
Due to the fast growth of the technology, the demand for automatic interpretation of human behavior within videos is also growing, making video action recognition a highly active area. Even though many different approaches have been presented throughout the years trying to solve the problem of the identification of actions in videos, action recognition has not seen the gains in performance that have been achieved in image classification or human face recognition. The main reason is the complexity of combining both spatial and temporal information, which makes this problem harder than image analysis.
In this paper, a pipeline for a video action recognition method is presented, which has been applied to solve two problems with common characteristics. The first application where the presented method has been tested is human-robot interaction. Human-robot interaction (HRI) aims to understand, design and evaluate robotic systems to be used by or with humans. Specially when dealing with social robots, a highly evolved type of interaction is required, since these robots cannot be merely teleoperated, and they are expected to meet high operational standards in order to be accepted by the general public.
The presented method aims to endow a pseudo-humanoid robot with the ability to understand the action that an actor is performing, in order to be able to give an adequate response, thus enhancing the social capabilities of the robot. A data set with six different actions performed by different people has been created to test the method. In Fig. 1 an interaction example between a person and the robot is displayed.
(a) Image captured by the robot.
(b) Expected reaction of the robot. The second application is the sign language recognition. Nowadays, a large number of people has some degree of hearing impairments, about 466 million, and this number is expected to grow in the next years. Many of those people use sign languages to communicate with others, but since these languages are not commonly known among the hearing community, people with hearing problems often face communication difficulties in environments where no interpreter is available. In order to try to break the barrier between the deaf community and the rest of the society, significant work is being carried out in Sign Language Recognition (SLR), where computer vision is playing a major role.
In order to improve the interaction between the people with hearing impairments and the robot, it is interesting to endow the robot with the ability to recognize certain gestures and react in different ways. Driven by the results obtained in [1], it has been decided to test the method presented in this paper on the recognition of some signs that are included in an Argentinian Sign Language database.
The approach presented herein continues with the work presented in [2], where Common Spatial Patterns (CSP), a method commonly used in Brain Computer Interface (BCI) for ElectroEncephaloGram (EEG) systems [3], [4], is used as feature extraction method for a video action recognition task.
In order to apply CSP, the information about the person performing the action to identify must be extracted. To that end, two different technologies have been used: OpenPose [5] to extract the skeletons of the action recognition videos and MediaPipe [6] to extract hand landmarks of the sign language data set.
The positions of the joints of the skeletons are used as input for the CSP, as presented in the previous work [2].
In this new approach, after computing the CSP algorithm, a matrix multiplication is applied and the transformed signals are represented as images. The features for the classification are extracted from those images using several visual global descriptors, and different classifiers have been tested to perform the classification.
Several experiments have been performed with the proposed approach in both databases, and their results have been compared to a Long Short-Term Memory (LSTM) paradigm in order to validate it.
The rest of the paper is organized as follows. First, in Section II some related works are described in order to introduce the topic. In Section III the proposed approach is introduced, explaining the process that has been carried out. Then, in Section IV the databases are presented and the experiments are explained further. Next, in Section V the obtained results are shown, and finally, in Section VI the conclusions extracted from this work are presented and future work is pointed out.

II. RELATED WORKS
Many approaches for video action recognition have been introduced lately. These techniques make use of the visual features extracted from the video, both static and temporal. The temporal features mix the static image features with time information, so that the temporal information of the video is maintained.
In [7] the authors use a temporal template which is based on a static vector-image where the value of the vector at each point represents a function of the motion properties at the corresponding spatial location in an image sequence. Local spatio-temporal interest points can be used to recognize complex motion patterns as it is demonstrated in [8]. A hybrid hierarchical model is presented in [9] where collections of spatial and spatio-temporal features are used to represent video sequences. Many other methods make use of Histograms of Oriented Gradients (HOG) or Histogram of Oriented Optical Flow (HOOF) [10]- [12]. Motion descriptors based on the direction of optical flow have also been introduced [13], [14]. The use of depth data captured by depth cameras has also grown due to the advances in imaging technology [15], [16].
With these two publications [17], [18] as a starting point, deep learning has continued to be used for activity recognition, mainly with Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) [19]. Very deep twostream ConvNets are presented in [20] which, according to the authors, get close to image domain deep models. Convolutional Neural Networks (CNN) and deep bidirectional LSTM (DB-LSTM) networks are used in [21]. In [22] the authors combine 3D-CNN and LSTM networks. Motion maps, which integrate temporal information, are iteratively extracted from videos using a kind of deep 3-dimensional CNN (C3D), acquiring a final motion map of the whole video. LSTM is used for the final prediction.
As two-stream CNNs are unable to model long-term temporal structures, Wang et al. [23] developed a temporal segment network (TSN) which is able to model dynamics throughout the whole video. TSN extracts short snippets over a long video sequence with a sparse sampling scheme, this way modeling long-range temporal structures and preserving relevant information. Temporal Relation Network (TRN) [24] is a network module which enables temporal relational reasoning and can be easily plugged into an existing neural network. The module tries to describe the temporal relations between observations in videos. While TSN uses average pooling ignoring the temporal order, TRN replaces the average pooling with an interpretable relational module. Authors of [25] proposed a Temporal Shift Module (TSM) which shifts the channels forward or backward along the temporal dimension to exchange information between adjacent frames. The Gate Shift Module (GSM) [26] has a learnable spatial gating block which controls spatio-temporal interactions. Other authors [27] present Channel-Separated Convolutional Networks (CSN), which factorize 3D convolutions in pointwise 1 × 1 × 1 convolutions for channel interaction or depthwise k ×k ×k (usually k = 3) convolutional operations for local spatio-temporal interactions. Temporal Pyramid Network (TPN) [28] models the visual tempo at feature level, extracting temporal features by combining features obtained at different tempos.
Skeleton data has also been used to perform activity recognition. The authors of [29] use a LSTM network to focus on the significant joints of the skeleton within each frame and, according to that, the outputs of different frames are weighted. In [30] the authors present a representation where a human pose estimator is used and heatmaps are extracted for the human joints in each frame. In [31] a method for encoding geometric relational features into color texture images is presented, where temporal variations of different features are converted into the color variations of their corresponding images. They use a multi-stream CNN model to classify the images. The authors of [32] propose a two-stream adaptive graph convolutional network (2s-AGCN), where both the coordinates of the joints and the bones between the joints are used as features for classification.
Regarding Sign Language Recognition (SLR), different techniques have been used in recent years [33]- [36]. On the one hand, we can find methods that make use of intrusive sensors which must be placed on the person who is performing the signs. These wearable markers or data gloves are used to detect the body and hand movements [37]- [39]. In the case of non-intrusive systems, there are techniques that make use of sensors such as Leap Motion or Microsoft Kinect [40]- [43] and others that focus on the information obtained by cameras, vision-based methods [44]- [46]. Most of the presented methods use neural networks to perform the classification, like CNNs and LSTMs [47]- [49], although Hidden Markov Models (HMM) have been widely used for SLR too [50]- [52]. As a practical application, it is possible to mention [53], where the authors develop a software system for hearing impaired children with articulation disorders.

III. PROPOSED APPROACH
The method presented in this paper is a continuation of the work presented in [2], which uses CSP applied to skeleton information for video action recognition. In this work, an image is obtained for each video that summarizes the information of the video and that can be then classified using image classifiers. Therefore, we transform the video classification problem into an image classification problem. An overview of the proposed approach can be seen in Fig. 2.
As seen in the overview of the method, the first step is the extraction of the skeletons of the person performing the action or sign to be recognized. The positions of the joints in the skeletons are then used to create signals. The created signals are the input for the Common Spatial Patterns algorithm.
The Common Spatial Patterns (CSP) algorithm [54], a mathematical technique for signal processing, has been widely used in Brain Computer Interface (BCI) applications for electroencephalography (EEG) systems [55], [56]. It has also been applied in the field of electrocardiography (ECG) [57], electromyography (EMG) [58], [59] or even in astronomical images for planet detection [60]. CSP was presented as an extension of Principal Component Analysis (PCA) and it consists of finding an optimum spatial filter which reduces the dimensionality of the original signals. Considering just two different classes, a CSP filter (1) maximizes the difference of the variances between the classes, maximizing the variance of filtered signals of EEG of one of the targets while minimizing the variance for the other.
As the feature vectors of the spatial filter W are sorted by variance, the first and the last q vectors, which produce the smallest variance for one class and the largest variance for the other class, are used to project the original signals (2). Finally, the feature vector is obtained by calculating the variance of the transformed signals Z (3). The feature vector value for the p-th component of the i-th trial is the logarithm of the normalized variance.
The CSP algorithm can only work with pairs of classes, but multiclass classification is possible using pairwise classification approaches, such as One versus One (OVO) as a class binarization technique [61].
The CSP-filtered signals are further processed applying two matrix operations. Being M ∈ R KxL a matrix formed by the extracted video signals where K is the number of signals and L is the maximum length value, on the one hand, a matrix multiplication is performed (Eq. 4) and, on VOLUME 4, 2016 the other hand, the covariance matrix is calculated (Eq. 5). The motivation behind these transformations is that one of the dimensions of the matrix representing the signals is the number of signals, but the other could be arbitrary long, as it is the number of time steps or frames. Therefore a matrix multiplication by its transpose reduces the data to a manageable size. On the other side, centering a matrix, multiplying by its transpose and dividing by the number of rows -1 produces the covariance matrix, which provides information about global characteristics of the signals.
A K×K matrix is obtained, being K the number of signals, the number of rows of the matrix. These matrices are then treated as images; that is, for each video one image is obtained.
The created images are then classified to identify the action that has been performed on the original video.

IV. EXPERIMENTAL SETUP A. DATA SETS
In the experiments presented in this paper two data sets have been used: one has been collected by us, and the other is a public available database.

1) Action Recognition (AR)
This database has been created by recording videos with the camera of the semi-humanoid robot Pepper. It consists of 272 videos with 6 action categories. There are around 45 clips in each category, performed by 46 different people. When recording the actions, the robot adjusts the orientation of its head according to the location of the face of the person appearing in its field of view.
The action categories and the information about the videos can be seen in Table I.
These are the 6 categories that are included in the data set: 1) Come: gesture for telling the robot to come to you. 2) Five: gesture of 'high five'.
3) Handshake: gesture of handshaking with the robot. 4) Hello: gesture for telling hello to the robot. 5) Ignore: ignore the robot, pass by. 6) Look at: stare at the robot in front of it.

2) Sign Language Recognition(SLR)
For the SLR task an Argentinian Sign Language (LSA) data set, LSA64 data set [62] is used, which is composed of 64 different LSA signs. The videos were recorded by 10 non-expert subjects, who repeat each sign 5 times. Among the performed signs, both one-handed (42 signs performed with the right hand) and two-handed (22 signs) signs can be found. In order to simplify the classification problem, a subset of the data set has been selected, precisely the 42 onehanded videos have been used. The name and information of the used signs can be seen in Table II. Thus, the subset used is composed by 2100 videos, where 1150 videos were recorded outdoors with natural lighting (23 signs, 10 signers, 5 repetitions) and 950 videos were recorded indoors with artificial lighting (19 signs, 10 signers, 5 repetitions). The signers wore black clothes and colored gloves (red and green), and they were recorded with a white wall as background. The colored gloves (red and green) are used in order to facilitate the task of hand segmentation, although this is not helpful in the approach presented in this paper, as no hand segmentation is performed. It must be mentioned that the subjects do not make use of the facial expression when performing the signs, they just focus on the movements of the hands.

B. METHOD APPLICATION
The method described in Section III has been applied to both of the presented data sets. Even though both data sets correspond to scenarios where the action or the sign performed by the person in front of the camera needs to be identified, some differences have been made on the application of the method.
Different classifiers have also been tested on the classification step of the images that are created from the videos. The different setups that have been tried are described below.

1) Get skeletons and create signals
The selected data sets have different purposes; on one hand the AR data set is an action recognition data set where different subjects perform general actions where the whole body is involved. On the other hand, on the SLR data set the focus is always on the upper body of the signers, specially on their hands. Due to this dissimilarity, different methods have been selected to extract the skeletal information of the videos of the different databases.
On the AR data set, it has been decided to use OpenPose [5] to extract the skeletons of the people of the scene. This tool is a real-time multi-person system to detect human body on single images. In this case, the actions that have to be recognized are centered in the actor who perform them. Therefore, the skeleton of the actor has been extracted in every frame of each video. The system has been designed with the restriction that only a person ought to be in the field of view of the camera. In any case, as OpenPose allows for restricting the detection to only one person in order to speed up the processing and tracking, this approach ignores people in the background.
OpenPose returns the (x, y) positions of 25-keypoints (joints). After obtaining the skeleton information for every frame of each video, we can create 50 different signals to represent each video, where each signal will be the position of a skeleton keypoint over time. This way, there will be 50 signals (25 for the x position of the joints and another 25 for the y positions) with the same length as the original video (one skeleton per frame). The appearance of the skeleton and the matrix extracted from the skeletons can be seen in Fig.  3a.
For the SLR data set a technology called MediaPipe [6] has been used to track the positions of the hands in each frame of the video. More precisely the MediaPipe Holistic solution is used, which integrates separate models for pose, face and hand components. This solution offers a real-time hand tracking, which includes 21 hand landmarks for each hand.
It has been noticed that due to the speed of the movements or the use of color gloves, MediaPipe is not able to track the hands in 52 videos of the SLR data set. In order to try to solve this issue, the original videos have been converted from RGB color space to black and white. This way, the performance of Mediapipe has been improved and the number of videos where the hand is not detected in any frame has dropped to 6.  Each landmark returned by MediaPipe is composed of three coordinates (x, y, z), where (x, y) denote its position and the z coordinate represents the depth of each joint in reference to the position of the wrist. Once the landmark values are obtained, a set of signals is created for every video of the database.
In Fig. 3b a graphical explanation of the hand landmarks and the extracted set of signals S for video i are shown, where k is the number of joint features, n is the number of frames and J u,c,v is the landmark value for joint u, coordinate c : x, y, z and frame v. For each frame 21 joints (k = 21) are extracted, and as each landmark is composed of (x, y, z) values, the signal matrix has 63 rows: 3 values (x, y, z) for each one of the 21 joints (3 × 21 = 63). In Fig. 4 an example of the sequence obtained from a video is shown, both for the VOLUME 4, 2016 action recognition data set and the LSA64 data set. In 4a the skeleton obtained by OpenPose is presented and, in 4b, the hand landmarks extracted with MediaPipe.
(a) Example of skeleton obtained for an action sequence.
(b) Example of hand landmarks obtained for a sign sequence.

2) Apply the Common Spatial Patterns algorithm
In order to compute the CSP algorithm, the signals have been preprocessed first. On the one hand, it has to be considered that some joints could be missing from the captured skeletons when the actor does not fit entirely in the camera range or OpenPose and MediaPipe are not able to capture some of the landmarks. In these cases, the missing joints values are estimated by a linear interpolation, using the previous and next values of that joint. The interpolation is done to avoid having missing values, and assuming that consecutive values of joints positions follow a smooth curve. An example of 25 signals of the x poses of a joint can be seen in Fig. 5, where the signals before and after the interpolation are shown.   Once the landmarks are processed and, hence, the signals are formed, the CSP is computed in order to separate the classes according to their variance. Since in both data sets a multiclass classification needs to be performed, a pairwise approach is used. In Fig. 7 an example of the variances obtained from the signals transformed applying the CSP algorithm can be seen.
As it has been explained, the CSP filter tries to separate the given classes by variance, where the first q vectors produce the smallest variance for one class and the largest for the other, while the last q vectors produce the opposite. In Fig.  7, three pairs of vectors are shown (q1 − q16, q2 − q17 an q3 − q18) and it is clearly noticeable the difference between the variances of the classes (ignore and five) in each of the vectors, where the first q vectors (q = 15 in this case) minimize the variance of class f ive and maximize the variance of class ignore, and the last q minimize the variance of class ignore and maximize the variance of class f ive.

3) Matrix representation
In Fig. 8 some examples of obtained images are shown, for the classes come and five of the presented AR data set (as mentioned before, all the process is computed in pairs). The images are low-dimensional since they come from 50x50 matrices.

4) Classification
Once the summary images are created, different classification strategies can be used in order to classify them into the origi- descriptor [63]: it represents an image by its local shape and the spatial information of the shape. Additionally, a Convolutional Neural Network (CNN) has been applied to classify the summary images obtained for each video. Its performance might drastically vary between several hyperparameter configurations, and therefore, in order to provide a fair comparison, we have used Keras Tuner Hypermodel, with a RandomSearch tuner to look for good configurations automatically. The input is composed by one image per video with a shape of 50x50x1 in the case of the AR data set and 63x63x1 when it refers to LSA64 database, since the images are gray-scale.
Convolutional layers, dropout layers, max pooling layers and a final dense layer of two units (as the classification is performed by pairs) make up the network. Adam is used as optimizer and categorical cross-entropy as loss function. The learning rate, activation functions, number of filters and dropout rate hyperparameters have been tuned.

5) Comparison
To finish, the proposed approach is compared with a type of neural network widely used for video action recognition task, a Long Short-Term Memory (LSTM) network. The LSTM network has the signals obtained from the joints of the skeleton as input, so its input is bidimensional (number of frames, number of joints) and the output is of 64 units. Then a dense layer of 2 units has been placed, since the classification is carried out between two classes. Regarding the rest of the hyperparameters, Adam optimizer and categorical crossentropy loss function have been used, and the network has been trained for 100 epochs with a batch size of 25.

V. EXPERIMENTAL RESULTS
In this section the obtained results for the experimentation that has been carried out are presented. First, the results obtained for the action recognition data set are shown and afterwards the outcomes obtained for the LSA64 are explained. Table III and IV show the results obtained using the characteristics extracted by the global descriptors from the images obtained by matrix multiplication (equation 4) and from the images of covariance matrices (equation 5), respectively. A mean accuracy value is also presented for each pair of classes, with the best values highlighted in bold. In the tables, to summarize, the classes have been represented as follows: C (come), F (five), H (handshake), He (hello), I (ignore) and L (look at).

A. ACTION RECOGNITION DATA SET
First, if the two tables (Table III and Table IV) are compared with each other, both the matrix multiplication and the covariance matrix obtain good results. After applying the descriptors, both representations yield a mean over all the entries of ∼ 0.86 and a median of ∼ 0.92. Thus, in general the results are encouraging.
Next, three types of comparisons are made: by classifier type, by image descriptor type, and by class pairs.
Regarding the classifiers, on average they all get even results, there is no one that stands out from the rest. Even so, it could be said that Naive Bayes (NB) has been the worst of all in both representations and the best average result is achieved by Support Vector Machine (SVM).
Concerning the descriptors, the difference is more noticeable. EHD and PHOG get outstanding results with an average accuracy of ∼ 0.95. The CLD descriptor does not get bad results either (∼ 0.87 on average). The worst results, by VOLUME 4, 2016   Finally, in relation to the pairs of classes, the good results of handshake-hello (H-He), handshake-ignore (H-I) or hellolook at (He-L) can be highlighted. For instance, the pairs fiveignore (F-I) and come-ignore (C-I) achieve very good results with all the descriptors except FCTH, which as it has been already mentioned is the descriptor with the worst results overall. However, the worst pair of classes (five-hello (F-He)) obtains an average of 0.71 accuracy, therefore very good results have been achieved in the experiment. Table V shows the average accuracy values obtained for each type of descriptor and image of the presented approach, along with the results obtained using the CNN network taking as input the matrix representation images and the results achieved by the LSTM mentioned before, where the best values are highlighted in boldface. The results obtained with a previous approach [2] are also shown.
The obtained accuracy values show that our new approach beats LSTM method and the previous approach for every class pairs. Furthermore, observing this table it is evident that the best results are achieved using the CNN and, EHD or PHOG when it comes to global descriptors. Although the best mean value corresponds to the use of PHOG descriptor, in 9 out of 15 pair of classes CNN performs better. Regarding the type of images, generally better outcomes are obtained using the features extracted from the covariance matrices. However, the best mean value is attained using the images created from the matrix multiplication. It is concluded that our approach beats LSTM network since only the configurations which use the FCTH descriptor obtain worse mean accuracy values than the LSTM method.

B. SIGN LANGUAGE RECOGNITION DATA SET, LSA64.
The results obtained after using the global descriptors to extract the features from the images created by matrix multiplication and extracting them from the images of the covariance matrix for the LSA64 data set are shown in Table  VI and Table VII, respectively. As the experiments have been performed pairwise and there are 42 different classes, a total of 861 (42 × 41/2) tests have been performed for each classifier and descriptor type. In order to summarize them, in the results some statistics are shown: minimum, 1st quartile, mean, median, 3rd quartile and maximum values.
When looking at the results, it can be seen that the obtained mean values are around 0.8 − 0.9 for most descriptors and classifiers. Although the minimum values obtained are quite low, the Q1 values already show that the 75% of the pairs of classes get accuracy values higher than 0.9 with EHD and PHOG descriptors, and greater than 0.8 for CLD descriptor. There are some pairs of classes which are perfectly classified (accuracy value 1) with every configuration. The best average result is 0.9911 for matrix multiplication, which is obtained with EHD descriptor and BN classifier. For covariance matrix, the best mean value is 0.9878, obtained with EHD descriptor and SVM classifier instead.
Comparing both tables (Table VI and Table VII) better results are obtained when using the matrix multiplication. However, there is not a noticeable difference between them.
Regarding the used descriptors, there is a clear difference between the outcomes obtained with each one of them. EHD descriptor is the one which achieves better results. In Table  Concerning the classifiers, there is not a perceptible contrast between them. As mentioned before, the best mean values have been obtained with BN and SVM. However, the worst average values have also been obtained with the BN classifier and FCTH descriptor. It can be concluded that their performance depends on the configuration used before the VOLUME 4, 2016 classification.
In order to compare the differences between the tested classes, in Table VIII the mean values obtained for each class of the data set are shown. These values have been calculated with the accuracy values of all the test pairs in which each class has participated. These mean values are achieved with the best configuration, in this case, the features obtained after applying the EHD descriptor to the images obtained by the matrix multiplication and performing the classification with a Bayesian Network. All the classes obtain a mean accuracy value between 0.97 and 1.00. Therefore, not many conclusions can be drawn about the difference of classes, since all of them obtain good results. Bright, Born, Candy and Shut down classes get the highest values and classes like Red2, Where and Bitter get slightly worse values.
A comparison with a LSTM network is performed and the results are shown in Table IX. Some statistics of the obtained accuracy values are displayed, which refer to all pairs of classes. For the comparison, it has been decided to show only the results obtained with EHD and PHOG descriptors, as they are the ones which performed best. The results achieved by applying a CNN after creating the images are also presented. Although the minimum obtained value among all the pairs of classes is lower in the presented approaches, the rest of the statistics show that better accuracy values are obtained than with the LSTM method. While the LSTM method obtains an average of 0.9186, our approach achieves a mean value of 0.9869. Regarding the approaches which use global descriptors to extract characteristics from the images to train the classifiers, both the median and the Q1 and Q3 values indicate that a greater number of pairs of classes obtain better results than with the LSTM.
All of the 4 configurations presented in the table are able to surpass the LSTM method, being the EHD descriptor the one that suits best, as mentioned above. When using the CNN, although the median and Q3 values are higher, there are several pairs of classes that achieve lower results than with the LSTM, as indicated by the minimum and Q1 values.
Finally, and in order to understand these results better, Table X shows the mean of the values obtained for each class using the Convolutional Neural Network with both types of images (MM: Matrix Multiplication, COV: covariance matrix). All the classes obtain high mean accuracy values, which vary between 0.85 and -0.96. For the matrix multiplication images the best value is obtained by the Sweet-milk class (0.9601) and the worst by Find (0.8853), whereas for the covariance matrix images the best value is obtained by the Opaque class (0.9504) and the worst by the Red2 class (0.8703). In short, the results obtained for all of the classes are similar and no matrix method stands out.

VI. CONCLUSION AND FUTURE WORK
In this paper a new pipeline for action recognition is presented, which has been applied to two different tasks in this domain: activity recognition and sign language recognition. In the presented approach the Common Spatial Patterns method has been applied to signals created from the positions of the skeleton joints of people performing different actions or signs. From the output of the method some images have been created, which have been then classified. In the classification step two approaches have been tested; one based on Visual Global Descriptors and the other a CNN implementation. The obtained results have been compared to a previous approach by the same authors and also to those obtained by a LSTM, a well-known deep learning method.
As further work, we plan to extend the range of human activities, as well as to implement the presented method in the actual robot. This would allow the robot to react to different actions performed in front of it, or to communicate with people with hearing impairments. Applications in Social Robotics are also to be developed, being this the next envisaged step.
Concerning the sign language recognition, several steps have been identified that would improve the presented method. Facial information of the signers should be added, since it is a crucial feature when interpreting sign language. Signs which use both hands should also be considered, in order to make the recognition system more complete.
On the classification step, other image descriptors could also be used, and in that case a feature subset selection step could be advisable.