Multi-Modal Human Action Recognition With Sub-Action Exploiting and Class-Privacy Preserved Collaborative Representation Learning

Multimodal human action recognition with depth sensors has drawn wide attention, due to its potential applications such as health-care monitoring, smart buildings/home, intelligent transportation, and security surveillance. As one of the obstacles of robust action recognition, sub-actions sharing, especially among similar action categories, makes human action recognition more challenging. This paper proposes a segmental architecture to exploit the relations of sub-actions, jointly with heterogeneous information fusion and Class-privacy Preserved Collaborative Representation (CPPCR) for multi-modal human action recognition. Specifically, a segmental architecture is proposed based on the normalized action motion energy. It models long-range temporal structure over video sequences to better distinguish the similar actions bearing sub-action sharing phenomenon. The sub-action based depth motion and skeleton features are then extracted and fused. Moreover, by introducing within-class local consistency into Collaborative Representation (CR) coding, CPPCR is proposed to address the challenging sub-action sharing phenomenon, learning the high-level discriminative representation. Experiments on four datasets demonstrate the effectiveness of the proposed method.


I. INTRODUCTION
''Human action or activity recognition has played significant roles in many potential applications, including security surveillance, human-computer interaction (HCI), health monitoring and intelligent transportation [1]- [6]. For instance, in healthcare environments, by monitoring the behavior of people and recognizing human activities, the activity habits and patterns of people can be understood. Thus correct emergency decisions can be made so that a healthier and more secure living environment can be created for the community. Human action recognition involves specific tasks such as action detection, localization and action recognition from different data modalities with RGB, Depth, infrared or inertial sensors. Normally, action recognition is to classify a video or data sequence into one of the pre-defined action categories, The associate editor coordinating the review of this manuscript and approving it for publication was Abhishek K Jha .
whereas action detection is to determine the presence of the interested action in continuous untrimmed data streams. Action localization aims to find the potential proposals which contain certain human movements, i.e., the time and area that an action of interest happens.
For an intelligent machine to achieve the level of action recognition like humans do, the first is the representation capability, i.e., the ability to perceive the informative observations (features) from multi-modal data. Based on these observations, a feature space is learned with a good capacity. This feature space receives and stores distinct characteristics of the objects of interest, which needs representation learning methods. Multimodal observations may facilitate the level of capacity of receiving useful information and the level of receptivity for impressions.
Earlier action or activity recognition researches focus more on the using of RGB video captured by conventional RGB cameras [8], [9]. The limitations of using RGB cameras is FIGURE 1. Sub-actions sharing phenomenon and non-accurate skeleton data makes multimodal action recognition challenging (the samples are from SBU-Kinect action dataset [7]). The first row is the depth image sequence. The second row is its segmented sub-actions generated by the proposed Energy-guided segmentation (left) or Time-guided segmentation methods (right). Depth modality has the geometry (shape) cues but is redundant with noise. The last two rows are to demonstrate that skeleton data is concise with action's partial semantic information, however sometimes with incorrectly tracked skeleton poses.
that conventional RGB images lack 3D action data, which is regards as one critical clue to improve the recognition performance. The advancement of sensor technology makes it possible Energyto sense 3D action data by the new depth sensors. In addition, 3D depth images provides a way to acquire 3D skeleton of a person for better action or activity recognition. For human action recognition, RGB-Depth video, and the skeleton pose sequence enrich the representation space of human motions. Figure 1 shows the RGB-Depth and the skeleton samples. Depth sensors-based action recognition has been studied for years [10]- [13], which is the focus of this paper.
Depth sensor-based human action recognition provides new opportunities whereas faces some challenges. Firstly, one important observation of action datasets is that different actions have irrelevant/similar actions and therefore certain actions from different classes share similar or even same characteristics. As shown in Figure 1, in SBU action dataset [7], sub-actions are shared between different action categories, especially among the similar categories ''Punching'', ''Hugging'' and ''Departing''. This is truly ubiquitous in different action datasets. This common phenomenon makes the human action recognition confusing and challenging. In another MSR Action 3D dataset [14], the existing literatures [10], [15]- [17] show that the lower accuracy always happened among three very similar actions, No.4 action ''Hand catch'', No.7 action ''Draw x and'' No.9 action ''Draw circle''. The reason is that they share sub-actions, leading to low recognition performance.
Secondly, as shown in Figure 1 (the last row), in the SBU-Kinect interaction dataset [7], the ground truth action category ''Hugging'' is similar and could be confused with ''Approaching'' and ''Departing''. Action category ''Pushing'' is a composition of sub-actions from ''Kicking'', ''Approaching '' and ''Departing''. Consequently, other samples from ''class i'' may be represented ''collaboratively'' with the assistance of samples from other classes, as demonstrated in [18]. Therefore, with the potential assistance of samples from other classes, samples from distinct classes may be ''collaboratively'' expressed. Thus the feature of a testing video can be coded collaboratively on the globally shared dictionary (i.e., the entire dictionary constructed from the samples of all categories).
Thirdly, the feature space constructed from multi-modal sources is high-dimensional and not uniform distribution. The action performing subjects have their own uncontrolled freedom and behavior habits. This results in larger within-class variations, making human action recognition more difficult than image classification tasks. Sparse representation (SR) has been successful for RGB-based action recognition [3], [19], [20]. The key assumptions of SR are: the features or representation of each category of training data is sufficient enough to span a separable subspace; and the training data are collected carefully, making the extracted feature space distribute uniformly. These preconditions limit their generalization to video analytic tasks. For example the gaming action dataset, UTD-MHAD-Kinect V2 [21] is a multi-modal dataset. It is a typical small-training sample-size dataset, likely causing unacceptable representation errors and unstable classification results when applying strong supervised approaches.
In this paper, to increase the capacity of perceiving information, two heterogeneous low-level features are extracted from depth and skeleton modalities, respectively. Then Canonical Correlations Analysis(CCA) is utilized for features correlation analysis, providing us with compact and shared mid-level heterogeneous features. The ubiquitous sub-action sharing challenge is regarded as an opportunity and is exploited by the proposed sub-action segmentation method and Class-privacy Preserved Collaborative Representation (CPPCR) learning method. In CCA feature space, CPPCR integrates the low-dimensional manifold (local consistency) into the collaborative sub-action learning process (globality), obtaining the final high-level discriminative representation. CPPCR not only leverages the collaborative representation to address the challenge of sub-action sharing phenomenon among different classes globally, but also preserves the expected local geometric structures of action classes. The analytical solution of CPPCR makes the computation efficient and avoids being trapped in local optima.
The main contributions of the proposed method are summarized as follows: 1) We propose an energy guided sub-action segmentation method based on which the input activity is decomposed into an unfixed number of temporally segmented sub-activities. Accordingly the depth features are extracted based on the new energy-guided sub-actions. In addition, guided by the proposed motion energy function generated from depth modality, the ''Cross-modality Parameters Transferring'' transfers the sub-action segmentation parameters into the synchronous skeleton modality for heterogeneous feature representation.
2) The proposed CPPCR is demonstrated to be an effective scheme for addressing sub-action sharing problem. CPPCR integrates local consistency into the collaborative sub-action learning process, alleviating the adverse effect caused by sub-action sharing, which leads to a final high-level discriminative representation. CPPCR demonstrates not only performance improvement over the CR learning process, but also efficient computation with the closed-form solution. 3) For human action recognition aiming to address the sub-action sharing phenomenon, the proposed framework demonstrates an effective Statistics Machine Learning (SML) based pipeline. It could be easily extended to DNN framework or hybrid framework combining SML and Deep Neural Network (DNN), if the hand-crafted feature extraction components are replaced by DNNs followed by either a DNN-based or a SML and DNN combined classification module.
In the following, Section II reviews the related works. Section III presents the proposed adaptive energy guided sub-action segmentation method, heterogeneous feature extraction and fusion. Section IV introduces the CPPCR method for addressing the sub-action sharing challenge. Experiments and analyses are conducted in Section V, and subsequently a conclusion is summarized in Section VI.

II. RELATED WORK
According to the feature extraction for action recognition, existing methods can be categorized into hand-crafted feature based and deep learning based. The relevant fusion and representation learning methods are also reviewed.
A. HAND-CRAFTED FEATURE BASED METHODS 1) DEPTH MAP FEATURES From depth sensors, hand-crafted features, such as bag of 3D points [14], depth surface normal feature super normal vector (SNV) [22], histograms of oriented principle components (HOPC) [10], depth motion maps (DMM) [23], spatio-temporal depth cuboid [24] motion history and statistical metrics are extracted for action recognition. Histograms of oriented 4D normals (HON4D) [25] takes the 3D depth data as an opportunity to construct the 4D normals of body parts and uses the statistical information as data representation. Similar to the motion history images (MHIs) and motion-energy images (MEIs) [26] which are successful for RGB based action recognition, Depth Motion Maps (DMM) [23] with depth sensor [27] aims to model human body shape and motion's history information. Generated from depth cloud points, Histograms of oriented principle components (HOPC) [10] feature uses the information of eigenvectors of its support cuboid, for view-invariant action recognition. Super normal vector (SNV) [22] is constructed and aggregated by grouping local hyper-surface normals into polynormal. However, depth maps have background noise, which disturbs the feature extraction process.

2) SKELETON FEATURES
As a distinct modality, skeleton data is heterogeneous with depth and provides us with 3D human skeletal joints. Therefore skeleton data features can be extracted from 3D skeleton data, such as skeletal quads feature [28], eigenjoints feature [29], 3DMTM-PHOG [30], active skeleton feature [13], body-pose feature [7] poselet mining, [31] joint trajectory maps, [11], covariance 3D Joint [32], skeleton optical spectra [33]. Skeletal quads feature [28] is proposed for 3D action recognition. By encoding the skeleton limbs into states via Markov random field, active skeleton representation [13] is aggregated for characterizing human actions. Based on the differences of skeleton joints of static poses and the dynamic poses over time, the eigenjoints feature [29] is used successfully for skeleton based action recognition. Skeletal representation by curved manifold Lie Group [34] is a novel method that models the 3D joint points of human bodies via 3D geometric algebra. In [33], skeleton optical spectra is proposed, in which the skeleton data are rendered into color images and then CNNs are used to learn the features for action recognition. By dividing 3D points into 4D grids, Vieira et al. [35] employed occupancy patterns to describe 4D grids spatially and temporally. Actionlet ensemble model [15] extracts local features in the neighborhood area of skeleton joints. For view-variant action recognition, Rahmani and 39922 VOLUME 8, 2020 Mian [36] proposed a nonlinear model, transforming data from distinct views into a canonical view. However, as shown in Figure 1, non-accurate skeleton poses and sub-action sharing phenomenon make action analysis more challenging.

B. DEEP LEARNING BASED METHODS
With the evolution of neural networks, bidirectional Recurrent Neural Network (RNN) [16], structured Convolutional Neural Networks (CNNs) [37], cross-modality feature analysis [12], Graph Convolutional Networks (GNN) [38], [39], attention-based long short-term memory (LSTM) [40], [41] and Spatio-temporal attention network [42] are proposed for action or video analysis tasks. By designing a general attention neural cell, spatio-temporal attention network with heterogeneous data is proposed for traditional RGB action recognition. Wang et al. [43] proposed to consider attention based deep 3D CNN features with LSTM for action recognition. Amir et al. [44] presented a spatio temporal LSTM networks for 3D action recognition. Zhu et al. utilized LSTM with Co-occurrence scheme [45] and achieved good performance. In order to address the challenge of action representations with view variations, two view adaptive neural networks [46] were combined for high performance skeleton-based action recognition. SkeletonNet method [47] transforms the features of skeleton frames into images and feeds them into the proposed deep learning framework. Similarly, in [17], the authors designed an RNN driven by privileged information (PI) for action recognition. From depth modality, three kinds of dynamic depth image features using rank pooling are generated [5], and then are fed into CNNs for action recognition.
However, deep neural network-based methods [5], [6], [17], [48], [49] for depth-action recognition are not popular as RGB-action recognition. One reason is deep learning based methods are data-driven, requiring big data with labels. The depth datasets are of relatively small or medium size so that the data-driven models are weakened, at the risk of over-fitting. By the augmented skeleton data, multiview LSTM fusion model with attend scheme [6] was proposed for skeletal action recognition. Liu et al. [48] constructed a 3D-based CNN (3DCNN) to learn the depth features. Then a hand-crafted skeleton joint based feature is fused with these 3DCNN learned depth features. DMM feature is weighted hierarchically, then is fed into three channel deep CNN [49] on small training datasets. To overcome the drawback of the less color or pixels of depth maps, methods [5], [17] were proposed.
Although end-to-end deep learning has many advantages, AI systems built through it often show the following fatal weaknesses: inexplicability, vulnerability (robustness is poor), easy to be deceived and attacked, and need a lot of data. These weaknesses make it possible to be used only in limited scenarios, such as complete information, deterministic information, static (or evolving according to deterministic laws) environment, and limited domains. Therefore, creating explainable, credible and robustness theories and methods is necessary for complex applications.

C. INFORMATION FUSION AND REPRESENTATION LEARNING METHODS
Fusing information from multiple modalities is useful for performance improvement. Fusion strategy can be performed at data-level, feature/representation-level and score/decisionlevel [8]. Each fusion category has its own cons and pros, and the selection of the fusion method is generally dependent on the types of features and data sources. Score-level fusion requires no post-processing or dimension reduction, and is independent on the types and lengths of different multi-modal features. However, score-level fusion has the main drawbacks: (1) Independent classification decisions that relate to each sensing modality, need to be combined via some soft rule for the final decision; (2) For n different modalities, the decision-level fusion needs to train n separated classifiers, resulting in more parameters and time consumption especially when using deep learning classifiers; (3) Decision scores are obtained from data streams separately, therefore the correlation between different modalities is largely lost when fusion takes place at the decision level.
In contrast, in the practice for multi-modal human action recognition system, concurrent data from multiple sources is a good clue to collect sufficient amount of information for making improved decisions. Therefore SML based feature-level fusion projects the features concurrently collected from multiple sensors to a new space by vigorous mathematical transformation to optimize information representation for high quality decision making.
The earlier works mainly focused on feature fusion for action recognition [50]- [57]. In [50], a multi-modal learning framework was proposed to fuse depth and skeleton-based features. Feature-level fusion [51] of depth features and skeleton joints based on random forests were proposed by the rule of Winner-Take-All. By extending CCA model [58], heterogeneous domain adaptation by 1 regularized CCA [52] was proposed to exploit the correlation subspace, for cross-view action recognition. Recent work [56] a 3D CNN fusion strategy is proposed by combing the softmax scores for action recognition with arbitrary length. In [57], RGB and depth futures are fused for RGB-D videos based action recognition.
On the other hand, representation learning could provide us with stronger discriminative power [42], [59]- [61]. Collaborative Representation (CR) [62] has been demonstrated to be effective for face recognition and is fast as it has a closed-form solution. Kernel collaborative representation (KCR) [63] and discriminative collaborative representation (DCR) [64] were proposed and then based on them dictionary learning and discriminant projection methods were designed to determine appropriate features. Discriminative compact representation [18], KCR with locality constrained dictionary (KCRC-LCD) [60] and locality-constrained collaborative representation (LCCR) [59] were proposed to extend collaborative representation for face recognition.  However, more complicated fusion methods increase dramatically the computational cost. Simpler but efficient fusion methods are preferred for action recognition. Moreover, CR-based methods have a weakness that they rely heavily on a ''good'' dictionary (properly controlled) which is derived from the training dataset. This shortcoming limits the application of the CR model in action recognition.

III. PROPOSED HETEROGENEOUS FEATURES AND FUSION
For depth sensor based human action datasets, skeleton data are not always accurate, as shown in the third row of Figure 1. Global depth feature DMMs [23] which is extracted from the entire videos contains more long-term temporal information of human movement while less short-term information. Speed variations will directly affect the appearances of DMM feature since it is extracted from inter-frame motions. Moreover, there are large intra-class variations since depth sequences are generated at different speeds, i.e, the depth sequences have different lengths, as the cases demonstrated in Figure 3 (a). These observations and ubiquitous sub-action sharing prompt us to propose an adaptive energy guided sub-action segmentation method. It discovers sub-actions automatically in diverse action video instances.

A. ENERGY-ORIENTED SUB-ACTION SEGMENTATION
As introduced in Section I, the sub-action sharing phenomenon within distinct action categories is existed, decreasing the recognition performance, which can be addressed by exploiting the relations of shared sub-actions. When dividing an action sequence, an intuitive segmentation strategy is to divide the action sequence into segments with the same length directly in the time axis. Thus each segment is a sub-action of equal length, which is called as ''time-oriented'' segmentation. In contrast, in order to express the dynamic information such as the speed variations of human motion over time, we propose to segment each video into temporal sub-actions according to the motion energy function, so that sub-actions with different lengths can be obtained, which is called as ''energy-oriented'' segmentation.
Assuming there is an action sequence with N depth maps, we first project each depth map with three-dimensional information onto three orthogonal Cartesian planes, each of which corresponds to a perspective of the 3D space. The three planes are denoted as v ∈ {front, side, top}. The difference between two consecutive projected maps on three views is then thresholded to generate a binary map. Then the accumulated motion energy on the i-th frame, E(i), is defined as: where F j+1 v is the j + 1-th depth frame on view v ∈ {front, side, top} from the depth modality, sum{.} returns the number of non-zero element in a binary map, θ is the threshold. Since the motion energy function is accumulated, it starts from the first i video frames. The motion energy of a frame reflects current frame's relative motion status and location with respect to the entire activity. Based on this method, a video is divided adaptively into sub-segments of unequal length globally, effectively capturing the motion's temporal orders.
The video is segmented based on equal division of the normalized motion energy, where each segment has the same percentage energy. As shown in Figure 2, the total motion energy E(N ) of an action video with N depth frames is normalized to one. Thus we divide this normalized energy into a set of segments whose corresponding indices of frames are used to partition a video. In the example of Figure 2, the video is segmented into P parts based on equal division of the normalized motion energy. Each segment accounts for approximately 1/P of the total energy. For easy computation, P is set to be the power of 2 and P = 2 Scale−1 where the Scale is temporal pyramid scale parameter. Thus the frame indexes for sub-action segmentation is obtained if Scale and P are ready. For instance, as illustrated in Fig. 2  where Ti corresponds to different frame index number. As a result we will have Scale=3 s=1 2 s−1 = 1 + 2 + 4 = 7 sub-actions in three temporal scale on three views. Parameter Scale, P in Fig.2 were determined experimentally, i.e., we evaluated the recognition rates versus different values of P, while fixing the others, and chose the best ones.

B. ENERGY-ORIENTED DEPTH FEATURE EXTRACTION
Depth maps have more information and geometric shape in the third depth dimension, which can be used as an important clue to describe the shape of human motion. But it contains structural and background noises, as shown in Figure 3.
Suppose that start i denotes the beginning frame indexes of i-th parts. The segmentation parameters should satisfy i ∈ {1 : P}. DMMs feature [23] of each perspective is extracted and they are concatenated for each sub-action as follows: where Ni is the number of video frames of the i − th subaction, and f is frame index, T means matrix transposition and ε is the background noise threshold. In the experiment ε = 50. The symbols F  The depth sensors capture the skeleton and depth data simultaneously and synchronously. Thus the motion energy function, generated from depth modality, characterizes the execution speed of actions in depth and skeleton data simultaneously. This motivates us to transfer the action segmentation parameters, generated from depth modality, into skeleton modality to divide the skeleton sequence.
In skeleton modality, based on the sub-sequences, Dynamic skeleton (DS) features [15] are extracted by computing the relative positions between each pair of trajectories. Then Fourier features and its gradient information (i.e., DSG) are further extracted and concatenated together as introduced in [15]. From the result of this practice, gradient information could depict the velocity change of the action's motion.

D. HETEROGENEOUS INFORMATION FUSION
Here, the goal of heterogeneous information fusion is to analyze and exploit the relations between heterogeneous feature sets. Thus the obtained representation is more discriminative than any of the input ones. The features extracted above always have high dimensionality with a certain redundancy. Therefore, these redundant features should be fused, compressed or refined. With the advantages of analytic capabilities of machine learning, features of high dimensionality can be analyzed efficiently to extract meaningful information, forming compressed and discriminative representations. Feature level fusion is perceived as simpler, more effective and meaningful than the other levels of fusion [65]. In this paper, the Canonical Correlation Analysis (CCA) is adopted to reduce the dimension of high dimensional features. CCA incorporates the vector associations into the correlation analysis of the feature sets, maximizing the correlation across the two feature sets.
Given two feature matrices X ∈ R p×n and Y ∈ R q×n , which are generated from n training samples. The feature vectors are from depth (p dimension) and skeleton (q dimension) modalities respectively. Within-set covariance matrices of X and Y are defined as S xx ∈ R p×p and S yy ∈ R q×q respectively. Between-set covariance matrix of X and Y is defined as S xy ∈ R p×q and S xy = S T yx . The aim of CCA is to maximize the pair-wise correlations of X and Y , resulting in the transformation matrices W x and W y . By these transformation matrices, linear combinations X * = W T x X and Y * = W T y Y are obtained by solving the eigenvalue equations: whereŴ x andŴ y are the eigenvectors and R 2 is the diagonal matrix of eigenvalues or squares of the canonical correlations. Two typical feature fusion methods using CCA are: serial feature fusion (CCA-serial) and parallel feature fusion (CCAparallel): and where Z 1 and Z 2 are called the canonical correlation discriminant features. X * and Y * ∈ R d×n are known as canonical variates and have two useful qualities: they have nonzero correlation only on their corresponding indices, and are uncorrelated within each feature set.
For instance, in SBU interaction dataset, there are n = 196 training samples, the dimension of depth features is p = 47376, the dimension of skeleton features is q = 54810. Using feature-level fusion method, the dimension of the computed CCA feature space is d = 195, so that the final dimension of fused features is either d = 195 by CCA-parallel summation or d = 390 by CCA-serial concatenation.

IV. THE PROPOSED CLASS-PRIVACY PRESERVED COLLABORATIVE REPRESENTATION (CPCCR) ACTION RECOGNITION METHOD
Sub-actions sharing among action categories is a challenge for action recognition. The global scheme in CR-based learning is designed to leverage the shared feature subspace. However, the locality of the action category is a strong clue to recognize actions. We propose to make the class-privacy property preserved, and select the linear combination of nearby characteristics/sub-actions, favoring class-preserved locality (which preserves some expected local geometric structures) even though the testing sample can be described by another classes' few far characteristics/sub-actions, which is called Class-privacy Preserved Collaborative Representation.

A. PRELIMINARY: CR-BASED LEARNING AND CLASSIFICATION
In popular low-dimensional manifold models [19], [53], for each feature space, one feature vector is represented by the linear combination of a few representative points. However, in these models the characteristics of the testing data can only be represented by the learned characteristics of one class from the training samples. As introduced in Section I and shown in Figure 4 (b) and (c), sub-actions sharing is ubiquitous throughout the different interaction categories. The sub-actions hugging are shared between ''Hugging'', ''Approaching'' and ''Departing'', which are different interaction categories.
This paper regards the negative sub-action sharing challenge as an positive chance, and ingeniously transforms challenge into opportunity in the proposed CPPCR method. As introduced in Section I, from Immanuel Kant's statement, the sub-actions sharing challenge is also a chance since sub-actions from other classes are helpful to represent the testing sample. If all the other sub-actions' extracted features are used as possible training features (samples) for representing each sub-action, we can not only significantly improve the ability of learning features (knowledge representation), but also mitigate sub-action share challenges.
Let D = [D 1 , D 2 , . . . , D i , . . . , D C ] be the dictionary, which has C human action categories. Each sub-dictionary D i is associated with the i action category, and each column of D i represents the fused feature set of training samples from class i. The fused feature set is obtained via CCA from heterogeneous data. Let y be the fused feature of testing sample.
Firstly, in CR learning method [62], the columns of D is normalized to have unit norm. Then y is represented on D collaboratively and globally using the 2 -minimization Lagrangian formulation: where λ is a Lagrangian scalar parameter to balance the residual of representation function and the regularization term. Secondly, by computing the representation residuals e i = y − D iαi 2 / α i 2 , the class label is determined via class(y) = arg min i {e i }.

B. CPPCR FOR ACTION RECOGNITION
CR [62] favors the global relationship and encodes the testing sample as a linear combination of sub-actions of training samples from all categories. However, the locality of the action category is a strong clue to recognize actions. In other words, we tend to select the linear combination of neighboring characteristics/sub-actions, rather than the global relationship. Class-preserved local geometric structures is favored in CPPCR, even though the testing sample can be described by another few far characteristics/sub-actions. By integrating this class-preserving locality to CR [62], we will improve the discrimination ability of the feature representation.
In this section, by adding a class-privacy preserved locality constraint C pp = ( y i ∈N K (y) y i − Dα 2 2 )/K to Eq. (6), the objective function is: where N K (y) are the top-K nearest training samples of the testing sample y. And λ and γ are regularization parameters to balance the reconstruction residual term, the sparsity term and the locality constraint term. The neighborhood local region is determined by some distance metrics, such as Spearman, Cityblock and Seucliden distance by KNN searching. Under this locality constraint, the closer the distance is, the greater the contribution will be. By computing the derivative with respect to α of Eq. (7) and letting it be zero, we have the solution as follows: In the training phase, the term(D T D + λI ) −1 D T can be precomputed. This precomputed term is independent of the testing sample, depending only on the training data.
In the recognition phase, given a testing sample y, the two low-level heterogeneous features are extracted and then fused by CCA to get the mid-level feature. Moreover, the neighborhood region N K (y) is determined by KNN searching from the training set, then the (1 − γ )y + γ K y N ∈N K (y) y N − Dα 2 2 is calculated and multiplied by the pre-calculated projection matrix. Finally the regularized representation residuals e i (y) is computed by e i (y) = y − Dα i 2 / α i 2 (9) and the action category label can be predicted via The proposed CPCCR algorithm for action recognition is concluded in Algorithm 1.

V. EXPERIMENTS AND PERFORMANCE EVALUATION
To evaluate the proposed method for multi-modal action recognition, we first conduct an ablation study. Then extensive experiments on four public datasets are conducted, including one interaction dataset [7] which has two-person interactions, and three action datasets [14], [21], [66] which include single person actions.

A. ABLATION EVALUATION
Firstly, we evaluated the energy guided sub-action segmentation for feature extraction and fusion strategies. The ablation evaluations are conducted on the SBU-Kinect interaction dataset [7]. As shown in Table 1, the performance of the proposed energy-guided sub-action segmentation method is superior to that of the time-guided method. Secondly, for individual heterogeneous features, the depth feature has better performance than the skeleton feature since skeleton data sometimes contains incorrectly tracked skeleton poses. For single feature, the energy-guided depth feature has the highest recognition accuracy of 87.69%. For feature fusion strategies, the CCA-serial fusion strategy performs better than the CCA-parallel one. As a result, VOLUME 8, 2020 Algorithm 1 CPPCR Algorithm for Action Recognition 1: Training phase and inputs: (1) A heterogeneous feature matrix (dictionary) D = [D 1 , D 2 , . . . , D i , . . . , D K ] constructed by Section III, containing the extracted heterogeneous (depth and skeleton) feature of all training samples. D i is the heterogeneous feature set of training video samples from class i.
(2) The heterogeneous feature vectors from the testing sample y.
(4) The parameter K , which is the size of the top-K local neighborhood region. (1) Use KNN searching, from the training feature set D, to determine the neighborhood region N K (y).
(3) Inference the label via Eq.(10). 5: end function the CCA-serial fusion of two heterogeneous energy-guided features has the highest recognition accuracy of 95.39%. This demonstrates that the appropriate feature fusion methods effectively preserve the complementary information of heterogeneous data.

B. PARAMETERS EVALUATION
The key parameters of CPPCR are investigated, in terms of analyzing the recognition rates iteratively. Table 2 reports the recognition accuracies versus the variant values of parameter K . From the evaluations, it's observed that the action performances are better when K = 3 and the metric distance is chosen to be Spearman metric. From the experimental evaluations, the parameter K , λ and γ are set as K = 3, λ = 0.01, γ = 0.2 respectively in all the following experiments. In DMM feature, scale of sub-action is empirically set to be Scale = 3 so that each sample has 7 sub-actions of three views.

C. SBU INTERACTION DATASET AND PERFORMANCE EVALUATION
The SBU Interaction dataset [7] is a multi-modal dataset with 8 interaction activities. It was collected from 7 participants, providing synchronized RGB video, depth map and skeleton pose modalities. It consists of 230 video sequence samples from 8 interaction categories. In most scenarios, interactions are performed when one person is acting and the other person is reacting.
The challenges of this data set include: (1) human motion categories are the interaction between two persons; (2) in most interactions, one person is acting while the other is responding, and most of the interactions are associated with the security surveillance, healthcare monitoring, smart buildings/home applications; (3) as illustrated in Figure 1, most of these action categories are social behaviors; (4) most categories are non-periodic human-to-human interactions, containing sub-actions and comparable physical movements. Same settings in [7] is followed, where a standard 5-fold cross-validation scheme is employed.
The result of the proposed method is 95.39%, as shown in Table 3. It is observed that the performance of the proposed method is better than body pose feature with libSVM method [7], privileged information-based RNNs method [17], representation learning of temporal dynamics by RNNs [16], deep structured model [37] and co-occurrence LSTM model [45]. This indicates that the proposed method is effective for person-to-person interaction recognition.
The confusion matrix is a useful tool to show the recognition accuracies of each class and the confusion percentage between distinct categories, which is for analyzing the detailed recognition results. Confusion matrix of SBU interaction dataset is illustrated in Figure 5. We can see that most interaction categories are recognized correctly. Careful observation shows that the confusions occur mainly in recognizing three similar interactions which share a lot of sub

D. MSR ACTION 3D DATASET AND PERFORMANCE EVALUATION
The famous MSR Action 3D dataset [14] has 20 categories and is collected by 10 subjects. Cross-subject experiment settings were adopted as in [14], where the data of 5 subjects are used for model training and the remaining data are used for recognition.
The proposed method has the recognition accuracy 93.82%, as shown in Table 4, better than methods [10], [23], [25] and most existing methods compared. It's observed that the proposed method is comparable to statistical method [22], heterogeneous features fusion method [54] and deep learning methods [16], [47], [67]. In addition, sub-segmentation by cross-modality parameter transferring is effective from the results. For feature fusion strategies, the CCA-serial fusion contributes more than CCA-parallel strategy. This indicates that the proposed method effectively preserves the spatio-temporal information of the two heterogeneous features, outputting high-level discriminative action representation.

E. UTD-MHAD DATASET AND PERFORMANCE EVALUATION
Multi-modal dataset UTD-MHAD [66] is collected by depth sensor and wearable inertial sensor. It consists of 27 action categories, 4 modalities, and each modality has 861 samples. These 3444 sequences include RGB video, depth maps data, skeleton data and accelerometer data. We followed the experimental settings of [66], where the cross-subjects protocol actions were employed. In this protocol, half of subjects were used for training and the other half for testing.
In Table 5, performance comparisons are conducted to exploit the benefits of fusing two heterogeneous features and the proposed CPPCR. It is clear that the proposed heterogeneous features fusion with CPPCR improves the recognition accuracy, compared to the existing methods. The proposed method has recognition accuracy of 87.0%, 90.7% and 91.2%, 94.2% for the schemes Time-guided+Fusion via CCA-parallel, Time-guided+Fusion via CCA-serial and Energy-guided+Fusion via CCA-parallel, Energy-guided+Fusion via CCA-serial, respectively. It's noted that both data from the depth and inertial sensors are adopted in method [66].
In addition, from Table 5 we observe that the proposed method achieves comparable performance compared with method [68] in which learned features from three modalities are fused in feature-level. The method [70] leads to the stateof-the-art performance since it fuses information collected from four types of sensors, i.e., RGB camera, depth sensor, and two wearable inertial sensors (accelerometer and gyroscope data) for action recognition. In contrast, the proposed method just fused two types of data modalities, depth and skeleton. The utilization of rich information across four data modalities is likely to be the reason for superior performance by [70]. Note method [70] was only evaluated on the multi-modal dataset UTD-MHAD, whereas the proposed is evaluated on four public domain datasets.

F. UTD-MHAD-KINECT V2 DATASET AND PERFORMANCE EVALUATION
UTD-MHAD-Kinect V2 [21] is a multi-modal dataset, which contains heterogeneous data from depth sensor and inertial VOLUME 8, 2020 sensor. It consists of 1200 sequences from three modalities. It has 10 action categories and is performed by 3 female and 3 male subjects. Same experimental setting of [21] is adopted in this paper.
Firstly, the recognition accuracy of single heterogeneous feature and fusion methods are investigated. The performances of the four features, skeleton feature guided by time, depth feature guided by time, skeleton feature guided by energy and depth feature guided by energy are first derived. From Table 6, it can be seen that the proposed Energy-oriented method improves the performance. For single features, the Energy-oriented depth feature has higher recognition accuracy than that of Time-oriented. The performance of the depth feature proposed in this paper is better than that of the skeleton feature.
The results demonstrate that the proposed CPCCR achieves 91.5% which is higher than Multimodal Hybrid Centroid CCA, Multimodal Centroid CCA and MCCA by 1.5%, 3.5% and 9%, respectively. For feature fusion strategies, CCA-serial fusion contributes more than CCA-parallel strategy. CCA-serial fusion of two heterogeneous Energy-oriented features has the highest recognition accuracy of 90.0%, further improved to 91.5% by the proposed representation learning method. This indicates that the proposed CPPCR preserves the spatiotemporal information of actions, outputting high-level discriminative action representation.

G. QUALITATIVE ANALYSIS OF THE PROPOSED METHOD
Here, we did the qualitative analysis of the proposed method, including: (1) Whether the two heterogeneous features have complementary characteristics; (2) the relationship between feature dimensions and recognition accuracies after the representation learning method CPPCR. The experiment was  conducted on SBU Kinect Interaction dataset to do qualitative analysis.
We evaluated the recognition accuracies versus feature dimensions by CCA-serial fusion method. The dimensions of the proposed two heterogeneous features are reduced to 195 from dimensions 47376 and 54810, respectively. As shown in Figure 6, on SBU-Kinect Interaction dataset, the recognition accuracies increase drastically when the feature dimensions are greater than 195, indicating that in the new CPPCR feature space, the proposed two heterogeneous features are highly complementary with each other. Furthermore, the accuracies are stable if the dimensions are higher than 230. The highest accuracy is 95.39% when the dimensions = [298, 309, 310, 311, 312]. It should be noted that 195 is precisely the feature dimension from the first data modality, skeletal sequence data. This shows that the features extracted from the two modalities are highly complementary with each other in the new CPPCR feature space. They are compact and high-level features for action representation. Fusing them together contributes to performance improvement.
There are several main possible reasons: 1) The scale of the datasets. The proposed method achieves the superior performance compared to most state-of-the-art on small or medium scale datasets. According to the development rule of artificial intelligence, deep neural networks are data-driven methods, requiring a large amount of training data. Statistical machine learning can be applied to databases of different scales. 2) Ubiquitous sub-action sharing challenge. As demonstrated in Figure 4 (b) and (c), sub-actions sharing are ubiquitous throughout the different interaction categories. The sub-action shugging are shared between ''Hugging'' and ''Pushing'', which are different interaction categories. This paper regards the negative sub-action sharing challenge as a positive chance, and ingeniously transforms risk into opportunity in the proposed CPPCR method. As introduced in Section I, from Immanuel Kant's statement, the sub-actions sharing challenge is also a chance since sub-actions from other classes are helpful to represent the testing sample. If the features extracted from all the other sub-actions are used as possible training features (samples) for distinct sub-action, we can not only significantly improve the ability to learn features (knowledge representation), but also mitigate sub-action sharing challenges. Non-accurate 3) skeleton data. The methods [6], [13], [16], [17], [37], [44], [45], [47], [67] used skeleton features. However, the estimated skeleton joints sometimes are not accurate because of the body parts occlusion and missing fragment, as illustrated in Figure 1 (a).
The sub-action sharing challenge and the experimental results demonstrate that it is a novel choice to employ the proposed heterogeneous feature fusion method with CPPCR learning.

VI. CONCLUSION
There are many challenges in human action recognition based on RGB-Depth sensors, among which ubiquitous sub-action sharing phenomenon (especially among the similar categories) is a critical one. To this end, sub-action segmentation based on equal motion energy and class-privacy preserved collaborative representation (CPPCR) learning are proposed to jointly explore/address the long-range temporal dynamic structure involved in the actions/interactions. The action motion energy is computed in the depth modality and accordingly the action videos are segmented into sub-actions based on equal motion energy division. Then the action segmentation parameters are transfered from depth to the temporally synchronous skeleton modality, thus two heterogeneous features are extracted respectively and fused. In addition, the proposed CPPCR takes the negative sub-action sharing challenge as a positive opportunity, addressing the sub-action sharing challenge.
The experimental results on four datasets consistently demonstrate the effectiveness of the proposed method. Qualitative analysis of the two features, as shown in Figure 6 and 7, illustrates that in the learned CPPCR feature space depth and skeleton features are complementary with each other, fusing them leads to superior performance than using either of the two individually.