Can Skeletal Joint Positional Ordering Influence Action Recognition on Spectrally Graded CNNs: A Perspective on Achieving Joint Order Independent Learning

3D skeletal based action recognition is being practiced with features extracted from joint positional sequence modeling on deep learning frameworks. However, the spatial ordering of skeletal joints during the entire action recognition lifecycle is found to be fixed across datasets and frameworks. Intuition inspired us to investigate through experimentation, the influence of multiple random skeletal joint ordered features on the performance of deep learning systems. Therefore, the argument: can joint order independent learning for skeletal action recognition practicable? If practicable, the goal is to discover how many different types of randomly ordered joint feature representations are sufficient for training deep networks. Implicitly, we further investigated on multiple features and deep networks that recorded highest performance on jumbled joints. This work proposes a novel idea of learning skeletal joint volumetric features on a spectrally graded CNN to achieve joint order independence. Intuitively, we propose 4 joint features called as quad joint volumetric features (QJVF), which are found to offer better spatio temporal relationships between time series joint data when compared to existing features. Consequently, we propose a Spectrally graded Convolutional Neural Network (SgCNN) to characterize spatially divergent features extracted from jumbled skeletal joints. Finally, evaluation of the proposed hypothesis has been experimented on our 3D skeletal action KLHA3D102, KLYOGA3D datasets along with benchmarks, HDM05, CMU and NTU RGB D. The results demonstrated that the joint order independent feature learning is achievable on CNNs trained on quantified spatio temporal feature maps extracted from randomly shuffled skeletal joints from action sequences.


I. INTRODUCTION
T HE Skeletal based action recognition is being practiced through deep learning on features extracted from 3D joint sequences. These sequences represent joint positions across a 3D action video. However, the quality of these sequences depends entirely on the capturing technologies. Two most widely used 3D human action skeleton recording systems are Microsoft Kinect and motion capture. Kinect is VOLUME 4, 2016 commercially affordable with a moderate reliability in capturing human skeleton representation as joints. On the other hand, motion capture technology is costly and capable of generating highly accurate representations of skeletal joints in 3D space.
The objective of skeletal human action recognition algorithms is to learn these 3D joint sequences and identify unique patterns for classification. Initially, joint positions were applied for training the classifiers [1], [2]. One such classifier was the graph matching (GM) algorithm [3], [4]. In GM, graph is constructed using the joint positions as nodes and the inter joint relationships as edges. Each skeletal action video frame is represented as a graph during training. Testing GM involves a computationally intensive frame by frame matching either through a learning algorithm or a matching measurement model. Similarly, decision trees [5], [6] also produced good action estimates from raw positional joint data for human action recognition on both Kinect and mocap captures.
In an ever expanding endurance for betterment in recognition accuracies, researchers saw an opportunity to develop sequence models for characterizing time series 3D joint data. Sequence modeling designs were exclusively applied to learn these 3D joint time series variations in actions for recognition. Recurrent Neural Networks(RNNs) [7] and its upgrades such as Gated Recurrent Units (GRU) [8] and Long Short Term Memory(LSTM) [9] has shown exclusive learning capabilities on sequence data. However, these networks are too deep and often need intensive computing power for execution. Hence, a successful alternative is to describe the time series joint data as a spatio temporal feature illustration [10]. In the last four years a dozen varieties of spatio temporal features have been reported on skeletal joint 3D data. These are popularly called joint feature maps characterize a human action sequences into images. Eventually, spatial patterns are learned from these action feature maps using convolutional neural networks (CNNs) for action recognition [11].
Surprisingly, most of the benchmarks works in the area of action recognition have selected different joint ordering on the skeleton during the classifier development. Interestingly, these joint orders play a key factor in determining the recognition accuracies on various datasets. To exemplify, HDM05 [12] and CMU [13], two most prominent action datasets have different joint ordering. Similarly, NTU RGB D [14] and MSRAction3D [15] are showing differences in joint ordering. This observation has profoundly influenced our research in this work. Similarly, our datasets KLHA3D102 [16] , KLYOGA3D [17] and KLSLR3D [18] which are recorded using 3D motion capture (mocap) technology also show different joint orderings. So far, only a few researchers have pointed towards the impact of jumble joints on the performance of skeletal action recognition methods [19], [20]. Fig 1 shows the joint ordering across action datasets.
The idea behind this skeletal joint random order training on the deep learning networks is to learn different possible random feature representations for a robust action recogni-tion framework. The point made by is valid as it says that the skeletal action recognition is based on joint order combination. If this order is altered by the system or forgotten by the user during data preparation, the pre-trained models are destined to give ambiguous results. In general, researchers have experienced problem during capturing 3D data using motion capture system. During capturing, the technician from multiple departments using it for their applications such as sign language, yoga poses, human action and medical biomechanics e.t.c and have given different joint orders according to their need. As they were building huge datasets for their applications over a period, different researchers were involved and resulted in different joint ordering in the skeletal datasets. When we want to test these datasets with our deep learning models, it has been found to give highly discriminate feature for within class labels. It took a while to understand the problem. Similarly, when we the models trained on 3D motion capture skeletal data and used on test inputs from Kinect data with similar number of joints has again resulted in a failed model. To convert these data preprocessing anomalies into refined information, there are two methods. One is to rerecord or reconstruct the data from scratch and the other is to use it as an opportunity to solve this problem through automation. This paper describes the research and experimentations performed for generating a research on order independent framework for action recognition.
A question that naturally arises from the above discussion is, can we design a deep network that will detect patterns in jumbled join features. Consequently this is the first work to explore the possibilities of developing a joint order independent feature learning through deep networks. Besides deep networks, features play a crucial part in overall training and improving the performance of skeletal action recognition tasks. Over the years, a variety of features [21] were computed from raw skeletal joint positions for 3D action representation. Some of these features are joint distances, angles, lines, planes, angular displacements, quadrilaterals etc.
The irony is, we actually have to maintain a constant skeletal joint order during the entire experimentation. Fig.1(a-d) shows joint ordering used in publically available benchmark 3D skeletal action datasets. Subsequently, fig's.1(e -f) are from our own yoga and action datasets. An inspection of the action skeletons in fig's.1(a -f) reveals that the joint orders were indeed different across datasets. This presents a bottleneck during comparison of a proposed deep network on these multiple action datasets. The most common form of feature representation is through the use of joint positions which intuitively are defined with respect to the skeletal joint ordering. Hence a change in joint ordering or sequencing during the recognition process effects the model accuracies as shown fig.2.
As an example, we extracted joint distance features from our previous work [22] and color coded them into RGB images using JET color coding.  shows the computed recognition accuracies under the headers same joint order testing and random joint order testing on all skeletal action datasets with a train test ratio of 15:4. In summary, the recognition accuracies were found to be below normal in all the cases when joint ordering differed in training and testing.
In this paper, we propose to develop a universal joint order independent learning network called Spectrally Graded CNN (SgCNN). Additionally, we also extend on the present feature maps into a more efficient and reliable skeletal representations. These proposed maps are called quad joint volumetric features(QJVF). The objectives of this work would be 1) To design QJV features along with a novel deep CNN model to develop a joint order independent feature learning.
2) To identify the number of randomly ordered joint feature maps required for training the designed SgCNN that results in sovereign 3D skeletal action recognition systems. 3) To determine the desirable joint feature maps that can achieve joint order independence on deep learning frameworks for 3D skeletal action recognition tasks. The results of this study are important for attaining explicit understanding of joint ordering in 3D skeletal based action recognition on deep learning networks. The following outcomes can be expected from our experimental study on random joint order selection for skeletal action recognition: 1) A 4 joint feature map QJVF, that has shown capabilities to represent joint relationships exclusively in 3D skeletal actions when compared to existing features. 2) A refined learning network (SgCNN), with multi dimensional filter rotations generalize the input by preventing feature loss in the dense layers.. 3) A discovery on a potentially optimal feature subset that can achieve joint order independent learning on deep networks. The rest of the manuscript is organized as follows. The following section describes various features and methods that were developed previously for skeletal based action recognition. The third section illustrates the methods developed in the work. The penultimate section presents results, discussion and analysis of various experiments conducted to achieve the formulated objectives. Finally, section V concludes the proposed problem with obtained results.

II. BACKGROUND
Human skeletal data is more robust than other modalities such as RGB video and depth. The robustness to 3D skeletal action data is because of its independence towards video backgrounds and human subject inconsistencies. These characteristics have made the 3D skeletal representation of human actions and activity, the preferred input modality for classification problems. This trend is fuelled by the availability of inexpensive hardware sensors such as Microsoft Kinect VOLUME 4, 2016 and Intel real sense 3D capture system [23], [24]. On the other hand more expensive and accurate capture technology is a multi camera 3D mocap system [25]. 3D human action data from these systems has revolutionized the human action recognition in the last five years. Although multitude of action recognition algorithms were proposed on these datasets [13], [26], [27], we prefer to review works focused on deep learning frameworks only.
The perfect recipe for 3D skeletal based action recognition is a combination of joint action data and the deep learning. Compared to other action data modalities it is observed that the skeletal data is spatially relational, temporally compatible and also form spatio temporal structures. Eventually, the machine learning algorithms have shown evidences to learn one or more of these characteristics for automated skeletal action recognition [28]. The first machine learning models focused on learning temporal patterns by extracting joint variations across frames [29] which further evolved by characterizing them as time series representations [10]. However the above models learned these temporal variations specific to a dataset and could not transfer the gained knowledge during testing with a different dataset. The recurrent ML models were found to be on the downshift across datasets. Hence to develop actionable Intelligence across datasets, deep learning architectures were applied on skeletal action data. Erstwhile deep learning models on vision computing applications [30] has shown impressive performances in decoding spatial and spatio temporal patterns.
Deep learning methods such as Recurrent Neural Networks(RNNs) [31], Long Short Term Memory(LSTM) [32], Convolutional Neural Networks (CNN) [33], Recurrent CNN (RCNN) [34] and lately the Graph Convolutional Networks (GCN) [35] has shown a monumental growth in human action recognition with skeletal datasets [12]- [15]. Primarily, the naturally occurring skeletal joint temporal cues in human actions are exceptionally well characterized by RNNs [31]. The structure of RNNs allow them to identify joint patterns by generating relationships between the previous and present joint variations across action sequences. Despite successful performances on skeletal action datasets, RNNs showed limitations in processing long sequences due to vanishing gradients problem [31]. This drawback was succeeded by inducing memory cells into the current architecture of RNNs to create upgrades such as Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU).
LSTMs were most exclusively applied for skeletal action recognition tasks in unidirectional [32] and bidirectional modes [36]. The bidirectional LSTM has shown to have recorded higher recognition accuracies over the other LSTM models [37]. However, LSTMs are computationally intensive and sometimes the gradient decay is highly dominant due to tanh function that becomes hard to ignore. The solution to the above problems came in the form a new improved architecture called as independent recurrent neural networks. These models were able to develop longer and deeper architectures without vanishing gradients problem [36]. However, it is implied that recurrent models were indecisive on spatial features which defined the joint relationships with in a skeletal action frame. Hence, spatial temporal combination networks were proposed with CNNs followed by LSTMs [38][39] [40] for action recognition. The CNNs learned spatial joint features and the flattened features in the dense layers of CNN are inputted to LSTMs to determine temporal patterns in the extracted spatial contents. Despite higher recognition accuracies, the CNN LSTM models are not endto-end trainable in most of the action recognition framework proposed in literature [38].
To overcome these network implications for action recognition, a rich spatio temporal feature representations in the form of RGB color images. These RGB color maps characterize a particular skeletal action across a set of 3D video frames. Consequently, the proposed spatio temporal images are found to be independent of length of the video sequences as well as number of joints. These spatio temporal features represent spatial relationships among joints within a 3D action frame and temporal changes between frames as we move horizontally representing temporal patterns. The proposed spatio temporal features are joint positional maps (JPM) [41], Joint Distance Maps (JDM) [11], joint Angular maps (JAM) [42], Joint Angular displacement maps( JADM) [17], Joint Velocity maps (JVM) [43], Joint acceleration maps (JaM) [44], joint planar maps (JpM) [45], joint trajectory maps (JTM) [46] and quad joint volume maps (QJVM). The above spatio temporal feature maps are embedded with patterns that can be quantified using a deep CNN of any architecture. It has been shown that the deep CNNs had certainly enhanced the performance of the skeletal action recognition system on Kinect and mocap datasets.
Undoubtedly, the above analysis shows that the spatio temporal feature maps can be learned exceptionally well by the deep networks. But, what if the joint orders on the skeleton changes during the experimentation across datasets and machines. Following the discussion, we propose to investigate, Can Skeletal Joint Positional Ordering Influence Action Recognition on CNNs for Achieving Joint Order Independent Learning. Fig.1 and 2 show how joint order independence is necessary if multiple datasets are being used for testing a proposed skeletal based action recognition system. We also found evidences where the independent researchers used different joint orderings on the same datasets [12]- [15]. Surprisingly, we found joint orders play a crucial role in evaluating classifier models for skeletal action recognition. Hence, the outcome of this paper which answers the question: can we achieve joint order independence on deep networks, is threefold. The first one being the design of spatio temporal feature maps to represent jumbled skeletal joints that can achieve order independence. Secondly, the number of these randomly ordered training feature maps necessary to develop a reasonably accurate classifier.
Thirdly, to find a deep learning architecture that will guarantee highest recognition accuracy on Jumbled maps of certain feature type. We demonstrate the entire process through experimentation. The following section illustrates the underlying methodology for the proposed hypothesis and its evaluation.

III. METHODOLOGY
Skeletal action recognition using deep learning models has to attain a remarkable level of flexibility in disregarding skeleton joint orders during feature computation. Despite a large contingent of these successful methods have been proposed on skeletal based action recognition, this is the first time to report the effects of joint order variations on their performance. There are three challenges in designing a universal deep learning system for skeletal action recognition. One, to choose how many randomly generated joint ordering feature maps are required for training, what should be the optimal CNN architecture for achieving joint independent learning and finally, to investigate which type of features give optimized performance with high accuracies. Here, we set the procedures for creating various feature maps along with our novel quad joint volumetric features, generating random joints given a joint order and designing a Spectrally Graded CNN architecture. This section consists of six subsections: describing the extraction of features from joints and converting to maps; our proposed QJVM features; generating random joint orders; the proposed SgCNN; Its training; testing and evaluation.

A. SPATIO TEMPORAL FEATURE MAPS
The spatio temporal features define a human skeleton joint's inter and intra frame relationships across 3D action video frames. The human skeleton is represented digitally with J joints which convey their spatial location with respect to the camera coordinates. These spatial locations are positional vectors defined as Currently, all the features are extracted from the positional vectors which provide spatial temporal relationships between the joints during an action. The first methods converted the joint positions in x into a red (R) colour coded plane, y into a green (G) and z into a blue (B) using a threshold on each of these positional values [41]. We call these coloured positional feature maps as Joint Positional Maps (JPM). Similarly, the method in [47] converts x − y, y − z and z − xplanes into R, G and B planes to create action feature maps, which are called as Joint Paired Positional Maps (JPPM). The above two methods applied three stream CNNs with 8 convolutional layers with max pooling and ReLu operations in between them. Each stream has a dense layer followed by a SoftMax layer. The output class probabilities are predicted using a decision score fusion model to recognize actions. The results are better than the previous non deep machine learning models due to multi feature learning which was automated in deep learning models. However, the recognition score was further improved through a small modification to the model in [48] by adding a 4th stream of xyz combined feature map from [11] in [49].
The above methods propose spatio temporal maps that does not explore relationships among the joints in spatial and temporal domains. This was achieved using joint distance maps (JDMs) in [11] through joint pairs. For a J joint 3D skeleton, there are J C 2 joint combinations accounting for J(J−1) 2 unique pairs. In 3D video frame t, the paired i th and j th joints represented by positional pointers p i = (x i , y i , z i ) and p j = (x j , y j , z j ) respectively, develop a l 2 norm based relational features expressed as Where, d ij is the Euclidian distance between two joints in a frame. For the entire frame with J joints, d ij becomes d t J . The d t J is a vector describing all the joint relationships within a frame t. Extending on to the entire skeletal 3D video action sequence with T frames, we represent d t J as a matrix d T Hence for all three axis, we have In [17], these three planes are colour coded into RGB planes to form a feature map that represents the spatio temporal variations in the 3D skeletal action sequence. The maps created were called as joint distance maps (JDMs). The performance of JDMs was found to be better on a single stream CNN network which is less complex and computationally efficient than the networks using in [40], [50] and [51].
The JDMs were further enhanced by joint angular information into a more robust feature representation through joint angular displacement maps (JADMs) [17]. The JADMs are created by combining joint distance features with their orientation angular information. The angular displacement features between the i th and j th joint in a t th frame is formulated as The orientation angle θ t ij in each frame t is a vector which is computed with respect to a common adjacent joint k. The × T feature matrix which represents a 3D action video sequence. The obtained feature matrix in 3D is colour coded to form a J(J−1) 2 × T × 3 RGB image. The JADMs characterized very subtle joint variations across 3D actions, thus transforming robust patterns into the image pixel representations that provided good discriminations across actions.
Alternatively, enhanced feature maps such as joint velocity maps(JVMs) [43], joint angular maps (JAMs) [42], joint angular velocity maps(JAVMs) [52], joint trajectory maps(JTMs) [46] and joint acceleration maps(JaMs) [44] with deep learning networks have shown to improve recognition accuracies over traditional features. All these maps VOLUME 4, 2016 model 2-joint relationships within a 3D video action sequence. Substituting 2-joint with 3-joint relationships has further enriched the patterns on the maps for automated feature extraction process in CNNs. The 3-joint relational feature maps were called joint planar maps (JpMs) [45] and joint surface maps (JSMs) [53]. Inspired, we propose a 4joint relational map called as quad joint volume feature map (QJVMs) which is elaborated in the following section.
However, we discovered that the created feature maps are based on one simple rule: Never change the skeletal joint ordering. Changing joint ordering during feature computation process across experiments greatly affects the performance of the deep learning algorithms. Extensive experimentation and analysis has been performed in this work to meet the proposed objective of discovering a universal action recognition framework which is independent of the joint ordering.

B. QUAD JOINT VOLUME FEATURES
The human skeletal model is represented on a machine with J joints which forms J C 4 unique 4-joint pairs. In 3D space, each joint is represented as a position vector described by To construct a geometric quadrilateral with 4 sides, we use the J C 4 four joint pairs. Hence, on a J joint skeleton, we construct J C 4 quadrilaterals in 3D space from. For our 39-joint action skeleton, we construct 82251 four-sided 3D quadrilaterals of arbitrary shapes. These 82251 polygons characterize all possible relations among joints in a 3D frame t. Joint volume features of the 82251 quadrilaterals describe the spatial joint relationships within a 3D video action frame. However, we eliminated the slow varying quadrilaterals across action frames using averaging threshold. Hence, only 10% of 82251 have impactful quadrilaterals in each action sequence, which are useful for feature computation.
During skeletal motion in the 3D video sequence, the constructed 3D quadrilaterals vary shapes and orientations proportionally with respect to the joint relationships. Thus, transforming these changes into quad joint volume feature (QJVF) matrices. To find the volume of the any 3D quadrilateral described by the coor- )} of its vertices are known, we apply the following process to calculate quad joint volume. Fig.3 shows the process of designing relative quad joint volume features. To find the volume of the irregular quadrilaterals, we split it into two truncated triangular prisms and we find the volume using the expression where (x i , y i , z i ) is the starting coordinate of the s th quadrilateral in the t th frame. Hence the QJVF is a vector of size 0.1 × J C 4 × 1 representing 3D quadrilaterals volume. For the entire 3D video sequence with T frames, the spatio temporal QJVF is a matrix of size 0.
Finally, for a dataset with N labelled 3D videos, the QJVF is a multidimensional matrix of size 0.
Finally, for a dataset with N labelled 3D videos, the QJVF is a multidimensional matrix of size 0.1 × J C 4 × T × N . The chronological arrangement of these features is shown in Fig.3.
In general, the above process is expandable on skeletal data captured using sensors like Kinect or a mocap system with different camera setup other than the one used in this work. Therefore, we proceed to investigate the performance of QJVFs on publicly available HDM05 [12] and CMU mocap [13] and NTU RGB D [14] Kinect dataset along with our own 3D mocap dataset KLHA3D-102. The QJVF feature matrix can be used for training the classifiers directly or can be encoded as color images for training on deep convolutional neural networks. Despite the success of classifiers like HMM and DTW on such time series data, their operating efficiency reduces on large datasets, such as the one used in this work. Hence, we encode the QJVFs into color coded pixels using the procedures from our earlier work [17]. The volume data is color coded into RGB planes using the 'jet' color map to form quad joint volume (QJVMs) feature maps, by following standard mapping procedure [54],

C. JUMBLE SKELETAL JOINTS
The color coded feature maps represent joint variations in 3D skeletal actions as pixels on an image. These pixel patterns are learned by deep CNN models for recognition of human actions across classes. The studies on these feature maps have revealed two interesting observations.
1) The joint ordering is fixed at the start of the experiment by the capturing sensors, which can be modified by the user based on the missing joint information after recording. 2) Training and testing with different joint orders hasn't been conducted on the deep learning models to understand its implications on overall performance of the skeletal based action recognition systems.
Hence, this work proposes to explore the impact and importance of joint ordering in skeletal based human action recognition tasks from feature maps to training a deep network.
The primary goal is to create a jumbled joint ordering, given the sensor generated skeletal joints. The initial skeletal joint ordering in our motion captured KLHA3D-102 is shown in fig.1(e). This is the joint ordering that was selected during data capture. To create a different ordering of joints, we used "The Fisher-Yates shuffle " algorithm [55]. The function shuffles the inputted J joint list randomly to produce a different joint ordering. In this work, the shuffle routine was called 100 times with the original J joint input in every instance. Out of 300, we selected 100 joint orders that were found to be uniquely random through a cross correlation coefficient on the original and the generated joints. Specifically, the selected 100 joints are highly discriminating and independent orders Hence, we selected these 100 shuffled joint orders for training and testing the proposed problem through deep networks. Fig.1 shows the projected shuffled joints onto the human skeleton.
However, repeating the same joint orders is not possible as they are generated randomly. Hence, we performed the experiments 5 times from scratch on different DL frameworks and across action dataset features. Consequently, testing is initiated to find the number of random joint order training set necessary to achieve good recognition accuracies.
The methods for creating feature maps were initiated on these 100 joint orders. The fig.5(a)  In addition to our motion capture dataset, we experimented to discover universality of the proposed framework across benchmark skeleton action datasets such as NTU RGB D [14], MSRACTION3D [15], HDM05 [12] and CMU [13]. The HDM05 and CMU are recorded on 3D motion capture platform, whereas the others are based on Kinect sensor. Joint shuffling and spatio temporal feature map creation VOLUME 4, 2016 FIGURE 5: Three feature maps from our previous works for 10 jumbled joints are consistent across datasets. The existing deep networks such as CNN, ResNets, GoogleNet and recurrent CNN have learned the above feature maps with a fairly small training loss but couldn't generalize during validation. The validation errors have become constant after 50 epochs for most of the feature maps across datasets. Hence to improve the recognition accuracies, we built a grid-like circular transferable feature model that we call a spectrally enriched graded CNN (SgCNN), the architecture of which is discussed in detail in the next section.. The following section describes SgCNN architecture, training and testing procedures.

D. SPECTRALLY GRADED CNN
The aim of Spectrally Graded CNN (SgCNN) is to rotate the multidimensional features around the network to create a nonlinear feature vector that can generalize from the input data samples. This approach is used by ResNets and RNNs to reinforce lost data and avoid vanishing gradients. These networks are very deep; the minimum number of layers has been found to be around 20. Our SgCNN differs from existing networks in three respects.
1) It processes multiband features simultaneously to generate a highly nonlinear feature vector and thus avoid over-or underfitting. 2) It does not use a dropout layer to induce random feature nonlinearity before the dense layer. 3) It is computationally efficient and requires less training.
The SgCNN architecture is shown in fig 4. Before applying the proposed SgCNN to QjRVMs and other maps, we needed to evaluate its performance on standard image datasets. For this, we used several image and video datasets, namely Fruits-360 [56], Food-101 [57], Caltech-256 Object Categories [58], KTH-Animal [59], and UCF Sports [60]. These were used to compare the proposed SgCNN against state-of-the-art network architectures such as CNN8 [22], VGG16 [61], VGG19 [62], ResNet-50 [63], GoogleNet [64], SENet-154 [65] and proposed SgCNN. The results of experimentation were presented in table 1. The results induced confidence on the SgCNN architectures ability to learn multiple resolution patterns simultaneously in images. The next subsection describes training of the SgCNN on randomly shuffled skeletal joint feature maps.

E. TRAINING SGCNN
To implement the SgCNN algorithm, we used Python 3.6, with TensorFlow library to train the model. We used the same hyperparameters for all datasets, except for the learning rate, which was reassigned during training for each dataset. Specifically, we decreased the learning rate exponentially from 0.01 until the error became constant. At the start of the training phase for each dataset, we set the network's weights and bias parameters randomly using a zero-mean Gaussian distribution function with variance 0.01.
The SgCNN learned by updating its weights and bias parameters using the back propagation gradient descent algorithms. Consequently, to contain the validation errors during training we applied a l2 weight regularizer after each convolutional layer.This has enabled the SgCNN to develop uniformity in weights across layers during training. We applied ReLU and SoftMax hyperparameter activations in the convolutional and dense layers, respectively. Finally, we used a fixed batch size of 32 for training, based on the image resolution and amount of GPU memory available. The training was performed on a NVIDIA GTX 1070, 8GB GPU with the model. The SgCNN model is built from scratch with keras frontend and tensorflow backend.

F. TESTING AND EVALUATION
Testing sets of different proportions were etched out using multiple combination of spatio temporal feature maps constructed by mitigating skeletal joint orderings. Here, we find solution to our third objective which aims to find the number of optimal joint order combinations required to achieve joint order independence. Consequently, testing process has been exhaustive which included multiple instances of executions across combinations of maps in all the considered datasets. The performance of the SgCNNs was evaluated based on recognition accuracies averaged across datasets.

IV. EXPERIMENTATION AND ANALYSIS
Exhaustive experiments were designed and subsequently conducted to discover spatio temporal feature maps and their train test ratios on deep networks that are better suited to deal with the skeletal joint order variations across datasets for action recognition tasks. We start by evaluating QJVMs on SgCNN against various feature maps and deep networks with base joint order on KLHA3D102 dataset. The base joint order is the initial joint orientation followed in the action datasets. Jumbled joint order is the randomly shuffled skeletal joints using "The Fisher-Yates Shuffle". The obtained results were then compared against benchmark datasets. Next, the above experiments are repeated for jumbled joints to identify the type of skeletal feature maps that will achieve joint order independence through optimal training on different network architectures. Finally, the random joint shuffle routine is called 5 times on 5 differently configured computer systems to generate 25 randomly shuffled skeletal joint ordered maps per class for testing and identify best in class deep networks that allow joint independent learning.

A. SKELETAL ACTION DATASETS
The KLHA3D102 is captured with an 8 camera vicon motion capture technology [18]. The human skeleton in our dataset has 39 joints from head to toe. The joints in 3D mocap are placed manually by pre-determining the highly articulated joints on the human body. In total KLHA3D102 consists of 102 classes with 10 subjects and each repeating the action 10 times. Hence we have, 102×10×10 = 10200 skeletal actions with a base joint order representation. In order to achieve a skeletal joint independent action recognition model, we now consider enumerating the base joint order to multiple random jumbled joint orders. As discussed in section III-C, a 100 shuffled joint ordered skeletons were generated. Subsequently, associated action datasets were formulated from these jumbled skeletons. Finally, features maps were extracted on these jumbled joint datasets. The complete jumbled joint dataset consists for a particular feature type has a set of 102 × 10 × 10 × 100 = 1020K feature maps. The size of feature map images are fixed to 256 × 256 for optimum loading effect on the GPU during training. Similarly, the above process is repeated for creating maps across 10 feature types as discussed in the section III-A.
Identically, we followed the above procedure to create jumble joint dataset for our KLYOGA3D. This is a 42 class yoga skeletal action dataset recorded with 10 subjects and 5 repetitions. The total size of jumbled joint feature maps on the yoga dataset would be 42×10×5×100=210K per feature type. For all 10 feature types combined, we have 10200K on KLHA3D-102 and 2100K on KLYOGA3D respectively. Apart from our KLHA3D102, we also used publicly available 3D mocap action datasets HDM05 and CMU. Out of the two, HDM05 was less noisy and consists of 70 action classes with 5 subjects performing an action several times. In this work, we used 70 × 30 × 5 × 100 × 10 = 1575000 action samples for training and testing. The CMU dataset in this work is carefully crafted to avoid missing and noisy marker information. Consequently, the CMU dataset used for training and testing has 30 × 30 × 10 × 100 × 10 = 9000000 samples, with 30 actions classes, 10 subjects and 30 variations per subject. On the contrary to our 39-joint skeleton, HDM05 and CMU are captured with 41-joint skeletons. Finally, to discover the usefulness of the proposed maps and the ML algorithm, we investigated Kinect skeletal action data with 25 joints from NTU RGB D dataset. Our refabricated NTU RGB D dataset has 60×30×10×100×10 = 1800000 action samples. Besides, these datasets were selected to have a 30 to 40% overlap among action classes.
The entire experimentation is divided into 3 clusters in which different experiments will be conducted. In the first cluster (C1) we test our proposed feature maps QJVF and the SgCNN architecture across multiple features on considered datasets. In C1, only base joint order or any random joint order skeleton is used and the joint order is fixed through out the experimentation. Experiments in C1 test the usefulness of our proposed spatio temporal feature maps QJVFs against the existing maps on deep networks. The second cluster (C2) is what makes this work really interesting. Here, we train deep networks to predict human actions with random joint ordered skeletal maps. We created a total of 100 random jumbled ordered joint feature maps per action per subject per repetition across datasets. The entire dataset has been divided into multiple train and test samples of different dimensions to discover the necessary train test ratios for joint independence in skeletal action recognition systems across datasets. We recorded the end-to-end system accuracies over a multitude of these train test ratios and discovered a possible range for joint order independent learning by the deep learning models. Additionally, we also recognized the best of joint feature maps that are suitable for achieving our proposed objective.
Finally, cluster C3, is designed to validate the proposed jumble joint independent learning across multiple machines. This phase is necessary to ensure that the required number of training samples doesn't change by a large margin across different hardware configurations. The proposed method generates the joint orders randomly which fluctuate between experiments. Therefore, we performed the experiments 5 times on 5 different hardware configurations to determine the possibility of a joint order independent learning for skeletal based action recognition.

B. C1: MONOSKEL RESULTS
To begin with, we focus on the performance of our proposed QJVMs and the novel architecture SgCNN as a traditional approach where the skeletal joints are unaltered throughout the experiment, MONOSKEL. The focus will be to test the effectiveness of quad joint relationships for skeletal based action recognition on deep learning architectures. Also, test the performance of the proposed SgCNN for classification tasks. We compare and analyze the test results with respect to different state-of-the-art maps and networks for skeletal action recognition.

1) Evaluating QJVM and SgCNN on KLHA3D102
The performance metrics for evaluation is maintained uniformly across the work as mean recognition accuracy. Here, we divided the entire KLHA3D102 dataset into multiple training units of different train test ratios. Specifically, we will identify MONOSKEL Accuracy Maximization Samplers (AMS) on the training data. However, the AMS can be emphasized as the minimum amount of training samples necessary for generating maximum recognition accuracy. After many different iterations, we selected to start at 20 and reach up to 80 training samples per class with an increment VOLUME 4, 2016    The state-of-the art networks such as CNN8 [22], VGG16 [61], VGG19 [62], RESNET-50 [63], GoogleNet [64], SENet-154 [65] which were highly competitive during Ima-geNet classification challenges are being considered for validating the proposed SgCNN. Table 2 gives the entire results of experimentation in C1. The proposed QJVMs consider 4 -joint combinations to calculate features instead of 2 or 3. Although, 4 -joint features need a large computational space, it is relatively richer in characterizing joint dependencies across actions. This is similar to 2 or 3 joint features except for the fact that the 4 joint features are extracted from the closed 3D volume between joint spaces. Hence, the feature maps QJVMs show high pixel patterns that are necessary for discriminating closely related actions. Hence, VOLUME 4, 2016 the accuracy on all the state-of-the-art CNNs for image classification problems have shown to generalize well on the proposed QJVMs. However, the most unsuccessful is the joint positional maps (JPMs) due to non discriminating pixel patterns between closely related actions. We can also observe that the maps with differentiating features such as JVM and JaM have produced good recognition accuracies. Similarly, higher dimensional features such as JpMs and JADMs are not far behind differentiating features. Overall, we found that a highly relational feature on the skeletal joints has provided good action recognition capabilities.
Secondly, the deep networks that were used in this work have already proved their might in the image classification space. However, the amount of data used for training these models is quite different from their usual datasets. All the models were trained from scratch on a 8GB GPU, GTX1070 with the same initial hyper parameters. Table 2 shows the recorded accuracies on different features maps. Since all the networks used are state-of-the-art, their accuracies across maps didn't have large margin. Interestingly our proposed SgCNN has proved to be competitive along with these models. However, what separates the state-ofthe-art from SgCNN is the computational complexity and memory usage during training, which are tabulated in last two rows of table 2. In particular, these last two rows show that the proposed network has less trainable parameters and occupies less memory making it stand out among the best. The reason for this would be the parallel architecture and gyroscopic filter Kernels that facilitate hyper hierarchical feature representation across multiple channels. Moreover, higher resolution filter Kernels above 9 has found to have little improvement in recognition accuracies on images and hence 9 was the maximum size considered for SgCNN.
Finally, the first column in table 2 shows the amount of training maps considered per feature per deep network. This is to identify the AMS necessary to achieve prediction confidence of the network. From table 2 AMS can be the range of 60 to 80 training samples of MONOSKEL data. However, AMS is subjective to operating GPU systems. Alternatively, we tested on 4 other GPU configurations and found that there was around ±3 % variation in accuracy levels. To summarize, the optimal value of AMS for our action dataset ranges between 60 to 80 samples per class when the skeletal joint orders are unchanged during training and testing periods.

2) MONOSKEL Across Benchmark datasets
To validate the proposed framework for skeletal action recognition with respect to different data sources, we applied the benchmark datasets from KLYOGA3D, NTU RGB D, UTKINECT, MSRACTION3D, HDM05 and CMU. Since each of these datasets were discussed elaborately in the start of this section, we present the results in table 3. Moreover, the training and testing samples differ in each case as they are unevenness in the number of classes and the number of images per class. Here the intuition is to test performance of the proposed framework and not to pick the best possible solution for action recognition. Table 3 generates confidence in our proposed framework through the computed mean recognition accuracies that are close to normal. However, the accuracies in table 3 are on the lower side when compared  to table 2, due to noisy datasets except ours KLYOGA3D.  Illustrating on table 3 allows us to contribute a novel interface for skeletal action recognition. In the following section, we present cluster C2, where the networks are taught to learn features from jumbled skeletal joints.

C. C2: JUMBLESKEL RESULTS
In this cluster, we present the results of skeletal action recognition tasks using features constructed using jumbled joints on deep networks. This cluster is the most captivating part of the entire experimentation. The focus would be to discover the JUMBLESKEL AMS on a particular set of features. We also extend this by analyzing networks on which a maximum accuracy is achievable. Finally, the results of JUMBLESKEL features on different skeleton sources when they interact with the deep network.

1) Evaluating JUMBLESKEL on KLHA3D102 dataset.
This 1020000 jumbled joint feature maps consists of all actions from different subjects with multiple orientations. Undoubtedly, one of the objectives is to find the AMS value that can provide an insight into the learning on jumbled joint skeleton data. Hence to accomplish this we downgraded the 1020000 sized jumbled data into 100 feature maps per class per subject in one orientation. Therefore, we have now 100 jumbled feature maps per classs which contains data from a single subject in a particular orientation. Noticeably, we bring uniformity among the experiments in clusters C1 and C2. This is important for getting a deeper insight into the performance of JUMBLESKEL when compared to MONOSKEL action recognition. Finally, we performed the experiment on all subjects in all orientations and the results were averaged across each experiment. Similar to the previous section, the training samples are incremental with a positive rate of 20 per experiment. The remaining are used for testing. In each training set 20% are kept for validation. Meanwhile, the same networks are trained from scratch with all the hyper parameters discussed in section III. Moreover, the hyper parameters are kept constant across networks. Table 4 presents the results of our experimentation on different AMS values from KLHA3D102 data. Here the random joint maps are from a single run of the "The Fisher-Yates Shuffle" on GTX1070 8GB GPU. Table 4 is having structural similarly with table 2 to help readers understand the difference between the unshuffled or base joint learning and shuffled mode. The mean recognition accuracy increased with as the number of training samples inputted are increased. Subsequently, it became reasonably consistent in the AMS range of 80 and above. This happened for only SgCNN where as the other networks it was beyond 80 samples.
Here we have to indeed forced to increase the number of jumbled joint features to 90 for training the other networks.  GoogleNet and VGG 16 has achieved in 80, whereas others reached a maximum accuracy at 88 jumbled joint features per class. Further, the SgCNN was able to achieve this results with comparatively less computational costs over the other networks. Additionally, there was no vanishing gradients problem in our network which was encountered by us when training the state-of-the-art models and have to eventually retrain them by applying weight regularizers. The reason for better performance in SgCNN has been attributed to the multiple hierarchical filter Kernels applied across convolutional layers. In short, variations in joints of the skeleton during experimentation can effectively be learned by a deep network which can then identify an differently ordered joint action class with around 90% accuracy. Table 5 shows the computed mean recognition accuracies on jumbled features from multiple sources. The variations in results were found to be similar to that of table 3. However, the accuracies across features and networks have been approximately equal to that of that of table 4. This consistency in mean accuracy can be quantified to the fact that the networks have learned to characterize the jumbled features and they have now become powerful enough to generalize on the noisy skeletal samples. Notably , table 5 has better performing  networks compared to table 3. This is indeed an interesting observation showing that jumble joint representations are helpful in improving the quality of skeletal action recognition systems. The final cluster C3 evaluates the universality of the proposed framework for action recognition.

D. C3: OMNIPRESENCE OF JUMBLESKEL FRAMEWORK
This part of the work evaluates the proposed concept of joint independent learning in deep networks for skeletal action recognition through iterative execution on multiple machines. For this purpose we used 5 types of GPUs located on 5 different machines, i.e. 3 laptops, workstation and a high performance computing (HPC) center located at our university campus. The machines used are, NVIDIA GTX1070 8GB GPU with a 16GB RAM, 4GB Quadro P100, 6GB TESLA K20M and two 940MX 4GB GPUs. All these GPUs are present on 5 different machines with different memory configurations. However, all are from NVIDIA and data management is performed using CUDA architecture.
The entire framework is executed from end-to-end on each of these machines. The feature maps were extracted on each machine and they are used for training from scratch. This operation was necessary to ensure the proposed joint order independent learning is actually implementable in real sense. Since, the skeletal joint shuffling process is random, which generates different joint orders during each routine execution either on the same machine or on different machine. This entire cluster is using only KLHA3D102 dataset as input. All the hyper parameters were made constant across machines and iterations. Each model is executed 5 times on 5 different machines from feature extraction to action prediction.
The mean recognition accuracy has been averaged across the datasets and the iterations. Therefore, we present table 6 with mean accuracies on different GPUs for QJVM feature maps trained on multiple models. Since we have the AMS for our KLHA3D102 dataset, we used 80 training samples for training each of these networks and tested with the remaining 20 samples of JUMBLESKEL. The resulting accuracies are averaged across both datasets and iterations on each machine. We found that the recognition accuracies have fractional deviation in each iteration on a machine and hence it was averaged across iterations along with the dataset samples. The results in table 6 are a replica of the table 4 and the essence of constant joint ordering can be replaced by capricious. In short, we found through experimentation and subsequent analysis that deep learning models can be trained on a random joint set feature maps to estimate skeletal actions with any joint representations. However, we estimate that the results will be slightly different on multiple machines as we have demonstrated in this cluster.
In conclusion, the minimum number of randomly ordered joint maps required for achieving independence is found to be above 80 samples per class. This has indeed achieved maximum accuracy of 86.56% on our proposed QJVMs and SgCNN was the highest among the state-of-the-art methods. Further increasing the AMS has improved the accuracies across models and features for all the considered skeletal action datasets fractionally. However, the training and validation losses were staggering around 0.00143 and 0.00054 after 80 training samples. All the models were run for 200 epochs and the accuracies reported in this work are average rates at 200th epoch.

E. THREAT ANALYSIS
Adding different randomly generated joint orders as test samples called as un-controlled group. This un-controlled group is constantly tested against the control group of test samples from the actual datasets. If the results from the un-controlled group are close enough to the controlled test group, the model outputs will be affected. In our experimentation the recognition accuracies from these two groups were separated by a margin of ±2.6% across all datasets. This process has nullified the internal threats to a large extent.
Testing on multiple machines has been performed to counter the external threats imposed by generating the random joint sequences on different machines and testing the resulting action feature maps on multiple machines as described in the last cluster C3.

V. CONCLUSIONS
A joint order independent learning method for skeletal based action recognition is proposed, evaluated and validated along with recognition method. Skeletal action datasets from mocap and Kenect are used for experimentation and analysis.
Further, the joint order variational data is created using random shuffling mechanism on the base skeletal joint data, called as jumbled joints. The conclusions are threefold: one, the new spatio temporal joint feature maps QJVMs has shown to have discriminating pixel patterns across closely related actions. Subsequently, a Spectrally Graded CNN architecture is developed for image classification tasks by using multiple filter Kernel sizes which enhances the non linearity in the learning through hierarchical receptive fields across the network. Second, the MONOSKEL training and testing with different feature maps from various datasets on deep learning frameworks has shown that the minimum AMS necessary was in the range of 60 to 80 samples per class. The mean accuracies across various skeletal action datasets was found to be in the range of 88 to 97% across features that have some kind of joint to joint relationships and if more joints are involved in a relationship the better are the pixel patterns. Thirdly, this study facilitated the discovery that it is possible to use different joint orderings (JUMBLESKEL) for skeletal action recognition. It further points to a better joint relationships in features greatly increases the networks capacity to generalize better. We also found through exhaustive experimentation on multiple machines that a AMS in the range of 80 and above training samples per class are necessary to develop a omnipresence 3D skeletal action recognition system. This study concludes that it is possible to develop a joint order independent skeletal action recognition system with joint relationship feature maps and deep learning networks.