A Real-Time 3-Dimensional Object Detection Based Human Action Recognition Model

Computer vision technologies have greatly improved in the last few years. Many problems have been solved using deep learning merged with more computational power. Action recognition is one of society's problems that must be addressed. Human Action Recognition (HAR) may be adopted for intelligent video surveillance systems, and the government may use the same for monitoring crimes and security purposes. This paper proposes a deep learning-based HAR model, i.e., a 3-dimensional Convolutional Network with multiplicative LSTM. The suggested model makes it easier to comprehend the tasks that an individual or team of individuals completes. The four-phase proposed model consists of a 3D Convolutional neural network (3DCNN) combined with an LSTM multiplicative recurrent network and Yolov6 for real-time object detection. The four stages of the proposed model are data fusion, feature extraction, object identification, and skeleton articulation approaches. The NTU-RGB-D, KITTI, NTU-RGB-D 120, UCF 101, and Fused datasets are some used to train the model. The suggested model surpasses other cutting-edge models by reaching an accuracy of 98.23%, 97.65%, 98.76%, 95.45%, and 97.65% on the abovementioned datasets. Other state-of-the-art (SOTA) methods compared in this study are traditional CNN, Yolov6, and CNN with BiLSTM. The results verify that actions are classified more accurately by the proposed model that combines all these techniques compared to existing ones.


I. INTRODUCTION
Action identification in videos is one of the crucial ongoing issues in computer vision and artificial intelligence.For developing intelligent environments and cutting-edge security systems, action recognition in a live video is essential.It has uses in a variety of industries, including human-machine interface [1], monitoring systems [2], and visual comprehension [3].Voluntary and non-voluntary activities taken by people are distinguished in human behavior [4].Manually identifying these actions is challenging.For this reason, various strategies have been presented in the literature.The models suggested in the literature rely on conventional techniques, including geometric, point, texture, and shape features.Deep learning methods are employed to address the difficulty in HAR of distinguishing between various human actions.Layers such as convolution layers, pooling layers, ReLU, completely connected layers, dense layers, and SoftMax activation functions layers are just a few of the layers that deep learning uses to represent data [5] uniquely.Deep learning has many methods composed of supervised learning, unsupervised learning, hierarchical models, and probabilistic models.The training samples help evaluate the performance of any deep learning model [5].
Other significant challenges of HAR are: (i) focal point recognition in the current frame in a video sequence is a big challenge, (ii) lighting conditions in video sequences, shadows, occlusions, and background complexity impacts inefficient classification of actions, (iii) motion variations capture wrong actions, and (iv) imbalanced datasets.Some of these challenges like occlusions, shadows in the videos, blurriness, background complexity and imbalanced datasets are considered in the present paper and rest are planned in our future work.The proposed model is evaluated on a real-time video taken from YouTube channel to classify the actions of people in an office and it shows that challenges like occlusions, blurriness, background complexity are taken care by the suggested model.
The research introduces a novel four-phase human action recognition model that utilizes object identification, skeleton articulation, and 3D convolutional network approaches that aid in resolving other key HAR difficulties, as mentioned above.The suggested approach creates 3DCNLSTM (3-dimensional Convolutional Network with LSTM) by combining multiplicative Long Short-Term Memory (LSTM) recurrent neural network with 3D CNN to process the videos.This method uses LSTM in conjunction with the multiplicative recurrent neural networks (mRNA) factorized hidden-tohidden transition to assist in producing quick and effective results.In natural language processing, LSTM and mRNN are combined [6].Classification is enhanced by incorporating feature extraction, object identification, and skeleton articulation techniques into the suggested model.The model's novelty lies in the combination of skeleton articulations of the person involved, classification of objects appearing in the scene, and extracted features from image sequences into a single neural network.
The KITTI dataset [7], the NTU-RGB-D and NTU-RGB-D 120 datasets [8], the UCF 101 dataset [9], and the fused dataset are used to assess the proposed model.r Combining all the modules into a single neural net- work is a tedious task, hence the proposed model is a combination of data fusion with Dasarthy's technique, feature extraction with Xception V3 model merged with multiplicative LSTM, real-time object detection with finetuned Yolov6 [10] merged with multiplicative LSTM, and skeleton articulation technique.The following is how the paper is set up: The relevant review is presented in Section II.Section III presents the proposed model design and deep learning models for feature extraction, object identification, and skeleton articulation techniques.The experiment's findings are presented in Section IV, and the study is wrapped up in Section V.

II. RELATED REVIEW
In recent years, HAR has emerged as a significant research area [11].CCTV surveillance [12], the field of robotics [13], authentication [14], smart healthcare systems [15], and various other technologies are only a few of the many uses for HAR.For the recognition of human action, researchers created numerous deep learning models.The effectiveness and speed of deep learning attracted the attention of researchers.A deep neural network incorporating CNN and Bi-directional LSTM was proposed by Soni et al. [16] to recognize the human activity.The suggested model is tested on the UCI-HAR and UCI-WISDM datasets, and both datasets showed a 97.96% and 97.15% accuracy, respectively, for the model's performance.On the vanKasteren, CASAS Kyoto, and CASAS Aruba datasets, Patricia et al. [17] applied twelve classification approaches, including Logistic Regression, OneR, Attribute Selected, J48, Random Subspace, Random Forest, Random Committee, Bagging, JRip, Random Tree, and REP Tree.The study shows that logistic regression and OneR achieved an accuracy higher than 90%.Table 1 presents a detailed review of other methods used for HAR with their respective research gaps.

III. METHODOLOGY
A novel four-phase model has been proposed to improve Human Action Recognition considering existing methods.This approach's primary goal is to merge four distinct components into one integrated neural network.However, it is a tedious task to capture all human activities on one platform.Hence the proposed model tries to classify as many actions as it can.The proposed model's whole architecture is depicted in Fig. 1.Four phases make up the model.Data fusion is the first phase, wherein the already-existing datasets are combined to generate a new dataset.The second and third phase helps in extracting features and classifying those features according to skeleton articulations of the selected objects in an image.The fourth phase provides results.The proposed model combines four modules: data fusion, 3D CNN with multiplicative LSTM, object detection with finetuned Yolov6 and multiplicative LSTM, and skeleton articulation technique with multiplicative LSTM in a single neural network.The data fusion module is not shown in the figure as it is a step before pre-processing of data, where different datasets are merged to form a bigger dataset.All these modules, except data fusion, have their multiplicative LSTM layers.Single CNN with the LSTM model is not able to recognize specific actions.Hence, the need arises to merge four modules in a single neural network to identify human actions accurately.Video sequences with n frames, shown in Fig. 1, are divided into k frames and are passed to three different modules.The first k frames are passed to 3DCNN.In 3DCNN, the Xception module has been implemented with transfer learning for classification purposes.Xception neural network (XNN) is an 'extreme inception' model, which is a 71-layer deep neural network and is more efficient than the inception v3 model.Instead of compressing input data into discrete chunks before performing a 1×1 convolution to determine cross-channel correlation, XNN translates the spatial correlation for every output channel separately.Hence, Xception is a combination of depth-wise separable convolution and pointwise convolution.Transfer learning is a technique in which a model is initialized with weights from a pre-trained model like Xception and uses the model either as a feature extractor or a fine tuner for the last layers.In this study, Xception with transfer learning is used for better results [33].The Xception module obtains convolution using the 1×1, 3×3, and 5×5 filter sizes.Convolutions are computed in parallel for all of them.Two further layers, max pooling, and average pooling, come before the completely connected layer.Utilizing the weights acquired during ImageNet training, the Xception module uses transfer learning.The Xception module consists of 4096 feature vectors.In this approach, 90 frames are obtained for each video with a frame rate of 35 Hz.Feature vectors are obtained for each frame, and the first multiplicative LSTM layer receives an input vector (F) of 90×4096 values.The first multiplicative LSTM provides an output in the form of a vector (T1) with other 4096 values that might represent the value of a sequence in input, but this output vector will be concatenated with outputs of other modules before pushing it to a fully connected layer.Feature maps obtained at different layers during Xception inference are shown in Fig. 2. In the second module, the finetuned YOLOv6 is used for object detection [10].Yolov6 helps balance between speed and accuracy.Yolov6 works on Anchor-free paradigm that helps in increasing the speed by 51%.The finetuned Yolov6 has been used in this study that helps in reducing the challenges of HAR like occlusions, background complexity, blurriness to some extent [10].Gupta et al. in [10] proposed a novel finetuned Yolov6 object detection model whose parameters are finetuned that helps in dimension reduction.Once the parameters are tuned, this reduces the model's accuracy.In order to enhance the reduced accuracy, transfer learning algorithm is proposed which enhances the model accuracy.This object detection method is given k frames as input.The vector in this module comprises 61 objects, each with six parameters that allow for the detection of confidence and bounding box positions.In this module, if comparable objects emerge, just one object is chosen, reducing the redundancy of objects in a single frame.In order to reduce redundancy, the object with the highest confidence score is chosen.These output values undergo batch normalization, and the resulting output is supplied as input to the second multiplicative LSTM layer.
Before being sent to the fully connected layer, an output vector (T2) from the second multiplicative LSTM layer is concatenated with the outcomes of other modules.
In the third module, skeleton articulations of persons selected in the scene are computed.The k frames are provided as input to the OpenPose module.OpenPose module is a Python library embedded with the CNN module and trained with the COCO dataset.OpenPose returns output in the form of a heatmap, Part Confidence Maps (PCMs), and Part Affinity Fields (PAFs).The 90 frames are passed to this module, and OpenPose returns 18 coordinates with 135 key points of the two tallest persons selected in the given frame, forming an output vector (P) of 90×2×18 values.This output forms the input for the third multiplicative LSTM layer.The output vector (T3) from the third multiplicative LSTM layer will be combined with the outputs of earlier modules before being pushed to the fully connected layer.Concatenating all of the multiplicative LSTM layers' output vectors (T 1 ||T 2 ||T 3 ), which is then used as the input to a fully connected layer of CNN.The action is classified into many classifications in the final stage.This section also provides a thorough explanation of each module utilised in the suggested model.

A. DATA FUSION
Data are raw facts that are not processed.After processing the data, it converts into helpful information.When working on HAR, it is impossible to classify each action a human performs.Humans are constantly performing some activities, such as if a person is sitting idle doing nothing, he is also conducting an act of sitting idle or standing idle.Hence, datasets are fused to form a larger dataset to analyze many actions.While merging datasets, it is necessary to understand that there will be redundant data and different data types.Using merge technique, datasets have been merged into two datasets, one with images of humans performing different actions and another with videos of human actions.The video database includes 184680 videos across 281 classes, and the picture dataset has 79282 images from 148 classes.This study uses Dasarathy's data fusion technique [34].The technique is categorized into five categories:    r DEI -DEO (Decision In -Decision Out): This level is known as decision-based fusion.This level helps in fusing decisions.By using the data fusion technique, one of the challenges of HAR like imbalanced dataset is reduced to some extent.Imbalanced dataset is not helpful in class separation and evaluation and also results in poor model performance.Hence, fused dataset approach is used in this situation.

B. 3DCNN WITH MULTIPLICATIVE LSTM
The first module is 3D CNN with multiplicative LSTM (Fig. 1), a model that extracts features with CNN and provides them as input to layers of multiplicative LSTM.LSTM learns long-term and short-term dependencies.Before the dense layer, the LSTM layer gets the final result of the pooling layer as input.The CNN model performs convolutions on input with three filters: 1×1, 3×3, and 5×5.A 3-dimensional convolutional neural network (3DCNN) is similar to a 2-dimensional convolutional neural network, except in a 3DCNN, the kernel can slide in three directions, whereas in a 2DCNN, the kernel slides in two directions only.3DCNN has two parts, a feature extractor and a classifier.3DCNN uses a 3D filter to perform convolutional tasks, unlike the 2DCNN and produces 3D volume as the output of convolutions.By shifting filters vertically, horizontally, and across the depth of the input video frame or 3D picture, the layers of 3DCNN convolve the input.Multiplicative LSTM layers get the outcome of the classification layer as input.A multiplicative recursive neural network (mRNN) and LSTM are combined to create multiplicative LSTM [7].Input gate a, Output gate b, and Forget gate c make up the three gates of an LSTM.Previously hidden state h t-1 and input layer x t provide input to the next hidden states h t of the LSTM, which is shown as: where, ĥt = current hidden state, W hx x t = weight of hidden state in input layer x, W hh h t−1 = weight of previous hidden state.
The three gates of LSTM, input gate a, output gate b, and forget gate c are stated as: (2) where σ = sigmoid function, W = weight vector.
The relationship between the components of the input gate and output gate determines what data should be stored and what data should be deleted at each transition.The input gate creates an internal state vector called d t and decides how much input should be sent to each hidden unit.Forget gate c determines the amount of how much previous internal state d t-1 is preserved.The internal state is stated as: The output gate b helps in preserving the relevant information which may not be helpful for the recent output but will be useful later.In (5), internal state vector d t is XNOR operation of forget gate c and output gate b with previous internal state d t−1 and current hidden state ĥt .An intermediate state m t from a multiplicative recurrent neural network is combined with each gate of LSTM forming multiplicative LSTM as: The max pooling layer has been applied to the model before all of these convolutions are concatenated to improve the feature extraction strategy.The average pooling layer presents the main features of images before classification.Instead of using the output from the entire network for processing, this layer's output is utilized.An activation Softmax function, denoted as ȿ, is used to minimize the output vectors to real numbers between 0 and 1.This activation function helps in obtaining normalized distribution as shown in ( 11) and ( 12): where c is classes and a is actions.
To obtain probabilities of each applied action a in classes c, using softmax function is depicted in the formula as: Each video sequence in the proposed model is processed at a frame rate of 35 Hz over the course of 90 frames.Each frame's 4096-element feature vector is obtained before being sent to the LSTM.The input is normalized using the batch normalization method to scale the pixel values between -1 and 1.

C. OBJECT DETECTION WITH FINETUNED YOLOV6 AND MULTIPLICATIVE LSTM
A real-time object identification technique called You-Only-Look-Once (YOLO) version 6 uses CNN for identifying objects in pictures and videos.YOLOv6 is a high-performing, single-stage detector with an effective design.YOLOv6 performs better than all of the earlier iterations of YOLO in terms of both accuracy and inference speed.This paper uses a hidden layer pruning approach that reduces the total number of parameters and the network depth of YOLOv6, making it a lightweight network.Following model pruning, there is a decrease in detection accuracy.Using the optimized YOLOv6 network, a transfer learning technique was applied to improve the detection accuracy [10].On YOLOv6, the head is detached.A network with a decoupled head signifies that the head part has more layers, contributing to improved performance.The decoupled head section receives the neck information directly and uses it for simultaneous objectness, classification, and regression tasks.Three components comprise the YOLOv6 model: the neck, the decoupled head, and the backbone.During the training phase, YOLOv6 employs reparameterized VGG blocks with skip conditions.The COCO dataset is used to train YOLOv6 [35].Unlike YOLO, the finetuned Yolov6 model uses two loss functions as Verifocal Loss and Distribution Loss.The varifocal loss function is used for classification and for box regression, distribution loss function is used.Verifocal loss function uses BCE (Binary Cross Entropy).Distribution Loss depends on the probability of the target box as discussed in [10].
The finetuned YOLOv6 processes at a speed of 70 Hz, and the video sequences are limited to 32 FPS, but this does not slow the process.This module forms a vector of 61 objects with six parameters per object.If objects of the same type with different confidence scores appear in an image, the object with the highest confidence score is selected.For humans appearing in the image, the tallest humans are chosen.After processing 90 frames with finetuned YOLOv6, the total processing is calculated as 90 * (61 * 6) = 90 * 366 values.These data are batch-normalized to an image's height and width.Fig. 3 shows some of the frames of videos representing different actions, such as PullUps, Biking, HulaHoop, Skiing, Playing Guitar, JavelinThrow, and others.

D. SKELETON ARTICULATION TECHNIQUE WITH MULTIPLICATIVE LSTM
The last module corresponds to the skeletons of the tallest humans involved in a scene.The structure in this study uses 18 coordinates which OpenPose returns [36].A real-time, multi-user human pose identification toolkit written in Python called OpenPose uses 135 key points to identify the human body.OpenPose comprises a CNN model trained on the COCO dataset.These skeletons are helpful when there is a need to see the movement of one or two people in 90-frame sequences.The frames help diagnose the bounding boxes around people moving in a scene.Hence there is no need to translate the image.For instance, it makes no difference where a person walks-in the middle of the room or close to a window-because a bounding box will follow him wherever In the first step, the image is passed via 3D CNN architecture, which extracts feature maps.Part Confidence Maps (PCMs) and Part Affinity Fields (PAF) are created by further processing these feature maps [37].Finally, PCMs and PAFs are further processed by a bipartite algorithm that helps to generate the skeletons.
Any body part that can be found in any pixel is represented in a 2D confidence map.Confidence map (C) is computed as: ) where j = the no. of body parts locations.
PAF is computed as follows: P = (P 1 , P 2 , . . . . . ., P x ) whereP x ∈ R w * h * x , x ∈ 1 . . ..x (15) The difference in loss among PCM and PAF is also calculated using an L2-Loss function as: where L * C = ground truth value for Part Affinity Field, P * j = ground truth value for Partial Confidence Map, W = binary mask with W(p) = 0 and it helps in preventing the extra loss.

E. COMBINING DATA FUSION, 3DCNN WITH MULTIPLICATIVE LSTM, OBJECT DETECTION WITH FINETUNED YOLOV6 AND MULTIPLICATIVE LSTM AND SKELETON ARTICULATION TECHNIQUE WITH MULTIPLICATIVE LSTM
Datasets are combined with the help of data fusion techniques, as discussed.The proposed model 3DCNLSTM receives input with 90 frames covering all the activities.Each module processes each frame to obtain a set of features.All the relevant features are filtered and normalized and have 61 objects with coordinates of two humans and feature vectors of 4096 elements which CNN obtains.LSTM helps in processing the generated features and helps in reducing the dimensions of the data.LSTM provides output in the form of three vectors, that is, F(f0, …..,f2047), O0(x0, y0, x1, y1) … …O60(x0, y0, x1, y1), and P(x0, y0, ….., x17, y17) where F is feature vector, O is movement vector, and P is person vector.As indicated in (21), all of these results are concatenated and given as a single input T to a completely connected layer.Dropout layers are also added to the model to avoid the problem of overfitting.The rest of the section describes the working of the model mathematically: Let Df be a vector produced by a 3D CNN model for f frames with dimensions (8 * 8 * 2048).The vector is pooled by the average pooling layer (Ff = avg(Df)).Each video sequence is divided into 90 frames.Hence these 90 frames are transformed into vectors of size (90 * 2048), that are given as an input to the LSTM layer.This first LSTM layer produces an outcome vector as: The object detection module provides an output vector Zf with dimensions (b, rx, ry, x), where b represents the bounding boxes, rx, and ry are several grids that consist of objects, and x is the output from finetuned YOLOv6.This module also provides another vector as [ax, ay, w, h, con], where ax and ay are positions of an object, and con is the confidence score concerning width (w) and height (h).this vector helps in identifying the things with the highest confidence score.This vector output acts as a filtering process that helps to produce 61 other object vectors as O0(x0, y0, x1, y1) … …O60(x0, y0, x1, y1) for a single frame 'f'.This second LSTM layer  produces another vector with dimensions as: The Skeleton articulation module produces a vector Sf having dimensions (h, k, t), where h is the number of humans detected, k is the number of joints in a body part, and t is data, [x, y, con], where (x, y) are coordinates of joints, and con is confidence score value.This vector is batch normalized to produce vectors of two persons as Hf = [person0(x0,y0, …..x17,y17), person1(x0,y0, … …x17,y17)].The third and final LSTM layer processes this input and produces a vector as: T 3 = mLSTM(vector (P 0 , . . .., P 89 ) (22) All these vectors generated by LSTM are concatenated (T1|T2|T3) and then given as input to a fully connected model layer.Fig. 5 depicts the data flow through the suggested paradigm.In this paper, the proposed model is tested, and all the other models are tested separately to check the efficiency of each module used.The model has just one fully linked layer, which is adequate to produce good results.The inputs, as well as the outputs of this model, affect how the neurons in this layer act.Layers with dropouts were employed to prevent the issue of overfitting.The experiment was run for 50, 100, 200, and 500 epochs, with 500 epochs producing outstanding results.Table 2 shows the outcomes of different models used in this study compared with other models for 500 epochs on the UCF101 dataset.Additionally, the suggested model is contrasted with the cutting-edge CNN with the Bidirectional LSTM model, and the findings demonstrate that the proposed model outperformed it.The work cited in this paper is based on IoT modules and smartphones.In this study, only the neural network (composed of CNN and BiLSTM) is implemented with the different datasets used for the experiment, and the results are captured.The proposed model is also compared with a deep residual convolutional neural network for human activity recognition [25].This DTR-HAR model is trained on all the described datasets in this study and results showed that DTR-HAR achieved an average accuracy of 90%.Table 3 depicts the results of various models used in this study compared with other models for 500 epochs on the KITTI dataset.Table 4 depicts the results of various models used in this study compared with other models for 500 epochs on the NTU-RGB-D dataset.Table 5 depicts the results of various models used in this study compared with other models for 500 epochs on the NTU-RGB-D 120 dataset.Table 6 shows the outcomes of various models used in this study compared with other models for 500 epochs on the Fused dataset.
Fig. 6(a), (b), and (c) presents the proposed model's total loss vs. total validation loss and total accuracy vs. total validation accuracy as well as the precision-recall curve for multiple classes.
All the evaluations have been obtained at each activity level.An average of all the accuracies, precision, F1 score, and Recall is calculated for evaluating the performance of different models on different datasets.Table 7 shows the results, and Fig. 7 shows the Precision-Recall curve with precision and recall scores.
A few classes are taken into consideration in order to streamline the outcomes and make them clear, making the confusion matrix simple to understand.The classes considered are shown in Table 8.To appropriately display the confusion matrix, classes are condensed.The confusion matrix is shown in Fig. 8  has overcome some of the challenges of HAR such as occlusions, blurriness and background complexity.It can be seen that the objects present in background are not creating confusion between human action, hence improving the background complexity problems.Further, it can be seen clearly that the person bending in one of the images is blurred to naked eyes, but the proposed model is able to detect the exact action of the person, hence reducing occlusion as well as blurriness problem.

V. CONCLUSION
The article presents a combination of different techniques in a single neural network named 3DCNLSTM.The study analyses other techniques as well.Several datasets have been used, namely, UCF101, NTU-RGB-D, KITTI, and NTU-RGB-D 120, for comparing various techniques with the proposed model.The data fusion techniques are also used to form a fused dataset that consists of 79282 images belonging to 148 classes and 184680 videos of 281 classes.Classic CNN, YOLOv6, CNN with BiLSTM, and DTR-HAR are also trained on these datasets and are compared with the proposed model.The suggested model outperforms existing models with the highest accuracy.The proposed model combines four techniques, namely, data fusion, feature extraction, object detection, and skeleton articulation techniques in a single neural network.Multiplicative LSTM has been applied to improve the suggested model's effectiveness.In the future, the model can be enhanced further by using deeper convolutional networks for better feature extraction, and this model may be integrated with humanoids.This model can be used to track the activities of old age people living alone in their homes and hence can be used as an assistant for them.

r
The NTU-RGB-D dataset has 56880 videos of 60 classes, NTU-RGB-D 120 has 114480 video samples of 120 different classes, and UCF101 has 13320 videos of 101 different classes.The KITTI dataset has 7481 training images and 7518 validation images of 69 classes.The discussed datasets are combined to create a single set of data for training the suggested model.The suggested model is trained using 79282 pictures from 148 classes and 184680 video clips from 281 classes after the combined datasets.The motivation behind the work is to correctly classify input data from different video sequences into their activity category to enhance the video surveillance features and security systems.HAR plays a vital role in classifying various activities performed by subjects in videos.The main objectives and concessions of the work are: Proposed a unique and novel four-phase model that combines four different modules into a single neural network.The model utilizes 3D CNN with multiplicative recurrent network LSTM and finetuned Yolov6 model for enhancing the classification and object detection process for actions in video sequences.r Yolov6 itself is a novel model and not much research is executed on this, hence the finetuned and transfer learning-based Yolov6 model is used for object detection module in this proposed model along with multiplicative LSTM.

FIGURE 1 .
FIGURE 1. Architectural design of the suggested model 3DCNLST.

FIGURE 2 .
FIGURE 2. Feature maps of input images obtained during Xception with Transfer Learning (a) are the input image from the Fused dataset, (b) batch normalized image, (c) feature map at first layer, and (d) feature map at last 48th layer.

r
DAI -DAO (Data In -Data Out): In this, raw data is input and output.After the data is collected from the sensors, data fusion is carried out.The algorithms used are based on single-image processing.

r
DAI -FEO (Data In -Feature Out): Features are ex- tracted from the raw data which help describe the data.

r
FEI -FEO (Feature In -Feature Out): In this, features are input and features are output.The data fusion technique is applied to features to refine them or to obtain new features.

r
FEI -DEO (Feature In -Decision Out): At this level, characteristics are used as input, and the result is a set of judgements.

FIGURE 3 .
FIGURE 3. Different frames from video sequences.

FIGURE 4 .
FIGURE 4. (a) OpenPose skeletons for different frames of different video sequences from the fused dataset, (b) estimation of pose on fused dataset.

FIGURE 6 .TABLE 7 .FIGURE 7 .TABLE 8 .
FIGURE 6.(a) Total loss and total validation loss of proposed model (b) total accuracy and total validation accuracy of the proposed mode (c) precision-recall curve for multiple-classes.

FIGURE 8 .
FIGURE 8. (a) Confusion matrix of the proposed model for training, (b) confusion matrix of the proposed model for testing, and (c) confusion matrix showing improvement in each class prediction.
(a) at training time and in Fig. 8(b) at testing time.The confusion matrix shown in Fig. 8(c) shows the improvement in each class prediction with the proposed model.The proposed model is tested in real-time environment also.For real-time analysis, 3DCNLSTM is tested on YouTube video and the evaluated results in the form of action recognition are shown in Fig.9.The figure depicts the pose estimations along with action recognitions in different situations.Action set has been selected randomly for testing in real-time environment.The video belongs to a workspace where three people are walking, running, bending, jumping and kicking sometimes and it is clear from Fig.9that the proposed model identifies the human actions with pose estimations well.The results showed that the proposed model

FIGURE 9 .
FIGURE 9. Real-time analysis of proposed 3DCNLSTM model on a random video from YouTube.