Human Action Performance Using Deep Neuro-Fuzzy Recurrent Attention Model

A great number of computer vision publications have focused on distinguishing between human action recognition and classification rather than the intensity of actions performed. Indexing the intensity which determines the performance of human actions is a challenging task due to the uncertainty and information deficiency that exists in the video inputs. To remedy this uncertainty, in this paper we coupled fuzzy logic rules with the neural-based action recognition model to rate the intensity of a human action as intense or mild. In our approach, we used a Spatio-Temporal LSTM to generate the weights of the fuzzy-logic model, and then demonstrate through experiments that indexing of the action intensity is possible. We analyzed the integrated model by applying it to videos of human actions with different action intensities and were able to achieve an accuracy of 89.16% on our intensity indexing generated dataset. The integrated model demonstrates the ability of a neuro-fuzzy inference module to effectively estimate the intensity index of human actions.


Introduction
Recently, action recognition based on supervised deep learning has attracted a lot of interest in the computer vision research community due to its numerous applications in video analytics, surveillance, security, sports analysis, and human-computer-interaction based applications [1].Researchers all over the world are doing extensive studies on various techniques to propose models with better performance [2][3][4].Despite these efforts, this field still poses many challenges which include intra-class variation, viewpoint orientation, occlusion, various motion speed and different styles of background clutter.A drawback to the supervised deep learning approach of action recognition is that less focus is given to predict the intensity of the action [5][6][7][8].Determining the intensity of an action is crucial in environments like bullying and violence detection in school, at work, at home, in public areas, and in prison [9][10][11][12].Intensity indexing can also be used for detecting aggressive behavior in applied behavior analysis (ABA) [13], a proven assessment and treatment model for Autism Spectrum Disorder (ASD) [14] and other severe mental disorders [15].In the context of ASD, intensity index can aid caretakers in assessing danger in patients' behavior and prevent serious health consequences such as concussion from head banging [16].
Action intensity index is defined as a measure of kinetic intensity used to determine whether a specific action is performed with high or low intensity.Kinetic intensity is the amount of kinetic power it takes to perform a certain action, and can be applied to the concept of indexing intensity of human actions [17,18].The kinetic power of a certain action is directly proportional to the velocity and the mass of the moving object [17].However, in the context of human activities, which involve the movement of human joints, the kinetic power depends on the velocity of the joints engaged in the main activity [19], as well as the number and extent in which they are engaged [18,20]: more moving joints utilizing more joint power results in greater kinetic power and intensity. 3he intensity of human actions cannot be generalized into a single, crisp formula as it varies from person to person.Intensity is rather a subjective term in which some level of uncertainty is always present, often expressed using imprecise language.Furthermore, measurement inaccuracies are inevitable from a 2D video.Therefore, to measure the intensity of an action from an input video, a mathematical model is required which accounts for such uncertainties and inaccuracies by modelling and minimizing their effects.[21].
While deep learning based models can help with learning adaptation and scaling up to more general applications [22], they cannot capture data or model uncertainty [23].In addition, deep learning based models lack the human-like ability to interpret imprecise information.Fuzzy inference systems, on the other hand, provide an inference mechanism for uncertainty and enable the qualitative interpretation of the actions.In the context of intensity indexing, deep learning based models also encounter serious problem when the dataset is biased towards a specific way of performing an action.These models are not able to learn dissimilarities in human motions when actions are performed with various intensities.However, adaptive fuzzy systems can generate membership functions for different types of target action intensities [24].To enable our system to deal with the uncertainty and varied nature inherent in this application, we propose a hybrid system combining the concept of fuzzy logic and deep recurrent neural networks.Such integration has proven effective in a wide variety of real-world problems [25][26][27].
Our proposed methodology is an attentive neuro-fuzzy system designed to recognize qualitative differences in human actions and to self-adapt to different intensities.Inspired by the model proposed by [28], our model utilizes recurrent neural networks to detect actions from spatio-temporal patterns of human poses, in tandem with an adaptive fuzzy inference system to learn the various human motions used to perform actions with different intensities and then estimate the action's intensity.
The integrated model can successfully learn the unique way a specific action with a certain intensity is performed, as well as estimate the intensity of the respective action.Experimental results prove the effectiveness of the integrated model in recognizing the action movements of different intensities.To the best of our knowledge, our framework is the first to index the intensity of action from an input video.Our contributions in this paper are: • We propose a novel hybrid model based on a fuzzy inference system coupled with a spatio-temporal LSTM action recognition module to jointly determine the intensity index of the recognized action.• Our work provides a case study on a generated dataset of human actions with two intensity indexes: intense and mild, to evaluate the performance of our model in more fine-grained recognition of actions and intensities.These coordinates act as an input to the Spatio-Temporal LSTM, which detects the actual action and generates the attention weights.These weights are the input to the Kinetic Fuzzy Intensity Analysis, which generates the intensity score.This intensity score dynamically updates the fuzzy logic rules and is also used to determines the Intensity Index of the performed action.

Related Work
The related work for the model is based on the deep learning components which are used for data pre-processing (as discussed in methodology section of this paper) and to train the model on action recognition based on supervised learning.Our model leverages deep learning components as well as neuro-fuzzy systems to dynamically generate fuzzy logic rules to detect the intensity of various human actions.To detect actions using human key-point coordinates, the model requires spatio-temporal information of the scene so it can also be seen as a time series problem.To overcome this, much research has been done in the recent past to come up with models that can effectively predict actions from key-point coordinates [29].Traditional methods involved features which were hand crafted to represent the inter frame relationship of the key-point coordinates sequence [30][31][32][33][34][35].Recent studies have utilized deep learning techniques to detect and predict relationships by using spatio-temporal information in a collection of frames.[36] designed a Fine-to-Coarse Deep Convolutional Neural Network (CNN) along with fully connected layers which extract the spatio-temporal and spatial features of a key-point coordinates sequence.Furthermore, the use of 3D-CNN with a 3D filter kernel has also been proved to be able to learn the spatio-temporal information [35].To capture temporal information, research has been done to predict action using Recurrent Neural Networks (RNN) which are based on Long Short-Term Memory (LSTM) or Attention Models.There have been few research examples where a RNN based model has been used to predict the actions from a human key-point coordinates [37][38][39][40].Recent research also points to the use of a Convolutional-Recurrent Neural Network (CRNN) where the CNN is used to extract the features from the input frames and the output of the CNN is fed to LSTM to extract the temporal dynamics.State-of-the-art results were achieved with the use of a graphical neural network [41,42].Compared to the graphical neural network and LSTM, CNN displays better results for learning to represent images in terms of key-point coordinates representation [43,44], but their performance drops when dealing with long spatio-temporal sequences.
While deep learning models can achieve better scalability and can generalize better, they lack in capturing data uncertainty, subjectivity and human-like reasoning [22].Fuzzy logic can capture the uncertainty, subjectivity and have human-like reasoning [45].The proposed idea is to use fuzzy inference on top of a deep learning action recognition module to index intensity of the action as either mild or intense [5][6][7][8].Indexing the intensity of a subjective task and involves a certain amount of uncertainty from individual to individual.It requires adaptive learning which cannot be derived by just stacking various modules sequentially.

Methodology
This paper proposes a novel neuro-fuzzy system using recurrent neural networks and fuzzy inference systems which adaptively perform fine-grained recognition of human action intensity indexes.As shown in Figure 2, our methodology consists of three processing sections: data preprocessing, action recognition, and intensity indexing.First, the preprocessing section transforms the input video of an action to a tensor of the human key-point coordinates over time using a pose detection algorithm.This tensor is next passed to an LSTM network to recognize the human action based on the spatio-temporal patterns existing in the tensor.The LSTM model is equipped with two self-attention mechanisms [46], one over the time frames and another over the coordinates.The attention weights, along with the coordinate's tensor, are then fed to the kinetic fuzzy intensity analysis module.The kinetic fuzzy intensity analysis module computes an initial intensity score based on fuzzy entropy measures.The fuzzy inference module converts the  intensity score and the attention weights into fuzzy sets using an adaptive membership function.Using the truth values of these fuzzy sets, our methodology defines the fuzzy rules through which the final intensity index is determined.Finally, the spatio-temporal LSTM's loss function gets updated with a customized penalty term to further adapt to distinct movements of intense-mild actions.

Data Pre-processing
Before the raw data can be input int the action recognition module, human key-point coordinates must be generated using the pose estimation technique.Using of human key-point coordinates to train the action recognition module will help to reduce the background clutter [37,38].Also, it will reduce the computational complexity as compared to using the entire image/video to train the module [39,40].We also need to feed the human key-point coordinates to the neuro-fuzzy section for qualitative action recognition.To extract the human key-point coordinates we use the model described by [47][48][49] which achieves, state-of-the-art results on multiple public benchmarks for pose estimation and human key-point detection.

LSTM with Spatio-Temporal Attention
To capture the sequential patterns of key point coordinates, we utilize LSTM, an RNN [50][51][52].For supervised deep learning, there are various LSTM models developed for action detection [40,43,[53][54][55].Inspired by Liu's (2016) Spatio-Temporal LSTM model, we apply a similar model that is equipped with a spatio-temporal mechanism to recognize the performed action and learn the exclusive motion patterns of the action.In other words, our inspired LSTM model utilizes two attention mechanisms [56]: attention over the time frames, and attention over various keypoint coordinates.Such spatio-temporal attention helps the model to understand an action despite variation among individuals preforming the same action with a certain intensity index, such as walking fast or punching hard.One attention mechanism is implemented on top of the recurrent architecture of the LSTM cells, and the other one is implemented across the units of input and hidden states, so that the model can selectively focus on the time frames as well as human key-point coordinates (see Figure 3).These two attention mechanisms demonstrate the engagement of the human key-point coordinates in each time frame in the detected action.In addition to learning the possible behavioral variation of performing an action, the weights of these two attention mechanisms are used to measure the kinetic intensity score and determine the fuzzy inference of the intensity index.

Kinetic Intensity Score using Fuzzy Entropy
Once the attention LSTM model is trained to recognize the performed action, we utilize the parameters of the attention vectors along with fuzzy entropy measures to compute an initial intensity score for an instance of the action.This initial kinetic intensity score is utilized to generates dynamically fuzzy rules to specify the index of the intensity as intense or mild.As shown in Figure 3, the Spatio-Temporal LSTM model is equipped with a self-attention mechanism [46,57] that detects the time frames in which the detected action is happening, extracting a linear combination of the hidden states to the output and generate the temporal attention weights.These weights denote the amount of influence of each time frame in the final inference.In other words, they determine whether, and to what confidence level, an action is observed in each time frame.We can utilize the distribution of these weights to measure the intensity of the action.For instance, the faster an action happens, the observation of the performed action in time, is less and resultant distribution of temporal attention weights is proportionally denser.Therefore, the entropy of these attention weights has an inverse relationship with the intensity and speed of an action.
The intensity of action depends on the kinetic energy of the limbs that are engaged in performing the action which are translated into key-point coordinates.Shan et al., [19] formulate this kinetic energy by the movement of the key-point coordinates over the video frames.Thus, we consider this kinetic energy by by adding it to the attention distribution as fuzzy membership weights and computing their fuzzy entropy.The weights are the change of the coordinates' locations from the last frame multiplied by their corresponding attention weights, following [19].Using fuzzy entropy methods from [45,58] , we calculate the fuzzy entropy of the attention vector which is indirectly related to intensity, as follows: where, x t is the input at the time frame t, µ t is the membership weight for H f uzzy (a t , µ t ) at time t, and a t is the attention weight over time frame t.
Furthermore, the intensity of an action also depends on the number of the engaged joints [18].As a concrete example, an intense punch, in comparison to a mild one, includes the movement of a greater number of joints across more dimensions such as hip rotation and non-dominant hand movement.Zhang et al. [20] use the same concept to quantify the intensity of human facial actions by the number of engaged coordinates and how much they are engaged.As such, we also extract multi-dimensional attention over the human key-points coordinates.Just as we calculated the fuzzy entropy of the directional attention, we also calculate the fuzzy entropy of the dimensional attention which is directly related to intensity.The fuzzy weights are the product of temporal attention and dimensional attention over every time frame.
where a t , attention weight over time frame t, is the fuzzy weight.a (j,t) is the attention weights over the key-point coordinate (i.e.human joints) at time frame t.
Finally, considering both kinetic energies which has been similarly used in the literature for action intensity [18][19][20], we formulate intensity as the proportion of the fuzzy entropy in Equation 4 over the the fuzzy entropy in Equation 3. In other words, we measure the kinetic intensity through the fuzzy entropy of the attention weights over the coordinate's locations divided by the fuzzy entropy of the attention weights over the time frames.As follows: where, I is the intensity score, a (j,t) is the attention weight over the key-point joint j at time frame t.

Fuzzy Inference for Intensity Indexing
As mentioned in section 1, intensity is not a very precise term and there is no general formula to measure a crisp value of it.Therefore, after computing the kinetic intensity score, our methodology uses an adaptive fuzzy inference system to detect the intensity index based on both the kinetic intensity score I computed from the previous section, and distribution of the joints' attention weights q = q j as motion patterns.These two values are fed to our fuzzy inference system as crisp input values.This procedure is illustrated in Figure 4.

Fuzzification of Intensity Score and Joints' Distribution
In this regard, using dynamically learned membership functions, these crisp input values are mapped to fuzzy sets: I = {I mld , I int } and P j = {P j int , P j mld }4 , which denote the partitioning of the intensity score and attention weight corresponding to joint j, respectively, into mild and intense regions.Our fuzzy inference system looks at these fuzzy sets as rough estimations of the intensity index.However, the final intensity index output is computed based on these rough estimations and fuzzy logic.
Our model dynamically learns fuzzy membership functions for these fuzzy sets, i.e. µ I and µ Pj , based on the previously computed kinetic intensity score and the distribution of the corresponding attention weights.Using the average intensity index and the common triangular shape, the fuzzy membership µ I is formulated as below: where, µ mld/int I (I) refers to the truth values of I mld and/or I int respectively.Ī is the averaged intensity score, which dynamically gets updated.σ defines the spread of the fuzzy set, larger values denote more uncertainty is assumed to exist in the data [21].
Algorithm 1 The summary of the first stage of our fuzzy inference system which is fuzzification of intensity score I and joints' distribution q = q j .This process is performed for every input video and updates its values to dynamically adapt to different action intensity indexes.
Ī : average intensity score ∆ H : average difference between the cross-entropy of mild and intense distributions C int : collection of joint attention weights for intense C mld : collection of joint attention weights for mild p int : probabilistic joint distribution for intense p mld : probabilistic joint distribution for mild for every input i: (a i t, a i (j,t) , x i ) do procedure FUZZIFIER(a i t, a i (j,t) , x i ) I i is calculated (Eq.5), and Ī is updated (Eq.6) truth value µ mld/int Explained at the bottom q i = q i j is calculated (Eq.10) for every joint j do ∆H is calculated between qj and Pj mld/int ∆ H is updated µ mld/int P j (qj) is calculated (Eq.11) end for return µ mld/int P j (q i j ) ∀j , µ mld/int I (I i ) end procedure end for procedure *UPDATE-JOINT-DIST(a (j,t) , µ mld/int update p mld (Eq.9) else append C int (Eq.8) update p int (Eq.9) end if end procedure The membership function µ Pj is adaptively computed using the membership function of Equation 6 and categorized distribution of joints' attention weights along with normalized cross-entropy distance.This process is elaborated upon in the followings: Firstly, the model stores the intensity scores of every action as well as the relative attention weights of the spatio-temporal LSTM network.Next, by comparing the truth values of I mld and I int we categorize the stored attention weights into the following categories: where, i ∈ {1, 2, 3, ..., N } is the sample index, and j ∈ {1, 2, 3, ..., J} is the index of human key-point/joint, N is the number of samples and J is the number of joint coordinates, T i is the number of time frames in the i th sample, a ij is the corresponding joint attention weight averaged over the time frames.
Every weight is multiplied by the corresponding µ I to highlight those with higher certainty.Then, by taking the average over various samples of every action, we derive customized distribution of joints' weights for each category of intense-mild.We calculate the softmax of these weights to convert them into probabilistic distributions: Similarly, we derive a probabilistic distribution of joints' weights for every new input sample, as follows: Finally, the µ Pj (q) is computed based on a normalized cross-entropy distance between q j and (p int j , p mld j ) and triangular shape, according to the following equation: where, (∆H = H(P int j , q j ) − H(P mld j , q j )) in which H(P int j , q j ) and H(P mld j , q j ) are the cross-entropy between the softmax activation of the attention weight distributions of intense-mild categories, in Equation 9, and those of the current input sample computed in Equation 10. ∆ H is the average of these differences for all stored samples, and σ defines the spread of the fuzzy set similar to σ in Equation 6.
Algorithm 1: summarizes the step-by-step fuzzification procedure.The procedure consists of two average values of intensity score and difference cross-entropy which gets updated with every input video.It also includes two collections which stores the joint attention weights of for mild and intense categories which are classified by comparing the truth values of intensity scores, and by taking an average and softmax two probabilistic distributions are extracted for corresponding categories.These collections and their corresponding probabilistic distributions are dynamically appended and updated with every new input.The procedure gets the time frames and joints' attention weights as well as the key-point coordinates as input, computes the intensity score I based on Equation 5, update its average value, maps it to fuzzy set I using triangular fuzzy membership function Equation 6, and updates collections and their corresponding distributions.Next, it computes the q Equation 10 and the cross-entropy between mild and intense distributions, update the cross-entropy average value, and maps the q into fuzzy set P using fuzzy membership function Equation 11.Finally, the procedure returns the truth values of I mld , I int , P mld , and P int .

Fuzzy Rules and Inference for Final Indexing
As mentioned in Section 3.4.1, the final intensity index output is inferred based on fuzzy logic principles on the input sets (I and P j for allj).In specific, the input sets are passed through IF-THEN fuzzy logic rules, and then, by combining these rules the final output fuzzy sets are inferred which denote the intensity index of the performed action.
Initially, an intermediate output set is extracted by a linear combination of the following intermediate fuzzy rules, which we have for every set P j : R mld/int j : IF qj is P mld/int j THEN q is P mld/int , weight = αj (12) where R mld/int j is a set of two rules for every joint coordinates j, P mld/int refers to P mld and/or P int members of the intermediate output fuzzy set P (i.e., P = {P mld , P int }) which denotes the aggregated categorization of joints' distribution into mild and/or intense.Each rule R mld/int j refers to the corresponding joint's individual decision on the aggregated categorization whose role is weighted by α j .Next, we combine the inferences of these rules using the linear combination of their output fuzzy membership functions [59], to compute the overall membership function of the intermediate output set.This process is an adaptive filter as α j s are adaptively learned during the training session on the intensity indexing dataset [21].Since the attention weights demonstrate the exclusive patterns of the action motions, the fuzzy rules will dynamically adapt to every category and index of actions.The fuzzy membership function of P mld/int is formulated as follows: where, µ mld/int P (q) is the intermediate fuzzy membership function, and µ mld/int Pj •Rj (q j ) if the fuzzy membership function for every rule which measures the truth of the relation between P mld/int and every joint's fuzzified distribution set P mld/int j .
The final output inference of the intensity index is predicted using the intermediate output set and the fuzzified set of intensity score (i.e.µ I ), passed through the following final fuzzy inference rules: R mld : IF I is I mld AND q is P mld THEN y is Y mld (14) Algorithm 2 The summary of the fuzzy inference stage which gets the membership function of the fuzzified sets from Algorithm 1 and returns the final inference for intensity index.This procedure is performed for every input and return the final inference output fuzzy set whose membership function have the maximum truth value.
R mld/int j : intermediate fuzzy inference rules for every joint j (Eq.12) αj: represents the role of every joint in the final inference (Eq.11) R mld/int : final fuzzy inference rules (Eq.14 and Eq.15) Y mld/int : final output sets for mild and intense indexes for every input i do (µ mld/int for every joint j do combination of intermediate inferences µ mld P (q) + = αj µ mld P j (qj) (Eq.13) R mld j µ int P (q) + = αj µ int P j (qj) (Eq.13) where y is the final inference value which belongs to the final index set of Y = {Y mld , Y int }.The AND process is performed on fuzzy sets I mld/int and P mld/int using "AND-type" inference introduced by [28] which compromise linear combination of t-norm and s-norm of truth values µ mld/int I (I), µ mld/int P (q), according to the following Equation: where λ parameter can be found in the process of learning subject to the constraint 0 < λ < 1 along with α j .
Finally, the intensity index is predicted by comparing Y mld 's and Y int 's truth values, i.e. µ mld Y and µ int Y .Algorithm 2: summarizes the process of our fuzzy inference system which outputs the final inference for intensity index based on input fuzzified sets I mld/int , P mld/int from subsubsection 3.4.1;along with intermediate and final fuzzy logic rules: R mld/int j and R mld/int respectively; and combining their output fuzzy sets.The output sets of intermediate rules R mld/int j are combined using a linear combination method and α j weights which show the degree of belief to each rule.The output set of the final inference rules R mld/int are computed using AND-type of I mld/int and P mld/int [28].Finally, the intensity index is inferred by comparing the truth values of the final output fuzzy sets Y mld and Y int corresponding to mild and intense indexes.

Loss Function Update
The ST-LSTM is initially trained on action recognition data of samples with similar intensity.However, actions performed with different intensities include different motion patterns.Consequently, the pre-trained ST-LSTM may pay attention the wrong joint coordinates once applied to the generated dataset which has samples of different intensities.Therefore, we add a penalty term to the loss function to enforce the model to pay attention to the intended joint coordinates by penalizing the wrong attention weights.In this regard, we use the cross-entropy as a distance between the input joints' distribution of Equation 10, and those of mild and intense categories computed from Equation 9.As such, the action recognition module of our methodology also adapts to the unique way a certain action-intensity is performed, e.g.'intense punching' vs. 'mild punching.'This, addition of a penalty term, in turn, leads to the further adaptation of the kinetic fuzzy intensity score and of the output fuzzy rules.Equation 17 is the loss function of the LSTM model with the aforementioned penalty term added, enables the model to distinguish mild and intense actions, is given as: where, the first log-based term denotes cross-entropy which is used as a distance function between the real label and the computed softmax of the final output, i.e. y and ŷ respectively.l is the index of the recognizable actions considered in the model.The penalty term is added through the Lagrange multiplier λ [60-63], which increases with the number of input samples related to every action.The second log-based term is the cross-entropy penalty.q denotes the input's distribution of attention weights over the joint coordinates, Equation 10,and p denotes those of the mild and intense categories, Equation 9. j is the index of human key-points/joints coordinates.The loss function in Equation 17enables the model to distinguish mild and intense actions.It improves action recognition accuracy when the action dataset includes mild and intense intensity indexes.

Experiments
In this section, we elaborate upon: (i) our experimental setup, (ii) results of these experiments, (iii) the experimental discussions, in which we discuss the performance and limitations of our experiments.

Generated Dataset for Intensity Indexing
In order to evaluate the intensity indexing scheme and the whole integrated model, we generated an additional dataset of human actions with two intensity indexes: (i) intense, and (ii) mild.In part, our objective was to minimize the data requirements for our model, therefore, we employed a fuzzy system for action intensity indexing, which requires only a small amount of data.The choice of a fuzzy system allows for the use of the pre-existing SBU dataset which is small compared to UCF101 [64] and NTU RGB+D [65] datasets.The SBU dataset contains 8 classes, 3 of which (exchanging object, hugging and shaking hands) cannot be differentiated as mild or intense due to the limitations of the model.Therefore, we extract the other 5 classes: approaching, punching, kicking, hugging, and pushing.For each of these classes, we generated 100 intense and 100 mild videos.Therefore, the generated dataset includes 1000 samples.The classification of action intensity is subjective in nature, because for each person the perception of action intensity varies, and depends on their physical attributes (e.g.sex, age, height, BMI etc.)With that in mind, in our generated dataset annotation, we have requested students with similar physical attributes to perform the action with a certain intensity, and the classification was associated to the subject's own perception toward their performed actions.Generating more clusters (mild, medium, intense) for intensity indexing requires more data.As mentioned, we had to generate our own dataset with mild and intense indexes to evaluate the intensity indexing scheme.Therefore, to keep it simple, we decided to stick with just two clusters for the intensity indexing.For future work, more clusters can be added.

Spatio-Temporal LSTM
Firstly, we utilize the SBU Kinetic dataset [66], which is used for 3D classification of human key-point coordinates into an action class, to train the spatio-temporal LSTM 5 .Each video in the SBU dataset is restricted to 2 people and each person has 15 joints targeted as key-point coordinates in each frame.Similar to Yun et al. [66], we apply a 5-fold cross validation scheme to evaluate the action recognition module.As shown in Table 1, our ST-LSTM model with attention mechanism enhances the accuracy of the model of [37] and achieves the state-of-the-art performance on SBU Kinetic dataset.

Action Recognition
We use the trained Spatio-Temporal LSTM module and fine tune it with the additional dataset which consists of actions performed with various intensity.Since the mild and intense actions are performed differently, in terms of motional patterns, the accuracy of the ST-LSTM module drops significantly, up to 19%.Therefore, we dynamically update the LSTM model with the results of the fuzzy inference system, according to Equation 17, and re-evaluate it on the generated dataset to see how the action recognition module's accuracy would be influenced by the integration of the LSTM and intensity indexing modules.Table 2 depicts the re-evaluation results on our additional generated dataset showing the average 2.75% decrease in the overall accuracy.

Intensity Indexing
We use the additional generated dataset to evaluate the performance of our intensity indexing methodology.Table 3 shows the action intensity indexing performance of our model on the generated dataset.We have considered the fuzzy inference rules in Equation 15and Equation 14 separately, to measure the F1 score and reported the averaged results.Due to the strictness of the fuzzy inference rules, the precision of the intensity indexing is comparably higher than other metrics.By using both fuzzy rules jointly, we would reach higher precision; however, there would be samples that are not classified as intense or mild.
Actions like hugging and approaching are tough to distinguish between intense and mild.The fuzzy module of the proposed system takes input from the attention weights generated by the spatio-temporal LSTM.These attention weights are of two type: one over the time frame, and another over the key-point coordinates in every frame.Keypoint coordinates for approaching and hugging do not differ by much for intense and mild classes, resulting in similar attention weights for both intensity indexes, which makes the model drop in accuracy as seen in Table 3 We further compare our methodology with multi-task learning baselines implemented on top of our ST-LSTM to comprehend the role of the fuzzy kinetic analysis.The evaluation results of Table 4 demonstrates the significance of the kinetic fuzzy intensity analysis and indexing modules of our methodology.Similar to the evaluation scheme in action recognition module, we use 5-fold cross validation to evaluate the intensity indexing algorithm for each action class.For evaluation of a model with limited data samples, k-fold cross validation process is used.In our 5-fold cross validation we use 4 folds for training and the remaining 1-fold for testing.As for the baselines, we use the attention weights as the input features to the SVM and DNN.There are two kind of attention mechanism applied on the time frames: one over every frame in the video and another over the key-point coordinates in every frame.These attention mechanisms generate the attention weights which are the input features to the SVM and DNN.The use of SVM, regression or fuzzy modules is to classify the action intensity as intense or mild whereas the ST-LSTM recognizes the action and together the output is the action plus its nature in term of intensity.The use of regression and SVM was to compare their performance with fuzzy for the task of intensity indexing from the attention weights generated by the ST-LSTM.As mentioned in subsubsection 3.4.1,our model dynamically learns the motion of every index of each action category through the weighted distribution of joints corresponding to the attention weights of the LSTM module.Figure 5 shows these distributions for actions of punching and kicking with mild and intense indexes.We fit a generalized bell function [69] to these distributions by assigning 1.0 to them if the intensity score is above the average Ī and the cross entropy of intense distribution is less than the mild distribution, 0.5 if the intensity score is less than the average but the cross entropy with the intense distribution is still less, and 0 otherwise.The Figure 5 shows, on the aggregated level, the distinct distribution of these weights for intense-mild actions.In Figure 5, we fit the generalized bell membership function to joints' attention weights extracted from the generated dataset.It illustrates the difference between mild and intense actions in terms of joints' movement and the weight by which the action recognition module is attending to them.As shown in the figure, the distribution of these attention weights for the intense actions tend to have higher variance whereas in the mild actions they are rather dense around the average value.In addition, while the intense actions tend to have a greater number of joints with significant corresponding attention weights, as for the mild actions the attention weight of only one joint have significant value and the rest have trivial values below 0.

Restrictions
In this subsection, we discuss the restrictions of our fuzzy recurrent attention model in detail, i.e. false positives, false negatives and explainable cases of misclassification.The action recognition module has a separate pre-processing unit whose function is not to achieve a contextual understanding from the scene but merely to extract the human key-point coordinates.Therefore, the model gets a limited understanding from the scene which does not include details such as camera angle, subjects' directions, etc. Figure 6 demonstrates such weaknesses of the action recognition module by providing concrete examples of the samples which have been misclassified due to the limited contextual understanding of the scene.
In Figure 6a, the action performed is hugging but the model failed to predict the performed action.Since our model is trained using the human key-point location and predicts using the same, the model failed to predict the action as hugging.If both the subjects were facing one another, the prediction could have been correct.In Figure 6b, the two humans are shaking hands but since the camera angle is not sideways the key-point location do not indicate the action of shaking hands and because of the wrong camera angle, the model fails to recognize the action performed.As for the intensity indexing unit, the temporal distribution of attention weights over the time frames does not distinguish between the key points of the subjects performing the action, while the speed of action, in Equation 3, refers to actions which are performed by every individual subject.As such, the model might malfunction in cases where two subjects are performing the action that previously had been performed by a single subject.As a concrete example, in Figure 6c, the two humans are walking slowly toward each other, but as the key point of both of the humans are approaching fast, the temporal attention over the time frames indicates it to be an intense action, whereas, in reality the action performed was of mild.

Conclusion
In this paper, we incorporate fuzzy logic based inference into neural-based action recognition systems to tackle the task of intensity indexing from video inputs.We propose a hybrid model of fuzzy logic in conjunction with a spatiotemporal LSTM network equipped with an attention mechanism, kinetics and fuzzy logic concepts as well as a fuzzy inference system.Our research shows that the implemented fuzzy logic component is able to handle the uncertainty inherent to interpreting action intensity from a video.The model was able to achieve a testing accuracy of 89.16% on our generated dataset for the task of intensity indexing.We were also able to determine the dynamic fuzzy logic rules to detect the intensity index for different action classes.
In the future, to help detection of aggressive and bullying behavior, we suggest using an enhanced version of our fuzzy recurrent attention model to perform action recognition with more classes of actions, with one or multiple objects which are intra-related.Enhanced version refers to the future work where we intend to add more sophistication to the model by adding more action classes and intensity indexes compared to the two indexes used in the current version: intense and mild.Coupled with these modifications, we also aim to improve the adaptive learning of the model, thus making it a more enhanced version of this base model.

Figure 1 :
Figure 1: The distribution of attention weights (over joints and time) changes with the intensity index of performed action.The line plot indicates attention over time frames and the bar plot indicates attention over the joint movements.Even though both subjects are performing the same action, the distributions of attention weights are different

Figure 2 :
Figure 2: The camera frames act as an input to the data pre-processing stage where the human key-point coordinates are generated.

Figure 3 :
Figure 3: The attention over human key-points defines the parts of the human bodies involved in the action can be observed from and the attention over time frames defines the frames in which the action is being performed.The attention weights also show the significance of the respective frames in recognition of the action.

Figure 4 :
Figure 4: The procedure for our fuzzy inference system.The crisp input values are mapped into fuzzy sets through fuzzification processes, i.e. fuzzifiers.Next, through intermediate rules and a linear combination on fuzzy inputs, an intermediate inference is computed.The linear combination uses adaptively trained weights, i.e. adaptive filter.The final inference is performed on the intermediate output set and the fuzzified intensity score flowing through the final inference rules.

Figure 5 :
Figure 5: We fit the generalized bell membership function to these distributions by assigning a membership score of 1 if the detected index is intense, and 0.5 if the distribution is closer to the intense category but the final intensity index is not detectable, i.e. intensity score is below the threshold.The attention weights are multiplied by the temporal attention of the time frames.

Figure 6 :
Figure 6: Example on false negative, incorrect prediction and false positive due to poor camera angle on test videos by the Spatio-Temporal LSTM model.

Table 2 :
The re-evaluation results of our action recognition model on the generated dataset.The loss function has been updated according to Equation 17 which help the model performance not to decrease significantly.

Table 3 :
2. This stand to the reason that our model generates fuzzy membership values of intensity indexes for every joint's motional patterns and utilizes it for final indexing inference.Experimental results of the intensity indexing performed on the generated dataset.There is a subtle difference between the mild and intense variation for hugging and approaching class which causes the accuracy to drop.

Table 4 :
Baseline Model Comparison.Experimental Results on the generated dataset show the fuzzy outperforms multi-task learning baselines implemented using Regression, SVM, DNN on top of ST-LSTM.