An LSTM-Based Approach for Understanding Human Interactions Using Hybrid Feature Descriptors Over Depth Sensors

Over the past few years, automatic recognition of human interactions has drawn significant attention from researchers working in the field of Artificial Intelligence (AI). And feature extraction is one of the most critical tasks in developing efficient Human Interaction Recognition (HIR) systems. Moreover, recent researches in computer vision suggest that robust features lead to higher recognition accuracies. Hence, an improved HIR system has been proposed in this paper that combines 2D and 3D features extracted using machine learning and deep learning techniques. These discriminative features result in accurate classification and help avoid misclassification of similar interactions. Ten keyframes have been extracted from each video to reduce computational complexity. Next, these frames have been preprocessed using image normalization and noise removal techniques. The Region Of Interest (ROI), which contains the two humans involved in the interaction, has been extracted using motion detection. Then, the human silhouettes have been segmented using the GrabCut algorithm. Next, the extracted silhouettes have been converted into 3D meshes and their heat kernel signatures (HKS) have been obtained to extract key body points. A Convolutional Neural Network (CNN) has been used to extract full-body features from 2D full-body silhouettes. Then, topological and geometric features have been extracted from the key body points. Finally, the combined feature vector has been fed into Long Short-Term Memory (LSTM) and each interaction has been recognized using a Softmax classifier. The proposed system has been validated via extensive experimentation on three challenging RGB+D datasets. The recognition accuracies of 91.63%, 90.54%, and 90.13% have been achieved with the SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively. The results of extensive experiments performed on the proposed system suggest that it can be used effectively for various applications, such as security, surveillance, health monitoring, and assisted living.


I. INTRODUCTION
The task of Human-Human Interaction (HHI) recognition involves detecting and understanding social interactions The associate editor coordinating the review of this manuscript and approving it for publication was Antonio Piccinno . between two humans. These interactions can be everyday activities like talking, passing objects, hugging, and waving. Similarly, these can be assisted living activities such as helping a person stand up, helping another person walk, or drawing another person's attention. Moreover, suspicious activities including touching someone's pocket, pushing someone, or fighting are also of interest for researchers in this field. HIR has become a trending topic in the field of artificial intelligence because of its wide range of applications, including security [1]- [3], content-based video retrieval [4]- [6], healthcare [7]- [11], and surveillance [12]- [15].
Even though significant progress has been made in this regard and many efficient HIR systems have been developed for various purposes, detecting human interactions remains challenging because of multiple reasons, such as different viewpoints, change of clothing, poor lighting, different interactions containing similar motions, and unavailability of large datasets. However, low-cost depth sensors, such as Microsoft Kinect [16] are now being used excessively since these are not as affected by lighting conditions as RGB cameras. Moreover, many interactions seem similar and are often misclassified. For example, two humans exchanging a very small object may look very similar to two people shaking hands. On the contrary, the same interaction appears different when viewed from various viewpoints. Hence, it is very important to extract distinctive features from images that can easily differentiate between two interactions that look the same.
This research paper proposes a novel approach for efficient video-based human interaction recognition using both machine learning and deep learning techniques. Human silhouettes have been extracted from both RGB and depth frames using GrabCut. Additional masking has been used to improve the output of the GrabCut algorithm in complex scenarios. Next, the full-body RGB and depth silhouettes have been fed into two separate CNN models and the extracted features have been concatenated. Then 3D meshes have been generated from the full-body silhouettes and their heat kernel signatures have been obtained. These have been used to extract six key body points. These key points have been used to extract topological and geometric features. The three different types of features have been combined and fed into LSTM. Finally, the Softmax classifier has been used for interaction recognition. Three publically available datasets have been used that provide RGB, depth, and skeletal information of human interactions. The major contributions of this research work include: • Silhouette segmentation from both RGB and depth images using GrabCut algorithm.
• Training and concatenation of two separate CNN models for RGB and depth images.
• Detection of key points via heat kernel signatures based on geodesic distance.
• Extraction of topological and geometric features using key body points.
• Extensive experimentation on three large and challenging RGBD video datasets. Section II of the paper describes similar research work and the proposed system architecture has been discussed in Section III. Section IV presents the implementation details and results of the proposed method. Section V contains a discussion on various aspects of the designed system and section VI contains the conclusion of this paper and proposes future work of the authors.

II. RELATED WORK
Researchers have been actively contributing to the development of efficient HIR systems. The existing systems have been divided into two categories: marker-based systems and video-based systems. Researches falling into each category have been discussed in detail below:

A. MARKER-BASED HIR SYSTEMS
In marker-based HIR systems, different types of sensors, for example, reflective spheres, light-emitting diodes, and infrared markers, are mounted on the bodies of the humans whose movements are being monitored. These systems are commonly used for rehabilitation treatments [17]. For example, a marker-based motion tracking system is proposed in [18] to analyze the movement of various body parts. The authors have argued that accurate detection of movement of different parts can result in better therapeutic decisions. However, the system was evaluated on a small dataset of only 10 real patients. Similarly, the authors in [19] attached an IR camera and an infrared emitter with a passive hand skateboard training device for conventional upper limb training. The proposed device was used to train eight patients with abnormal upper limb function. After four weeks of training, all the patients were able to move the hand skateboard along the designated 'figure of eight' path.
Capturing body movements is also critical for sports. Hence, researchers have used marker-based sensors for movement detection in walking gait [20], discus [21], dressage [22], and swimming [23] activities. Esfahani et al. [24] developed a trunk motion system (TMS) using printed body-worn sensors (BWS). Twelve BWSs were printed on stretchable clothing to measure the 3D trunk movements and a neural network data fusion algorithm was used to integrate the data from sensors. However, one shortcoming of these marker-based techniques is that they require the installation and calibration of multiple cameras. Hence, these systems are quite expensive. Moreover, they can only encode twodimensional motion information.

B. VIDEO-BASED HIR SYSTEMS
In video-based HIR systems, video cameras are used to record human interactions. In such systems, the first step is to extract important features or interest points [25], [26]. Based on these distinctive features, the interaction that has been performed in the video is identified. Khan et al. [27] proposed a deformable part-based modeling technique to detect the body parts of a patient and track them in subsequent frames. Their system then performed movement analysis to detect various movement disorders in infants. They captured the data in a local hospital using Microsoft Kinect but it was only RGB data. Khan et al. [28] proposed a system for analyzing a patient's body movements during Vojta therapy. They proposed the use of color features and pixel locations for segmenting the patient's body in the images. Then they employed a multi-dimensional feature vector to classify the correct movements using multiclass SVM.
Some researchers [29], [30] also prefer extracting various features and then combining them since hybrid features have yielded better classification results in the past. For example, Jalal et al. [31] combined four different types of features including blobs, multiple orientations, Fourier transforms, and geometrical points. Similarly, the hybrid features introduced in [32] included energy, sine, distinct body parts movements, and 3D Cartesian views of smoothing gradients. The authors of [33] also used a hybrid of four different local descriptors: spatio-temporal features, energy-based features, shape-based angular and geometric features, and Motion-Orthogonal Histograms of Oriented Gradients (MO-HOG). However, all these approaches only employed 2D features.

III. THE PROPOSED APPROACH
The proposed system can be divided into four main sections: image preprocessing, image segmentation, feature extraction, and interaction recognition. The used methodologies and results of each section are discussed in detail below. Fig. 1 shows a flow chart of the proposed system architecture.

A. IMAGE PREPROCESSING
The RGB videos taken from the NTU RGB+D dataset have been converted into image frames at the rate of 31 frames per second. The image frames of the other two datasets were already available. Since there are multiple videos for each interaction class and each video consists of a large number of image frames, 10 keyframes have been extracted from each video to reduce complexity. The extracted frames have been normalized and noise has been removed from them.
Finally, regions of interest (ROI) have been extracted from each frame. These four subsections are explained in detail below. Moreover, Algorithm 1 explains each step of the preprocessing stage.

1) KEYFRAME EXTRACTION
The number of frames varies from video to video. So, to get a fixed number of frames, 10 keyframes have been extracted from each video of every dataset. To extract the keyframes of a video, the histograms of all the image frames have been obtained. The histogram of an image x can be computed using (1).
where n i is the number of pixels with intensity i and N is the total number of pixels in the input image. Then the histograms of every two consecutive frames have been compared and their differences have been stored in a sorted array. The indices corresponding to the top ten differences have been fetched and the images at those indices are referred to as keyframes. In other words, these frames are the ones with the highest differences in their histograms.

2) IMAGE NORMALIZATION
The purpose behind image normalization is to change the pixel values of an image to a common scale so that the image appears more normal to the senses. The depth images in two out of the three datasets used are too dark to be seen by the naked eye. Moreover, features on drastically different scales can be problematic for an HIR system. In other words, features with a larger scale will dominate others and cause the system to make inaccurate assumptions. Hence, all images have been normalized. Each pixel x i in the normalized
where E(I org ) and Var(I org ) are the mean and variance of the original image I org .

3) NOISE REMOVAL
The technique of ''non-local means denoising'' has been used to remove noise from the images. Local-means filters replace the value of a pixel with the mean of a group of pixels surrounding it. However, a non-local means filter takes the weighted mean of all the pixels in the image. The weight of each pixel depends on how similar it is to the target pixel. A pixel in the denoised image u(p) at point p after applying non-local means denoising technique on a pixel at point q in the original image v(q), is defined by (3).
where f (p, q) is the weight and C(p) is a normalization factor defined by (4).

4) ROI EXTRACTION
The regions of interest have been extracted from images through motion detection. Frame differencing technique has been used to detect motion in subsequent frames and rectangular boxes have been drawn over the points where motion has been detected. Since different body parts show different movements, rectangular regions of various sizes for different body parts have been obtained. Each region has a starting point x, y, width w, and height h. A minimum area condition has also been set for these rectangular regions to be considered valid. The minimum and the maximum values of x and y and the maximum values of w and h have been obtained. Finally, one rectangular region comprising all the smaller regions has been extracted as the region of interest. Its starting position is the minimum value of x and y obtained from all the smaller regions and its width is equal to the maximum value of x added to the maximum value of w. Similarly, its height is equal to the maximum value of y added to the maximum value of h.

B. IMAGE SEGMENTATION
Image segmentation is the process of segmenting the image into two parts: foreground and background. The GrabCut algorithm proposed by Rother et al. [34] has proven to be an efficient foreground extraction technique. It takes a rectangular region as input and assumes that all the pixels outside that region belong to the background. Then it uses a Gaussian Mixture Model (GMM) to define the area inside the rectangle by labeling each pixel as probable background and probable foreground depending upon their relation to the provided data.
Using this pixel distribution, a weighted graph is created. All pixels are treated as nodes in the graph. Then two additional nodes are added: the Source node and the Sink node. Every foreground pixel is connected to the Source node and every background pixel is connected to the Sink node. The weights of edges connecting pixels to the Source node depend on the probability of a pixel of belonging to the foreground or background. The weights between the pixels depend on pixel similarity, that is, if there is a large difference in pixel color, the edge between them will get a low weight and vice versa. Next, the graph is segmented using a Min-Cut algorithm. The graph is cut separating the Source node and the Sink node with a minimum cost function. The cost function is the sum of all weights of the edges that are cut. After cutting the graph, all the pixels connected to the Source node are labeled foreground and those connected to the Sink node are labeled background. The process continues until convergence.
However, in some cases, the extracted foreground contains portions that belong to the background. This problem came up while segmenting images from the NTU RGB+D dataset. In all those images, a major portion of the region of interest is the floor. Hence, a floor mask has been created by extracting a certain range of the intensity values from the original image and the GrabCut output is masked to get accurate results. The results of the segmentation process are shown in Fig. 2.

C. KEY BODY POINTS SELECTION
First, the full-body silhouettes have been converted into 3d meshes [35] as shown in Fig. 3. The center points of the 3d meshes have been considered the source and then geodesic distance-based heat kernel signatures (HKS) of the 3d meshes have been achieved as shown in Fig. 4. HKS, as introduced by Sun et al. [36], is based on a heat kernel, which is a VOLUME 9, 2021  fundamental solution to the heat equation. The heat equation describes the variation of heat distribution with time. HKS is one of the many recent shape descriptors which are based on the Laplace-Beltrami operator associated with the shape [37]. The thermal diffusion process can be described by the heat equation as given in (5).
where is the Laplace-Beltrami operator and u (x, t) is the heat distribution at any point x at a given time t. The solution to this heat equation can be expressed in (6).
where h t (x, y) is called heat kernel function. The heat kernel equation is the fundamental solution to the heat equation. The Eigenvalue decomposition of the heat kernel is expressed in (7).   where λ i and ∅ i are the i th eigenvalue and Eigen function of . For a concise feature descriptor, HKS restricts the heat kernel only to the temporal domain.
After obtaining the heat kernel signatures, all vertices in a mesh are grouped into multiple clusters based on their color or intensity value. Moreover, the centroid of each cluster is detected and is stored as a key body point. In this way, six key body points are obtained for each silhouette. When a geodesic path is drawn from the source vertex to the other five target vertices, a 2D leaf skeleton model is obtained as shown in Fig. 5. Algorithm 2 explains the process of extraction of key body points from full-body silhouettes.

D. FEATURE EXTRACTION
This section can be divided into two phases. In the first phase, a Convolutional Neural Network (CNN) has been used to extract features from full-body silhouettes. Full-body silhouettes have been extracted from the segmented images by removing the black background from the images and making them transparent. In the second phase, topological and geometric features have been extracted using key body points. Both these phases are described in the following sub-sections.

1) FULL BODY SILHOUETTES: CNN-BASED FEATURES
For extraction of features from images, a convolutional neural network has been used. The transfer learning approach has been employed, which includes using VGG16 as the base model and then fine-tuning its weights according to the used datasets. Visual Geometry Group-16 layers deep (VGG16) [38] is a CNN model that achieved 92.7% on the ImageNet dataset which has 1000 classes. Fig. 6 shows all the layers in the VGG16 model. All images have been reshaped to 224 × 224 × 3 to match the desired input size of the VGG16 model. After training on the VGG16 base model, input images are resized to 7×7×512. These are then trained on the proposed CNN model.
There are three convolutional layers in the proposed model with 128, 64, and 32 filters respectively. The size of each filter is 3 × 3. The convolutional layers compute the output of neurons that are connected to local regions in the input. Convolution is similar to sliding a filter over an image, computing the dot product of filter weights and image pixels. Rectified Linear Unit (RELU) is used as the activation function for all three convolutional layers. It simply rounds up all the negative values to zero as shown in (9).
The convolutional layers are followed by a batch normalization layer. The pixels x k of input images of each batch are normalized using (10).
where E(x k ) is the mean and Var(x k ) is the variance of pixel values. The batch normalization layer is followed by a flatten layer that remaps the output of the batch normalization layer to a column vector. Lastly, a drop-out layer of 0.2 has been used to avoid overfitting. Two such models have been trained: one for RGB images and one for depth images. The two CNN models are then concatenated. Table 1 shows a summary of the CNN model.

2) KEY BODY POINTS: TOPOLOGICAL FEATURES
Topology can be defined as the spatial relationship between adjacent or neighboring features. Topological features are the properties of a geometric object that are preserved under continuous deformations. In the proposed architecture, four types of topological features have been extracted using the key body points: 1. Geodesic distance from the source. 2. Geodesic path.

Connected faces. 4. Nearest neighbors.
A mesh is a collection of vertices, edges, and faces that describe the shape of a 3D object. Every single point in a mesh is a vertex, a line connecting two vertices is an edge, and a flat surface enclosed by edges is called a face. In the proposed approach, the 3d meshes have been converted into graph models and these four topological features have been extracted for each key point.
First, the geodesic distance gd i between the source vertex and each key point or target vertex has been obtained. Geodesic distance gives the distance between two vertices in a graph along the shortest path between the vertices. Hence, unlike Euclidean distance, geodesic distance considers the shape of the object while computing the distance between two points. If any two vertices are not connected in a graph, the geodesic distance between them will be infinite. After storing the value of geodesic distance, an array of all the vertices lying on this shortest path from source to target vertex has been stored as the geodesic path gp i . For finding the connected faces, each key point or target vertex has been compared with the three vertices in each face of the mesh. In this way, the faces containing one or more of these target vertices have been found and stored as connected faces cf i . Finally, the distance of each target vertex from all other vertices in the graph has been computed and stored in a sorted array. Then the top 128 vertices corresponding to the shortest 128 distances have been acquired. These have been stored as the nearest 128 neighbors nn i . These features are VOLUME 9, 2021 shown in Fig. 7. Hence, for each key point, a topological feature vector [gd i , gp i , cf i , nn i ] has been obtained.

3) KEY BODY POINTS: GEOMETRIC FEATURES
Similar to topological features, some geometric features have also been obtained using the key body points. Ten triangular shapes have been drawn by joining different combinations of key points as shown in Fig. 8. These key points are labeled as left hand (LH), right hand (RH), left foot (LF), right foot (RF), head (H), and torso (T). Finally, the feature vector is also updated as geometric features are added to it. Algorithm 3 explains how these topological and geometric features have been extracted and concatenated in the proposed system. At this stage of the proposed model, the interaction that has been performed in the input video has been recognized.  After concatenating the different features extracted using full-body silhouettes and key body points, the feature vector has been fed into an LSTM model which is followed by a dense layer and a Softmax classifier. Hence, this section is subdivided into two sections: LSTM and Softmax classifier.

1) LSTM
Long Short-Term Memory (LSTM) [39] is a special type of Recurrent Neural Network (RNN) that is capable of learning long-term dependencies. The cell structure of LSTM is shown in Fig. 9. The working of LSTM has been described as follows: 1. The output value at a previous time h t−1 and the input value at the current time x t are entered into the forget gate, and the output value of the forget gate f t is obtained using (11).
2. The output value at a previous time h t−1 and the input value at the current time x t are also entered into the input gate. The output value i t and the candidate cell stateč t of the input gate are obtained using (12) and (13). 3. The current cell state c t is updated using (14).
4. The output and input are received as input values at the output gate at time t, and the output of the output gate o t is obtained using (15).
5. Finally, the output value of LSTM is calculated by using the output of the output gate o t and the state of the cell c t , as shown;

2) SOFTMAX
The Softmax classifier has been used to recognize human interactions. The Softmax function is a popular choice for multiclass classification [40]. It is an activation function that computes the probabilities of all the classes based on the output of the fully connected layer. The probabilities are between the values of 0 and 1 and the normalized sum of these probabilities is always equal to 1. It uses cross-entropy loss. The Softmax output for each class is computed using (17).
where z is the probability of each class, i is a vector of the inputs to the output layer, j is the set of the output units, and n is the total number of classes.

IV. EXPERIMENTAL SETUP AND RESULTS
This section explains the details of the experiments conducted to validate the proposed system. All the processing and experiments have been performed using Python 3.8 with Tensorflow 2.5.0 and Keras 2.4.3. A hardware system with an Intel Core i5 processor and a 64-bit Windows-10 has been used. The system has an 8 GB and 5 (GHz) CPU.
The proposed system has been tested on three different datasets and the recognition accuracies for each interaction class have been computed in the form of their confusion matrices along with precision, sensitivity, and F1 scores. For further validation, the accuracies have been compared with those of other State-Of-The-Art (SOTA) methods. This section is further divided into two sections: dataset description and experimental results.

A. DATASETS
The three datasets that are used for experimentation are the SBU Kinect Interaction dataset [41], the NTU RGB+D dataset [42,43], and the ISR-UoL 3D social activity dataset [44]. Details of each dataset are given in the following subsections:

B. EXPERIMENTS AND RESULTS
For validating the performance of the proposed system, different metrics have been used. The experimentation phase has been divided into two categories: classification accuracy of each class in terms of confusion matrix, precision, sensitivity, and F1 score, and comparison of the proposed system with other state-of-the-art methods. The results for each stage are given in the following sub-sections.

1) INDIVIDUAL CLASS ACCURACY
The results of the proposed model's performance are given in the form of confusion matrices showing true positives, true  negatives, false positives, and false negatives for each class individually. The confusion matrices for SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets are given in Tables 3, 4, and 5 respectively. It can be observed, from the above-mentioned tables that the interaction classes of all three datasets achieved higher recognition rates with the mean accuracy rates of 91.63%, 90.54%, and 90.13% with the SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively. However, there is still some confusion between interaction classes that involve similar actions such as the departing and approaching interactions in the sports dataset. Similarly, shaking hands and exchanging an object interactions of the SBU Kinect Interaction dataset are confused with each other as shown in Table 3. Table 4 shows that the pat on back and point finger interactions of the NTU RGB+D datasets are confused with each other. As seen in Table 5, there is confusion between the hugging and shaking hands interaction of the ISR-UOL 3D social activity dataset.
Tables 6, 7, and 8 show the precision, sensitivity, and F1 scores of each class in SBU Kinect Interaction, ISR-UoL 3D social activity, and NTU RGB+D datasets respectively. The precision, sensitivity, and F1 scores of all the interaction classes for each dataset have been calculated as; where TP, FP, and FN stand for True Positives, False Positives and False Negatives respectively.

2) COMPARISON WITH STATE-OF-THE-ART METHODS
In this section, the proposed method is compared with different methodologies adopted by researchers for HIR recognition from recent years. The action recognition accuracies of each evaluated methodology are used for comparison with the proposed system. Tables 9, 10, and 11 give the comparison of the proposed system with other state-of-the-art (SOTA) systems evaluated on SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively. The results show that the proposed hybrid descriptors are more VOLUME 9, 2021   robust than the different types of features used in the recent SOTA systems.

V. DISCUSSION
The proposed system is a complete HIR solution that should be applicable to many real-world problems involving the  tasks of human behavior monitoring, security, surveillance, and managing smart homes. It is designed for RGB+D datasets but can also be used with RGB only or depth only datasets using only one stream of the proposed CNN model and skipping the model concatenation stage. Each step from the preprocessing stages to the classification stage contributes to the improved performance achieved by the system. The proposed feature extraction method successfully extracts robust features, which in turn, play a critical role in accurate classification of the interactions. Using two CNN models for training RGB and depth images separately and then concatenating the models gives better results than those obtained by concatenating both RGB and Depth images first and then training the 4-dimensional images using only one CNN model. Moreover, since all three datasets contain video sequences, the LSTM-based classification step gives accurate results.
Despite yielding good results, the proposed system is not without limitations. The proposed 2D leaf skeleton model for the detection of key body points can only extract six key points so far. However, better accuracies can be achieved if more key points are identified and their features are extracted. Moreover, the proposed system is very extensive and computationally expensive. The time complexity of the proposed system is shown in Table 12.

VI. CONCLUSION
In this paper, a novel HIR framework had been proposed that uses both machine learning and deep learning techniques for feature extraction from 2D human silhouettes and 3D meshes. Using efficiently segmented silhouettes of the humans from images, multiple features using full-body silhouettes and key body points from their corresponding 3D meshes have been extracted. The features have then been fed to an LSTM and a Softmax-based classifier. The proposed system achieved average accuracies of 91.63%, 90.54%, and 90.13% with the SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively.
In the future, the authors plan to shift their focus to the task of human-object interaction recognition and investigate new features and modeling techniques for better classification results.