Introduction
Interpreting a human’s movements in a video, also known as human activity recognition, is used to automatically deduce the activities and context from the video data by extracting detailed information about the human movements and the environment in which the activities are being performed. In a situation where the video is divided into segments such that each segment contains a human action, and the aim of the system is to successfully classify the activity into the category it belongs to [1]. Several HAR studies utilizing deep learning and machine learning have been carried out recently, but few of them focus on developing a framework for elderly people. In 2017, there were 962 million persons in the world who were 60 or older, more than double the 382 million who were in that age group in 1980. Over 2.1 billion senior individuals are predicted to exist by 2050, a more than doubling of the current amount [2]. As a result of the enormous scientific advancement that has propelled human society into the global aging time over the past few decades, elderly care is the most recent use of HAR. Our ability to live longer and better lives has changed as a result of this development. With these developments, there is rising optimism and hope that most senior citizens would be able to live independently in their homes and take part in communal life. Various security methodologies can be used to provide security to the individuals [3]. Even while technology advancements have made it possible for the majority of senior citizens to lead healthy and active lives, it is still dangerous for elderly individuals to live alone as they age because of both physiological issues and loneliness. Artificial intelligence is becoming more popular for HAR due to its capacity for self-improvement and reliable classification models [4]. The challenge of deriving useful features from sensor data has become easier since deep learning was introduced to the HAR domain [5]. The advancement of deep learning models facilitates the creation of numerous models like CNN [6], [7], Transfer learning which allows for the reuse of previously trained information by training a recognition model on a set of data and then using that model with a different testing dataset such as Inception [8], [9], ResNets [10], VGG-16 [11] and hybrid deep learning model development such as fusion of CNN-LSTM [12], Inception and ResNets [13].
Unfortunately, all of these algorithms have not been used effectively on elderly dataset. Since elder persons movements are different from young people because of various reasons majorly health issues, the model designed particularly for them cannot be tested on young people dataset as the results might differ in real-time. An AI based framework has been proposed for managing these kind of risks [14], [15]. A system whose objective is to recognize the elderly activities must have efficient prediction capabilities and high recognition rate. These traits can be used to predict the abnormal behavior of an elder person living alone. Hence the system will help to prevent any kind of mishappening in advance by raising an alarm to the caretaker.
The proposed research work will be able to answer the following research question: Research Question: How to propose a new workflow to address the HAR problem for elderly and evaluate it on real-time dataset?
To answer this question, we have proposed a personalized multi-input feature fusion method for elderly (goldenAGER), which would be able to extract light, illumination and shadowing invariant features using HOG and complex patterns using VGG-16 to recognize similar activities. To overcome the difficulty of the high dimensionality of the features, these feature maps are passed through a deep neural network. The final feature vector is then concatenated to recognize different actions. The proposed approach will aid in the development of strong HAR system to classify and recognize elderly activities in a more accurate way. There is no publicly available dataset available till date that provides videos data for elderly activities. Therefore, we have made a HAR dataset, in which the human actions data are collected from elderly people. The videos of the elderly volunteers were captured using a static camera. The proposed approach has been evaluated on primary dataset and publicly available Microsoft research (MSR) action detection dataset. Our model performs well on this dataset. The major contributions of this work can be summed up as follows:
We proposed personalized multi-input feature fusion model which takes into account both handcrafted and self-learnt features as two different input sources to classify the actions performed by elderly.
A novel algorithm, goldenAGER, is proposed to overcome the difficulty of extracting essential features from videos in order to recognize the activities of an elderly person.
During experiments, the proposed model is tested against our primary dataset and a standard MSR action detection dataset, yielding high recognition performance of 95 and 93.08 percent, respectively. Our model outperforms other approaches, demonstrating its suitability and viability for use in applications for elderly monitoring and abnormal activity prediction.
The paper’s organizational structure is as follows: Section II focuses on the related study of HAR approaches using machine learning and deep learning algorithms. Section III elaborates the process of primary data collection and description of publicly available dataset used in the present research. Section IV explains the feature extraction process in detail. The design of algorithm, goldenAGER and the model is presented in Section V. Section VI presents experimental results and evaluates the algorithm on extensive performance metrics. This Section also includes discussion on the performance of proposed algorithm in comparison to the existing algorithms. Section VII concludes the real crux of the proposed methodology, also including the future scope.
Related Study
Human activities are the various behaviors that a person can perform over time in a certain setting. There is no perfect mechanism for recognizing human activity, therefore, academicians have been working on this subject for more than 20 years. Because it is easily influenced by numerous elements, including background noise, varying lighting, illumination, and intra-class variability, human activity is too complex to be recognized [9], [16], [17].
A. Traditional Machine Learning-Based
A review and description of traditional machine learning methods using decision trees, k-nearest Neighbours (KNN), and support vector machine (SVM) was published by Jobanputra et al. [18]. Based on the dataset utilised, the model trained and the best features extracted, the system has been tested and accuracy has been validated. This study reveals that there are numerous difficulties with the conventional machine learning methodologies. For instance, while collecting data from sensors,it is mandatory for a person to wear many sensors, and the location of these sensors becomes a significant problem. Even when using typical handcrafted machine learning approaches, feature extraction can be challenging.
To evaluate the efficacy of human actions recognition using classic classification algorithms including SVM, KNN, random forest (RF), and Naive Bayes, Subasi et al. [19] conducted a comparative analysis. Comparing their outcomes reveals that SVM has the highest classification accuracy. By monitoring various activities, such as walking and running, sitting and standing, Kececi et al. [20] investigated machine learning algorithms for user authentication. By implementing ten different machine learning algorithms on the HugaDB database, they proposed a person recognition system. The technique provides significant accuracy, and samples from 18 subjects were used to collect the data.
Zhang et al. proposed enhancements to the best-in-class activity recognition system, particularly for the activity representation and classification methods [21]. In this analysis, representation and classification methodologies are reviewed and examined. They describe current writing using a dataset they have created as well as point-by-point scientific categorization methodologies.
Traditional machine learning techniques have shown outstanding results for local feature extraction during the past ten years, according to Wei et al. [22]. Traditional and handcrafted algorithms still have a number of drawbacks, like the fact that they take a long time to complete and that feature engineering selection is a highly challenging task because of high-dimensional features and sophisticated processing.
Now-a-days, researchers are using deep learning to create new methods for improved video-based human activity recognition because of the shortcomings and difficulties in standard machine learning approaches.
B. Deep Learning Based Approaches For Recognising Human Actions
Deep learning-based approaches employ the current end-to-end architecture to learn and represent high-level features while also categorizing them, in contrast to standard machine learning-based methods. These end-to-end designs learn the ideal qualities and adjust the parameters based on the data [23].
Liu et al. [24] suggested a technique for identifying humans in a brief video and spotting single-human movement that also involves human-human conversation. Ac-cording to Hammerla et al. [25], deep, convolutional, and recurrent algorithms have been used that outperform state-of-the-art methods. They have also given directions for researchers who want to apply deep learning to address challenges.
By choosing to focus on specific aspects in the input frame, Muhammad et al. [23] introduced a bi-directional long-short-term memory-based attention method for detecting human actions from movies. For the extraction of the spatial and temporal patterns in different human actions from movies, He et al. [26] proposed a DB-LSTM system. Although deep learning has achieved notable success recently, action recognition remains a difficult issue due to occlusions, crowded backgrounds, camera motions, etc.
Guo et al. [27] integrated Wi-Fi signals along with video streams, employed convolution neural networks, and applied statistics analysis approaches to improve the recognition of human actions in movies. Nieto-Hidalgo et al. [28] suggested a vision-based gait analysis technique that may be applied to a Smartphone with cloud computing.Cui et al. [29] created the YOLO-NSG (you only look once) method in the area of deep machine learning to lighten the load of inspectors by automatically spotting crime scenes in a video surveillance.
A capsule neural network was suggested by Basu et al. [30] to perceive human action coupled with visual layout. This is accomplished using deep CNNs. A new architecture based on 2D CNN was proposed by Gholamrezaii and Taghi Almodarresi [31] to recognize human actions using only convolutional layers. The performance of the model will not change when the pooling layers are removed, but the computing time will significantly drop. HAR has developed into a significant topic of research for computer vision over the past ten years.Due to its superior performance, deep learning has been the primary focus of most experiments. But, extracting the essential features from vision-based data is still a challenging task [32]. Multi-feature extraction methods are used to extract the complex patterns in an image [33].
C. Multi-Feature Fusion Based Approaches To Recognise Human Actions
Kim et al. [34] recognized nine older people’s everyday routine activities by extracting skeleton joint features. They suggested a video based HAR framework that identifies typical indoor activities of elderly persons based on their skeleton’s joint characteristics. Each human silhouette is given a collection of 23 joints after depth maps are processed to track them and create skeletal data on body joints. Following that, skeletal joints features are generated using the joint information. Finally, a hidden Markov model is employed to identify a variety of human behaviors utilizing these characteristics. The proposed approach outperformed as compared to the other existing approaches. Kaur and Gupta [35], [36] presents major challenges for the development of AI based system.
In order to categorize actions like walking, talking, etc., Seemanthini et al. [37] introduced a novel technique based on cluster segmentation for recognizing human action. The proposed model used HOG classifier for feature extraction. In contrast to current approaches, this system is able to accurately identify human objects of various sizes and shapes, producing a proficient action recognition method.
Khan et al. [35] integrated a hybrid model for activity recognition which makes use of a convolutional neural network (CNN) for extracting spatial properties and a long short-term memory (LSTM) network for learning temporal information. The ac-curacy achieved was better than existing approaches using the proposed approach. They have also proposed a new dataset with 12 number of classes of human actions.
Hayat et al. [2] implemented existing approaches to identify elderly activity. They applied a publicly available University of California Irvine (UCI) dataset to identify human activity. The study concludes that out of all machine learning classifiers, SVM performs better but with a very low computation time, and out of all other available deep learning models, LSTM performs better.
Zerrouki et al. [38] developed an effective human action recognition technique that benefits from the adaptive boosting algorithm’s excellent discrimination ability. Using experimental data from two publicly accessible fall detection datasets from the Universidad de Málaga and the University of Rzeszow, they evaluated the efficacy of their methodology. Furthermore, they have shown that when it comes to distinguishing different human gestures, the suggested method beats state-of-the-art classifiers based on neural networks, K-nearest neighbor, support vector machines, and naive Bayes. A new approach to expressing and detecting human activity was proposed by Hbali et al. [39]. The proposed approach computes spatial and temporal channels from 3D skeleton coordinates and combines them to represent each frame of a human activity sequence. The technique employing overly randomized trees was trained and validated using the MSR Daily Activity 3D and MSR Action 3D datasets.
Huan et al. [40] provided a method for extracting multi-layer features from hybrid CNN and BiLSTM. To improve the performance of human complex activity recognition, features from different sets that are obtained via trained networks, manual extraction, and location information are combined to produce mixed features. The proposed approach is tested on the PAMAP2 and UT-Data datasets. The results demonstrate a method that greatly outperforms the state-of-the-art methods now in use. This method is based on a number of qualities addressed in this research.
By Kiran et al. [41], a multi-feature fusion model based on HAR was suggested. The global average pool and fully connected layers are used to extract features. The canonical correlation analysis combines the characteristics of the two levels. The chosen characteristics are passed to several classifiers in the end for final classification. For the final classification, the selected attributes are sent to many classifiers which performed quite well to classify the actions.
Three networks are used in Chan et al. [42] ’s unique approach to determine the position of the human and other objects in the scene that are the most related to the human, and the entire situation, which includes the actors and everything else in the scene. Three parallel networks are utilized to incorporate these factors, followed by feature fusion and convolutional neural networks. An SVM classifier then performs the classification.
A spatio-temporal feature fusion approach was put forth by Yao et al. [43] to gather the crucial visual data for recognition. Moreover, they expanded on the bag of words representation by adding a compressed spatio-temporal video representation. Their tests on two well-known datasets demonstrate effective performance.
Huynh-The et al. [44] proposed a feature fusion approach called deep convolution neural network (DCNN) to simultaneously extract spatiotemporal information. In this method, the SVM classifier is used to train the hybrid feature, which mixes the deep and handmade features. The model was tested on three benchmark datasets for human activity recognition.
Table 1 shows summary of the existing approaches. A multi-modal learning methodology utilising video and wi-fi feature fusion was developed by Guo et al. in [27]. Using proposed Vi-Wi15 dataset, they have achieved 94.2% accuracy. Basu et al., [30] proposed capsule neural network by extracting features using VGG-16 to recognize indoor home scenes. The model was trained and tested on public dataset with accuracy 76.9%. Gholamrezaii and Taghi Almodarresi et al., [31] proposed a 2DCNN model with no pooling layer for activity recognition. They have used publicly available HAR dataset and achieved 95% accuracy. Kim et al., [34] proposed skeleton joints feature detector by concatenating the features collected from centroid point, magnitude and joints distance. They have created an elderly dataset with 9 activities involving eating, drinking, lying down, stand up, walking, exercise etc. Using the proposed method, they have achieved 84.33% accuracy. Seemanthini and Manjunath [37] proposed a model based on cluster segmentation. They used HOG+SVM to recognize the human activity. A publicly available dataset is being used in this paper which gives 89.59% recognition accuracy. Number of classes and activity types in the dataset are not mentioned in the paper. Since this model is trained and tested on publicly available dataset, the results might vary if this model will be used on elderly people. Hbali et al., [39] proposed a skeleton-based approach for elderly monitoring systems. They have used publicly available datasets involving pickup,throw, two hand wave, hand clap etc. activities with overall accuracy of 86.1%. Similarly, Abdelbaky and Saleh Aly [45] proposed spatio- temporal feature fusion HAR model. The model gives 93.3% accuracy but, the model has been trained and tested on sports dataset. All the activities considered for recognition are also sports activities such as dive, golf, kicking, Ride Horse and Run etc. Therefore, this model is also not applicable for elderly people. Khaire et al., [46] proposed decision level fusion approach by fusing the ConvNet and SoftMax scores. They have used 4 publicly available datasets with different no. of classes including activities swipe, wave, clap, throw, boxing, squat, draw circle etc. The proposed model achieved 94.60%, 92.82%, 93.06%, 96.26% accuracy respectively on each dataset. But because these da-tasets are collected from normal people, result may vary on elderly people. Zhang et al., [47] proposed handcrafted and learned feature fusion method in which the features collected from discrete wavelet techniques and CNN-RNN model are fused with ac-curacy around 95.1%,70%, 72.5% respectively on three different publicly available datasets. But these datasets have been collected from YouTube videos for people of all ages not specifically elders, therefore, the results might vary for elderly people. Most of the research has been focused on healthy young people, therefore, therefore, these approaches are partially applicable to be used as a framework for elderly monitoring and abnormal activity recognition systems. Our proposed method used handcrafted and learned feature fusion method with accuracy 95% and 93.08% respectively. Since our model has been trained and tested on dataset collected specifically for elderly, it is most suitable to recognize elderly people activities.
As discussed in the previous section, traditional handcrafted machine learning algorithms and deep learning algorithms have shown remarkable results for feature extraction and human activity recognition, but due to some limitations and challenges in extracting features using these methods, multi-feature fusion methods are used.
Objectives
After a thorough analysis on the existing approaches with a focus on present limitations, the following objectives have been set forth to address the significant research gap.
Objective 1: To propose a HAR model that can be incorporated as an application to continuously monitor the activities of an elderly person and predict abnormal activity.
Objective 2: To develop a novel algorithm which recognizes the elderly activities such as Carry, Clap Hands, Pickup, Pull, Push, Sit Down, Stand Up, Throw, Walk and Wave Hands.
Objective 3: To explore the significant features such as important landmarks, edges, textures and complex patterns in the videos to identify these activities.
Objective 4: To evaluate the goldenAGER algorithm on extensive performance metrics such as Precision, Recall, F1-score, Cohen’s kappa Coefficient and ROC curve.
Objective 5: To test and compare the proposed algorithm with existing algorithms.
Data Collection
The dataset used in this work has been collected from elderly volunteers over the age of 60 which contains video sequences of 10 classes of continuous human actions. The sequences were taken over homogeneous environment using a static camera.
The sample images for each of the 10 kinds of elderly movements are shown in Figure 1. Videos of 10 sec were captured for each class of activity which are then converted into image frames to provide as an input to feature extractor. 30 frames were displayed every second. There were around 2% of the frames in which same information was recorded with very slight difference. Therefore, we have kept only one frame. The final frame rate was 15 frames per second. The covered activities are carry, clap hands, pickup, pull, push, stand up, sit down, throw, walk and wave hands. Sample distribution of each class of activity is shown in Figure 2. Out of 20270 total images, the percentage of samples in each activity are represented.
The quantity of samples that were gathered for each activity individually is de-tailed in Table 2. The data is split into training and testing data with an 80/20 split, meaning that To determine how the algorithm will act in the presence of unobserved data, 20% of the samples are used for testing and 80% of the samples are used for training.
MSR Action Detection Dataset: We have also used MSR Action dataset to validate our model. For experiments on action detection, this dataset is used. It has ten subjects and twenty different action kinds. Each subject carries out the tasks twice or three times. This dataset has the advantage of using action sequences from several angles. For in-stance, there are significant differences between several ways that the same action is implemented [48]. We have compared the performance of our model on MSR Action dataset also. The frames were captured from video sequences as shown in Figure 3.
Feature Extraction
A. Extracting HOG Feature Descriptors
Since humans can appear in a variety of ways and adopt a variety of poses, detecting them in images can be difficult. The first need is a strong feature set that enables clean human form discrimination even in cluttered backgrounds with challenging lighting. To fulfil this requirement, [49] proposed HOG for human detection.
This Section talks about how to extract HOG features of humans. The steps below demonstrate the entire method for obtaining HOG’s characteristics:
Pre-processing: RGB frames that have been collected from the dataset are of size 480*640. All the activities have been resized to the new size using Python openCV libraries.
Calculate the Gradients of the image: Combining the image’s magnitude and angle produces the gradient.
andG_{x} are initially calculated for each pixel in a block ofG_{y} pixels. The following formulas are used to calculate3\times 3 andG_{x} first for each pixel value:G_{y} where r and c are rows and columns respectively. After that, each pixel’s Magnitude and Gradient are calculated from the following formula:\begin{align*} G_{x}(r,c)& = I(r, c+1) - I(r, c-1) \tag{1}\\ G_{y}& = I(r-1,c) - I(r+1, c) \tag{2}\end{align*} View Source\begin{align*} G_{x}(r,c)& = I(r, c+1) - I(r, c-1) \tag{1}\\ G_{y}& = I(r-1,c) - I(r+1, c) \tag{2}\end{align*}
Figure 4 explains the process to calculate gradient magnitude and direction in detail.\begin{align*} Magnitude (\mu)& =\sqrt {(G_{x})^{2}+ (G_{y})^{2} } \tag{3}\\ Angle (\theta)& = {\tan }^{-1} G_{x} / G_{y} \tag{4}\end{align*} View Source\begin{align*} Magnitude (\mu)& =\sqrt {(G_{x})^{2}+ (G_{y})^{2} } \tag{3}\\ Angle (\theta)& = {\tan }^{-1} G_{x} / G_{y} \tag{4}\end{align*}
Block Normalization: Two cells (
) are combined to make a8\times 8 block. This clubbing is carried out in an overlapping way with an 8-pixel stride by concatenating each cell’s nine-point histograms to create a 36-point vector for all four cells in the block., represented as16\times 16 where\begin{equation*} fb_{i}= b_{1},b_{2},b_{3},\ldots \ldots \ldots {b}_{36} \tag{5}\end{equation*} View Source\begin{equation*} fb_{i}= b_{1},b_{2},b_{3},\ldots \ldots \ldots {b}_{36} \tag{5}\end{equation*}
is one feature vector belonging to one block. Then values offb_{i} are normalized by L2 normalization. By normalising the blocks, the impact of differences in contrast between images of the same class is lessened.fb_{i} Getting HOG feature vector: The image has been divided into 8*8 cells, the cells per block are (2, 2). 16*16 sized block has been taken and the final feature vector has been calculated using all the 126 features that are combined together. All the features have size 36, so 126*36 i.e., 4536 feature vector has been formed.
\begin{equation*} F=f_{1},f_{2},f_{3}\ldots \ldots \ldots, f_{k} where k=4536 \tag{6}\end{equation*} View Source\begin{equation*} F=f_{1},f_{2},f_{3}\ldots \ldots \ldots, f_{k} where k=4536 \tag{6}\end{equation*}
are the feature dimensions.f_{1},f_{2},f_{3}\ldots \ldots \ldots, f_{k} Figure 5 shows the visualization of HOG features on a sample image from dataset.
B. Extracting Features Using VGG-16
An extremely deep neural network called VGG-16, developed by [50], is used for both feature extraction and image classification. We have used VGG-16 in this work as a feature extractor. The 14 million images in the ImageNet dataset, which is divided into 1000 classes, were used to pretrained the VGG-16 model. The ImageNet collection contains images with RGB channels and a fixed size of 224*224. As a result, our input is a matrix of (224, 224, 3).
As seen in Figure 6, this model analyses the input image and generates a 1000-value feature vector. Instead of training with a smaller dataset, it is a good idea to import the VGG-16 weights. The Visual Geometry Group, University of Oxford’s data is used to import the weights for the VGG-16 model [50]. The weights from the source are provided in MATLAB format. The provided weights are compatible with MATLAB, and MatConvNet is a MATLAB toolbox for CNN. The weights of the VGG-16 are converted in this study from MATLAB format to a Keras-compatible format using machine learning libraries. The total trainable parameters are 13.8 million in VGG-16 after going through each layer. Transfer learning helps the researchers to focus on the real problem while saving computation time. In the proposed model, 1000 deep features from the imported model are combined with the HOG features. Figure 7, 8, 9, 10 represents VGG-16 low-level and high-level features in detail.
C. Intra-Class Variations
The same person repeating the same activity with the exact same execution is extremely uncommon. Moreover, even when executing the same action, every person behaves differently. For instance, walking action may be different for every person as each individual walks at his own pace, throwing action may be different and so on [32]. After concatenating the features, intraclass variation of different action classes is analyzed by finding spearman’s correlation matrix [51], [52] as shown in equation (7).\begin{equation*} Corr(\phi _{i})= 1- 6d^{2}(x,y)/n(n-1) \tag{7}\end{equation*}
\begin{equation*} d(x,y)= \sum |m(x)-n(x)| \tag{8}\end{equation*}
The distance between two feature samples for the same class in the dataset, x and y, is measured by d(x,y). If the distance value is minimal, the characteristics are identical. An ideal action recognition system should be able to generalize over variations of a class in order to recognize a previously unidentified instance. The spearman’s correlation coefficient’s values for the 10 classes in the primary dataset are Carry (+0.41), Clap Hands (+0.59), Pickup (−0.42), Pull (-0.08), Push (-0.21), Sit Down (+0.77), Stand Up (+0.89), Throw (-0.91), Walk (−0.82) and Wave Hands (+0.73). The correlation values vary between -1 and +1. Negative values shows that the feature samples are not correlated or weakly correlated to each other which means that there is a large variation between the different feature samples for the same class. It can be seen from the values that Pick Up, Push, Pull, Throw and Walk classes have large intra class variation because of the directional changes. Moreover, one person can use both hands for pick up, push and pull actions while the other can use one hand.
An action recognition model is proposed to handle intraclass variations [53] using HOG+HOF descriptors with accuracy 93.13%. Our proposed method outperforms the existing method. The block normalization step used in HOG feature extraction process improved the performance of HAR model by extracting the discriminative features.
Using a fine-tuned CNN model enhances generalization rather than training the model on smaller datasets. A transfer learning strategy called “fine-tuning” focuses on using the information gained from addressing one problem to solve another that is comparable. By using pre-trained weights, which are created from larger databases through fine-tuning, the overfitting issue can also be avoided. This happens because of the multiple layers and characteristics that CNNs have. The CNN models usually identify patterns that are especially pertinent to the smaller datasets on which they are trained.When the model uses fine tuning, the filters can reused. VGG-16 is used in this paper as a transfer learning strategy which has a significant capacity for discrimination. As a result, the HOG and VGG-16 combination handles intraclass variation very effectively.
Algorithm and Model Design
Feature fusion makes it easier to properly learn features for the rich description of their internal information. After collecting features from HOG feature descriptor and VGG-16 feature extractor, the feature maps are passed through three hidden layers of CNN and a dropout layer to reduce the number of features. After That, 256 HOG features and 256 VGG-16 features are concatenated. The final feature vector is then passed through two input layers which contains features and labels separately which in turn classify the output actions.
A. Benefits Of Multi-Feature Fusion
Using HOG feature descriptor, intensity gradients distribution and edge directions can be described to find local object shape and appearance. After the image is separated into cells, which are connected, small areas, and for the pixels within each cell, a gradient histogram is created. The combination of these histograms becomes the feature descriptor. The local histograms are contrast-normalized to increase accuracy by generating an intensity measure over a larger area of the image, referred to as a block, and using this value to normalize all cells within the block. Better invariance to variations in illumination and shadowing is the effect of this normalization.
To discriminate between various human activities, features at low-level like edges and corners are required. Additionally, the features may become locally intensive to illumination conditions after normalizing the cells over wider areas, like blocks.
Transfer learning is a machine learning technique that uncovers hidden patterns and important information in one particular domain before applying and transferring this knowledge to an unrelated but companion area. In transfer learning, we take advantage of the model’s ability to generalize what it has learnt for one task to a different, related issue. As compared to building a model from scratch, transfer learning uses pre-trained models that require less computing work to fine-tune. VGG-16 is a pretrained model on the ImageNet dataset containing 14 million images belonging to 1000 number of classes. ImageNet dataset is exclu-sively trained for object recognition. We have used VGG-16 in this paper because of its efficiency on popular datasets such as Stanford-40 action dataset [14] and Multi-View Human Action Recognition (MVHAR) datasets [15].
HOG and VGG-16 can efficiently operate together to collect crucial action features. Important low-level action features are gathered via the HOG features. The VGG-16 characteristics generalize the local features collected from human activity to deal with high-level features. VGG-16 extracts features at an abstract level by employing various action patterns. Given that HAR relies on two distinct features that are produced using two different approaches, the multi-feature fusion of HOG and VGG-16 improves the effectiveness of HAR.
The model is divided into two phases. In Phase 1, features are extracted using two different methods, handcrafted feature extraction and transfer learning feature extraction. Both the methods are explained in detail in the previous section. After getting both feature descriptors, the output of Phase 1 is passed as two different inputs in Phase 2. In this phase, goldenAGER, multi-input feature fusion algorithm is applied to classify various elderly activity classes to recognize their actions. Two feature sets are passed as two different inputs to the deep neural network. One input is the HOG feature vector and second input VGG-16 feature vector plus pre-trained VGG-16 weights. Along with the convolutional layers, activation functions are also used to ensure that only valuable information is gathered, speeding up computation. Pooling layers are used to aid with memory size issues so that the information acquired and its associated data are condensed to a smaller memory size. The model is explained in detail in Figure 12. List of abbreviations and notations used in goldenAGER are described in Table 3.
Phase 1: Feature extraction
INPUT: Video sequences captured from static camera
OUTPUT: Two Feature Vectors
Initialize Frame rate and image size= 480*640 pi
Convert RGB images to Grayscale images
Resize the image to 80*120 pi
For each pixel in a block of 3*3 pixels do
Calculate the gradients
andG_{x} G_{y} \begin{align*} G_{x}(r,c) &= I(r, c+1) - I(r, c-1) \tag{9}\\ G_{y} &= I(r-1,c) - I(r+1, c) \tag{10}\end{align*} View Source\begin{align*} G_{x}(r,c) &= I(r, c+1) - I(r, c-1) \tag{9}\\ G_{y} &= I(r-1,c) - I(r+1, c) \tag{10}\end{align*}
Compute Magnitude and Direction of the Gradient
\begin{align*} Magnitude (\mu)&=\sqrt {(G_{x})^{2}+ (G_{y})^{2} } \tag{11}\\ Angle (\theta)&= {\tan }^{-1} G_{x} / G_{y} \tag{12}\end{align*} View Source\begin{align*} Magnitude (\mu)&=\sqrt {(G_{x})^{2}+ (G_{y})^{2} } \tag{11}\\ Angle (\theta)&= {\tan }^{-1} G_{x} / G_{y} \tag{12}\end{align*}
Divide gradient image into 8*8 cells
Find histogram bins for each cell
Group 2*2 cells to form overlapping blocks
Normalize the blocks.
Get the HOG descriptor for the image End for
Save HOG feature vector
Repeat step 1 for using VGG-16 feature extractor
Resize the image to 224*224*3
Use the pretrained weights of ImageNet dataset
Convert the image frames to NumPy array
Apply VGG-16 model
Save VGG-16 feature vector
Return HOG and VGG-16 feature vector
Phase 2: goldenAGER: multi-input feature fusion algorithm
INPUT 1: HOG feature vector
INPUT 2: VGG-16 feature vector + pre-trained weights
OUTPUT: Recognized action class
Pass Input 1 to hidden layers
Evaluate individual neuron weights
during training stages by passing training video dataw_{ij} For each Layer do
Apply ReLU activation function
\begin{equation*} F(x) = Max(0,x) \tag{13}\end{equation*} View Source\begin{equation*} F(x) = Max(0,x) \tag{13}\end{equation*}
Use ADAM optimizer to deal with non-stationary activities
\begin{align*} m_{t}&= \beta _{1} m_{t-1} + (i-\beta _{1}) [\delta L /\delta w_{t}] v_{t} \\ &= \beta _{2} v_{t-1} + (1-\beta _{2}) [\delta L/ \delta w_{t}]^{2} \tag{14}\end{align*} View Source\begin{align*} m_{t}&= \beta _{1} m_{t-1} + (i-\beta _{1}) [\delta L /\delta w_{t}] v_{t} \\ &= \beta _{2} v_{t-1} + (1-\beta _{2}) [\delta L/ \delta w_{t}]^{2} \tag{14}\end{align*}
Use Dropout Layer to prevent overfitting
Apply Categorical Cross Entropy loss function to minimize error value
\begin{equation*} Loss = - \sum _{i=1}^{Output Size} y_{i}.log \hat {y_{i}} \tag{15}\end{equation*} View Source\begin{equation*} Loss = - \sum _{i=1}^{Output Size} y_{i}.log \hat {y_{i}} \tag{15}\end{equation*}
Compute feature vector dimensions
\begin{align*} &\hspace {-.2pc}[m,m,ch]* [f,f,ch] \\ &=[(m+2p-f)/s+1] \\ &\quad * [(m+2p-f)/s+1],n \tag{16}\end{align*} View Source\begin{align*} &\hspace {-.2pc}[m,m,ch]* [f,f,ch] \\ &=[(m+2p-f)/s+1] \\ &\quad * [(m+2p-f)/s+1],n \tag{16}\end{align*}
End for
Repeat steps 1 and 2 for Input 2
Concatenate both the feature vectors
Pass concatenated feature vector to 2 different input layers, one for features and other one for labels.
Apply ReLU activation function to check the feature maps for any negative values
Use Dropout Layer again to prevent overfitting
Pass the final feature vector to trained model for classification
Return Recognized Action Class
Figure 11 depicts the proposed model architecture. Basically, the model takes two inputs. One from the HOG descriptor and the another one from VGG-16 feature ex-tractor. CNN model is used to classify the actions. Deep neural networks that use convolutional layers, convolutional operations, activation functions, pooling layers, and SoftMax functions make up the CNN class. CNN can be directly fed an RGB image as an input. The RGB image is subsequently processed by the CNN model using a variety of filters, producing feature maps. The essential hyperparameters are size of the filter, total number of filters and padding type. The depth of the output feature map is determined by these hyperparameters. The size of the images after convolution can be determined using Equation 9.\begin{equation*} (I*H) [m,n] = \sum _{j} \sum _{k} H[j,k] * I[m-j,n-k] \tag{17}\end{equation*}
The feature vector dimensions have been mentioned in Figure 11. After calculating the feature vector, ReLU activation function is applied to bring nonlinearity in the model. After that, by downsizing the image while keeping the essential features, we have applied pooling layer to reduce feature vector size. To extract the few most useful features from an input image, a block made up of a Convolution layer, ReLU, and max pooling layer is repeated. The 3-dimensional feature maps are then flattened to create 1-dimensional versions. The last fully connected layer has a SoftMax activation function that generates probabilities between 0 and 1 for the relevant output class. CNN model employed in this paper is VGG-16 shown in Figure 3.
Results and Discussions
The success of a HAR model depends upon the three major factors that are the model architecture, Loss function design and the Hyperparameters selection. In our proposed model, we have carefully worked on all these important factors.
A. Model Architecture
In our proposed model, HOG features and VGG-16 features are concatenated. The benefit of concatenation is that feature fusion provides rich description of the internal information about the activities being performed.
One of the main parameters to find HOG feature vector is cell size. If the cell size is small, it will not be able to classify between valuable and irrelevant features. On the other hand, if we choose a larger cell size, then it will compress the portion of the image and some important information may be lost. Table 4 shows the accuracy of HOG+SVM on primary dataset and MSR Action dataset. It can be seen from the table that A cell size of [32], [32] does not capture much shape features, whereas [4], [4] captures a lot of shape features but increases the HOG feature vector and slows down training. As a result, [16], [16] is the cell size used in this study. Another main parameter is the number of orientation bins. If we choose a smaller number of orientation bins, it will lead to information loss and if a large number is chosen, it will spread the information in all bins and lower the performance of HAR. A proper set of these two parameters makes the HOG feature extractor the most suitable one to categorize elderly actions peculiarities.
The benefits of using VGG-16 are that it is pre trained on millions of images which give more accurate description of the features. Combining the two feature vectors will collect all the important information from the image. While HOG works on low level features, VGG-16 works on abstract level features to extract complex patterns from the image.
The preparation of HAR models necessitates a GPU. This model is trained on NVIDIA GeForce GTX 1650 with 16GB RAM. The model used Intel Core i5-10th Gen Processor with a clock speed of 2.50GHz. Keras has been used to build deep neural network architecture. We have used ReLU as an activation function because it brings non-linearity [46]. It will check the feature maps and if there are any negative values, it will replace them with zero. ReLU helps with making the model non-linear.
B. Loss Function Design
To reduce the error, categorical cross entropy loss function is utilized. It is a loss function used in multi-class classification. The formula to calculate this loss function is as follows:\begin{equation*} Loss = - \sum _{i=1}^{Output Size} y_{i}.log \hat {(y_{i})} \tag{18}\end{equation*}
The degree to which two discrete probability distributions can be separated from one another is well demonstrated by this loss. In this instance,
C. Hyperparameters Selection
Hyperparameters selection is one of the important factors for the selection of a HAR model. Table 5 depicts the details of the hyperparameters selection in the proposed model. As discussed in the previous sections, cell size and no. of orientation bins used are essential hyperparameters for HOG. For VGG-16 model, the output image size and output feature map dimensions are the important hyperparameters as shown in Figure 12. Number of hidden layers, epochs, dropout rate, Optimizer etc. are another important hyperparameters for CNN model.
The model was trained and tested on real-time elderly dataset showing which shows outstanding performance of the model. Figure 12 presented the confusion matrix plotted for each class for primary dataset and Figure 13 depicts the confusion matrix for MSR action detection dataset. The model is getting a little bit confused between push and pull because of strong intra-class correlation between these two activities. For all the other activities, the model is behaving well. It is showing 99% accuracy in case of wave hands, 96% accuracy in case of carry and sit down, 96% for push and walk, 95% for pickup activity, 91% for stand up and 82% for throw. Figure 14 and Figure 15 represents training and validation accuracy of the proposed model on two different datasets.
Confusion matrix representing the recognition accuracy using MSR action detection.
D. Performance Measures
The precise classification of each class is equally significant in the perspective of F-measure (F1). The model can be evaluated more effectively than the precision because when computing the score, it takes into account each class’s recall and precision. Following are the formulas to calculate precision, recall and F1 score.
Precision:\begin{equation*} P = \frac {TP}{TP+RP} \tag{19}\end{equation*}
Recall:\begin{equation*} R= \frac {TP}{TP+PN} \tag{20}\end{equation*}
F1-Score:\begin{equation*} F1= \sum _{i}2* w_{i} \frac {Precision.recall_{i}}{Precision + recall_{i}} \tag{21}\end{equation*}
1) Cohen’s Kappa Coefficient
The agreement between two raters who each assign N items to C mutually exclusive categories is measured by Cohen’s kappa [54]. The formula for K is \begin{equation*} K= {Pr(a) - pr(e) }/1-Pr(e) \tag{22}\end{equation*}
We have calculated Cohen’s kappa score for the proposed model and the score come out to be K = 0.94 which means that the model is performing well.
2) Receiver Operating Characteristic Curve (ROC Curve)
ROC curve is an evaluation metric to measure multiclass performance. In general, there are two different methods for selecting multiple class performance indicators. OvR i.e., one versus rest and OVO i.e., one versus one. OvR stands for one way to assess multiclass models by contrasting every class simultaneously. In this case, we choose one class and think of it as our “positive” class, while the others are thought of as our “negative” class. OvO, which stands for “One vs One,” is quite similar to OvR in that we compare all possible two-class combinations of the dataset rather than just one class against the other. In this paper, we have used OvR method for measuring the performance of the model. Figure 16 represents the ROC curve for one class versus rest of the classes. Area comes out to be 1 which means that the model is distinguishing between the classes effiently.
The associated cumulative distribution functions should be F(p) = F(p(x)— 0) and G(p) = G(p(x)— 1), respectively. A plot of G(p) on the vertical axis against F(p) on the horizontal axis is then used to define the ROC curve. This area undoubtedly falls within a unit square. An ROC curve in the upper left triangle of the square, which represents a good classification rule [55]. Figure 17 represents the ROC curve for all the classes.
E. Discussions
The suggested approach has created a customized multi-input feature fusion model for monitoring older people. The study employs handmade and self-learnt techniques to recognize elderly daily routine behavior and forecast aberrant activity utilizing HOG and VGG-16 as primary and open datasets. Two separate features are taken into consideration as the prediction model is trained over an 80 percent dataset, and deep neural networks are used to recognize activity. The outcome analysis of the suggested model is discussed in this section. HOG + SVM is used in [37] to build human activity recognition system. The model was tested on open video dataset with 89.59% accuracy. We have also used HOG+SVM on our primary dataset and public dataset, the results were less accurate because of the noisy background in MSR Action Detection Dataset. Similarly, VGG-16 is used in [30] for indoor home scene recognition with an accuracy of 71%. When we used VGG-16 on our datasets, the accuracy scores are slightly better as shown in Table 7, but the reason behind less accuracy is the increase in number of feature dimensions resulting in dying ReLU. The proposed goldenAGER algorithm integrates handcrafted and self learnt features but feature fusion often leads to high dimensional features. goldenAGER uses a series of pooling layers and dense layers to reduce feature dimensionality which, in turn, save computational resources to train the model. Moreover, the existing models lack in finding strong intra-class correlation between activities. Our model uses spearman’s correlation coefficient to find intra class variations between the activities of the same class. We have trained and tested our primary dataset and MSR action detection dataset on three different models. We have also compared our multi-input feature fusion model with VGG-16 model in Table 7. To design a HAR model using CNN, the 1000 features of VGG-16, as shown in Figure 7, are enhanced by using hidden layers and dropout layers. These features are then integrated to a SoftMax layer consisting of 10 different classes. Three hidden layers connect the 1000 features in a manner similar to the second branch of the proposed model, as shown in Figure 12, providing 256 total features. To categorize 10 different actions, these 256 features are again fed to SoftMax. The evaluation of the proposed model in contrast to the VGG-16 CNN model explains the advancements made in fusing CNN and HOG characteristics. Figure 14 and Figure 15 show the accuracy graphs. In comparison to CNN model, proposed model extracts more detailed features. It enables the deep neural network to easily detect more complex patterns of different actions. As a result, the training of the proposed model becomes faster in each training cycle. A multi-input feature fusion model is created for action recognition in this paper. Using hand-engineered and self-learnt feature descriptors, our proposed model can extract the features separately. At the same time, the filters in different channels are not shared, which ensures that each input can have a different convolutional setting for the different features. Each branch of the multi-input deep convolutional neural network proposed in this study can rely on the deep CNN to extract features, and the features extracted from each input will be concatenated together through the concatenate layer. This network can be thought of as a data fusion method at the feature level. The feature vector will then be forwarded to the network that is completely connected for feature fusion. Without relying on intricate feature engineering, this effective feature-level fusion technique may automatically fuse the most distinctive features from each channel and remove redundant features between channels. The entire procedure is end-to-end, which is better for subsequent real-time decision-making. Table 7 represents the accuracy score of the three models. It can be seen from the table that HOG+VGG-16 performs better as compared to using HOG and VGG-16 individually. The first model uses HOG as a feature descriptor and SVM for classification of the activity type. HOG+SVM gives accuracy 85.92% on primary dataset and 71.17% on MSR Action Detection Dataset. The accuracy is less because HOG can extract the local features such as edges and corners from an image significantly but is unable to identify the complex patterns. Then, VGG-16 model is applied on both the datasets giving ac-curacy 82.31% and 75.84% respectively which is also comparatively less because it is not invariant to illumination changes like HOG. The combination of the two models gives better results for the primary dataset as well as publicly available dataset with accuracies 95% and 93.08% respectively. While HOG works on low level features, VGG-16 works on abstract level features to extract complex patterns from the image. Table 8 represents the performance comparison of our proposed method with the other existing methods on MSR action detection dataset. It can be seen clearly from the table that our method outperformed the other existing methods.
The proposed model can be used as a framework for many applications including continuous monitoring of elderly people. However, the model can only currently recognize a finite set of actions. As a result, the training data should be enhanced to include real-world activities for elderly.
Conclusion
Automatic recognition and analysis of human activities especially elderly activities, is still a complicated task. The substantial intra-class correlation between various activities performed by senior people due to the age factor is the key problem with the design of such a system. We have used spearman’s correlation coefficient to find the intra class correlation between the activities. The values for the 10 classes in the primary dataset i.e., Carry, Clap Hands, Pickup, Pull, Push, Sit Down, Stand Up, Throw, Walk and Wave Hands are +0.41, +0.59, −0.42, −0.08, −0.21, +0.77, +0.89, 0.91, −0.82 and +0.73 respectively. Activities such as Pick Up, Pull, Push, Throw and walk has strong intra class variation. HAR model’s accuracy is affected by a number of problems, including illumination variations and variations in performing the same activity. The most recent CNN-based models perform admirably and increase accuracy; however, they fall short in differentiating between same actions. In order to solve this issue, a personalized multi-input feature fusion algorithm, golderAGER that takes into account both handcrafted and self-learned features is proposed in this study. To distinguish different human action patterns, VGG-16 and HOG features are extracted. This algorithm is trained and tested on real-time dataset which is collected from elderly volunteers with 10 different action classes. Our method performs well on this dataset with an accuracy of 95%. There exist multi-feature fusion models for the same application in the literature, but our proposed model is different in context of the model architecture, the optimization strategy used, loss function design, and the hyperparameters selection. The proposed method is compared with the existing methods summarizing the fact that our model aids in the development of strong HAR system to classify and recognize elderly activities in a more accurate way.
The proposed method recognizes only the simple activities that are mentioned above.This could be the major limitation of the proposed work.In the future, we intend to recognize more complex activities which contains human-human interaction or human-object interaction such as Handshake, hug, brushing teeth, cleaning the floor, cooking, watching TV etc. to check the validity of this model on these types of activities. We are also planning to add activities like falling to prevent life-threatening risks. We’ll also compare the proposed model by fusing HOG features with other off-the-shelf CNN classifiers such as VGG-19, Wide-ResNet and ViT etc.
ACKNOWLEDGMENT
The work of Diana Nagpal belongs to her Ph.D. study.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this article.