Introduction
Human action recognition (Human Action Recognition (HAR)), a crucial component of computer vision, lies at the intersection of the latest sensor technologies, Machine Learning (ML) and Deep Learning (DL). This intricate process involves careful identification and classification of human actions based on raw data collected from diverse sources, such as body-worn sensors, smartphones, and cameras. By analyzing the patterns of human movement, these actions enable a deeper understanding of behavior across various environments, offering valuable insights for applications ranging from healthcare to law enforcement [1].
The realm of HAR encompasses a wide range of practical applications, including the enhancement of surveillance systems for crime prevention and the improvement of user interactions in smart home settings. In recent years, this field has undergone significant advancements with HAR systems evolving from basic motion sensors to sophisticated networks capable of intricate pattern recognition and real-time processing [2], [3], [4]. Figure 1 demonstrates the utilization of deep learning (DL) techniques on sensory data for time-series categorization, which plays a vital role in understanding temporal changes and identifying human actions using various sensor modalities [5], [6].
HAR has made significant progress, but still faces considerable challenges. Throughout its development, the field has achieved important milestones, encountered obstacles, and adapted to emerging technologies. Initially, the main challenge was to achieve high accuracy in various complex real-world scenarios. This requires a deep understanding of the intricacies of human movement and real-time data processing. Researchers have addressed this challenge by creating inclusive datasets covering a wide range of human activities and demographics. They also focused on reducing the biases in algorithmic models and utilizing multimodal data to enhance the robustness and accuracy of HAR systems [5], [6].
Furthermore, integrating HAR with emerging technologies, such as augmented reality (AR) and virtual reality (VR), has opened up new possibilities, providing more immersive and interactive experiences. This integration highlights the dynamic nature of the field because it adapts to the potential offered by these technologies [7], [8]. In this review, our objective was to offer a comprehensive overview of the development of HAR. We discuss key milestones, technological components that have supported its evolution, and recent advancements. Additionally, we explore the essential features necessary for accurate recognition, address the ongoing challenges that shape the future of HAR research, and critically evaluate current approaches.
A thorough investigation of the most recent advancements in HAR, encompassing methodologies based on ML or DL, benchmark datasets, and the metrics employed to assess their effectiveness.
A detailed presentation of a standard HAR system’s architecture, outlining the crucial technological components and their interaction.
A critical evaluation of current approaches in HAR using standard metrics.
The remainder of this paper is structured as follows: Section II provides an in-depth explanation of the methodology employed for the literature search and selection, Section III summarizes the main findings, and Section IV presents concluding remarks and potential avenues for future research.
Materials and Methods
This section outlines the methodological framework used to conduct a systematic review of the literature on HAR systems. The primary objective of this systematic approach is to ensure comprehensive coverage and robust evaluation of the relevant research published within the scope of this study.
The selection process is depicted in Figure 2, which adheres to PRISMA guidelines.
The PRISMA flowchart illustrates the meticulous selection process, starting from the initial identification of records to the ultimate inclusion for qualitative synthesis.
The process of a HAR system is depicted in this flowchart. It visually represents the various interconnected stages that are crucial for identifying human actions. Every step, starting from data collection and ending with classification, plays a vital role and relies on each other, ultimately enhancing the accuracy and effectiveness of the entire system.
A. Search Strategy
To comprehensively gather studies related to HAR based on sensor and vision technologies, a thorough search strategy was implemented. This strategy involved systematically exploring various electronic databases, such as Google Scholar, Web of Science, Scopus, IEEE Xplore, and ScienceDirect. By utilizing a combination of specific keywords like “human action recognition,” “daily life activities,” “sensor data,” and “vision data,” along with Boolean operators (AND, OR, NOT), the search was carefully refined to obtain relevant results. The search strings used for each database can be found in Table 1.
B. Inclusion and Exclusion Criteria
The inclusion criteria were precisely defined to select studies that significantly contributed to the understanding of HAR systems. Criteria for exclusion were established to uphold the quality of the review, ensuring a focus on original empirical research.
Inclusion criteria included:
Studies published in English from 2000 to 2023.
Research focusing on HAR within the context of the Activities of Daily Living (ADL).
Studies based on sensors or vision data.
Articles published in peer-reviewed journals or conference proceedings.
Studies with accessible full text.
Exclusion criteria applied:
Non-English publications.
Duplicate studies within the databases.
Review articles and meta-analyses, which were used for background context.
Studies with inaccessible full text or published in low-quality venues.
C. Selection Process
A total of 164 articles were obtained from the initial database searches, comprising 160 records from the database searches and an additional 4 records sourced from other relevant sources. After eliminating duplicates, 128 distinct records were retained. These records were screened based on their titles and abstracts to determine their relevance.
D. Data Extraction and Analysis
To maintain consistency and objectivity throughout the review process, a standardized form was used for data extraction. This form encompasses various details, such as the authors, publication year, methodologies employed, datasets used, evaluation metrics, and principal findings. Subsequently, a mixed-methods approach was employed to synthesize the extracted data, enabling the identification of prevailing trends, state-of-the-art techniques, and persistent challenges within the domain of HAR research.
E. Rationale for Inclusion and Exclusion Criteria
The inclusion and exclusion criteria were established to focus the review on methodologically robust studies that directly pertained to the sensor- and vision-based HAR systems. This deliberate selection aimed to ensure the academic rigor and relevance of the review.
Summary of Key Observations
This section provides an overview of the key findings from the present study, which include the structure of HAR systems and their applications, datasets used, algorithms employed, and the challenges faced during the research process.
A. The HAR System Architecture
HAR systems have been developed to analyze raw sensory data and derive meaningful insights regarding human behavior. The initial step involves collecting data from various sensors, including cameras, inertial measurement units, and microphones, which capture multi-modal data reflecting human movement. To ensure data quality and standardization, the collected data undergoes preprocessing, which includes labeling the data with relevant action descriptors and normalizing the input scales. These preprocessing steps are essential for cleaning and preparing the data for feature representation. Subsequently, sophisticated techniques are employed to extract significant features from the preprocessed data. In the classification stage, advanced DL or artificial intelligence (AI) models interpret these features to categorize complex actions. Convolutional Neural Networks (Convolutional Neural Network (CNN)s) and Recurrent Neural Networks (RNNs) are commonly utilized algorithms for discerning and classifying actions.
This architectural framework serves as a fundamental basis for comprehending the intricate challenges associated with HAR. These challenges encompass the handling of high-dimensional data, ensuring reliable feature extraction in varying environmental conditions, and selecting appropriate classification algorithms that offer both accuracy and prompt response times. The subsequent sections of this study delve into these challenges in greater detail, proposing potential solutions and discussing diverse applications and the range of techniques employed to refine HAR systems.
Figure 4 provides a taxonomy that encapsulates the existing literature on HAR, illustrating the extensive scope and depth of the field. The following sections will further explore various applications of HAR, showcasing the wide array of techniques and algorithms that enhance the capabilities of these systems.
Taxonomy of HAR. This taxonomy outlines the various facets of HAR research, encompassing methodologies, applications, and challenges, providing a structured overview of the field.
B. Application Areas
This section provides an overview of the various fields where HAR finds its applications, with particular emphasis on its important contributions to ADL, surveillance, Human-Computer Interaction (HCI), and competitive sports.
1) HAR in Daily Life Actions
To enhance human well-being, a nuanced understanding of daily activities facilitated by advanced HAR technologies such as PoseNet and GHUM is crucial. These models, which utilize convolutional neural networks, have significantly advanced human pose estimation in images and videos, leading to breakthroughs in fields such as smart home systems, elderly care, and physiotherapy [9], [10], [11], [12].
PoseNet’s efficiency in detecting and tracking angular movements was highlighted by its impressive accuracy rate of 97.6% for 2D pose detection. This capability has been instrumental in telehealth systems for home-based rehabilitation, allowing for the precise monitoring and adjustment of rehabilitation exercises [10]. GHUM, with its advanced bone-based sensing technique, offers robust pose estimation in nonintrusive monitoring scenarios, making it particularly useful in elderly care environments for fall detection and activity monitoring [13]. These technologies ensure safety and health monitoring in smart homes by providing real-time alerts in case of abnormalities [14].
Despite these advancements, HAR systems face challenges related to their sensitivity to varying lighting conditions and camera angles. These environmental factors can significantly affect the accuracy of pose estimation, with deviations of up to 15% caused by variations in the lighting conditions [15]. To address these issues and enhance the robustness of HAR systems, multiview models and image enhancement techniques have been developed [16]. Additionally, recognizing individual actions in complex scenarios involving multiple people poses challenges. Recent studies explored advanced detection models to address these complexities [11].
Gong et al. [17] presented a novel DL model that effectively improved the recognition accuracy of activity classes within datasets, particularly in complex activity scenarios. By incorporating this model, recognition accuracy was enhanced by approximately 10% [17], [18]. This significant development addressed the persistent challenge of the semantic gap in HAR.
Ongoing research endeavors are expected to prioritize the enhancement of data quality, model adaptability, and real-time processing in HAR technologies. These advancements are of utmost importance for accurately interpreting intricate daily human actions and seamlessly integrating HAR into our everyday lives. Ultimately, these advancements have the potential to revolutionize interactions with technology.
2) HAR in HCI
The integration of HAR into HCI has brought about a significant transformation in how users interact with technology. By integrating HAR capabilities, such as gesture control and eye tracking, systems have been able to enhance the user experience by making interactions more intuitive and responsive. These technologies have become increasingly prevalent in various applications, ranging from smart home interfaces to healthcare monitoring, and have significantly improved the seamless interaction between humans and technology [19], [20].
One notable example of the impact of HAR technologies is on smart home systems, where they have achieved an impressive accuracy rate of up to 90% in gesture recognition. This high level of accuracy enables users to control home appliances more naturally and efficiently [21]. In the context of healthcare monitoring, HAR systems play a crucial role in facilitating patient interaction with medical devices. By doing so, they contributed to the ease and accuracy of patient data collection.
Overall, integrating HAR into HCI has revolutionized user interactions with technology, making them more intuitive, responsive, and seamless. These advancements have been observed in various domains, including research showing that gesture recognition accuracy can decrease by up to 20% in poorly lit environments [22]. Furthermore, the latency in eye-tracking systems, which is crucial for interactive applications, can vary, with some systems reporting latencies of up to 50 ms, which may affect user experience in fast-paced applications [23].
This study aims to overcome these limitations by developing robust HAR algorithms. These algorithms have been specifically designed to adapt to different environmental conditions and user behaviors, thereby enhancing the accuracy and minimizing the latency. For instance, the use of machine learning (ML) techniques in dynamic mathematics software has resulted in a 15% enhancement in user interaction efficiency by predicting user intent [24].
In the future, the field of HAR for HCI is expected to witness further advancements with the integration of AR/VR technologies. These advancements are geared towards creating more immersive and interactive experiences, potentially reducing the perceived latency in user interactions and enhancing the naturalness of user interface controls. The integration of HAR with AR/VR holds the promise of opening up new possibilities for HCI, fostering greater user engagement, and facilitating the development of innovative interaction paradigms.
3) HAR in Competitive Sports
The integration of HAR into competitive sports has brought about significant changes, particularly in the utilization of biomechanical data to enhance athletic performance and develop strategies for preventing injuries. Sports analytics relies heavily on technologies such as accelerometers and GPS tracking devices, which provide detailed insights into athlete movements and training loads. These insights are crucial for personalizing training programs and minimizing the risk of injury [15], [25], [26].
For example, in the track and field, HAR technologies have been employed to optimize the running techniques. The data showed that athletes could achieve up to a 5% improvement in running efficiency by adjusting their form based on HAR feedback [27]. Similarly, in team sports, HAR systems have played a vital role in reducing injury rates by approximately 20% through tailored training programs and the early detection of fatigue [28], [29].
Despite the achievements of HAR, several challenges need to be addressed. One of these challenges is the variable accuracy of HAR systems under various environmental conditions. For instance, in outdoor sports, data collection can be hindered by weather variations, resulting in a decrease in data reliability of up to 10% under adverse weather conditions [30], [31], [32]. Moreover, the complexity of athlete movements and the presence of sports equipment can affect accuracy.
In the future, advancements in HAR for sports will primarily focus on improving the precision of collected data and developing more sophisticated analytics. The integration of HAR with ML, DL, and AI has great potential for transforming athletic performance analysis into predictive analytics. By utilizing HAR data for real-time feedback and for predicting injury risks, more effective training strategies and injury prevention measures can be implemented [33]. The evolving role of HAR in competitive sports is expected to provide deeper insights into athletic performance and become an integral aspect of sports training and performance enhancement.
4) HAR in Surveillance
The significance of HAR in surveillance, particularly in security and elder care, has grown significantly. HAR systems play a crucial role in improving the efficiency and privacy of real-time action identification in surveillance operations, thereby contributing to monitoring environments and detecting unusual activities [34], [35]. In the context of elderly care facilities, HAR systems have proven to be effective in reducing false-alarm rates by up to 30%, leading to a notable improvement in emergency response effectiveness [36].
The future of HAR in surveillance is characterized by the integration of AI tools, which are expected to enable the development of more context-aware and intelligent systems. This integration aims to enhance the ability of the system to accurately interpret complex scenarios, with recent advancements demonstrating a 25% improvement in detecting subtle activities [37]. Such advancements are particularly crucial in domains such as elderly care, where accurately distinguishing normal behaviors from potential emergencies can significantly impact the safety and well-being of individuals.
The application of HAR technology in different settings poses a significant challenge. In particular, outdoor security scenarios often present difficulties for surveillance systems owing to unpredictable weather conditions and varying lighting. These factors can have a detrimental effect on the accuracy of activity detection, reducing it by as much as 20% [38], [39]. It is crucial to ensure the reliability and precision of HAR systems under diverse conditions, particularly for applications in security and healthcare. Ongoing research endeavors are focused on enhancing the robustness of HAR systems under such diverse conditions, such as low light and occlusions, especially for applications in security and healthcare. This involves the development of advanced algorithms and the integration of ML/DL techniques to improve adaptability to changing conditions and camera perspectives. As HAR technologies continue to progress, their integration with AI is expected to result in more sophisticated surveillance systems. This integration enhances the efficiency of both action detection and contextual interpretation, aligning surveillance operations more closely with real-world requirements [40], [32], [41].
C. Datasets
The HAR field relies heavily on a wide range of datasets, each providing unique insights and presenting fundamental challenges that are crucial for advancing HAR technologies. The datasets used in HAR include both RGB and depth data, which are distinct modalities. While Red Green Blue (RGB) images capture color information, depth images are two-dimensional and utilize Time-of-Flight (TOF) technology. The RGB format merges both color and depth data and displays the distances of the objects in the RGB images from the image plane. These datasets are pivotal for developing and testing HAR algorithms and significantly contribute to the progress of the field [42].
1) Benchmark Datasets
Benchmark datasets play a vital role in (HAR) research because they serve as a fundamental foundation for the development and assessment of models. These datasets hold immense importance, as they provide controlled settings for testing and enhancing the HAR algorithms.
KTH dataset introduced by Schuldt et al. [43], which comprises six types of human actions (walking, jogging, running, boxing, hand waving, and hand clapping) performed by 25 subjects in four different scenarios, consists of 2391 video sequences with a resolution of
UT-Kinect: This dataset is presented by Xia et al. [45], and is commonly used for action recognition based on depth sequences. This dataset was acquired using a single stationary Kinect camera and featured 10 action types: walking, sitting down, standing up, picking up, carrying, throwing, pushing, pulling, waving hands, and clapping hands. Each action type was performed by ten different subjects, resulting in a total of 100 samples. The dataset included three channels of information: RGB, depth, and skeleton joint location. A high resolution of
MSR 3D Action dataset comprises twenty distinct action categories and ten subjects, and each subject performed each action two or three times. A total of 567 depth map sequences were recorded, with a resolution of
Florence 3D Action: This dataset comprises nine everyday activities: wave, drink from a bottle, answer the phone, clap, tie lace, sit down, stand up, read/watch, and bow. The dataset features ten subjects, each of whom performs the actions two or three times. The dataset included 215 video clips with a resolution of
NTU RGB D: As utilized by Tu et al. [49], this large-scale dataset is a comprehensive resource for RGB HAR, encompassing 56,880 samples from 60 action classes across 40 subjects. The action categories included 40 daily actions, nine health-related actions, and 11 mutual actions, occurring in 17 different scene conditions captured by three cameras at various horizontal imaging viewpoints. By providing multimodal information, including depth maps, 3D skeleton joint positions, RGB frames, and infrared sequences, the dataset is publicly available for non-commercial use. Although advantageous for its extensive samples, action categories, and scene conditions, it suffers from low resolution (
RGB D Huda Act: Explored by Zhao et al. [50], it is a comprehensive collection of RGB HAR data. It encompasses 56,880 samples from 60 action classes, collected from 40 subjects. These actions can be broadly categorized into three groups: 40 daily actions (e.g., drinking, eating, reading), nine health-related actions (e.g., sneezing, staggering, falling), and 11 mutual actions (e.g., punching, kicking, hugging). The actions were captured under 17 different scene conditions corresponding to 17 video sequences (S001-S017). The actions were recorded using three cameras with varying horizontal imaging viewpoints (−45°, 0°, and +45°). The dataset provided multimodal information for action characterization, including depth maps, 3D skeleton joint positions, RGB frames, and infrared sequences. The NTU RGB D dataset offers several advantages, such as public availability for noncommercial use, a large number of samples, action categories, scene conditions, and coverage of various action types and viewpoints. However, it also has some limitations, including low resolution (
UCF 101: Examined by He et al. [51], it comprises 101 categories such as sports and human-object interactions. Using 13,320 clips sourced from YouTube, the dataset presents challenges with variations in the camera motion, object appearance, and environmental conditions. The advantages include diverse collections and robust model performance. The limitations include limited resolution and potential noise in the action labels. Suited for applications such as surveillance, it may not be ideal to recognize simple actions using other modalities. Recommendations for future development include enhancing the RGB video resolution, refining action labels, and incorporating additional modalities, such as depth or skeleton, for improved recognition.
Each dataset has distinct characteristics that make it suitable for different aspects of HAR research. However, limitations such as realism, diversity, and complexity should be considered when selecting a dataset for a specific research goal.
2) User-Generated Datasets
In computer vision research, particularly in applications such as surveillance systems, home monitoring [52], [53], [54], [55], [56], [57], [58] and sensor-based applications for senior monitoring [59], HAR is among key areas of interest. The role of human motion-based characteristics in detection and classification, involving pre-processing techniques such as spatial-temporal filtering, background subtraction, and optical flow [60], along with applications in video surveillance [61] and virtual reality [62], often leverages both benchmark and user-generated datasets.
a: ViHASi
Ragheb et al. [63] developed the ViHASi dataset, a large collection of synthetic human activities designed for testing action identification algorithms. This dataset included synchronized perspectives from multiple cameras, various actors, and a wide range of action classes, thereby offering a comprehensive testing ground for long video sequences with numerous action samples.
b: Smartphone-Based Datasets
Micucci et al. [64] examined smartphone accelerometer datasets for detecting ADLs. They highlighted the importance of feature selection in categorizing falls, using the UniMiB SHAR dataset as a case study. However, datasets such as MobiAct, RealWorld, and UMA Fall demonstrate a gender imbalance, predominantly featuring male subjects, which should be considered in future research.
c: VidOR
Shang et al. [65] introduced the VidOR dataset containing annotated films with object categories and predicates. Their work included an automated pipeline for labeling user-submitted videos and extensive annotation analysis, offering rich resources for video object and relation recognition.
Table 4 offers a more comprehensive perspective on the performance of classifiers in terms of accuracy, considering both benchmark and user-generated datasets.
3) Summary on Comparing HAR Datasets
In summary, notable advancements have been achieved in creating HAR datasets, contributing significantly to the progress of this field. A thorough examination of different datasets demonstrated substantial improvements in addressing complex human actions and interactions. However, challenges still need to be addressed, particularly in terms of dataset diversity and representativeness, as these factors can impact the universality and effectiveness of HAR systems. It is crucial to obtain datasets that encompass a wide range of human actions under various conditions. Future developments should focus on enhancing the comprehensiveness of datasets by prioritizing inclusivity and variability to better reflect real-world scenarios. This approach will further refine HAR technologies, resulting in more precise and adaptable systems, ultimately pushing the boundaries of what can be achieved in human-action recognition.
D. Techniques/Algorithms
Technological progress has had a profound influence in different areas, particularly in the development of innovative techniques in image detection, computer vision, and facial recognition, which have played a vital role in advancing HAR with applications in training, security, video surveillance, and automated observation. Figure 5 illustrates the progression of the HAR strategies over time, emphasizing important milestones in the field.
1) Supervised Learning
Supervised learning methods have played a crucial role in advancing HAR, as evidenced by numerous studies that show their effectiveness.
Random Forest (RF): Xu et al. [71] demonstrated the efficacy of Random Forest (RF) in handling accelerometer data by achieving superior accuracy and adaptability to environmental limitations when compared to conventional approaches.
DL with Wearable Sensors: employed ML algorithms in conjunction with wearable sensors found in smartwatches. Their study highlights the practical implications of these technologies in the realm of health and safety monitoring [72].
3-D CNN for Video Surveillance: Almaadeed et al. [73] presented a groundbreaking Three-Dimensional Convolutional Neural Network (3D CNN) structure that effectively enhances the analysis of action sequences and strengthens the capabilities of video surveillance.
Structured-Tree Neural Networks: In their study, Khan et al. [69] introduced an innovative structured-tree neural network that was trained on the NTU RGB D dataset. This approach provides a distinct perspective for analyzing human motion in the context of 3D action recognition.
Enhancement of Pre-trained CNNs: Ozcan and Basturk [74] investigated the performance enhancement of the NASNet-Large architecture in pre-trained CNNs. They conducted experiments to demonstrate the improved accuracy achieved by fine-tuning training parameters.
Two-Stream Neural Network for AIoT-based Surveillance: Ullah et al. [75] proposed a novel two-stream neural network architecture that facilitates the real-time detection of events in AIoT-based surveillance systems. Their work emphasized the significance of this approach in resource-constrained environments, showcasing its potential for efficient event identification.
2) Human–Robot Interaction (HRI)
The integration of HAR into the HRI is crucial for developing intelligent robotic systems. Mojarad et al. [76] proposed a hybrid approach that combines ontology and ML techniques, thereby enhancing the robot’s understanding and interaction capabilities within human environments. Another study by Martinez et al. [7] specifically focused on the role of HAR in robotic imitation learning, which is crucial for robots to acquire the ability to learn and mimic human activities for social interaction and assistance. Additionally, Rea et al. [77] identified key moments in human actions, which significantly contributed to more intuitive and responsive robot behavior in real-time, thereby advancing the field of HRI.
3) Silhouette Sequences
The integration of 3D skeletal posture estimation with 2D forms for real-time low-dimensional feature extraction and fusion is highlighted in silhouette sequences of HAR. Chaaraoui et al. [78] successfully combined skeletal and silhouette features, resulting in a significant improvement in recognition rates for capturing dynamic human actions. Elharrouss et al. [79] introduced an innovative approach that analyzed silhouettes by incorporating time as a third spatial dimension, thereby showcasing advancements in dynamic human action capture. Expanding the field, Murtaza et al. [80] and Maity et al. [81] focused on multiview HAR and silhouette normalization to enhance accuracy and computational efficiency. Ramya and Rajeswari [82] explored the use of distance transforms and entropy for silhouette analysis and demonstrated the versatility of silhouette-based methods across various settings.
4) Computational Modeling
Computational modeling in HAR centers on the interplay between the hardware capability and algorithm efficiency. Meng et al. [83] deployed Field-Programmable Gate Array (FPGA) technology to enable real-time on-device processing in HAR, which is crucial for immediate application requirements, such as surveillance. Liu et al. [84] developed bioinspired models that emulate the neural processing of the visual cortex, streamlining computational demands. Several recent studies [85], [86] presented FPGA-compatible architectures that prioritize computational frugality while maintaining accuracy, thereby broadening the scope of HAR in portable and embedded devices. Javed et al. [87] focused on the practical integration of HAR into everyday life through smartphones, thereby propelling the field towards user-centered applications.
5) Graph-Based Approaches
Utilization HAR has proven highly valuable in various domains. Aoun et al. [88] demonstrated the effectiveness of graphs for managing spatiotemporal data and skeleton-based structures in the context of HAR. Similarly, Li and Leung [89] employed graph kernels to analyze action similarity, and achieved remarkable results in the analysis of 3D skeletal data from depth-captured benchmark datasets. Mondal et al. [90] took a step further by developing an end-to-end fast Graph Neural Network (GNN) that transforms time-series data into structural graph representations, thereby introducing a novel dimension to HAR. Ahmad et al. [91] explored various Graph Convolutional Network (GCN)-based methods, including reinforcement learning and encoder-decoder GCN models, shedding light on the continuous advancements in this field. Zhou et al. [92] applied graph-based techniques to long-term activity patterns, effectively integrating local temporal and global semantic relations to gain a comprehensive understanding of human actions.
6) Deep Learning Architectures
Various DL architectures have been used in HAR to distinguish intricate human actions. This discourse delves into multiple prominent approaches, assessing their adoption and suitability, while emphasizing their advantages and limitations.
One such approach is SlowFast, a dual-stream network capable of simultaneously capturing spatial and temporal features. This makes it well-suited for analyzing complex and diverse human motions [93]. SlowFast was further enhanced by incorporating a You Only Live Once (YOLO) model and temporal attention mechanism, enabling spatial localization and temporal alignment [94]. However, it is important to note that SlowFast has some limitations. These include high computational costs, sensitivity to hyperparameters, and challenges in effectively handling occlusions and background clutter.
I3D ResNet50 is a network that combines 3D convolution with ResNet50 feature extraction. It is designed to be adaptable to varying activity durations and has demonstrated superior performance on a kinetics dataset [95]. To make it suitable for real-time HAR on mobile devices, it was optimized by reducing its complexity using a lightweight RGB model and employing a knowledge distillation technique [96]. However, this approach has some limitations including high memory consumption, low efficiency, and poor generalization to unseen domains.
In contrast, two-stream CNN is a network that introduces separate spatial and temporal streams. This architecture has proven to be effective for recognizing diverse human actions and has achieved competitive results on the UCF-101 and HMDB-51 datasets [97]. To enhance its performance in skeleton-based HAR, it has been fused with capsule networks [98]. Despite its strengths, the two-stream CNN has some limitations. This relies heavily on optical flow computations, which are computationally expensive. In addition, its fusion efficiency is relatively low and lacks spatiotemporal coherence.
The 3D CNN with LSTM is a powerful network that combines the capabilities of 3D CNN with Long Short-Term Memory (LSTM) for spatiotemporal feature learning. This network recognizes prolonged sequential patterns and has shown impressive performance on benchmark datasets such as KTH and Weizmann [44]. It is commonly used for abnormal behavior recognition by employing a multi-scale feature fusion module and a multitask learning scheme to enhance its capabilities [99]. However, it is important to note that this approach has certain limitations. First, it exhibits high complexity, which makes it computationally expensive and time-consuming. Second, its robustness is relatively low, making it sensitive to variations and noise in input data. Finally, it faces challenges in handling noisy and irregular actions that can affect performance in real-world scenarios.
On the other hand, transformer-based networks leverage attention mechanisms to capture long-range dependencies in temporal sequences. This network has shown great promise in modeling complex human actions and has achieved remarkable results in various natural language processing tasks [100]. Recently, efforts have been made to develop lightweight transformer models specifically designed for HAR on mobile devices [101]. These models aim to reduce the computational cost and model size while maintaining a satisfactory performance. However, it is worth mentioning that transformer-based networks also have limitations. First, they often require significant computational resources, which can be challenging for resource-constrained devices. Second, a large model can impose storage and memory constraints. Finally, similar to many DL models, transformer-based networks are vulnerable to adversarial attacks, which can compromise their performance and reliability in real-world applications.
Capsule Networks (CapsNets) are a type of neural network that focuses on capturing hierarchical representations of features. They excel in handling spatial hierarchies and have demonstrated impressive performance on popular datasets such as MNIST and CIFAR-10 [102]. CapsNets are utilized in skeleton-based HAR with Graph Convolutional Networks (GCN) [103]. However, it is important to acknowledge the limitations of CapsNets, including their high sensitivity to hyperparameters, limited scalability, and challenges in modeling temporal features.
These architectural approaches collectively contribute to the evolving landscape of HAR by offering unique strengths and by addressing specific challenges in understanding and recognizing diverse human actions. The selection of an appropriate architecture depends on the intricacies of the targeted recognition task, the characteristics of the dataset being used, and computational considerations.
A comprehensive history of research on action and behavior recognition is presented in Table 5. Although many studies have provided an overview of different methods for detecting human actions, they often overlook a detailed analysis of the advantages and disadvantages associated with each approach.
E. Open Challenges and Limitations
HAR systems encounter various obstacles, particularly in the realm of image analysis on a global scale, where each pixel plays a role in the final descriptor. Conventional approaches require manual detection and background subtraction, which makes them susceptible to variations in lighting, background noise, and visibility. In the realm of surveillance, video analysis is employed by law enforcement and security agencies to ensure public safety, monitor events, and resolve crime. A substantial amount of data acquired from surveillance systems necessitates algorithms that strike a balance between swift training and detection, high reliability, and the ability to learn from limited datasets [108]. The following sections discuss the common challenges faced by HAR systems.
1) Data Collection and Pre-Processing
Accurate prediction models rely heavily on effective data collection and pre-processing. In their study, Jiang et al. [109] emphasized the significance of accurately recording affected signals by considering the diversity of human activities across different locations and the influence of ambient sensors. The process of data labeling varies, as some scenarios involve manual labeling, whereas others adopt alternative approaches.
Brezmes et al. [110] used smartphones to identify and classify six distinct movements. Anguita et al. [111] employed Support Vector Machines (Support Vector Machine (SVM)) to model human behavior using data from a waist-worn smartphone inertial sensor. Kose et al. [112] tracked real-time activity using a smartphone accelerometer. Fuentes et al. [113] and Lara et al. [114] employed smartphone accelerometers for real-time motion detection. Lara et al. [114] further enhanced their processing techniques by incorporating ML methods. Lee et al. [115] addressed the issue of noise in accelerometer data by focusing on the vector magnitude of the signal.
Data collection and preprocessing pose various challenges such as the management of erroneous forms, outliers, and anomalies. Machine learning (ML) algorithms may occasionally disregard outliers, whereas the presence of duplicate frames can lead to confusion within models. It is imperative to address these concerns, as they have the potential to significantly influence decision-making processes and the overall performance of the model.
2) Dataset Modeling and Configuration
Modeling actions from video sequences poses significant challenges. In their study, Zhang et al. [116] introduced motion context, a representation that combines image and motion information. This representation is robust to variations in action size and effectively captures the 3D characteristics of actions, leading to improved results for specific datasets. However, obstacles remain to be overcome, such as space-blind motion words (MWs) and limitations of graphical models when dealing with limited video datasets.
Advancements in multimedia and computer vision have recently focused on the detailed analysis and understanding of videos, including objects and their relationships. However, current evaluations often rely on small datasets or indirect metrics. To address this issue, Shang et al. [65] proposed a cost-effective annotation pipeline. This pipeline addresses challenges related to keyframe creation, task decomposition, and other factors such as free camera motion, illumination, object deformation, and keyframe generation. By enabling large-scale annotation, this approach overcomes the limitations of previous evaluations.
3) The Role of Open-Access and Commercial Tools in HAR
The utilization of HAR systems in both their development and deployment is significantly enhanced by the availability of open access and commercial tools. Open-access tools, such as OpenPose and DeepPose, offer cost-effective and easily accessible solutions, enabling researchers and developers to quickly prototype and experiment [117]. These tools are often equipped with pre-trained models and user-friendly interfaces, expanding the accessibility of HAR technology.
On the other hand, commercial tools such as Microsoft Kinect and Intel RealSense provide advanced capabilities and robustness for professional and large-scale applications [118]. These tools typically offer comprehensive support and integration options, making them well-suited for complex surveillance systems or commercial activity monitoring solutions.
The decision to opt for either open access or commercial tools is contingent upon various factors, including the financial resources available, scale of implementation, and specific requirements of the HAR system [119]. Open-access tools play a crucial role in facilitating research and fostering innovation. However, they may lack the ability to be tailored to specific needs and may not offer the same level of support as their commercial counterparts. However, commercial tools provide advanced functionalities but are prohibitively expensive and less adaptable for research purposes. This dichotomy poses a challenge when selecting tools for HAR scenarios because it necessitates striking a balance between advanced features, customization options, and cost-effectiveness.
4) Analyzing a Dataset Based on Input From Images or Video Frames
Video analysis in HAR involves processing a sequence of static images and video frames. The extraction and classification of frames from videos pose several challenges, including variations between different classes, similarities within the same class, poor image quality, interactions between individuals in a group, long-distance shots, complex backgrounds, and interactions involving multiple subjects [120], [121].
Most HAR datasets consist of video clips that capture the actions of ADLs. Analyzing every frame in these videos can be computationally and memory intensive. To address this issue, Serpush and Rezaei [122] proposed a method that categorizes selected frames from a video, thereby improving processing speed and enabling real-time applications. Typically, a subset of 30 frames is considered sufficient for an accurate categorization [123]. Activities and confidence levels were determined by averaging the categorization results of the selected frames. However, this approach may encounter difficulties when dealing with videos involving multiple concurrent activities, which can complicate the recognition and categorization of actions.
5) Metrics for Performance Evaluation
More than 650 million people have disabilities worldwide. They should be provided with a monitoring system that is both reliable and precise with a high recall that helps reduce action recognition mistakes [124]. ML techniques such as Random Forest (RF), KNN, ANN, and SVM have been used for HAR [125], [126], [127]. However, an HAR system is only acceptable if it performs better in terms of the evaluation metrics. In this regard, some well-known and widely used evaluation metrics include accuracy, precision, recall, and f-score [128]. Accuracy (Equation 1) is the most common metric used to determine how well a model works by finding the correct number of predictions from the total predictions. Precision (Equation 2) is the percentage of the total number of accurate forecasts possessed by an item. Recall (Equation 3) is the percentage of real positive cases detected accurately in a given dataset. F-score/F1-score (Equation 4) evaluates binary classification systems by combining the model’s precision and recalling samples as ‘positive’ or ‘negative.’\begin{align*} Accuray &= \frac {TP+TN}{TP+TN+FP+FN} \tag{1}\\ Precision &= \frac {TP}{TP+FP} \tag{2}\\ Precision &= \frac {TP}{TP+FN} \tag{3}\\ Precision &= 2 \times \frac {Precision \times Recall}{Precision + Recall} \tag{4}\end{align*}
A sensor-based Body Area Network (BAN)-based DL) model, InnoHAR, was developed in [129] by integrating an inception neural network with a recurrent neural network. With this method, metrics such as those in the table can be evaluated using time-series features in the form of sensor waveforms from more than one channel.
Bao and Intille [130] developed and evaluated classifiers based on accelerated data, such that the thigh and wrist were the only two pitched of the five small biaxial accelerometers available. This method was also used to measure factors such as the correlation mean and frequency-domain entropy. They found that their decision-tree-based classifier performed best in detecting ordinary activities, with an overall accuracy rate of 84 percent.
Lee et al. [115] attempted to improve accuracy and precision scores by employing an approach that lets the users’ smartphones provide the data for a 1D CNN-based triaxial accelerometer model. They converted the X, Y, and Z acceleration data into vector magnitude data before using them in training a 1D CNN. Overall, 1D CNN-based ternary action recognition exceeded the baseline Random Forest approach by 89.10% for ternary action recognition.
Bayat et al. [131] proposed an HAR system using a digital low-pass mesh that isolates gravity acceleration from body acceleration and achieved 91.15% accuracy using the average of probabilities as the fusion technique.
Attal et al. [132] used chest, right thigh, and ankle sensors in combination with a preprocessing and data classification approach for supervised classification. They used SVM, KNN, RF, and mixture models [133]. They evaluated their results based on the f-measure, recall, and precision. They found that three MTx inertial IMUs were located in the chest, right thigh, and left ankle compared with 12 static, dynamic, and transition action classification methods. Unsupervised classification techniques can quickly build models from unlabeled data and are inexpensive to run on computers. However, supervised techniques are more accurate when working with raw data, or extracted or selected features. A real-world test demonstrated that RF performed better than SVM. The SLGMM is difficult to use when viewed. Additionally, the Hidden Markov Model, Gaussian Mixture Model, and K-means can handle temporal and sequential data.
Lawal and Bano [134] generated image sequences using two trained CNN sensors and combined them to forecast classes based on human activities carried out on a public dataset with an f-score of 0.87%.
Conclusion and Future Work
The impact of HAR on daily life, real-time situations, and collaborative efforts has been discussed extensively in the literature. This paper offers a comprehensive review of the current state of HAR systems. The findings from the various sources cited in this paper highlight several key points. In addition, a summary of the taxonomy of HAR techniques and methods is presented in Table 7.
Researchers have focused on various aspects of HAR, including user behavior, static and dynamic living activities, and lifestyles. However, limited attention has been paid to real-time HAR for medical security. The complexity of real-time actions, hardware and technical limitations, and data scarcity pose challenges for the development of HAR systems. Meeting the demands of real-time public or occupied spaces and adapting to changing scenery requires significant computing power. Furthermore, the availability of high-quality real-time data is limited, and existing literature falls short of addressing real-time activities. For instance, Kinect devices, with their limited field of view, can only detect falls and unusual behavior in smart homes.
Privacy concerns arise in vision-based HAR systems that rely on cameras. Some individuals may be reluctant to have their data, including images and videos, stored permanently, raising privacy concerns.
Sensor-based HAR has advantages and disadvantages similar to those of other technologies. Some sensors are worn by individuals, whereas others are embedded in buildings and vehicles. Wearing body-implanted sensors may cause discomfort; however, they provide versatile and detailed motion data. In contrast, wearable sensors offer the advantage of capturing real-time data despite potential HAR data loss.
The widespread use of smartphones has made HAR sensors readily available for the general population. This accessibility opens opportunities for novel combinations of devices and sensors in HAR research.
In the literature, several HAR implementations utilize both ML and DL methods. Researchers have suggested the use of hybrid HAR approaches. The following are several potential areas that can be explored in future research.
Researchers involved in HAR should focus on addressing two key aspects: (1) acquiring sufficient data for HAR systems and (2) evaluating their performance effectively.
Previous evaluations have primarily concentrated on specific aspects of HAR, such as an integrated HAR with artificial intelligence frameworks [135] or the categorization of video frames [122] to enhance real-time processing. To provide valuable and accurate insights into human behavior and interactions with the environment, it is recommended to establish connections between HAR and Computer Vision.
While the KTH and Weizmann datasets have been extensively studied and utilized due to their simplicity and limited number of videos, the MSR dataset, which captures behavior from multiple cameras, has been identified as a valuable benchmark dataset. However, further research is necessary to develop more suitable datasets that encompass a wider range of activities and perspectives, thereby enabling the training of state-of-the-art DL models because they span a broader range of activities and perspectives.
This review article aims to establish a foundation for further investigation into significant issues and potential advancements in the field.
List of Abbreviations
3D CNN | Three-Dimensional Convolutional Neural Network |
ADL | Activities of Daily Living |
BAN | Body Area Network |
CNN | Convolutional Neural Network |
DKN | Dynamic Kernel Network |
DL | Deep Learning |
FPGA | Field-Programmable Gate Array |
GCN | Graph Convolutional Network |
HAR | Human Action Recognition |
HCI | Human-Computer Interaction |
ML | Machine Learning |
RF | Random Forest |
RGB | Red Green Blue |
SDF | Signed Distance Function |
SVLRM | Subspace Video Linear Regression Model |
SVM | Support Vector Machine |
TOF | Time-of-Flight |
YOLO | You Only Live Once |