Exploring Human Pose Estimation and the Usage of Synthetic Data for Elderly Fall Detection in Real-World Surveillance

The world’s elderly population continues to grow at an unprecedented rate, creating a need to monitor the safety of an aging population. One of the current problems is accurately classifying elderly physical activities, especially falling down, and delivering prompt assistance to someone in need. Owing to the advancements in deep learning research, vision based solutions are employed for action recognition. One such popular approach is human pose estimation based action recognition or fall detection. Nevertheless, due to a lack of large-scale elderly fall datasets and the continuation of numerous challenges such as varying camera angles, illumination, and occlusion accurately classifying falls has been a problematic. To address these problems, this research first carried out a comprehensive study of the AI Hub dataset collected from real lives of elderly people in order to benchmark the performance of state-of-the-art human pose estimation methods. Secondly, owing to the limited number of real datasets, augmentation with synthetic data was applied and performance improvement was validated based on changes in the degree of accuracy. Third, this study shows that a Transformer network applied to elderly action recognition outperforms LSTM-based networks by a noticeable margin. Lastly, by observing the quantitative and qualitative performances of different networks, this paper proposes an efficient solution for elderly activity recognition and fall detection in the context of surveillance cameras.

viding better elderly care and a longer life expectancy. Fortu- 23 nately, with the advent of modern technologies assisted living 24 has become more accessible to monitor the behavior of an 25 The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . aging population. In the last decade, researchers detected falls 26 by wearable devices, environmental sensors, and cameras [2]. 27 Sensors and wearable devices may offer a quick response 28 time and better fall-detection accuracy. Nevertheless, these 29 methods fail when a person forgets to wear the device or falls 30 in an untracked area [3]. Thus, a more efficient approach is 31 to take advantage of the increasing number of surveillance 32 cameras which are easy to set up and that receive consistent 33 data by tracking an entire area. However, the rapid increase 34 of surveillance cameras raises serious concerns about pri-35 vacy, and discriminatory bias against specific groups of peo-36 ple [4]. Among many methods, an effective one to avoid the 37 surroundings that differ from daily living environments. Fur-48 thermore, most of the fall detection datasets involve young 49 people whose speed and actions differ from the elderly [11]. 50 As a result, when an action recognition model is trained on 51 such datasets and tested in real-world settings, these dif-52 ferences may result in unreliable action recognition [12]. 53 Consequently, it is assumed that these methods do not exhibit 54 generalization capabilities for elderly physical action recog-55 nition in real-world environments. To address these problems 56 a qualitative and quantitative study was carried out on the AI 57 Hub dataset [13], which covers 12 types of falls and daily life 58 activities. 59 The major contributions of this paper are threefold: This paper is organized in the following order. Section II 70 outlines a literature review of public fall detection datasets, 71 pose-based human action recognition and fall detection 72 in addition to synthetic data usage. Section III introduces 73 the proposed methodology. Section IV presents implemen-74 tation details, describing the qualitative and quantitative 75 results. Section V discusses the findings of the research. 76 Finally, section VI draws conclusions from this research. 77 Code and video explanations are publicly available in 78 https://sard0r.github.io/. Because deep learning models require a large amount of data, 82 it is crucial to select training and testing datasets to achieve 83 high performance in real-world applications. For this reason, 84 this research compared several vision-based public datasets 85 and a dataset that fulfills the real-world requirements was 86 chosen.  at a community center, in a parking lot, a park, a market, 140 a residential alley, a subway station, on a footbridge, and in 141 front of apartment complexes.

142
This research was carried out using the AI Hub dataset 143 for a number of reasons. First, because it was collected in 144 10 different environments there is great variance in distances 145 and viewing angles from camera to subject. Second, unlike 146 laboratory environments, real-world surveillance cameras 147 stream continuously, day and night. Thus, as the videos were 148 recorded both at night and during the daytime, it represents 149 different levels of complexity. Third, it was collected in places 150 where the elderly usually happen to be in need (for example, 151 hospitals and community centers), enabling us to evaluate the 152 performance of human pose estimation methods in real-world 153 assisted living scenarios. In recent years, pose-based human action recognition has 157 attracted a lot of attention owing to big improvements in 158 human pose estimation.  [20] 177 built a robust human fall detection system using Long 178 short-term memory (LSTM) and GRU networks based on 179 the OpenPose pose estimator and reported 99% sensitivity 180 using the Le2i, and URFD datasets. Lin et al. [21] approached 181 the problem of fall detection with a similar method, intro-182 ducing a fall detection framework utilizing OpenPose as 183 a feature extractor and LSTM and GRU for classifica-184 tion, achieving 98.2% accuracy using the URFD dataset. 185  [32]. To evaluate 238 performance improvement from using synthetic data, KIST 239 SynADL [11] was utilized for a number of reasons. Firstly, 240 the body shapes and motions of the characters were captured 241 from real actions by elderly people. Secondly, it was designed 242 especially for augmenting realistic elderly datasets for action 243 recognition by smart surveillance and care robots. Finally, 244 it is a large synthetic dataset including 15 characters who 245 performed 55 actions of elderly people in four different envi-246 ronments. Although the article by Wang et al. [11] is closely 247 related to our research, it only considered synthetic data usage 248 targeting recognition of daily activities from a care robot's 249 view. Therefore, it utilized realistic datasets [33] and [12] 250 which were captured in laboratory settings from a side view 251 for application by care robots. By contrast, our study focuses 252 on the usage of synthetic data for real-world surveillance 253 applications which is explored for the first time.

255
In this section, a performance evaluation methodology for 256 human pose estimation methods is introduced as a solution 257 to elderly action recognition and fall detection. The overall 258 process of the proposed methodology is illustrated in Fig. 1. 259 The main aim of this research is to explore the performance 260 of human pose estimation methods for elderly action recog-261 nition and fall detection in real-world surveillance, and to 262 illustrate the degree of improvement through synthetic data 263 exploitation. To evaluate the potential of pose estimation for 264 fall detection problems, five human pose estimation meth-265 ods are considered with two prominent action recognition 266 networks. In general, the raw video frames from the chosen 267 real dataset are first fed into the selected pose estimation 268 methods, which output sets of human poses. Next, qualita-269 tive and quantitative evaluations were performed for each 270 pose estimation method utilizing the extracted human pose 271     (IC), is calculated as seen below: In equation (2), C is the confidence of each keypoint, K 367 denotes the total number of keypoints, and T stands for the 368 number of frames.

369
Owing to many factors, such as occlusion, dim light, and In equation (4), n represents the total number of missing 384 keypoints from an image.

386
To evaluate the performance of action recognition models, 387 four metrics of a confusion matrix [41], (accuracy, recall, pre-388 cision, and F1-score) were chosen. Accuracy can be defined 389 as the ratio of the total number of correct predictions to the 390 total number of predictions. For a binary class classification, 391 recall, precision, and F1-score can be defined as follows. Pre-392 cision is a measure of correct predictions out of all the positive 393 predictions. Recall (or Sensitivity) is a measure observing the 394 accurately classified cases out of all positive cases. F1-score 395 combines precision and recall into a single metric by taking 396 their harmonic mean. However, because this study classifies 397 four action classes, the evaluation method differs from the 398 binary classification problem. For this study, precision, recall, 399 and F1-score for each action class were calculated separately, 400 after which the weighted average of each evaluation metric 401 from all action classes was calculated.     To evaluate the robustness of the chosen pose estimation 481 methods quantitative experiments were conducted using 482 the AI Hub dataset. First, performance from the pose-483 estimation-based action recognition pipelines was compared. 484 Results for accuracy, precision, recall, and F1-score in both 485 action recognition pipelines are shown in Table 5. Both 486 Transformer-and LSTM-based pipelines achieved good per-487 formance when coupled with AlphaPose. The worst perfor-488 mance was observed when MoveNet was used as a pose 489 feature extractor in the action recognition pipeline. The best 490 pipelines were AlphaPose for pose features and Transformer 491 for action classification. Pipeline results for accuracy, recall, 492  precision, and F1-score were 89.22%, 90%, 89%, and 89%, age confidence for keypoints in T frames were compared.

501
The reason was that the AI Hub dataset did not provide 502 ground truth keypoints for body joints. Therefore, the mAP Lastly, for fair evaluation of action recognition in the models' 521 performance, the important characteristics of Transformer 522 and LSTM were calculated when trained and tested on the AI 523 Hub data. Table 6 shows the number of parameters, MFLOPs, 524 GPU memory requirements, and inference time of the Trans-525 former and LSTM models. The table shows that both models 526 had comparable inference speeds. It is also worth mention-527 ing that the number of parameters was 3.6 times higher in 528 Transformer.  In the above, we saw an increase in performance when adding 551 synthetic data. Next, we examined the trend in performance 552 improvement by adding synthetic data incrementally. Graphs was noticeable in the early steps, which tended to decrease 566 marginally with more synthetic data. The highest increase can 567 be attributed to AlphaPose with the Transformer pipeline at 568 a 5.16% improvement, where accuracy went from 89.22% to 569 94.35%. The lowest increase was obtained from OpenPifPaf 570 and LSTM pipelines (an increase in accuracy of 2.46%). 571 The mean increase in accuracy from Transformer and LSTM 572 action recognition models when coupled with all pose estima-573 tion methods was 3.96% and 3.77%, respectively. It is worth 574 noting that the trend in the performance increase generalized 575 similarly with both Transformer-and LSTM-based action 576 classification models. To visualize the performance improvement after adding syn-579 thetic data, a Transformer pipeline was used. Keypoints were 580 extracted from AlphaPose and given as input to the Trans-581 former model. Three scenarios were tested with two pipelines 582 where, in all cases, test data were the same three samples of 583 falling down from the AI Hub dataset. In the first pipeline, 584 the Transformer-based action recognition model was trained 585 only on real (AI Hub) data. Next, 6400 synthetic video sam-586 ples were added to the real data to create a larger training 587 set. Then, on this larger set, the second Transformer-based 588 pipeline was trained. Performance improvements after adding 589 synthetic data are shown in Table 8 for the three cases 590 shown in Fig. 11 where each row represents a single video 591 clip failure with different snapshots. When the pipeline was 592 trained only using real data, all three cases failed with high 593 probabilities of identifying the wrong category. The pipeline 594 that was trained with the additional synthetic dataset cor-595 rectly predicted the correct class with high probability. One 596 possible explanation is that the nature of falling down is 597 diverse. There is not enough falling down data in the AI Hub 598 dataset to capture such diversity. However, the synthetic Kist 599 SynADL dataset can be used to cover some missing action 600 variations.

529
FIGURE 11. Improved failure cases after including synthetic data in training. All three scenarios were failing when model is trained only using real data. Adding synthetic data then retraining the model successfully corrected failure cases.

602
It is noteworthy that top-down approaches performed better

628
Another finding of this work is that the Transformer model 629 showed inference speed comparable to the LSTM model. 630 Thus, in real-world applications using AlphaPose in combi-631 nation with a Transformer model can be assumed to demon-632 strate high accuracy with lower latency. This approach can be 633 optimized to target real-time applications, especially in mon-634 itoring elderly people where the problem should be solved 635 instantly to prevent sudden accidents.

637
Exploiting state-of-the-art human pose estimation methods 638 for pose-based action recognition and fall detection was the 639 main emphasis of this study. Specifically, this paper explored 640 action classification for elderly-care-monitoring applications 641 that include fall detection. As a pose estimation model, 642 we used five methods: AlphaPose, DCPose, OpenPose, 643 OpenPifPaf, and MoveNet. LSTM and Transformer were 644 explored as potential methods to model action sequences. 645 Perhaps most importantly, this study examined the benefits 646 from using synthetic data for pose-based action recognition 647 and fall detection due to the limited amount of real-world 648 data. AlphaPose was found to be the most accurate human 649 pose estimator in surveillance scenarios. Transformer outper-650 formed LSTM in accuracy, precision, recall, and F1-score. 651 Results show that exploitation of synthetic data improved 652 action recognition performance significantly. A limitation of 653 this research is that quantitative evaluations were performed 654 only on indoor data. As future research, one can extend 655 the present work by considering observations from outdoor 656 elderly human behavior datasets as well.