Comparing Cross-Subject Performance on Human Activities Recognition Using Learning Models

Human activities recognition (HAR) plays a vital role in fields like ambient assisted living and health monitoring, in which cross-subject recognition is one of the main challenges coming from the diversity of various users. Although recent studies have achieved satisfactory results in a non-cross-subject condition, the recognition performance has significant degradation under the cross-subject criterion. In this paper, we evaluate three traditional machine learning methods and five deep neural network architectures under the same metrics on three popular HAR datasets: mHealth, PAMAP2, and UCIDSADS. The experimental results show that traditional machine learning approaches are generally more robust to the new subject scenarios under strict leave-one-subject-out cross-validation. Extra analysis indicates that hand-crafted features are one major reason for the better performance of traditional machine learning on cross-subject HAR, while deep learning is more prone to learning subject-dependent features under an end-to-end training process. A novel training strategy for decision-tree-based methods is also proposed in this paper, resulting in an improvement on the random forest model which achieves competitive performance at an average F1-score (accuracy) of 94.49% (95.09%), 91.64% (92.21%), and 92.70% (93.29%) on the three datasets, compared with state-of-the-art solutions for cross-subject HAR.

INDEX TERMS Cross-subject, deep learning, human activity recognition, leave one subject out, traditional machine learning. scope, and magnetometer equipped has gained more attention 24 recently on account of the ability to provide a portable, pri-25 vate, continuous, non-invasive, and low-cost recognition ser- 26 vice, compared to the vision-based HAR [7] which has some 27 challenges in privacy protection, resource consumption, and 28 blind areas. A typical framework of HAR is shown in Fig. 1, 29 The associate editor coordinating the review of this manuscript and approving it for publication was Siddharth Tallur .

I. INTRODUCTION
where the general process of the HAR algorithm includes 30 four stages: sensor data acquisition, data pre-processing, off- 31 line feature extraction and model training, and online activity 32 classification. In the data acquisition stage, IMU sensors 33 can be found in glasses [8], phones [9], watches or wrist 34 bands [10], chest patches [11], shoes [12], etc., directly 35 reflecting the subject's behavior tightly related to physical 36 locations throughout the body. Since measured signals suffer 37 from inherent sensor drift and subject's unconscious move-38 ments, median filter and low-pass filter are common methods 39 for data cleaning in the pre-processing stage to eliminate 40 noisy interference and redundant information [5], [13], [14]. 41 Besides, continuous data segmentation is also necessary for 42 this stage by dividing the signal into sliding windows with or 43 without overlaps [15]. Feature extraction and model training 44 stage plays a vital role to detect significant low-dimension 45 patterns from raw high-dimension sensor input. According 46 to different feature extraction methods, current HAR solutions can be divided into two categories: hand-crafted feature 1. This paper has conducted a comprehensive strict 92 cross-subject evaluation of traditional machine learning mod-93 els and common-used deep learning models in new subject 94 scenarios of HAR applications. We have performed experi-95 ments using traditional machine learning and deep learning 96 models on three publicly available datasets, and the impact 97 of hand-crafted features is further analyzed and discussed. 98 2. A novel training criterion for decision-tree-based learn-99 ing models is proposed, which tries to discriminate different 100 classes while ignoring the diversity of various subjects. This 101 improvement increases the recognition accuracy of random 102 forest and shows comparable performance with state-of-the-103 art cross-subject HAR solutions. 104 The rest of this paper is organized as follows. The related 105 works of this paper are presented in Section II. Section III 106 explains the chosen datasets, evaluation criteria, and the set-107 tings of traditional machine learning and deep learning mod-108 els. Section IV presents the experimental results of different 109 models in cross-subject activity recognition with detailed 110 analysis and discussion. Finally, Section V concludes this 111 paper.  Simple time domain and frequency domain features are com-116 monly used in HAR [31], [32], [33] like harmonic mean, 117 standard deviation, Pearson correlation coefficient, etc. These 118 hand-crafted features are trained to build a recognition model 119 like random forest, decision tree, SVM, and KNN as shown 120 in Fig. 1. Casale et al. [34] utilized a set of 20 computa-121 tionally efficient features to recognize 5 basic daily activ-122 ities. The use of random forest reached a 94% accuracy 123 for recognition, which outperformed the decision tree alone 124 and boosting of trees. With the aid of feature selection and 125 sensor data fusion techniques, Ayman et al. [ [46] found that 193 the features obtained by hybrid deep-learning architectures 194 involving CNN and LSTM, had advantages to discover both 195 short-term and long-term temporal relationship in the data.  The heterogeneity introduced by different subjects can 198 significantly reduce the accuracy of activity recognition. 199 Ravi et al. [47] made an experiment on 2 subjects wearing an 200 accelerometer on the waist and recorded eight daily activities 201 on different dates. They found that over 99% accuracy was 202 achieved on cross-validation when two subjects' data were 203 mixed for training and testing, while only 65% accuracy when 204 the subjects' data were divided and used as either training or 205 testing set. Janidarmian et al. [33] evaluated different tradi-206 tional machine learning methods on HAR using accelerom-207 eter data from 14 public datasets containing 8 independent 208 positions and 8 daily activities (walking, running, jogging, 209 biking, standing, sitting, lying, up and down the stairs). In the 210 non-cross-subject 10-fold evaluation, the average classifica-211 tion accuracy of the 8 positions was 96.44%±1.62%, how-212 ever the number decreased to 79.92%±9.68% in the LOSO 213 cross-subject evaluation.     [64] examined the 295 traditional machine learning model and two commonly used 296 deep learning models (CNN and LSTM) on HAR in terms 297 of accuracy, memory consumption, real-time performance, 298 etc. They found that random forest is the best model for 299 memory-limited applications, while the best model consid-300 ering complexity and performance is linear kernel SVM. The 301 two deep neural networks are comparable in performance, but 302 their increasing complexity makes it hard for real use cases. 303 Gholamiangonabadi et al. [41] compared the cross-subject 304 HAR performance between the feed forward neural network 305 and CNN, and the results showed that CNN architecture with 306 two convolutions and one-dimensional filter had the best 307 generalization ability.  [65] discovered the phenomenon that 314 the traditional solutions outperform deep methods under the 315 same metric on HAR, and the reason remained unclear. In this 316 paper, we conduct a comprehensive comparison between 317 traditional machine learning and deep learning methods on 318 HAR under strict LOSO validation, and make a further anal-319 ysis to the result of the experiment. Different from studies 320 like [45], [46], the hyper-parameter settings of traditional 321 machine learning is clarified in detail in this paper, together 322 with the explicit definition of strict LOSO cross-validation. 323

325
To comprehensively evaluate the cross-subject activity recog-326 nition performance of traditional machine learning and deep 327 learning, we selected 3 datasets with different scales, con-328 taining multiple subjects and covering simple, complex, and 329 similar activities.

330
The mHealth dataset contains body motion and vital 331 signs recordings from 10 subjects. Each subject performed 332 12 activities in an out-of-lab environment without any con-333 straints. 3 IMU sensors were placed on the subject's chest, 334 right wrist, and left ankle to measure the 3-axis acceleration 335 (m/s 2 ), 3-axis angular velocity (deg/s), and 3-axis magnetic 336 field (G/s), respectively. Besides, the sensor placed on the 337 chest also provides 2-lead ECG measurements. The sampling 338 frequency of all sensors is 50 Hz.

339
The PAMAP2 dataset is a benchmark for daily activity 340 recognition. It was recorded by 9 subjects (8 males and 341 1 female, aging from 24 to 32), wearing three IMUs placed on 342 the arm, chest, and ankle, respectively, consisting of 12 activ-343 ities including simple activities (such as sitting, running, etc.) 344 and complex activities (such as cleaning, ironing, etc.). The 345 sensor data were recorded at 100 Hz.

346
The UCIDSADS dataset was specially constructed for 347 daily and sports activities recognition. It comprises 19 activ-   parison benchmark, this paper directly uses the N groups of 400 hyper-parameters obtained through the cross-subject LOSO 401 cross-validation mentioned above as the model configuration 402 (i.e., the validation process is skipped), and performs a non-403 cross-subject 5-Fold cross-testing (80% samples for the train-404 ing set and the rest 20% samples for the testing set) on the 405 dataset. Finally, the average classification performance of N 406 groups of hyper-parameters is taken as the non-cross-subject 407 recognition result of the model.

409
Model design and hyper-parameter selection need to avoid 410 overfitting to overcome the impact of new subject scenar-411 ios. For traditional machine learning models, this paper 412 incorporates the parameters related to overfitting into the 413 hyper-parameter search space, such as the maximum tree 414 depth in the random forest, the regularization parameter 415 of SVM, etc. For deep learning models, effective general-416 ization methods such as dropout, batch normalization, and 417 L2 regularization are fully utilized in the network structure 418 design. For HAR, a lightweight deep learning model is suffi-419 cient to achieve a satisfactory recognition performance [66], 420 while too many trainable parameters often have the risk of 421     Table 3 have 9 dimensions. 475 We extracted corresponding features mentioned in Table 3 476 from the data frame after dimension expansion (includ-477 ing original time-domain data, amplitude time-domain data, 478 original frequency-domain data, and amplitude frequency-479 domain data), which were further normalized into a normal 480 distribution with mean 0 and variance 1 according to (4), 481 where f µ and f σ are the mean and standard deviation of the 482 input feature f . Before the normalization, we delete the fea-483 tures that are not distinct enough with f σ < 0.01. The extracted 484 and actually used numbers of features on the three datasets are 485 listed in Table 4. Finally, the concatenated features are used 486 as the input of traditional machine learning classifiers listed 487 in Table 2.  We normalized the filtered data by (4) before feeding it into 521 the deep neural network models. To fine-tune the deep learn-522 ing models depicted above, we evaluate the hyper-parameter 523 ranges in Table 5, where C denotes the number of axis, which 524 is 9 times the number of IMUs in   Fig. 6 and 7 show the box plot of F1-score and accuracy 546 of the traditional machine learning and deep learning models 547 on 3 datasets, where the box extends from the first quartile 548 to the third quartile of the data, with a line at the median. 549 Note that the blue boxes are non-cross-subject results, while 550 the orange boxes are for cross-subject tests. Table 6 and 7 551 demonstrate the average of accuracy and F1-score of the 552 traditional machine learning and deep learning models on 553 3 datasets, together with the 95% confidence limits. Since 554 the number of cross-validation is small in LOSO, we use 555 t-distribution for an unbiased 95% confidence interval as: where n denotes the number of users in different datasets, 560 and µ is the average of samples x 1 , . . . , x i . The following 561 insights can be obtained: (1) Under the non-cross-subject 562 test, all models except LSTM achieved nearly perfect per-563 formance, and traditional machine learning models got the 564    In addition, by analyzing the confusion matrix of each 607 subject, we found that in cross-subject activity recognition, 608 deep learning models are more likely to misclassify some 609 activities almost entirely, resulting in a significant drop in 610 overall recognition accuracy. For instance, the static activity, 611 A1: standing still of subject 1, are all wrongly classified as 612 A8: knees bending in the mHealth dataset using deep learning 613 models, as shown in Fig. 9(a), (c), and (e). While in traditional 614 machine learning cases, the classification remains accurate, 615 which can be seen in Fig. 9(b), (d), and (f).

616
Nevertheless, some traditional machine learning models 617 can also make a totally wrong recognition. For example, the 618 A7: standing in an elevator in UCIDSADS is incorrectly 619 classified as A8: moving in an elevator for subject 4 using 620 SVM, as shown in Fig. 10(d), which is similar to the behavior 621 of BLSTM and CNN-LSTM in Fig. 10(c) and (e), while the 622 TABLE 6. The average accuracy and 95% confidence limits of different learning models on the three datasets using non-cross-and cross-subject evaluation criterion. The best accuracy among all the methods is highlighted.

TABLE 7.
The average F1-Score and 95% confidence limits of different learning models on the three datasets using non-cross-and cross-subject evaluation criterion. The best F1-score among all the methods is highlighted. classification is relatively correct using Conv2d-CNN, KNN 623 and random forest as shown in Fig. 10(a), (b), and (f). can reduce the data distribution differences between different 647 users, as shown in Fig. 11 (b). This is one of the major language processing, making the deep learning models with 656 a large amount of parameters hard to extract general features. 657 To compare the effect of traditional hand-crafted features 658 and features automatically extracted by the neural network 659 on cross-subject recognition, the feature extraction part is 660 removed in deep learning models and only retained the fully 661 connected layers to form an MLP classifier. According to 662 the criteria defined in section III-B, the activity recognition 663 performance of MLP using hand-crafted features as input 664 under both non-cross-subject and cross-subject conditions is 665 evaluated. The experimental results are shown in Table 8, 666 where MLP using hand-crafted features has achieved better 667 cross-subject recognition results than the five deep learn-668 ing models on all three datasets, and the average F1-score 669 and accuracy is comparable with SVM and random forest, 670 proving the superiority of traditional hand-crafted features in 671 cross-subject recognition. One of the key solutions for the cross-subject recognition 675 problem is to maximize the discrimination among different 676 classes and ignore the various distribution of subjects, which 677 is performed by transfer learning in deep neural networks as 678 mentioned in Section II. In this paper, we propose a novel 679 training strategy for the decision-tree-based learning methods 680 under this principle to cope with cross-subject scenarios.

681
Recall that the Gini impurity, when making decision trees 682 in the random forest, indicates the label diversity of data in the 683 VOLUME 10, 2022 FIGURE 6. The box plot of F1-score of different learning models on the three datasets using non-cross-and cross-subject evaluation criterion. current node, as shown in (1)  of distinguishing the subjects, which can be formulated as a 701 new object as: where p j,l and p j,r are the sample proportion of subject j in left 704 and right child nodes, respectively. A parameter α ∈ [0, 1) is 705 set to represent the importance of gini impurity for subject 706 labels, then the original criterion of finding the best split can 707 be rewritten as maximizing the following formula:  training strategy, where the cross-subject performance is bet-719 ter than the all the method as shown in Fig. 6 and 7. The 720 paired t-tests is conducted between the original random forest 721 and the modified one on F1-score to determine the degree 722 of significant difference in terms of the significance level 723 VOLUME 10, 2022 diverse [26], which is hard for the modified random forest to 730 make a balanced tree node splitting between subject labels 731 and activity labels. 732 We further explore the behavior of the modified object 733 function by varying the number of decision trees in the ran-734 dom forest, using a set of fine-grained α with each step of 735 the average recognition accuracy using different numbers of  For instance, the modified objective function achieves better 744 results on the UCIDSADS with a larger data scale and even 745 distribution of labels, while in PAMAP2 we find a rapid 746 performance degradation, which further explains why the null 747 hypothesis is nearly failed to reject. [55], [58], the labeled and unlabeled target samples were used 775 when training the models, while the other studies had only 776 training data for constructing the classifiers. 777  TABLE 9. The average accuracy and F1-score of random forest with modified training strategy. The p−value from paired t-tests on F1-score is also presented.    scenarios in terms of computational complexity and gen-816 eralization, which however does not mean that deep learn-817 ing methods are useless in cross-subject recognition. The 818 characteristics of end-to-end training and automatic feature 819 extraction make deep learning models flexible and easy to 820 expand. For example, fine-tuning the trained deep learning 821 models with a small number of labeled samples of the new 822 target subject can quickly reduce the differences in data 823 distribution and obtain a personalized classification model. 824 Table 11 shows the improved cross-subject recognition per-825 formance of the Conv2d-CNN model after fine-tuning, where 826 n-shot means the number of samples from the testing subject. 827 The traditional machine learning method is limited by the 828 training method and can not perform fine-tuning on the pre-829 trained model, thus we re-train the random forest model 830 under the condition of leaking a small number of target 831 samples. As shown in Table 12, on the mHealth dataset with a 832 small data scale, the random forest model with leaked testing 833 samples is slightly better than Conv2d-CNN, while on the 834 UCIDSADS dataset with a larger data scale, the fine-tuned 835 Conv2d-CNN performs better.

836
This paper has some limitations. First, the sizes of the 837 datasets we use are small and complete, and they have 838 relatively even distribution on activity labels. We have not 839 covered situations that have huge amount of missing data or 840 significant uneven labels like the last subject in PAMAP2, 841 VOLUME 10, 2022 FIGURE 13. The comparison of accuracy for each individual subject using the original and modified random forest on the three datasets. The x-axis denotes the different subjects. which might be an advantage for deep learning case. Sec-842 ond, the explanation for the reason that traditional machine 843 learning performs better than deep learning on cross-subject 844 HAR is limited. The decision boundaries for different meth-845 ods have not been explicitly examined in each LOSO test. 846 Third, the deep learning architectures are inspired by previous 847 studies, and we have not evaluated whether the structure has 848 implicit impact on the result (e.g. the number of convolutional 849 layers in CNN). Finally, although statistical significance, the 850 improvement of modified training process in random forest 851 is small, and we have not proved the methodology on other 852 tasks other than HAR to provide enough evidence for the 853 superiority.

855
In this paper, five deep neural network models and three 856 traditional machine learning models are trained and evalu-857 ated on three classic HAR datasets: mHealth, PAMAP2, and 858 UCIDSADS. A strict cross-subject LOSO test is deployed 859 to simulate new subject scenarios and evaluate the general-860 ization performance of deep neural networks and traditional 861 machine learning in cross-subject recognition, and the result 862 indicates that all models experience significant performance 863 degradation due to the heterogeneity among subjects, com-864 pared to non-cross-subject recognition. In general, the tradi-865 tional machine learning methods using hand-crafted features 866 achieve better cross-subject recognition than deep learning 867 models on the three datasets, and the analysis proves that 868 the automatic end-to-end feature extraction using deep neu-869 ral networks is more susceptible to distribution difference 870 between users and prone to learning user-dependent features 871 from training sets. This paper also provides a novel decision-872 tree-based training strategy, which makes the random forest 873 model achieve best cross-subject HAR performance over 874 all the using learning models, and the competitive results 875 are obtained compared with state-of-the-art cross-subject 876 HAR solutions. In detail, the average F1-score (accuracy) on 877 the three datasets are 94.49% (95.09%), 91.64% (92.21%), 878 and 92.70% (93.29%). Future work will make attempts on 879 other complex datasets and other learning frameworks like 880 AdaBoost, GAN, and VAE to find out the best solution for 881 cross-subject HAR application. The effectiveness of the pro-882 posed learning strategy for decision-tree-based methods will 883 be further evaluated on other cross-subject applications like 884 handwriting classification and speech recognition.

886
The authors would like to thank the anonymous referees for 887 their valuable comments and helpful suggestions, and also 888 would like to thank Takashi Mifune from irasutoya.com for 889 providing the public available figure samples presented in 890 Fig. 1.