Human Activity Recognition Based on Acceleration Data From Smartphones Using HMMs

Smartphones are among the most popular wearable devices to monitor human activities. Several existing methods for Human Activity Recognition (HAR) using data from smartphones are based on conventional pattern recognition techniques, but they generate handcrafted feature vectors. This drawback is overcome by deep learning techniques which unfortunately require lots of computing resources, while generating less interpretable feature vectors. The current paper addresses these limitations through the proposal of a Hidden Markov Model (HMM)-based technique for HAR. More formally, the sequential variations of spatial locations within the raw data vectors are initially captured in Markov chains, which are later used for the initialization and the training of HMMs. Meta-data extracted from these models are then saved as the components of the feature vectors. The meta-data are related to the overall time spent by the model observing every symbol for a long time span, irrespective of the state from which this symbol is observed. Classification experiments involving four classification tasks have been carried out on the recently constructed UniMiB SHAR database which contains 17 classes, including 9 types of activities of daily living and 8 types of falls. As a result, the proposed approach has shown best accuracies between 92% and 98.85% for all the classification tasks. This performance is more than 10% better than prior work for 2 out of 4 classification tasks.


I. INTRODUCTION
Human Activity Recognition (HAR) has gained in importance for many decades for its capability to learn meaningful and high-level knowledge about various types of human activities including (but not limited to): 1) Ambulation: walking, running, climbing stairs, etc.
The associate editor coordinating the review of this manuscript and approving it for publication was Yue Zhang . More detailed descriptions of the existing types of human activities are available in [1] 1 and [2]. 2 A review of state-of-the-art techniques for abnormal HAR is also proposed in [3]. There are two main categories of HAR. These are video-based HAR and sensor-based HAR. Video-based HAR performs high-level analysis of videos or images containing human motions from cameras. No further details related to this category of HAR are provided in the current paper which rather focuses on the second category.
Nevertheless, relevant surveys on video-based HAR are available in [4]- [6]. Sensor-based HAR is more popular and widely used because it better preserves privacy than videobased HAR. It relies on motion data from several types of smart sensors including: 1) Body-worn sensors: These sensors are worn by the user to describe the body movements. They are generally embedded in smartphones, watchs, standalone devices which include sensors such as accelerometers, gyroscopes, etc. 2) Object sensors: They are attached to objects to capture object movements. Radio frequency identifiers (RFID) deployed in smart home environment or accelerometer fixed on objects (e.g: glass, cup) are generally used for this purpose. 3) Ambient sensors: They are used to capture the interaction between humans and the environment in a smart environment. There are many kinds of ambient sensors such as radars, microphones, pressure sensors, temperature sensors, WiFi, Bluetooth, etc. 4) Hybrid sensors: Here, the three former types of sensors are combined. Further details are available in [7]. Besides HAR, the aforementioned sensors are also adapted for several other topics including indoor positioning methods [8] and pedestrian dead reckoning [9], [10]. Detailed presentations of these types of sensors and related papers are provided in [11]. 3 When the experiments are performed in smart homes with a variety of sensors, the HAR data processing needs to be distributed over a group of heterogeneous, autonomous and interacting entities in order to be more efficient. An efficient multiagent approach for HAR in this context has recently been proposed in [12].
HAR can be treated as a typical pattern recognition problem and existing papers in HAR can be organized into two main categories: 1) Conventional pattern recognition techniques [13]- [46] where the feature extraction and the model building steps are separated. 2) Deep learning techniques [43], [44], [47]- [88] where the features extraction and model building processes are performed simultaneously in the deep learning models. Conventional pattern recognition techniques embed the following drawbacks analyzed in [11] 4 : 1) The features are extracted via a handcrafted process that heavily relies on human experience and domain knowledge. 2) Only shallow features can be learned according to human expertise. Such shallow features can only be used for recognizing low-level activities (walking, running, etc.), but they can hardly enable to 3 See Section 3 of [11] 4 See Section 2.2 of [11] accurately infer complex activities like having a coffee for example. 3) These techniques often require a large amount of well-labeled data to train the model. However, most of the activity data are remaining unlabeled in real applications. The aforementioned drawbacks are overcome by deep learning sensor-based solutions. However deep learning techniques embed the following limitations analyzed in [2] 5 : 1) They require lots of computing resources.
2) The parameters of the resulting models are difficult to adjust.
3) The components of the resulting feature vectors are less interpretable. More details related to this limitation are available in [89] where an overview of explainable artificial intelligence for deep neural networks is proposed. The current paper addresses these limitations through the proposal of a Hidden Markov Model (HMM)-based technique for HAR which derives interpretable feature vectors from the model's meta-data. The parameters of the resulting HMMs are understandable and can therefore be easily adjusted. Furthermore, the models' training time is reasonable compared to deep models. Raw data from triaxial smartphones sensors are preferred here because studies demonstrated that samples from smartphones sensors (e.g., accelerometer and gyroscope) are accurate enough to be used in the clinical domain, such as ADLs recognition [23].
More precisely, given a signal window w, we first represent w as a sequence w = w 1 . . . w T of 3-dimensional vectors. The sequential variations of spatial locations within these 3-dimensional vectors are then captured to transform w into the Markov chain δ w whose content later serves to fix the parameters of an initial HMM associated with w. These parameters are then iteratively adjusted at each iteration of the Baum-Welch algorithm to obtain the final HMM λ w . Thereafter, meta-data derived from λ w are saved as the components of the descriptor vector − → w associated with w. The performances of the proposed approach are evaluated through flat classification experiments on UniMiB SHAR [23] which is a database containing acceleration patterns captured by smartphones and constructed in 2017 for the objective evaluation of ADLs recognition and fall detection techniques.
The rest of this paper is organized as follows: The state of the art is presented in Section II, followed by a summarized presentation of HMMs in Section III. A detailed description of the approach proposed in this paper is given in Section IV. Experimental results are presented in Section V and the last section is devoted to the conclusion.

II. STATE OF THE ART
A. RELATED WORK 1) CONVENTIONAL PATTERN RECOGNITION TECHNIQUES Conventional pattern recognition solutions for HAR generally rely on the following process depicted in Figure 1:    [25].
• Hidden Markov models (HMMs) [36], [42]- [46]. 4) Classification step: The previously trained model (classifier) is now used for inferring the corresponding human activity. The accuracy is the most used metric for evaluating the performances of these classifiers. Other metrics like the F1-measure, the Precision, the Recall are also rarely used.

2) DEEP LEARNING TECHNIQUES
With deep learning techniques for HAR, the features extraction and model building processes are performed simultaneously in the deep learning models as shown in Figure 2. Here, the feature vectors are automatically learned through the network χ instead of being manually designed. A detailed study of deep neural networks for HAR is available in [90]. An evaluation framework allowing a rigorous comparison between handcrafted features and features generated by several deep models is proposed in [91]. The following deep models are most often used in HAR: • Deep Neural Networks (DNN) [47]- [50].
• Hybrid models [7], [49], [60], [67], [72]. Although handcrafted features are considered a drawback in HAR, these features can nevertheless enhance the performances of CNN in some face related problems including age/gender estimation, face detection and emotion recognition [92]. Handcrafted features have also been combined with CNN generated features for HAR [73].
Among the datasets experimented by existing deep learning techniques listed in Table 1, OPPORTUNITY, PAMAP2 and UCI-HAD are the most experimented datasets. However these datasets have not been considered in the current work for several reasons. OPPORTUNITY and PAMAP2 were respectively designed with 4 and 9 participants, which are low values. Additionally, the subset of human activities selected by the authors for these two datasets varies from one work to the other. The dataset UCI-HAD was design with enough participants (30 participants) but it only enables us to identify 6 human activities. The advantages of the dataset UniMiB SHAR (selected here) compared to the other publicly available datasets listed in Table 1 are thoroughly analyzed in [23]. 8 A summarized description of the UniMiB SHAR datatset is given in Section V-A.

B. PROBLEM STATEMENT
Existing approaches for HAR rely on conventional pattern recognition techniques or deep learning techniques for deriving feature vectors from the raw data acquired by diverse sensors. These feature vectors are used for classification purposes. Conventional pattern recognition techniques only enable to learn shallow features extracted via handcrafted processes and require a large amount of well-labeled data. Deep learning techniques for HAR overcome these drawbacks, but they require lots of computing resources and they generate less interpretable feature vectors. Additionally, it is challenging to adjust the parameters of the resulting deep models.
The current paper attempts to provide a solution to these limitations. This solution is based on the following observation: Every human activity is a sequential process; hence it embeds a natural temporality that is meaningful for its characterization. For this reason, human activity is generally captured at a precise sampling frequency by dedicated sensors which sequentially record several values in a raw data vector. Unfortunately, the natural temporality embedded in the raw data vectors is actually ignored during the computation of existing feature vectors. Our opinion is that, the sequential variations of spatial locations within each raw data vector enables the derivation of relevant feature vectors which may provide a better characterization of the considered human activity.
The current paper follows the principle presented in Figure 3 to analyze these sequential variations in order to generate one new feature vector − → w from each raw data vector w through a machine learning process. HMMs have been selected in this paper on the one hand because they are suitable for sequential data and their training time is reasonable compared to deep models. On the other hand, these models are managed by algorithms whose robustness and efficiency are widespread. HMMs have already been used in HAR systems as classifiers [36], [42], [45], [46]. They have also been combined with deep learning techniques [43], [44]. But they are used here in a different way and for a very different purpose (i.e: feature vectors are extracted from their meta-data).

5) The initial state probability distribution
.o X composed of X symbols observed by the sequence of states q = q 1 q 2 . . .q X as described in the Markov chain (MC) shown in Figure 4. In order to obtain the MC presented in Figure 4, the following algorithm is executed: 1) Select the initial state s j ∈ S according to the distribution π and set x = 0. 2) Set x = x + 1 and change the current state to q x = s j 3) Select the symbol o x ∈ ϑ to be observed at state q x according to the distributions in B. 4) If (x < X ) go to step 5, else terminate. 5) Select the state transition to be realized from the current state q x to another state s j ∈ S according to the distribution A, then go to step 2. is the re-estimated model. The Baum-Welch algorithm [94] is generally used to perform this re-estimation. This algorithm runs in θ(γ .X .N 2 ) where γ is the user-defined maximum number of iterations. In this paper, the value γ = 100 is selected following [95].

D. STATIONARY DISTRIBUTION OF A HMM
estimates the overall proportion of time spent by λ in state s j over a long time span. ϕ can be extracted from any line VOLUME 9, 2021 FIGURE 5. Extraction of the 3 data vectors x, y and z from the raw data generated by the triaxial accelerometer of a smartphone located inside subject's waist pocket during a fall. of the matrix A r = A × A × . . . × A (r times) when r → +∞. Therefore, the computation of ϕ requires θ(r.N 3 ) arithmetic operations.

IV. THE PROPOSED APPROACH A. MAIN IDEA
Given a human activity, we assume that the experimental data are recorded by a triaxial smartphone accelerometer which generates 3 vectors x = (x 1 , . . . , x T ), y = (y 1 , . . . , y T ) and z = (z 1 , . . . , z T ) for each signal window w. These vectors are respectively obtained after sampling the signal on each Cartesian axis at a unique sampling frequency. Figure 5 depicts this process for fall detection where the smartphone is located inside subject's waist pocket. In that figure, a signal window extracted from the raw data generated by the smartphone triaxial accelerometer is sampled on each Cartesian axis to derive the 3 data vectors x, y and z.
Existing techniques for HAR generally concatenate these 3 raw data vectors to obtain a unique vector w = (x 1 , . . . , x T , y 1 , . . . , y T , z 1 , . . . , z T ). All the (3T )-dimensional raw data vectors resulting from the application of this principle on all the signal windows of the database are then used for handcrafted feature extraction or for deep feature extraction. The main idea of this work is related to the fact that the former raw data vectors x, y and z can also be viewed as one sequence w = w 1 w 2 . . .w T composed of 3-dimensional data vectors as depicted in Figure 7 where each w i = (x i , y i , z i ) with (1 ≤ i ≤ T ). Hence, the sequential variations of the spatial locations within the T vectors composing w can be captured into a MC δ w which can later be used to initialize and train a dedicated HMM λ w . Thereafter, one single feature vector − → w derived from the model's meta-data can finally be associated with w for classification purposes.

B. METHODOLOGY
As it is summarized in Figure 6, the proposed methodology for deriving the feature vector − → w associated with the sequence w = w 1 w 2 . . .w T is composed of the three following steps: 1) The spatial locations of the T data vectors composing the input sequence w = w 1 w 2 . . .w T are captured by transforming w into the MC δ w through a calculation involving each w i and the K central vectors derived from the off-line K -means clustering of the training data vectors. More explanations about this step are provided in Section IV-C.
2) The content of δ w is used for initializing a HMM which is then trained using the Baum-Welch algorithm to learn the sequential variations occurring inside δ w (and consequently, inside w). The resulting model is λ w . Section IV-D is devoted to the presentation of this step. 3) Meta-data are finally extracted from λ w to derive the features vector − → w . More precisely, − → w has M components, where M is the number of symbols of λ w . The k th component w k of − → w being the overall proportion of time spent by λ w observing the symbol v k ∈ϑ over a long time span, irrespective of the state from which v k is observed. This step is fully presented in Section IV-E.

C. TRANSFORMATION INTO MARKOV CHAIN
As shown in Figure 4, a MC is composed of symbols and states, both belonging to finite sets. In order to transform a signal window into a MC, these two finite sets must be defined.
To determine the set of symbols, we are initially going to cluster the data vectors derived from all the signal windows found in the training database. More formally, let H = {H 1 , . . . , H n } be the set of activities found in the training database, each activity being represented by |H j | signal windows, with (1 ≤ j ≤ n). Given that each signal window w is now considered as a sequence w 1 . . .w T of 3-dimensional data vectors, the experimental database becomes a collection composed of T × n j=1 |H j | data vectors. The k-means clustering algorithm [96] is then executed off-line for organizing this collection of training data vectors into K clusters, where K is a positive user-defined integer. The resulting set ϑ = {v 1 , . . . , v K } of clusters is finally considered as the set of symbols of the model in such a way that all the vectors found inside a given cluster are associated with the same symbol. If we note v(w i ) the cluster containing the data vector w i , then the signal window w = w 1 . . .w T is associated with the sequence of symbols v(w 1 ). . .v(w T ). The k-means clustering algorithm is preferred in this work due its simplicity of implementation and the quality of its resulting clusters.
To determine the set of states, we focus on the spatial locations of the data vectors inside each cluster. More formally, consider a data vector w i and letṽ(w i ) be the center vector of cluster v(w i ). We first evaluate the distance between w i and v(w i ), then we compare the resulting distance to the highest distance between any data vector of cluster v(w i ) andṽ(w i ). This comparison leads to the computation of a percentage α(w i ) which spatially characterizes each data vector w i inside its cluster. Given a selected distance measure dist between vectors, the computation scheme of α(w i ) is shown in (1).
Our objective is to consider all the possible values of α(w i ) as the set of states. In these conditions, Figure 8 depicts the resulting 'pseudo' MC δ w associated with w = w 1 w 2 . . .w T .
If the value of m is very high, the width of each slice s j becomes tiny and all the elements in s j converge to one unique value which is 100 m × j. In that case, the elements of s j can be approximated by this single value which we identify here by the index j of slice s j . In these conditions, the finite set {s 0 , s 1 , . . . , s m } of slices can be considered here as the set of states. This reasoning enables defining the valid MC δ w associated with w = w 1 w 2 . . .w T by replacing every α(w i ) appearing in δ w by the value β(w i ) which is the index j of the slice s j containing α(w i ) as shown in (3). Figure 9 shows the resulting MC.
Proceeding this way, δ w effectively embeds information related to the sequential variations of spatial locations within w because: 1) The sequence v(w 1 ). . .v(w T ) of symbols embeds information related to the sequential variations of clusters within w.
2) The corresponding sequence β(w 1 ). . .β(w T ) of states embeds information related to the sequential variations of spatial locations inside the various clusters within w. Consequently, if a HMM λ w is initialized and trained according to the content of δ w , this model will learn all the information related to these sequential variations.

D. HMM INITIALIZATION AND TRAINING 1) DESIGN OF THE INITIAL HMM
Given a positive user-defined constant ε, the parameters of each the initial HMM λ 0 w associated with w, are set to statistically capture the state transitions and the symbols probability distributions from the content of δ w as follows: 1) The set of symbols is the set ϑ = {v 1 , . . . , v K } of clusters generated by the k-means clustering algorithm, the number K = M of clusters (symbols) being user-defined.
2) The set of states is S = {s 0 , s 1 , . . . , s m } whose content is computed in (2) where m is the user-defined number of slices used to split the interval [0, 100]. Consequently, the number of states is N = m + 1.
3) The probability of transiting from state s j to state s k is calculated in (4), where transit(s j , s k , δ w ) is the number of transitions from state s j to state s k in δ w and transit(s j , −, δ w ) is the number of transitions from state s j to any destination in δ w .
4) The probability to observe symbol v k at state s j is calculated in (5), where observe(v k , s j , δ w ) is the number of times symbol v k is observed at state s j in δ w , and observe(−, s j , δ w ) is the number of occurrences of state s j in δ w .
5) The probability that the observation starts with state s j is calculated in (6), where start(s j , δ w ) = 1 if δ w starts with state s j , 0 otherwise.

2) READJUSTMENT OF THE INITIAL HMM
The parameters of λ 0 w are not probability distributions. This inconvenience is intentionally introduced by adding ε to the denominators of its various components in order to avoid eventual divisions by zero and zero probabilities. In this work, we experimentally fixed ε = 1. An equitable redistribution of the missing quantity is applied to each element of each line in λ 0 w = (A 0 w , B 0 w , π 0 w ) to obtain the readjusted initial model λ 1 w = (A 1 w , B 1 w , π 1 w ) whose parameters are:

3) TRAINING OF THE HMM
The readjusted initial HMM λ 1 w is trained to learn the sequential variations occurring inside δ w using the Baum-Welch algorithm. The resulting HMM λ w is the final model associated with w. During this training phase, the training sequences are exclusively composed of symbols appearing in δ w .

E. FEATURES VECTOR COMPUTATION
The features vector − → w = (w 0 , w 1 , . . . , w m ) associated with w is finally derived from λ w = (A w , B w , π w ) by analyzing the behavior of λ w regarding each symbol v k . More precisely, we propose to consider w k as the overall proportion of time spent by λ w at observing symbol v k over the long term, irrespective of the state from which this observation is realized. In order to compute w k , one must first evaluate the overall proportion of time spent by λ w at observing v k in each state s i over the long term as follows: The value of w k is finally obtained by repeating this process for every state s i and summing the resulting proportions (7).

A. EXPERIMENTAL DATASET
Among the publicly available databases recorded with smartphones listed in Table 1, the dataset UniMiB SHAR has been selected in this work [23]. It is database of acceleration patterns measured by smartphones to be used as a common benchmark for the objective evaluation of both, ADLs recognition and fall detection techniques. This dataset contains 17 human activities including 9 different types of ADLs and 8 different types of falls. Table 3 presents the description of each ADL/fall.
During the construction of the selected database, human activities where performed by 30 subjects (including 24 females) between 18 and 60 years of age. Each ADL/fall type was performed twice by each subject: the first time with the smartphone in the right pocket and the second time with the smartphone in the left pocket. Signal windows of 3s each were saved during every experimental trial of a given ADL/fall performed by each subject. For each signal window w, the accelerometer recorded three data vectors (samples) x, y and z, each having T = 151 components. The database contains a total of 11.771 samples not equally distributed across activity types: 7.759 samples describing ADLs and 4.192 samples describing falls. Further details about the data acquisition, the experimental protocols, the characteristics of the subjects, the signal segmentation and the signal processing are available in [23].

B. EXPERIMENTAL SETTINGS
The classification experiments performed in this work were realized on a personal computer having 16 GB of main memory and the following processor: Intel(R) Core(TM) i7-8665U CPU @ 1.90 GHz 2.11 GHz. We evaluated four different classification tasks, following [23]: 1) AF-17 which contains 17 classes (9 ADL classes and 8 FALL classes). 2) A-9 which contains 9 ADL classes. 3) F-8 contains 8 FALL classes. 4) AF-2 which contains 2 classes obtained by considering all the ADLs as one class and all the FALLs as one class. The Euclidean distance was selected in this work as the distance dist between two vectors required in (1). (2) was used to split the interval [0, 100] into 51 slices (i.e: we fixed m = 50) following [97] where the authors used analog reasoning to split the same interval to perform the comparison of finite sets of histograms using HMMs. Hence, the number of states of the HMMs designed in the current work is 51.
To analyze the impact of the user-defined number K of clusters discovered by the k-means clustering algorithm on the performances of the proposed approach, we experimented with the 5 following values of K : 20, 40, 60, 80 and 100. Consequently, the number of symbols of the HMMs designed in this work also varies accordingly. The first step of the machine learning process presented in Figure 6 and corresponding to the transformation of every signal w into the MC δ w was entirely developed in Matlab. This choice was dictated by the fact that the database files were available as Matlab tables. Therefore, we executed a Matlab version of the k-means clustering algorithm during this step. Depending on the classification task and on the selected number of clusters, the off-line clustering step could sometimes take over an hour.
The two remaining steps of the proposed machine learning process were both developed in C language. Given a signal window w and its associated HMM λ w = (A w , B w , π w ), the stationary distribution ϕ w of λ w is obtained in this work by extracting the first line of the matrix (A w ) r with r = 100. After discovering the K clusters for each classification task, the computation of each feature vector − → w associated with w took between 50 and 3500 ms, depending on the content of the signal window. The overall time taken for the computation of the feature vectors associated with all the signal windows varied from one classification task to another depending on the considered number of signal windows. For each classification task in {AF-2, F-8, A-9, AF-17} and for each number of clusters in {20, 40, 60, 80, 100}, the resulting descriptor vectors have been saved into one online available 'arff' files 10 which are taken as inputs by WEKA. The Matlab items (codes and tables) required to perform the off-line k-means clustering and the transformation into of Markov chains are also available through this same URL.

C. CLASSIFICATION PERFORMANCES
Classification experiments were realized with the soft WEKA [98] through 5-fold cross-validation following [23]. The following classifiers have been selected in this paper and their corresponding names in WEKA are shown in brackets: 1) k-NN (IBk) with k = 1, was used for the Euclidean and the Manhattan distances. 2) SVMs with polynomial kernel (SMO). 3) Multilayer Perceptron (MLP). 4) Decision trees (J48). 5) Random Forest (RF), these are bootstrap-aggregated decision trees with 300 bagged classification trees. Table 4a presents the best classification accuracies for each classification task using the proposed feature vectors, irrespective of the number of clusters generating these accuracies. According to Table 4a, the selected classifiers can be organized in descending order of performances for all the classification tasks as follows: RF, IBk (Manhattan), IBk (Euclidean), J48, MLP and SMO. The content of Table 4a demonstrates the high quality of the proposed descriptor vectors derived from the proposed HMM-based learning process with best accuracies always above 75%, 84% and 92% for the J48, the IBk and the RF classifiers respectively. This table also reveals the unsatisfactory performances exhibited by the SMO and the MLP classifiers for the AF-17 classification task and the very poor performances of these same classifiers for the F-8 classification task. Detailed classification performances for each classification task are presented in Tables 4b to 4f respectively for 20, 40, 60, 80 and 100 clusters. According to these tables, the variations of the user-defined number K of clusters do not significantly influence the performances of the proposed technique. Indeed, the gaps between the classification accuracies for the 5 experimental values of K are low. Consequently, it is not worth selecting a high value of K . We therefore recommend low values between 20 and 40.

D. COMPARISONS WITH RELATED WORK
We have compared the best classification performances obtained in this paper with those obtained in [23] where the authors conducted classification experiments on the same database and for the same classification tasks, using the following classifiers: 2) SVMs with a radial basis kernel.
Comparison results presented in Table 5 reveal that the approach proposed in this paper always outperforms [23] with positive accuracy gains reaching +13.45% and +10.36% respectively for F-8 and AF-17.  [23]. Accuracies are in (%). The best accuracies are in bold.

E. TIME COST 1) THEORETICAL TIME COST
The main contribution of the current paper is the computation of the proposed feature vector − → w associated with an input sequence w of raw data vectors as it can be observed in Figure 6. This computation embeds an offline k-means clustering whose time cost is not considered in this evaluation because it is realized 'offline'. The remaining steps of the computation of − → w are: 1) The transformation of w into the Markov chain δ w (See Section IV-C).
2) The HMM initialization using the content of δ w to obtain the initial model λ 0 w (See Section IV-D1).
3) The readjustment of λ 0 w to obtain λ 1 w (See Section IV-D2). 4) The HMM training of λ 1 w with the Baum-Welch algorithm to obtain the final model λ w (See Section IV-D3).
5) The extraction of meta-data from λ w to derive − → w (See Section IV-E). The time cost of the 3 first steps was experimentally low and very tiny compare to the time cost of the two last steps. Consequently, only the HMM training and the meta-data extraction are time consuming. According to Section III-C, the HMM training phase runs in θ(γ .T .(m + 1) 2 ). The main operation realized during the meta-data extraction is the computation of the stationary distribution of the HMM which runs in θ(r.(m + 1) 3 ) as stated in Section III-D. Therefore, the overall time cost of our contribution is approximated by θ(r.(m + 1) 3 + γ .T .(m + 1) 2 ) where: 1) r is the user-defined number of matrix products needed to compute the stationary distribution. In this paper r = 100. 2) γ is the user-defined maximum number of iterations of the algorithm. In this paper γ = 100. 3) T is the number of vectors in the sequence w. In this paper T = 151. 4) m is the user-defined number of slices used to split the interval [0, 100]. In this paper m = 50. This time cost can be further reduced by gradually reducing the values of parameters like r or γ without negatively impacting the classification results. If the stationary distribution is discovered after r 0 iterations with (r 0 < r), this stationary distribution will not change during (r − r 0 ) remaining iterations. Similarly, if the Baum-Welch algorithm reaches its local optimum after γ 0 iterations with (γ 0 < γ ), this local optimum will not change during the (γ − γ 0 ) remaining iterations. The experimental value T = 151 cannot be modified because it was fixed during the design of the experimental database. Similarly, the experimental value m = 50 cannot be changed here because it was fixed after several experiments performed in [97].

2) EXPERIMENTAL TIME COST
In order to measure the speed-up of the two most timeconsuming stages of the processing chain (i.e. the stages of HMM training and the meta-data extraction), we executed and benchmarked the program on two different architectures: a desktop and the Nvidia JETSON TX2 [99] comparable in terms of hardware to an embedded platform.
The main characteristics of the experimental desktop are the following: The experiment presented in this section has been performed using the 11.771 MCs obtained when the k-means algorithm is executed with k = 20 (our smallest experimental value of k).
The Nvidia Jetson TX2 has 8 GB L128 bit DDR4 of main memory. It runs on L4T (Linux for Tegra, i.e. Linux Kernel 4.9) and has a Parker SOC consisting of: 1) A Pascal GPU at 256 Cuda cores (not used in this experiment). 2) One HMP (6 cores) including 2 Denver cores (custom core designed by Nvidia to run the ARMv8 ISA) and 4 Arm Cortex A57 cores (to run the ARMv8 ISA). All the cores are compatible with ISA ARMv8 which is the 64 bits architecture of Arm. The Nvidia Jetson TX2 has several power supply modes that influence the frequency of the different cores. The modes used during our tests were the 'Max-N' mode which allows to reach 2 GHz on each core and the 'Max-P Core-All' mode allowing each core to be used at 1.4 GHz (i.e. a tradeoff between performance and energy consumption). During the current experiment, the program was executed on cores excluded from the Linux scheduler allowing to dedicate the core entirely to the program and thus, not to distort the performance measurements. The following units were selected to measure the performance of our program using the Performance Monitoring Unit and Syscall: 1) Milliseconds (ms) 2) Instructions per cycle (IPC) 3) Instructions per second (IPS) 4) Floating point operations per second (FLOPS) Tables 6 and 7 present the performances of our program on the experimental devices. According to these tables, the best performance is obtained by the desktop with a mean time cost of less than 2 seconds. Nevertheless, the performance on Nvidia Jetson TX2 is also very interesting with a mean time cost of around 12 seconds when the Core Denver 2 GHz is used.  A wattmeter was additionally used for measuring the energy consumption in order to deduce the Average Power VOLUME 9, 2021  Consumption (APC) of each experimental device. This enabled us to calculate the Figure Of Merit (FOM) of each device by multiplying the Average Execution Time (AET) by the APC as shown in Equation (8). According to the FOM presented in Table 8, the Nvidia Jetson TX2 is 2.138 times more efficient than the Desktop 4.6 GHz for the execution of our program.

F. MAIN ASSETS
The technique proposed in this paper: 1) Considers the sequential variations of spatial locations inside the raw data vectors unlike existing techniques. 2) Uses HMMs for the extraction of feature vectors, unlike existing techniques which only use these same models during the classification step. 3) Generates feature vectors whose components are interpretable, unlike existing techniques generating handcrafted or less interpretable feature vectors. 4) Performs the feature vectors extraction in reasonable time compared to deep learning techniques. 5) Efficiently performs HAR and outperforms prior work for the selected database. 6) The algorithm has been demonstrated viable on embedded platform processors from 2014 mobile phones and tablets, namely, ARM and Denver cores clocked at 1.4GHz and 2GHz. This shows that current mobile phones and tablets would be much closer to the performances of a PC.

VI. CONCLUSION
This paper addresses the problem of HAR based on acceleration data from Smartphones. Existing approaches for this purpose either rely on conventional pattern recognition techniques or on deep learning techniques. Conventional pattern recognition techniques generate shallow handcrafted feature vectors which heavily rely on the human experience/ expertise. Deep learning techniques are preferable, but they require lots of computing resources while generating less interpretable feature vectors. The current paper attempts to overcome these limitations by proposing an efficient HMM-based technique that generates interpretable feature vectors while requiring a reasonable time cost with a demonstrated feasibility of implementation on embedded processors, namely, Denver and ARM cores. Four different classification tasks have been tested on the UniMiB SHAR dataset containing 17 human activities including 9 types of ADLs and 8 types of falls. Classification results have demonstrated the efficiency of the proposed approach with the best accuracies between 92% and 98.85% for all the classification tasks. This performance is more than 10% better than state of the art for two classification tasks.
The main contribution of the current work is the HMM-based sequential learning of the sample (raw data vector) w associated with a human activity. Meta-data extracted from the resulting HMM λ w are then used for deriving the corresponding feature vector − → w . Consequently, the number of samples in the experimental database does not impact the components of − → w since each sample in the database is handled individually, irrespective of the other samples in the database. For this reason, the proposed approach will still exhibit good classification results even for large scale databases. Only the overall computation time for all the samples in the database will increase in these conditions. Parallel computation of all the feature vectors in the data base can also be implemented to reduce this overall computation time.
The current work has a dual impact on further researches in HAR. Firstly, it has been theoretically and experimentally demonstrated that learning a human activity as a sequential process enhances the quality of the resulting feature vectors and consequently, induces better classification results. A lot of the research in HAR only considers discrete activities as opposed to activities in a continuum. The proposed method enables extracting salient features from a stream of data using HMM. Secondly, the proposed method is an advance towards real-time implementation since this method can be efficiently ported on embedded platforms. Indeed, the current work uses HMMs for generating the feature vectors and the resulting feature vectors exhibit good classification performance, even with a basic classifier like the k-NN. Given that efficient hardware implementations of the Baum-Welch [101] and the k-NN [102] algorithms on Field-Programmable Gate Array (FPGA) chips are available, this method can therefore be deployed using hardware platforms with a lower footprint than GPUs in terms of energy and using fewer resources on a FPGA due to the simplicity of the implementation compared to CNN or DNN approaches.