W-Trans: A Weighted Transition Matrix Learning Algorithm for the Sensor-Based Human Activity Recognition

The sensor-based human activity recognition has been wildly applied in behavior tracking, health monitoring, indoor localization etc. Using activity continuity to assist activity recognition is an important research issue, in which the activity transition matrix which describes the activity transformation in real scenarios is the most important parameter. Aiming at the problem that the current classic transition matrix learning algorithm cannot fuse weights of sample classification results, a weighted transition matrix learning algorithm is proposed in this paper. First, the basic definitions of an improved Hidden Markov Model (HMM) which fuses weights of classification results are given. Then, the recursive formula of transition matrix learning is derived, and the learning algorithm W-Trans is put forward. Finally, the proposed algorithm is simulated with the public data sets. The evaluation results show that the proposed algorithm outperforms the classical Baum-Welch algorithm under evaluation metrics of both the cosine similarity and the euler distance. By applying W-Trans to current activity recognition post-process methods, the advantage of our method is verified.


I. INTRODUCTION A. BACKGROUND
In the past decade, the sensor-based human activity recognition [1], [2] has been a key research field in industry and academic field. In this research, the custom-made devices integrated with multiple inertial sensors (e.g., accelerometer, gyroscope) were bounded on human bodies. These devices can capture body movement and generate real-time sensor data as humans preform activities such as walking, running etc. This research has brought about many applications such as behavior tracking [3], [4], health monitoring [5], [6], indoor localization [7], [8] and so on.
Currently, the supervised learning is the commonest technique adopted for activity recognition [9], in which a classification model is first trained with labeled samples, and The associate editor coordinating the review of this manuscript and approving it for publication was Wei Xiang . then applied to classify the unlabeled ones [10]. However, there are still several challenges affecting the performance of sensor-based activity recognition [11], [12]. First, for the same activity, the sample distributions are different from person to person. This problem makes it difficult to train a universal recognition model for all users [13]. Second, for some mobile devices such as smartphones, the device locations are always unfixed. For the same activity, the sensing data obtained from different body locations are inconsistent [14]. It brings a great challenge to improve the performance by removing the influence of device locations [15]. To overcome these shortages, many researches try to incorporate supervised learning method with activity continuity which is independent of sample distribution.
In applying the activity continuity to the activity recognition, the transition matrix is the most important parameter. It describes the probabilities that a certain type of activity will hold itself and turn to others at the next time in a scene. This parameter is the basis for calculating the confidence of recognition results and smoothing the activity results sequence. Since this parameter depends on the application scenarios, it needs to be learned from the sequence of classification results of the activity in the scene. Currently, the Baum-Welch algorithm [16], [17] is the most commonly used algorithm, but it still has rooms for improvement.

B. RELATED WORKS
This section will briefly introduce current methods on applying activity continuity to activity recognition. Using activity continuity to assist activity recognition has always been a research issue in activity recognition researches. In early 2008, a computationally inexpensive methodology [18] for incorporating smoothing classification temporally was proposed, which can couple with any classifier with minimal training for classifying continuous sequences. The Hierarchical Support Vector Machine and Context-based Classification (HSVMCC) was proposed in [19] to recognize human activities when the sampling rate was less than the frequency of activities. These two methods utilize naive modification strategy, and do not consider the activity transition. The performance of these two methods are obviously worse as the activities change to another.
The other studies attempt to add this fact to sequence smoothing. The Activity Recognition Shell (ARShell) was proposed in [20], in which a Markov smoother was applied to post-process the results generated by the Google recognition service. The Lowest Cumulative Cost Activity Sequence (LCCAS) was put forward in [21], in which the similar policy was adopted to perform sequence smoothing. However, these two methods are all follow the straightforward schemes, which just obtain locally optimal solution [22]. Other researchers applied the HMM [23], [24] to provide global solutions. In [25], HMM was adopted to smooth out the accidental misclassifications generated by supervised learning schemes. The HMM and ensemble HMM were applied to recognize activities in [26] and [27] respectively. Because of its simplicity and effectiveness, the HMM is still the most popular model in current applications.
In the past several years, with the great success of deep learning in the fields of image processing and speech recognition, many researchers have applied deep learning model to activity recognition. To model the activity sequence, the sequence models of deep learning such as Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM) were adopted in [28]- [31], and they got meaningful results. But the advantage of deep learning models depend on sufficient training data and powerful computing ability, which cannot be provided by the wearable devices [32]. Especially for the sequence models, as the activity transitions are different from scene to scene, these models always need to be trained independently for a specific scene. It is impractical for the wearable devices. Therefore, it is still a long way to go to apply sequence models of deep learning to activity recognition in practical applications [33]. On the other hand, some applications such as the Google activity recognition service just provide the activity results but not the raw sensor data [20]. In these scenes, the deep learning models cannot take advantage of them. In summary, the classical models such as HMM are still the most commonly used models in current activity recognition systems, and the activity transition matrix is still the most crucial parameter for these classical models.

C. MOTIVATION
In the classical sequences models such as HMM, the activity transition matrix which describes the transition probabilities between different activities is the most important parameter. An accurate transition matrix can provide effective results on applying activity sequences to activity recognition. Currently, this parameter can be trained with the classical Baum-Welch algorithm [16], [17] which is a basic algorithm in HMM theory. But it still has its own shortage for sensor-based activity recognition. In Baum-Welch algorithm, the sequence of activity recognized labels should be the input. However, the recognized labels which acquired from most supervised learning algorithms always contain additional information which is the weights for these labels.
Taking a set containing two activities for example, one sample is recognized as (0.9,0.1) and the other one is (0.6, 0.4). The final labels of them are the same, but their classification weights are quite different. Low confidence indicates the low probability that the sample is correctly classified, while high confidence means the opposite. Thus, these two samples should provide different influences in training the transition matrix. But the classical Baum-Welch algorithm loses sight of this important feature. Fusing classification weights in learning transition matrix is a direction for improvement.
This paper proposes a weighted transition matrix learning algorithm named W-Trans which is the continuation of our previous works [22], [34]. The key concept of weighted observation probability described in the following section was first proposed in [34] to identify the confidence level of classification results. Then, this concept was applied to smooth the classification result sequence in order to improve the recognition accuracy in [22]. This paper focuses on the transition matrix learning issue on applying the concept of weighted observation probability. The main works of this paper are as follows.
1) The definitions of weighted observation probability are described, and the improved HMM named Weight Observation Hidden Markov Model (WOHMM) is introduced.
2) The recursive formula of transition matrix learning is derived, and the W-Trans for WOHMM is put forward.
3) The W-Trans is evaluated with metrics of cosine similarity and euler distance on two public data sets. 4) This algorithm is applied to the current post-process method and is verified to be effective.

II. METHOD
This section first introduces a common framework of activity recognition, and point out the position of transition matrix learning in this framework. Then the formal description of WOHMM is given. Finally the W-Trans is put forward.

A. THE FRAMEWORK OF ACTIVITY RECOGNITION
To clearly introduce the purpose of this paper, we first introduce the common framework of activity recognition, shown in Fig. 1. As shown in Fig. 1, activity recognition is divided into three steps. In the first step, a variety of data including acceleration and angular velocity can be collected via the programming interfaces provided by wearable devices. These sensing data are divided into samples with fixed size, and the feature vectors are extracted from those samples. This step is summarized as feature extraction, shown as the upper area in Fig. 1. In the second step, those feature vectors are identified by the trained classifier to generate a classification result sequence, which is H in the middle area. Generally, H is always normalized [35] with where min(h j ) is the minimum element in vector h j . Specially, some classification models such as Convolutional Neural Networks(CNN) would combine the first two steps into one step. After normalization, it can get the normalized result sequence O, and this sequence would be the input of sequence-based strategies shown as the lower area in Fig. 1.
In the sequence-based strategies, some sub-problems are concerned, such as identifying the probability that each label was correctly recognized, smoothing the result sequence to correct wrong labels, and so on. The bases of these subproblems are obtaining the activity transition matrix in a specified scene. The classical Baum-Welch algorithm learns the transition matrix with the recognized label sequence L, and our proposed W-Trans will learn this parameter with the normalized result sequence O. The following subsections will give the basic definitions and detail our algorithm. The WOHMM is shown in Fig. 2, which consists of three sequence A = {a 1 , a 2 , · · · a T }, L = {l 1 , l 2 , · · · l T } and O = {o 1 , o 2 , · · · o T }. In these sequences, a t , l t and o t are the hidden state, observation and observation vector respectively at time t, and meet a t ∈ S and l t ∈ R. For A, the hidden state may remain constant or transform to another as time goes on. The transition probabilities between different states constitute the transition matrix, written as where 0 ≤ q s i s j ≤ 1 and N j=1 q s i s j = 1. q s i s j represents the probability P(a t+1 = s j |a t = s i ), and its subscript means the state at t + 1 is s j on the condition that the state at t is s i . Corresponding to A, L is the observation sequence, in which l i is the observation as t = i. Similar to HMM, the observation matrix is defined as where 0 ≤ p s i r j ≤ 1 and M j=1 p s i r j = 1. p s i r j is the probability that the state s i is observed as r j , that is p s i r j = P(l t = r j |a t = s i ). In activity recognition, the observation matrix corresponds to the confusion matrix which can be obtained through experimental results. According to the observation matrix, we define the weighted observation probabilityp s i l t = P (l t |a t = s i , o t ), which is the probability that hidden state s i observes l t on the condition that given the corresponding observation vector o t = (o tr 1 , o tr 2 , · · · o tr M ). This probability can be calculated by the cosine similarity of p s i = (p s i r 1 , p s i r 2 , · · · p s i r M ) and o t , shown asp The physical significance ofp s i l t is as follows. p s i is the priori probability distribution, which describes the probabilities of different observations as the hidden state is s i . But the actual observation probabilities are o t at time t. If p s i and o t are highly similar, the probability that the hidden state is s i would be high. In other words, as the actual observation vector is o t at time t, the probability that hidden state s i observes l t should be obviously high for this sample. Conversely, if the similarity between p s i and o t is very low, the corresponding probability should be low. To fuse the observation vector, the p s i l t can be instead byp s i l t in HMM, and we rename the new HMM as WOHMM.
Similar to the HMM, the initial vector π is defined as where 0 ≤ β s i ≤ 1 and N i=1 β s i = 1. β s i is the probability that the hidden state is s i at t = 1, and it equals to P(a 1 = s i ).
To sum up, WOHMM is written as which contains three parts of transition matrix Q, observation matrix P and initial state vector π . As P and π can be obtained through experiments, our parameter learning problem is summarized as training the optimal transition matrix Q on given the observation sequence O and L. The following section will give the learning algorithm W-Trans.

C. OUR PARAMETER LEARNING ALGORITHM
Before the details of W-Trans, we introduce two variables.
1) The forward probability which is the probability that the observation sequence between 1 and t is {l 1 , l 2 , · · · l t } and the hidden state at time t is s k on given the WOHMM parameter λ and observation vector sequence {o 1 , o 2 , · · · o t }. This value can be obtained by wherep s k l t is shown as Equation 4. VOLUME 8, 2020 2) The backward probability which is the probability that the observation sequence between t + 1 and T is {l t+1 , l t+2 , . . . , l T } on given λ, the hidden state s k at time t, and the observation vector sequence This value can be obtained by wherep s j l t+1 is shown as Equation 4 . Following the Baum-Welch algorithm, the logarithmic likelihood function of complete-data is log P (L, A|λ, O), which denoted as log P (L, A|λ, O) = β a 1p a 1 l 1 q a 1 a 2p a 2 l 2 . . . q a T −1 a Tp a T l T , (9) where A is a possible hidden state sequence corresponding to L and O, λ is the WOHMM parameters. Construct the Q function of EM algorithm, shown as where λ is the parameters to be maximized, andλ is the current values. Substituting 9 into 10, we can get log p a t l t ) log q a t a t+1 ), (11) which composes of three parts, the summation of initial vector, observation probability and transition probability respectively. As the initial vector and the observation matrix do not need be updated, the third part of Equation 11 should be maximized in order to maximize the Q function. Expand the third part and construct the lagrange function, expressed as Calculate the partial derivative for q s i s j , and let results be 0.
Decompose the numerator and denominator, shown as and P(L, a t = s i |λ, O) which is the recurrence formula for transition probability learning. According to this formula, the W-Trans is shown as Algorithm 1.
In Algorithm 1, the first loop indicates iteration times, which depends on specific scenarios. For each iteration, a new Q is calculated based on the previous iteration. In this step,

III. EXPERIMENTS
In this section, we compare W-Trans with the classic Baum-Welch algorithm on two public data sets. First, it introduces the design of experiments and then gives the final experimental results for different classifier on two data sets. Finally, it discusses the influences of algorithm parameters.

A. EXPERIMENTAL DESIGN
First of all, the SARD [36] and HAPT [37] are selected as the data sets in our experiments. Both of them contain the acceleration and gyroscope data provided by the smartphones. SARD is the product of Twente University, while HAPT is shared by UCI Machine Learning Repository. Five popular daily activities including walking, running, upstairs, downstairs and standing were chosen for SARD and six daily activities such as walking, upstairs, downstairs, sitting, standing and lying were considered for HAPT. Secondly, we divided those sensing data into fixed-size samples by the half-overlapping sliding window. As the window size was set to 1 second, we obtained 27000 and 11000 samples respectively from SARD and HAPT. Finally, we extracted features with different methods for each data set. For SARD, we extracted the features such as mean, variance, average cross rate, maximum, minimum and 10 low frequency coefficients of Fast Fourier Transformation (FFT) [38]. Since HAPT has provided 561 features, we only selected the top 30 ones according to their scores by ReliefF algorithm [39].
Before simulating Algorithm 1, we recognized the activities with the supervised learning method [14]. The samples of each user were selected as the test set in turn, and the samples of other users constitute the training set. We selected two popular classifiers of SVM and ELM to get the recognized label and the corresponding vector for each sample. Then we randomly arranged the sample orders to reconstruct different activity sequences. In this step, the duration of each activity was set to a specified value n, which denotes the activity will last n samples as it occurs. Besides, the transitions between different activities were randomly specified. An example is shown as Fig. 3, in which the activity duration is set to 3. In this example, w denotes walking, s indicates standing, and r represents running. As shown in Fig. 3, as activity of walking occur, it lasts 3 sample. Then, the activity randomly changes to standing, and it also lasts 3 samples. Next, it randomly changes to running.
According to the reconstruction, we could simulate the activity sequences in different scenarios. After reconstruction, the diagonal elements of actual transition matrix would be n−1 n and the other elements would be close to 1 n(n−1) , where m is the number of activities. The reconstructed sequence would be the input of parameter learning methods. In simulating Algorithm 1, the initial state vector was set to follow the uniform distribution. The observation matrices corresponding to SARD and HAPT are described in Table 1 and 2 respectively, which were built according to the confusion matrices for activity recognition. Due to different numbers of default activities in SARD and HAPT, the sizes of matrices are unequal.
To prove the advantages of our algorithm, we should choose metrics to evaluate the similarity between the learned transition matrix and the actual one. As the cosine similarity and the euler distance are the most popular similarity tools, we use these two methods as our metrics in the following experiments. The cosine similarity is defined as   whereq ij and q ij are the learned and actual transition probabilities respectively, Q F is the Frobenius norm of matrix Q.
The euler distance is defined as These two metrics evaluate the results in different views. For the cosine similarity, high values indicate good results, and low values instead for the euler distance.
For example, Q is the actual transition matrix which derived from sequence reconstruction, Q 1 and Q 2 are learned by different algorithms respectively, which denoted by Alg1 and Alg2. The d cosine (Q, Q 1 ), d cosine (Q, Q 2 ), d eular (Q, Q 1 ) and d eular (Q, Q 2 ) can be calculated. If d cosine (Q, Q 1 ) > d cosine (Q, Q 2 ), Alg1 would be better. But if d eular (Q, Q 1 ) > d eular (Q, Q 2 ), Alg2 would be better. The following subsections will show the comparison results based on these two indicators.

B. COMPARISON RESULTS
As the transition matrix is a parameter of HMM, the experiments just compared Algorithm 1 with Baum-Welch algorithm [16]. Fig. 4 shows mean results with different metrics and data sets, where W-Trans is our method and Baum-Welch is the original algorithm. The iteration time was set to 10, and the activity duration was set to 5 for these results. The selection of these two parameters will be introduced in the following sections. Fig. 4 includes four figures. The first two figures show comparison results of the cosine similarity with different data sets, and the following two figures illustrate results of the eular distance. For each figure, it shows four results, which are derived from different parameter learning algorithms and classifiers.
As shown in Fig. 4(a) and (b), the W-Trans provides obviously higher cosine similarity than Baum-Welch algorithm whatever the classifiers. It means the transition matrix learned by our method is closer to the actual one. For the following two subfigures, the W-Trans gives obviously lower eular distance. As lower eular distance means better result, our method gives better performance.
Comparing the results derived from different classifiers, we can know that SVM performs better than ELM whatever the parameter learning algorithms. According to analyzing the classification results, we find that the classification accuracy of SVM is slightly better than ELM. It causes that the incorrect results in classification result sequence of SVM are less than ELM. In parameter learning, these misclassified results offer more obvious influences to the final transition matrix. Thus, the performance of ELM would be worse. This comparison reveals that the activity classification accuracy is an import influencing factor to the final transition matrix.
Comparing the results evaluated on different data sets, we can find that the data set of HAPT provide better performance. The reasons are described as follows. By observing Table 2, we can know activities included in HAPT can be divided into two groups. The first group contains Walking, Upstairs and Downstairs, and the second group consists of the rest activities. The activity confusion probabilities between these two groups are 0, which causes the activities in one group would not be recognized as activities of another group in the recognition result sequence. In parameter learning, the activity transition probability between groups would be easy to be determined. The factors that affect the final transition matrix are just the confusion matrix within the group, which is a simplified problem. It also reveals that simplifying the confusion matrix is an important direction to improve the algorithm in practical applications.

C. INFLUENCE OF ITERATION TIME
In Algorithm 1, the iteration time is an import parameter affecting performance. Section III-B set this parameter to 10. This subsection will discuss how this value is obtained. In the process of simulation, we calculate the cosine similarity and the eular distance for each iterate. As the SVM and ELM provide similar results, this subsection just introduces the results classified by SVM. Fig. 5 shows the experimental results as the iteration time changes from 1 to 15. The activity duration is 5 for these results. As shown in Fig. 5, the euler distance declines and the cosine similarity improves for these two 72876 VOLUME 8, 2020 methods as the iteration time increases. When the iteration time is smaller than 5, these curves change obviously, and then become gentle. It is because that the learning process becomes convergent and the transition matrix is gradually approaching the optimal matrix. When the iteration time is 1 or 2, the W-Trans performs worse than the Baum-Welch algorithm, and then it becomes better. The reasons are as follows. After replacing the original observation probability with the weighted observation probability, the differences between probabilities which observed by the same hidden state become small. This feature leads to the convergence rate becomes slow. So the performance is worse when iterating few times. It is because of slow convergence rate, the W-Trans can get more accurate result.
Comparing the results simulated on different data sets, we can find that the results on HAPT converge faster. For the SARD, the transition matrix converges after about 6 iterate times, and it is 4 for HAPT. The reasons are also simple. As the observation matrix of HAPT is divided into two groups, the transition matrix learning of HAPT is actually divided into two sub-problems. Each sub-problem contains three activities. Obviously, the parameter learning problem of 3 activities is faster than 6 activities.

D. INFLUENCE OF ACTIVITY DURATION
In the above sections, activity sequences were randomly reconstructed, in which the duration of each activity was set to a specified value and the transition between different activities was randomly specified. In this step, activity duration is an important parameter affecting the final results. This section will discuss the influence of activity duration. Fig. 6 shows the comparison results as the activity duration changes from 2 to 15 on SARD data set. As the results derived from two classifiers are similar, these figures just show the results of SVM.
As shown in Fig. 6, the learned transition matrix of activity sequence is approaching the optimal value as the activity duration increases. The main reasons are as follows. When the activity duration is small, the activity transitions frequently. Learning the transition matrix in the result sequence is susceptible to the incorrect recognition results. As the number of continuous samples increases, the transition law between activities becomes obvious, and it is also easy to be captured from the activity result sequence. In contrast, the influence of the error classification results become small.
Comparing the results of the two algorithms in Fig. 6, it can be found that when the activity duration is 2, the performance of W-Trans is slightly worse than Baum-Welch. With the  increase of activity duration, the performance of W-Trans is gradually better than that of Baum-Welch. The main reasons are as follows. Since the difference of weighted observation probability for different hidden states is small, W-Trans is more susceptible to misclassified results when the number of continuous samples is small. It causes that the W-Trans performs worse. As the activity duration increases, the effect of misclassified results becomes small, and the results of W-Trans also become better. However, parameter learning algorithm always requires a long sequence of recognition results. For a long activity sequence, activity duration is 2 means one person changes his activity in about 2 seconds. Obviously, it is impossible in practical applications. Therefore, it has little effect on the activity recognition in real scenes.

IV. APPLICATION VERIFICATION
The performance of W-Trans has been evaluated in the previous section, which compared to the classical transition matrix learning algorithm under metrics of cosine similarity and eular distance. To verify the validity of the method further, this section will apply W-Trans to activity post-process which is an import issue in sensor-based activity recognition. Current popular post-process methods include WOODY [22], HMM [40], CRF [41] and LCCAS [21] etc. In these methods, three methods of WOODY, HMM, and LCCAS are all regard the activity transition matrix as a basic parameter, which is learned by the classical Baum-Welch algorithm in these methods. This section will evaluate these post-process methods in replacing the Baum-Welch algorithm with W-Trans.
In this section, the SARD and HAPT are also selected as the experimental data sets. The experimental designs of data preprocessing, feature extraction, training and classification, sequence reconstruction are the same as Section III-A. In this section, activity duration is set to 5. After getting the normalized recognition result sequence, we first learn the activity transition matrix with Baum-Welch algorithm and W-Trans respectively. Then the learned transition matrix and the activity result sequence are as inputs of post-process algorithms. Fig. 7 shows the average recognition rates of seven post-process methods with respect to different classifiers and data sets. Fig. 7 shows four sub-figures. Each sub-figure plots the recognition rates with different datasets and classifiers. For each sub-figure, it contains seven methods. Original represents the classification without any post-process. HMM, LCCAS and WOODY are the methods in [21], [40] and [22] respectively. These three methods are included in our experiments as baselines, in which the Baum-Welch algorithm is applied as the transition matrix learning algorithm. The following three methods of HMM-W, LCCAS-W, WODDY-W are the improved post-process methods, in which the transition matrix learning algorithm replaced by W-Trans.
Comparing each post-process method and the corresponding improved method, we can find that the activity recognition rates are all improved. The best result is WOODY-W in Fig. 7(d), which increases the recognition rate by 1.85%. Even for the worst result of LCCAS-W in Fig. 7(a), it also improves the recognition rate by 0.31%. The reasons are obvious. The transition matrix is a most important parameter in post-process. It describes the transition relationship between activities. Accurate activity transition matrix can decrease the modifying error than inaccurate one in estimating the result probability. It is bound to increase the recognition rates. These results reveal that the activity recognition rate improves steadily after replacing Baum-Welch algorithm with our approach in post-process methods.

V. CONCLUSION
The sensor-based human activity recognition has been wildly applied in behavior tracking, health monitoring, indoor localization etc. Applying the activity continuity to assist activity recognition is an important research issue, in which the HMM is a commonly used theoretical tool because of its simplicity and effectiveness. As the most import parameter, the transition matrix should be learned from the classification results sequence dynamically.
Aiming at the problem that the classical transition matrix learning algorithm cannot utilize the weights of classification results, an improved HMM named WOHMM is introduced in this paper. Based on the definitions of WOHMM, the weighted transition matrix learning algorithm W-Trans is proposed. In the evaluation, the public data sets are applied to activity recognition and transition matrix learning, and the metrics of cosine similarity and eular distance are adopted for measuring the performance of different methods. The experimental results show that our algorithm outperforms the classical Baum-Welch algorithm.
There are still some future research directions based on our method. In the actual scene, the sample distribution in the activity sequence is often unbalanced. For example, a certain type of activity such as sitting always lasts a long time, while other activities such as running and upstairs are relatively short. The transition matrix learning algorithm in this paper does not consider this subject, whose effect to the proposed algorithm need to be studied in the future researches. On the other hand, the proposed W-Trans in this paper is a universal algorithm, which can be applied not only to sensor-based activity recognition. Other sequence related problems with observation weight can also use this algorithm to learn the transition matrix. Applying this algorithm to other fields is also worth studying.