Can Wearable Devices and Machine Learning Techniques Be Used for Recognizing and Segmenting Modified Physical Performance Test Items?

Assessment of physical performance is essential to predict the frailty level of older adults. The modified Physical Performance Test (mPPT) clinically assesses the performance of nine activities: <italic>standing balance</italic>, <italic>chair rising up & down</italic>, <italic>lifting a book</italic>, <italic>putting on and taking off a jacket</italic>, <italic>picking up a coin</italic>, <italic>turning 360°</italic>, <italic>walking</italic>, <italic>going upstairs</italic>, and <italic>going downstairs</italic>. The activity performing duration is the primary evaluation standard. In this study, wearable devices are leveraged to recognize and predict mPPT items’ duration automatically. This potentially allows frequent follow up of physical performance, and facilitates more appropriate interventions. Five devices, including accelerometers and gyroscopes, were attached to the waist, wrists and ankles of eight younger adults. The system was experimented within three aspects: machine learning models, sensor placement, and sampling frequencies, to which the non-causal six-stages temporal convolutional network using 6.25 Hz signals from the left wrist and right ankle obtained the optimal performance. The duration prediction error ranged from 0.63±0.29 s (<italic>turning 360°</italic>) to 8.21±16.41 s (<italic>walking</italic>). The results suggest the potential for the proposed system in the automatic recognition and segmentation of mPPT items. Future work includes improving the recognition performance of <italic>lifting a book</italic> and implementing the frailty score prediction.


I. INTRODUCTION
F RAILTY level has been a standard scale for evaluating the ageing process of older adults [1]. One of its essential components is physical frailty, which is relevant to the ability to live independently [2], [3]. The assessment of physical frailty is normally dependent on the performance of physical activities (PAs) [4]. Furthermore, monitoring the performance of PA is beneficial for 1) older adults to maintain their physical health and prevent or postpone frailty; 2) doctors and physiotherapists to decide on physical treatments; 3) healthcare providers to personalize services; 4) governments to arrange health service resources [1], [2], [5].
Physical tests are typically applied to assess the performance of PAs, such as modified physical performance test (mPPT) [6], multi-dimensional risk appraisal for older people (MRA-O) system [7], and short physical performance battery (SPPB) test [8]. Clinically, these tests were conducted sporadically, every one or three months, or only after accidents like falling. Hence, the subtle change in the PA performance cannot be monitored continuously and promptly. Furthermore, the conduction of the tests requires the supervision and recording of doctors or physiotherapists, which increases their workload and is inconvenient to patients who need to transport for the test.
This study aims to investigate whether using multiple wearable devices combined with machine learning techniques can be used to recognize and segment physical test items automatically. The mPPT is selected as the ground truth test, including nine items, which will be introduced in Section III. Three machine learning algorithms, i.e., a support vector machine (SVM) and two deep neural networks (DNNs), will be compared at the recognition and segmentation level. The segmentation result is converted to be the duration of each item, which is the primary evaluation standard of mPPT. Results will be presented in Section IV, with the discussion in Section V and the conclusion in Section VI.

A. Activity Recognition & Segmentation Application in Frailty
Although wearable devices have already been utilized in activity recognition, there have not been many studies employing the recognition result to measure the physical frailty of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ older adults. In the study conducted by Razjouyan et al. [9], one tri-axial accelerometer sensor (PAMSys) was adopted to collect movement signals of the chest during daily life, with 163 older adults as participants. The collected signals were used to extract features, which were grouped into four patterns: sleep quantity (sleep time, sleep onset latency, and time in bed), PA types (walking, sitting, and standing), PA intensity (sedentary, light, and moderate-to-vigorous), and step (number of steps and prolonged stepping bout). The Fried frailty index [10] was used as the ground-truth scale, classifying frailty into three levels: pre-frail, non-frail, and frail. The embedded decision tree model was trained and tested in five-fold cross-validation with all participant's data mixed together. The results presented that the combination of PA types, PA intensity, and step features achieved the highest sensitivity (0.918) in classifying non-frail and frail.
In the study of Abril-Jiménez et al. [11], a wristband and a smartphone were combined with Bluetooth beacons to predict frailty levels. The applied sensors in the smartphone were GPS, inertial measurement unit (IMU), Bluetooth, and Wi-Fi, which were used to collect information on physical movements and daily travel trajectory. The frailty evaluation patterns were categorized into four aspects: activity movement, weekly visiting, transport daily usage, and system engagement.
Their correlation values with the frailty levels were analyzed separately. The results showed that the walking speed, distance and number of steps were sensitive to the change in the frailty level. Similarly, in study [12], physical activities, such as standing, sitting, lying, gait, and standing balance were proved to be relative to frailty. Five IMUs, involving accelerometers, gyroscopes, and magnetometers, were attached to the shanks, thighs, and lower back.
Unlike the studies discussed above, study [13] focused on predicting the frailty of the arm using gyroscopes attached to the upper arm. The trauma-specific frailty index (TSFI) [14] was the golden standard and included two levels: frail and nonfrail. In this study, participants were required to conduct elbow flexion continuously for 20 s. The multivariate regression was applied to predict the score, further used to predict whether a subject was frail or non-frail. The classification on the test set reached an F1-score of 0.80.
The limitations of these studies are that: • The frailty scale they adopted categorized frailty in two or three levels: non-frail, (pre-frail,) and frail. However, the severity level of frailty, which is also valuable information to apply the intervention to postpon the progression of frailty [15], is seldom evaluated due to the frailty assessment tool previous studies selected. • In study [9], the model was only tested in five-fold crossvalidation; hence, the robustness of the model in unseen users has not been verified yet.

B. Activity Recognition & Segmentation
In this study, mPPT items monitoring is separated into two steps, first recognizing and then segmenting activities. The segmentation results are used to predict the performance duration of every activity, which is required to calculate the mPPT test final frailty score. The segmentation performance is based on the recognition result; hence, the recognition algorithms, which are the machine learning models, are discussed here.
Machine learning models are generally categorized into two groups: traditional machine learning models and DNN models. Except SVM applied in the study of Razjouyan et al. [9], other traditional machine learning models, e.g., random forest (RF) [16], naïve Bayes (NB) [16], and artificial neural network (ANN) [16], [17], have also been used for PA recognition. One challenge of traditional machine learning models is that they typically use informative hand-crafted features as input. This requires representative features to be designed based on domain knowledge the context of the machine learning task.
In comparison, DNN models are end-to-end algorithms, which directly apply raw signals as input. Chen et al. [18] developed a convolutional neural network (CNN) to classify eight PAs, i.e., falling, running, jumping, walking, walking quickly, step walking, going upstairs, and going downstairs. Three tri-axial accelerometer sensors (sampling rate at 100 Hz) were placed in the cloth pocket, trouser pocket and to the waist. Around 86 participants' data were used for training, and the remaining 14 participants' for testing. The CNN model outperformed the SVM model, with an average recognition accuracy of 93.8%. However, the detailed information on the dataset was not mentioned, e.g., whether it was balanced. Accuracy may not be suitable in imbalanced data, as it mainly implies the true positive result of majority class. Instead, the F1 score is proposed, such as the micro F1 and segmentation F1@k score deployed in this study.
Besides CNN models, recurrent neural network (RNN) models have also been used for PA recognition to extract temporal features and recognize sequential activities. The RNN model is categorized into two types: the gated recurrent unit (GRU) and long short-term memory (LSTM). The GRU unit outperforms the LSTM unit because it is faster and easier to generalize while training with fewer gates [19], [20]. The RNN models were normally combined with the CNN units to extract both temporal features and features of single sensor channels or cross channels [21]- [23]. For example, in the study of Xu et al. [23], Incep V1 modules [24] were combined with the GRU units. In an inception V1 module, three CNN units were used to extract features in parallel. The combination of Inception V1 units and the GRU units is more efficient in extracting features from IMU signals for activity recognition than the normal CNN model or CNN-LSTM model.
The aforementioned models were generally applied in the fixed-length samples, with sliding window splitting signals first. One drawback of the sliding window is that the length of the window depends on the duration and the movement complexity of the classified activities. Without splitting signals, the temporal convolution network (TCN) [25] was proposed and applied to varied-length samples. The TCN model is composed of a hierarchy of temporal CNNs, which updates the prediction of every sample simultaneously. The TCN model can extract spatial and temporal information simultaneously, without RNN units. There are two types of TCN models, dilated TCN and encoder-decoder TCN (ED-TCN). Except for the temporal CNN layer, ED-TCN includes pooling and upsampling layers. Several studies have compared these two models with the LSTM and CNN model [26], [27]. In the study of Nair et al. [26], the models were tested on the UCI HAR dataset (six PA) [28], with 70% of the data for training. The ED-TCN achieved an F1-score of 0.946, followed by the dilated TCN with an F1-score of 0.938. The CNN model achieved the highest F1-score of 0.976, but with hand-crafted features as input. Therefore, although TCN models achieved a bit lower F1-score, they were verified to be efficient in dealing with varied length sequence data without using the sliding window and selecting features.
Two limitations of these studies are that: • their inputs were still samples segmented by the sliding window; thus, the robustness of TCN models in data with varied lengths has not been verified with IMU signals. • the models were also mostly tested with all participants' data mixed, which is not appropriate for the actual application, where the test data is unseen by the model. The contributions of the present study are: • Activities related to the frailty of all the body parts will be monitored, which are the activities included in the validated mPPT test. In addition, this test divides frailty into more detailed levels, namely non-frail, mild frail, moderate frail, unable to be functional. • The (MS-)TCN model is used to analyse IMU signals, compared with the SVM and Incep V1-GRU models. Models are trained in the leave-one-person-out (LOPO) mode. • The optimal sensor placement and sampling frequencies are investigated. • The mPPT items will be recognized and also segmented.
The segmentation result will then be used to predict each activity's duration, which is required to calculate the mPPT test score.

A. Data Acquisition
Five wearable devices from Byteflies [29] were applied for collecting the movement signal. Each sensor contained one tri-axial accelerometer (A x , A y , A z ) and one tri-axial gyroscope (G x , G y , G z ), with the sampling rate at 100 Hz. Two Byteflies sensors were attached to each wrist, with one sensor in the trouser's pocket and the other two sensors attached to the ankles.
Five males and three females, aged 24 to 33 years old, took part in the experiment. All participants were righthanded. Every participant performed the mPPT test for three rounds, and they were required to move slower than their natural movement speed to mimic the movement behaviour of older adults. Participants followed the researcher's instruction to perform the mPPT test items in a fixed order, standing balance, chair rising up & down, lifting a book, putting on and taking off a jacket, picking up a coin, turning 360 • , walking 50 ft, climbing one stair (ten steps), climbing max. of four stairs. The experiment was conducted in two places, one staircase for climbing stairs and one indoor room for the other seven items. The detailed information about the experiment can be found in [30]. Participants were asked to wave hands as the flag activity at the start and the end of each item. One researcher instructed the participant, and the other researcher annotated activities. The experiment was approved by the research ethics committee UZ/ KU LEUVEN (EC RESEARCH), with the assigned serial number S62736. All participants voluntarily participated in the experiment by signing the informed consent form.

B. Data Analysis
The Data analysis process is shown in Fig. 1. It includes four stages, and each stage will be explained as follows.
1) Signal Pre-Processing: The collected signals are annotated and segmented based on the researcher's annotation record. The signals of transition movements are removed. Hence, segmented signals are categorized into ten classes: standing balance, lifting a book, turning 360 • , putting on and taking off a jacket, picking up a coin, chair rising up & down, walking 50 ft, going upstairs, going downstairs, and waving hands.
2) Downsampling: Proposed by Farha et al. [31], in the activity segmentation, data downsampling can overcome the over-segmentation problems with the reduced precision result of detecting the segmentation boundaries. In addition, the longest sequence of this study has 31963 samples, which is long for the TCN model, which will be introduced in Section 3.3.1. Multiple frequencies (50 Hz, 25 Hz, 12.5 Hz, and 6.25 Hz) were tested with the TCN model, and the 6.25 Hz (Decimation factor = 16) was finally selected as it achieved a similar F1-score but a higher segmentation performance compared to other frequencies. The result is shown in Table III. All models are tested with signals at 100 Hz and 6.25 Hz.
3) Sliding Window: The sliding window is applied on splitting signals, except for TCN models. If there are multiple classes within one window, the window is annotated as the majority class. The size of the window is tuned in study [32]. The optimal size is 2 s with 95% (190 samples) overlapping for the 100 Hz sampling frequency and 2 s with 92% (12 samples) overlapping for the 6.25 Hz sampling frequency. Table I lists the total performing duration based on the annotation and the number of sliding windows of 100 Hz signals of each class.

C. mPPT Items Recognition
This study compares three machine learning algorithms: dilated TCN models, Incep V1-GRU model, and SVM. The setting up of every model will be discussed as follows.

1) Machine Learning Algorithms:
a) Dilated TCN: In the Dilated TCN model, the raw signals of one sequence, refer to the data sequence of one mPPT test, in other words, data before windowing. There are in total 24 data sequences, three sequences per participant.
Two types of dilated TCN models are tested, the causal and non-causal one [25]. In the causal TCN model, while predicting the nth sample, only the previous samples are considered. In contrast, the non-causal TCN model considers all samples in the receptive field before and after the considered sample.
The multi-stage TCN (MS-TCN) model [31] is also investigated in this study. This model includes multiple stages and  every stage has the same architecture as the normal TCN and the output of the previous stage is processed by a 1×1 convolutional layer with the softmax activation to alter the dimension. The architecture is shown in Fig. 2. With this architecture, this model effectively solves the over-segmentation problem without changing the receptive field [31]. The loss function used in the normal TCN model is the classification loss (L cls ), considering the results of each sample as shown in (1): T is the sequence length, y t,c is the predicted probability of the real annotated class for time sample t. In contrast, the MS-TCN model calculates the loss (L) of every single stage, and it combines the classification loss with a smoothing loss (L T −M S E ) to smooth over the time samples, as shown in (2),(3),(4), and the final loss is shown in (5).
C represents classes, s for stages, τ (=4) for threshold value and λ (=0.15) for the weight of L T −M S E [31]. In this study, the architecture of single stage is the same as the normal TCN, and the number of stages is tuned in the range of [2:8].
The tuned parameters of TCN models are the kernel size  (6): Considering the large number of combinations of hyperparameters, the Tree of Parzen Estimators (TPE) method [33]   is used to tune hyper-parameters. This method is based on the randomly-search algorithm and Bayesian theory. The combination of the hyper-parameters with the higher Bayesian probability is searched based on the recognition result of the previous combination. The hyper-parameters were tuned for five rounds. The one with the highest result was selected. The models were trained in 100 epochs. The early stopping algorithm was applied, and the training process stopped if the F1-score of the validation dataset did not increase over 0.005 in 10 epochs. The tuned results of TCN models are K size = 2 (causal TCN) \ 3 (non-causal TCN), N block = 1, N f ilter = 256, and N layer = 12 (sampling frequency at 100 Hz) \ 7 (sampling frequency at 6.25 Hz). Therefore, for signals sampled at 100 Hz, the receptive field of causal TCN is 40.96 s and 81.91 s for non-causal TCN. For signals sampled at 6.25 Hz, the receptive field of causal TCN is 20.48 s and 40.8 s.  Table II. The features are extracted from the x-, y-, z-axis and the magnitude of the accelerometer (7) and gyroscope signals separately (8): The correlation between axes (Corr) and zero-crossing times are not extracted from the magnitude of the accelerometer or gyroscope. The detailed information on the features can be found in [34]. In total, 74 ( 17 × 4 (axes + magni tude) + 2 × 3 (axes)) features are extracted from one accelerometer or gyroscope sensor.
A one-versus-one multiclass SVM classifier is used with the RBF kernel. The regularisation parameter C and the γ value of the RBF kernel are optimised. The optimized value of C is 10 (searched in 10 [0:1:3] ) and the optimized value of γ is 0.01 (searched in 10 [−5:1:−1] ).

2) Model Training & Evaluation of Recognition Performance:
Each participant has three datasets, each representing one pass through the mPPT test. Models are trained and tested in two modes: • Participant independent. The dataset is separated into three parts, training, validation, and testing. In each iteration, two datasets of one participant were used for testing, and the left dataset of the test participant was not used.  [35], with one Tesla T4 GPU, and the RAM memory of 25.46 GB. As shown in Table I, the dataset is imbalanced. To solve this problem, the weight of each class equals the inverse of its frequency. Thus, while calculating the loss function, the loss of the minority class will have a higher weight than the majority class.
The performance of items recognition of models is evaluated using the following criteria: • F1-score: the F1-score is adopted as the evaluation metric for comparing the performance of the same model using signals with different sampling frequencies. • Computation duration: the training and prediction duration of the whole dataset is calculated separately.

D. MPPT Items Segmentation
1) Segmentation Process: Based on the predicted windows/ time samples, a segment is established with a predicted class at discrete time instances. This segment is compared with the ground-truth annotated segment. The items segmentation process is illustrated in Fig. 4.
The neighbouring windows with the same predictions are considered one group, which is one mPPT item. The item's duration is calculated by counting the number of covered windows or time samples and converting it into seconds.

2) Evaluation Metrics of Segmentation Performance:
• f1@k: The f1@k is proposed by Lea et al. [25], which refers to the F1-score of segmentation with a threshold value (k) of the intersection-over-unit (IOU). If the corresponding IOU of one predicted segment is large than the k value, the predicted segment is a true positive (TP); otherwise, it is a false positive (FP). The unpaired actual segments are false negative (FN), as shown in Fig. 4. In this study, k values are 0.1, 0.25, and 0.5. The average f1@k value over all classes is calculated, and it is the main evaluation metric, particularly f1@k (k = 0.5). • Absolute duration prediction error: the absolute duration predicted error is the absolute error of the predicted duration of one segment with the annotated one.

E. Optimal Sensor Placement
The optimal sensor placement is investigated with the best machine learning model. Firstly, signals of every Byteflies sensor are applied separately. At first, signals of one sensor is added until all the five sensors are included. The combination of sensors is evaluated based on the f1@k (k = 0.5). If the f1@k values are similar, the F1-score will be used as the second evaluation criterion. Table III presents the activity recognition & segmentation results of all models using signals at frequencies of 100 Hz and 6.25 Hz. Among the models, TCN models performed better than other models in both activity recognition and segmentation, with the non-causal TCN model (6.25 Hz) achieving the highest f1@k (k = 0.5) of 0.277. While comparing TCN models, the non-causal TCN outperforms the causal TCN model in both recognition and segmentation results. The f1@k (k = 0.5) of the non-causal model (6.25 Hz) is around 0.06 higher than the causal model's. In addition, the training of TCN models is significantly shorter than SVM and the Incep V1-GRU model.

A. Activity Recognition & Segmentation in the Participant Independent Mode
Considering the sample frequency, the downsampling method increases the F1-scores and f1@k results of TCN models. In addition, with the 6.25 Hz sampling frequency, the computation load is less. In contrast, after downsampling to 6.25 Hz, SVM and Incep V1-GRU models' performances are both reduced, significantly the f1@k reduced by more than 0.38 for the SVM model.
After comparison, the non-causal TCN models using signals sampled at 6.25 Hz are applied to test the MS-TCN model with the transfer learning method, shown in Table IV. The performance of the MS-TCN (1 stage) is better than the normal TCN, especially its f1@k (k = 0.5) value, around 0.05 higher than the normal TCN. Among the MS-TCN models, MS-TCN (6 stages) obtains the highest segmentation performance, with the f1@k(k = 0.5) of 0.705 in the participant independent mode.

B. Activity Recognition & Segmentation With the Transfer Learning Method
The MS-TCN (6 stages) is trained with the transfer learning method and investigates the best sensor placement because it obtains the highest performance compared to other models. As listed in Table IV, including part of the test participant's data for training the model, the robustness of the model is improved in unseen participants, with F1-scores increasing from 0.709 to 0.758 and f1@k scores increasing over 0.1.   Table V lists the results of a single sensor and optimal combinations of sensors when the number of sensors increased from two to five. The optimal sensor combination is sensors on the left hand and right ankle, with the f1@k (k = 0.5) of 0.850, 0.11 higher than using all five sensors and only 0.18 lower than the highest result, including the sensor on the left ankle. In addition, the number of worn sensors is reduced. The recognition confusion matrix is presented in Fig. 6. The absolute duration predicted errors of the non-causal MS-TCN model (6 stages) using signals of the left hand and right ankle are calculated, as shown in Table VI. The duration prediction error averaged over all mPPT items participants 4-7 are about 1 s, which is acceptable as it is the duration precision in the mPPT test. The most significant errors of these participants are going upstairs (16.16 s) and going downstairs (10.48) for P1, walking 50 ft (48.16 s) and standing balance (28.24 s) for P2. An example of the poor result is shown in Fig. 5 for P2.

A. Machine Learning Models
Different sampling rates have various impacts on different models. For TCN models, downsampling results in a higher segmentation performance, as proposed in study [31]. However, for the SVM model, signals at 100 Hz achieved higher segmentation results. This is because, within the same length (2 s), the larger the sample size, the less spread one gets on the extracted statistical features. Therefore, the window at 100 Hz has more samples, which is more informative, resulting in a better segmentation performance. However, the impact of  IV  THE F1-SCORE AND F1@K RESULTS OF MS-TCN MODELS WITHOUT  AND WITH THE TRANSFER LEARNING METHOD  frequency on the Incep V1-GRU model is insignificant, which could be because it is more influenced by the dataset size than the sampling rate. The MS-TCN (six stages) outperformed other models in recognizing and segmenting mPPT items. In addition, it does not require the sliding window method and extracting handcrafted features. However, there are still limitations of the MS-TCN models: 1) it applies the non-causal TCN model to consider the complete sequence during prediction. Hence, the non-causal TCN model can be only applied for offline data analysis; 2) all the participants performed mPPT test items following the same order. Thus, whether the model can be used for a random order sequence has not been verified yet; 3) Although participants were asked to slow their speed, their performance is still different from older adults, even frail older adults, such as the increased step instability and reduced ability of keep balance [36]; 4) The physical meaning of the features extracted by the MS-TCN has not been explored yet.

B. The Impact of the Transfer Learning Method
To overcome the impact of subjective difference in the participant independent mode, the transfer learning method is proposed to be effective to improve the generalizability of the model. In a real-life application, it can be applied with a data pre-collection session for new users. During this session, movement signals can be collected while users learn how to perform mPPT items and wear the devices.

C. Sensor Placement
The optimal sensor combination is a sensor on the left hand and right ankle. The sensor on the ankle is more sensitive to the movement of the lower extremity, such as walking, as proposed in study [37]. In addition, the non-dominant hand is better than the dominant hand when combined with the right ankle in recognizing mPPT items, which is the same as in study [38]. As a result, with only two sensors, the segmentation performance is even higher than with five sensors. Moreover, the experimental setup is simpler, which is more convenient for users to perform the mPPT test by themselves. However, from the confusion matrix we ntoe that this combination is not helpful to differentiate going upstairs, going downstairs, and walking, which were similar activities but performed in different directions. The features extracted from the transitional acceleration and IMU signals between knee and ankle [39] can be considered in future work.

D. MPPT Items
The confusion matrix of the non-causal MS-TCN (6 stages) model is shown in Fig. 6. Waving hands is confused with other activities, except for going downstairs. Among the mPPT items, lifting a book achieved the lowest recognition performance. The reason could be that this activity has the smallest numbers of samples, not enough for the model to learn properly. In addition, participants were required to mimic the behaviour of older adults, like slowing down their movement speed. Hence, the collected movement signals of this activity could be similar to waving hands without moving the arms. In addition, going upstairs and downstairs are confused with each other.
Regarding segmentation results, P2 obtained the largest duration predicted errors in walking 50 ft and going upstairs. While checking the performance of P2, (s)he performed the mPPT test at a much slower speed than others, resulting in a much longer duration of each activity. Moreover, (s)he stopped multiple times during the walking and going upstairs test, which can happen with older adults and is misclassified with standing balance, shown in Fig. 5). This misclassification implies the limitation of the dataset not including all the variability of performing these items; hence the model did not learned that stopping could occur during the walking/ going upstairs test. In addition, a larger size of receptive field of the TCN model may be needed to grasp the context.  This study predicted the duration of most mPPT items with the error of less than 4 s, except for walking 50 ft, going upstairs, and going downstairs; however, the mPPT test score can not be automatically predicted yet. What is still required to be able to calculate the mPPT score will be: 1) detecting the steadiness and continuity of turning 360 • ; 2) counting the number of finished steps of going up-/downstairs; 3) detecting different feet positions of standing balance. The solution can be using multiple types of sensors, like a pressure sensor to record the feet positions and movement. Furthermore, the zero velocity updates (ZUPT) method, which measures the velocity and the location of the feet, could be applied to detect the continuity and stability of the gait and count the number of climbed stairs [40].

VI. CONCLUSION
This study proposes using five wearable devices to recognize and segment mPPT items. Three machine learning models and the optimal sensor placement are investigated, with the signal sampled at 100 Hz and 6.25 Hz. The transfer learning method is applied to improve the robustness of the model in unseen participants. The results indicate that the non-causal MS-TCN model (6 stages) is competitive with the SVM, Incep-v1 model, and the normal (non)-causal TCN model, with the downsampled 6.25 Hz and only the sensors on the left hand and right ankle. The results imply the potential of this system to predict the duration of most mPPT items; however, the duration prediction errors are still varying among participants. The future works can mainly focus on solving the problems of increasing the recognition performance of some mPPT items, like lifting a book, and testing the feasibility of the model on older adults.

ACKNOWLEDGMENT
The assistance provided by Ahmed Youssef Ali Amer and Benjamin Filtjens (KU Leuven) was greatly appreciated. In addition, the authors thank all the participants who volunteered in the experiment.