A Novel CNN-BiLSTM Ensemble Model With Attention Mechanism for Sit-to-Stand Phase Identification Using Wearable Inertial Sensors

Sit-to-stand transition phase identification is vital in the control of a wearable exoskeleton robot for assisting patients to stand stably. In this study, we aim to propose a method for segmenting and identifying the sit-to-stand phase using two inertial sensors. First, we defined the sit-to-stand transition into five phases, namely, the initial sitting phase, the flexion momentum phase, the momentum transfer phase, the extension phase, and the stable standing phase based on the preprocessed acceleration and angular velocity data. We then employed a threshold method to recognize the initial sitting and the stable standing phases. Finally, we designed a novel CNN-BiLSTM-Attention algorithm to identify the three transition phases, namely, the flexion momentum phase, the momentum transfer phase, and the extension phase. Fifteen subjects were recruited to perform sit-to-stand transition experiments under a specific paradigm. A combination of the acceleration and angular velocity data features for the sit-to-stand transition phase identification were validated for the model performance improvements. The integration of the CNN, Bi-LSTM, and Attention modules demonstrated the reasonableness of the proposed algorithms. The experimental results showed that the proposed CNN-BiLSTM-Attention algorithm achieved the highest average classification accuracy of 99.5% for all five phases when compared to both traditional machine learning algorithms and deep learning algorithms on our customized dataset (STS-PD). The proposed sit-to-stand phase recognition algorithm could serve as a foundation for the control of wearable exoskeletons and is important for the further development of intelligent wearable exoskeleton rehabilitation robots.


I. INTRODUCTION
T RANSITION from sitting to standing is essential in daily lives of patients with lower limb motor dysfunction [1], [2], [3].Employing a wearable exoskeleton robot to assist patients with sit-to-stand transitions is an effective way for patients with weak lower limb muscle function, balance disorders, and poor coordination [4].Furthermore, the utilization of a wearable exoskeleton robot not only reduces the risk of falls in patients, but also reduces the cost of care [5], [6].In the control of wearable exoskeleton robots, accurately identifying the sit-to-stand transition phases is vital for determining the assisting moments.
Researchers have performed studies on phase segmentation and phase identification during the sit-to-stand transition.Schenckman et al. [7] analyzed the kinematic and kinetic characteristics of healthy adults using an optoelectronic camera and force plates.They divided the sit-to-stand transition into flexion momentum phase, momentum transfer phase, extension phase, and stabilization phase.Roebroeck et al. [8] used a motion capture system to calculate joint angles, angular velocities, and angular accelerations, and then they divided the sit-to-stand transitions into three phases based on the centerof-mass velocities in the horizontal and vertical directions.The divided phases included the acceleration phase, the transition phase, and the deceleration phase.Kralj et al. [9] measured the ground reaction force using a force plate and analyzed the kinematic characteristics of each phase of the sit-to-stand transition.They divided the phase of the sit-to-stand transi-tion into seven events, including quiet sitting, initiation, seat unloading, seat off, ascending, stabilization, and quiet standing.Etnyre et al. [10] defined 11 events of sit-to-stand transition using a vertical, anterior, and lateral three-dimensional force platform.Norman-Gerum and McPhee [11] divided the phase of the sit-to-stand transition into six events and five phases based on angular data from the trunk, hip, knee, and ankle, as well as vertical ground reaction forces.These phases included the start of the sit-to-stand, start of seat unloading, seat-off, end of momentum transfer, beginning of stabilization, and end of sit-to-stand (STS).In summary, these sit-to-stand phase segmentation methods all rely on instrumentation such as video acquisition systems and force plates [12].As a result, these methods can only be used for data acquisition and analysis in stationary scenes, and cannot be applied to the motion control of exoskeleton robots that assist patients in completing sit-to-stand transitions.
Wearable inertial measurement unit (IMU) has the advantages of portability and low cost, which has the potential to be used outdoors.IMU has been widely used in various fields such as motion recognition, health monitoring, and humancomputer interaction [13], [14].Maswadi et al. [15] achieved an accuracy of 89.5% and 99.9% using the NB and DT classifiers for activity recognition, including sitting, standing, walking, lying down, and sit-to-stand, respectively.They denoised the raw triaxial acceleration data by using Gaussian filters, then extracted time and frequency domain features and fed them into the NB and DT classifiers.Martinez-Hernandez and Dehghani-Sanij [16] used a Bayesian classifier to identify the sitting, sit-to-stand transitions, and standing states based on acceleration signals.Additionally, they also identified the three transition phases of the sit-to-stand transitions.Wannenburg and Malekian [17] proposed the K-Nearest Neighbors (KNN) algorithm for offline recognition of five types of activities: sitting, standing, lying, walking, and jumping.The algorithm achieved the highest classification accuracy of 99.01%.
Compared with traditional machine learning algorithms, deep learning algorithms utilize a deep network structure to automatically extract features from raw data, resulting in improved model generalization.Yen et al. [18] automatically extracted features using a 1D CNN convolutional layer based on the three-axis acceleration and angular velocity data.The output of this layer provided the probability of each of the six activities, which include walking, walking upstairs, walking downstairs, sitting, standing, and lying down.Xia et al. [19] proposed a deep neural network combining CNN and LSTM to automatically extract activity features to recognize human activities.Geravesh and Rupapara et al. [20] developed a Multi-Layer Perceptron (MLP) network architecture by integrating acceleration and angular velocity features.The architecture consisted of a dense layer stack, which achieved an accuracy of 98% in recognizing six daily activities including sitting, standing, walking, biking, climbing stairs up, and climbing stairs down.To sum up, previous studies utilized IMU sensors to recognize human activities, however, they did not specifically focus on phase segmentation and identification of sit-to-stand transitions.
Given the above problems, this paper proposed a method for segmenting and identifying the sit-to-stand phase using two IMUs.First, we defined the sit-to-stand transition into five phases, namely, the initial sitting phase, the flexion momentum phase, the momentum transfer phase, the extension phase, and the stable standing phase based on the acceleration and angular velocity data preprocessed by two inertial sensors.Second, the threshold method was used to recognize the initial sitting phase and the stable standing phase.Finally, we designed a novel CNN-BiLSTM ensemble model with attention mechanism to identify the three transition phases, namely, the flexion momentum phase, the momentum transfer phase, and the extension phase.These methods improved the accuracy of sit-to-stand transition phase identification, which is vital in the control of a wearable exoskeleton robot for assisting patients to stand stably.

II. METHODS
In this study, we proposed a novel method for segmenting and identifying the sit-to-stand transition using two IMUs.Our proposed method defined five phases including the initial sitting phase, the flexion momentum phase, the momentum transfer phase, the extension phase, and the stable standing phase.The overall flowchart is shown in Fig. 1.First, the raw acceleration data and angular velocity data from the two IMUs were collected and pre-processed.Second, we identified the initial sitting phase and the stable standing phase using the thresholding method.Finally, we identified the flexion momentum phase, the momentum transfer phase, and the extension phase based on the CNN-BiLSTM-Attention algorithm.

A. Sit-to-Stand Phase Segmentation
The sit-to-stand transitions, from initial sitting to stable standing, is comprised of five phases including the initial sitting phase, the flexion momentum phase, the momentum transfer phase, the extension phase, and the stable standing phase, as shown in Fig. 2. The initial sitting phase is characterized by a ground reaction force that is approximately 24% of the body weight.During this phase, the ground reaction force remains constant and there is no intention to stand.The flexion momentum phase begins when the trunk and pelvis start tilting forward from the initial sitting position.In this phase, the femur, calves, and feet remain stationary while the ground reaction force gradually increases.The flexion momentum phase ends just before the hips leave the seat, at this moment the ground reaction force is 0 N.The momentum transfer phase is defined as the period starting when the hips leave the seat and ending when the knee moment reaches its maximum value, coinciding with the maximum value of the ground reaction force.The extension phase is defined as the period starting when the knee moment reaches its maximum value and ending when the hip, knee, and ankle joints align horizontally, and the body reaches a fully upright position, with the ground reaction force approximately equal to the total body weight.The stable standing phase is defined as the period during which the ground reaction force is maintained at the same level as the total body weight, without any intention to sit down.Fig. 2. Sit-to-stand transition phase segmentation.Consider a subject weighing 69 kg as an example.The vertical ground reaction force under the feet and the seat are presented by blue solid line and green solid line, respectively.The dotted lines and arrows delineate five phases including the initial sitting phase, the flexion momentum phase, the momentum transfer phase, the extension phase, and the stable standing phase.

B. Feature Selection
The acceleration after removing the gravity component was calculated using the relationship between quaternion and 3D rotation and its theorem.The calculation formula is shown in (1)-( 2).The measured acceleration in the inertial sensor system was first transformed into the measured acceleration in the world coordinate system.Second, subtracted the gravitational acceleration component, and we thus obtained the linear acceleration.
where a denotes the linear acceleration, a W denotes the measured acceleration in the world coordinate system, a I denotes the measured acceleration in the inertial sensor system, a g = [0, 0, 9.8] is the gravity acceleration, q is the orientation quaternion calculated from the built-in sensor fusion algorithm of IMU.q * is the conjugate of the quaternion, ⊗ denotes the quaternion multiplication operation.
To remove the noise during the signal acquisition process, a 12th-order low-pass Butterworth filter was applied to the vertical acceleration signal and angular velocity signal [21].The cut-off frequency of the filter was set to 1.7 Hz.The resulting filtered signals were then used as features for the model input.

C. CNN-BiLSTM-Attention Ensemble Model
To accurately identify and understand the dynamics of the sit-to-stand transitions in the context of controlling exoskeleton robot, we designed the CNN-BiLSTM-Attention architecture in this study.The decision to utilize this specific architecture was motivated by several key factors.First, the CNN component can effectively extract spatial features from pre-processed time series data.Second, the BiLSTM component is adept at modeling temporal dependencies, allowing for a more comprehensive understanding of the input data.Finally, the Attention mechanism is used to redistribute the weights of the extracted spatial and temporal features and integrate them into more salient features.The network architecture of the CNN-BiLSTM-Attention ensemble model is illustrated in Fig. 3.The whole network structure consists of three sub-networks cascaded by CNN [22], Bi-LSTM [23], and Attention mechanism [24].The selected hyperparameters of The convolutional layer is the most important unit in CNN.In spatial dimension, it performs feature extraction on time series data using a convolutional kernel, as shown in (3).We utilized two layers of 1D convolutional layers, and the first convolutional layer employed 256 convolutional kernels for feature extraction, while the second convolutional layer employed same convolutional kernels to perform deeper feature extraction on the output features from the upper layer.Each convolutional kernel had a size of 2 × 2, a stride of 1, and a padding of 1.
where Y i, j is the output feature mapping value, W m,n denotes the m * n weight matrix of the convolution kernel, X i+m, j+n is the input feature mapping value, b is the bias vector, ⊗ is the matrix multiplication operation, and f (•) is the nonlinear activation function.The ReLU activation function is used to calculate the feature map.The calculation formula is shown in (4).
During deep neural network training, the input distribution at each layer is continuously updated with the weight parameters, resulting in slower convergence.The batch normalization layer is added to the convolutional layer to improve the performance and stability of convolutional neural networks.By normalizing each small batch of data, it makes the model's input distribution more stable at each layer, improving the efficiency and accuracy of training the model.Since the max-pooling layer has the roles of downsampling, feature extraction, translation invariance and data normalization.Maxpooling layer helps the model to better handle the time series data and enhances its robustness to noise and fluctuations.Therefore, after the convolution layer and batch normalization layer, max-pooling layer was used to further extract significant features to achieve the sit-to-stand phase identification.
2) Bidirectional LSTM Network (Bi-LSTM): A bidirectional LSTM network is used for better extraction of temporal features in time serial data.The bidirectional LSTM network consists of two layers of LSTMs, each with 256 memory cells, except that the information is passed in a different direction.It is assumed that the first LSTM layer is in chronological order and the second LSTM layer is in reverse chronological order.The unfolded h t is calculated by the following equation: where h t+1 are the hidden states of layer 1 at time t-1 and layer 2 at time t+1, respectively, U ∈ R D * D is the state-state weight matrix, W ∈ R D * M is the state-input weight matrix, b ∈ R D is the bias vector, x t is the input at time t, f (•) is the nonlinear activation function, and ⊕ is the vector addition operation.
3) Attention Mechanism: Using the hidden state of the last unit of the bidirectional LSTM network as the query vector, the Attention mechanism is introduced to calculate the correlation of all the input temporal hidden states.Redistributing the feature weights extracted by Bi-LSTM enables the model to select and utilize salient features more accurately, thus improving the model's performance in temporal classification tasks.The output O of the attentional layer is calculated from ( 6)-( 8): where X = [x 1 , x 2 , . . ., x t ] denotes the features captured by the Bi-LSTM model, W a denotes the matrix of weight coefficients, α is a vector representing the attentional weights of the feature X , and T is a transpose operation.
III. EXPERIMENTAL VALIDATION Human Activity Recognition (HAR) public datasets are used for recognition, such as University of California Irvine-Human Activity Recognition (UCI-HAR) [25], Wireless Sensor Data Mining (WISDM) [26], OPPORTUNITY [27], etc.All the above public datasets only include human daily activities like standing, lying, walking, going up and down stairs, going up and down steps, jogging, etc., as well as a variety of gesture actions and movement patterns.As a result, there is no dataset about the sit-to-stand transitions.Therefore, it is necessary to perform the following experimental studies and build a customized dataset for the training of the CNN-BiLSTM-Attention model and the validation of the effectiveness of the sit-to-stand phase identification.Additionally, these public datasets were also used for further verifying the generalizability of our proposed classification model.

A. Subjects
Fifteen subjects (12 male, 3 female, age 26 ± 3 years, height 168 ± 7.05 cm, Weight 60 ± 15.2 kg) without any reported musculoskeletal diseases or severe neurological injuries were recruited to perform the designed experiments.Kinetic and kinematic data were collected by force plates and inertial sensors.All participants provided written informed consent before enrolment in the study, which was approved by The Ethics Committee of the Affiliated Second Hospital, Harbin Medical University (KY2022-173).

B. Experimental Settings
The experimental equipment consisted of two AMTI force plates (BP400600-OP-1000, The United States), a set of Optima signal amplifiers, two wireless inertial sensors (MTw, Xsens Technologies BV, Enschede, The Netherlands), and an Awinda Station receiver.All algorithms were performed on a personal computer with Microsoft Windows 10, an Intel Core Processor i7-13700K, 32 GB RAM and GPU of NVIDIA GeForce RTX 3090.The experimental acquisition devices setup is shown in Fig. 4. The force plate and the inertial sensor systems were synchronized using a synchronization cable, During the experiment, the ground reaction force collected by the force plate and the kinematic data collected by the inertial sensor were measured at the same sampling frequency of 100 Hz.First, the foot-ground interaction force and hip-ground interaction force in the vertical, anteriorposterior, and lateral directions were simultaneously captured and measured by two force plates located underneath the feet and the seat.The trend variations were all within reasonable limits and had parallel repeatability.Second, the inertial sensors were securely attached to the body at the L5 and outer thigh positions using restraint belts [13], and the inertial sensor wearing positions were strictly defined.Finally, during the sitto-stand transitions, we acquired the 3-axis acceleration signals and 3-axis angular velocity signals.

C. Experimental Protocol
All participants performed the sit-to-stand transitions in a specific experimental protocol.A seat without armrests or backrests was used, and the height was about 85% of each participant's knee height.Participants were stabilized in the initial sitting position with the buttocks in contact with the seat, eyes looking forward, arms kept relaxed and crossed in front of the chest, trunk vertical, thighs horizontal to the ground, calf's vertical to the ground, feet parallel and spaced shoulderwidth apart.When the subject heard the "stand up" command from the experimenter, the subject actively stood up from the seat at a natural speed until the subject consciously maintained the initial sitting position and stopped swaying.Then, when the subject gave the "stop" command, an experiment was completed.After a two-second interval, participants would return to the initial sitting position and repeat the process five times, following the above procedure.

D. Dataset Customization
The sit-to-stand phases were manually segmented and labeled using the preprocessed vertical acceleration and vertical angular velocity feature vectors.A fixed-width sliding window segmentation was applied for the time series classification task, considering the different maximum elapsed times of each phase during the sit-to-stand transitions.Specifically, the flexion momentum phase lasted 0.8 s, the Momentum transfer phase lasted 0.40 s, and the Extension phase lasted 1.28 s.The sliding window used for segmentation was set to the longest elapsed time of 1.28 s.The whole customized dataset used for training the model is called STS-PD.It took approximately 15 hours to build this dataset, and the dataset was divided into six categories.To address the issue of imbalance in the number of samples in each category, data enhancement techniques were employed to improve the model's generalization ability.The dataset consisted of around 27,276 samples, which were randomly split into 70% for training data and 30% for evaluating the accuracy of the trained model.An overview of the collected data sample set is presented in TABLE II.

A. Performance of CNN-BiLSTM-Attention Algorithm for Identifying Sit-to-Stand Phases
The classification performance of the model is highly dependent on the feature selection.First, we acquired the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II OVERVIEW OF THE CUSTOMIZED DATASET
3-axis acceleration signals and 3-axis angular velocity signals, as illustrated in Fig. 5. Second, we used three types of features, including a single acceleration feature, a single angular velocity feature, and a combined acceleration and angular velocity feature as model inputs, respectively.Finally, the recognition classification accuracy of the sit-to-stand phases for each model is presented in Fig. 6.We observed that the classification accuracy of the combined features is higher compared to the single features, confirming the effectiveness of the combined features in improving the model's performance (Fig. 6).Therefore, the CNN-BiLSTM-Attention model utilized the combined acceleration and angular velocity features as inputs for identifying the phase of sit-to-stand transitions.The accuracy and loss curves during model training and validation are shown in Fig. 7.The proposed CNN-BiLSTM-Attention algorithm achieved 99.5% mean accuracy in the sit-to-stand phases classification.

B. The Influence of Different Network Structure Combinations on Classification Performance
Ablation studies were performed to verify the effectiveness of the proposed model structure.The CNN-BiLSTM-Attention model was compared with the CNN, Bi-LSTM, and CNN-Bi-LSTM models, respectively, to observe the effects of each module on classification performance.The average classification accuracies of the above four different classification models for each phase of the initial sitting phase, the flexion momentum phase, the momentum transfer phase, the extension phase, and the stable standing phase, were presented in Fig. 8.The CNN-BiLSTM-Attention algorithm proposed in this paper recognized each phase of the sit-to-stand transition with an average recognition accuracy of 99.5%, as shown in Fig. 8.However, it is worth noting that the algorithm tends to  5.2% and 1.3%, respectively.Building upon this, the attention module was introduced, resulting in the best performance in recognizing the sit-to-stand transition phase (approximately 99.5%).We found the accuracy using attention module was generally higher than that of the CNN-BiLSTM.The attention module significantly affected recognition accuracy (p<0.05).This improvement of 3.3% compared to CNN-BiLSTM confirmed the validity of each module in the improved network structure, as shown in TABLE III.

C. Performance of CNN-BiLSTM-Attention Versus Different Classification Models
This paper aims to demonstrate the feasibility of the CNN-BiLSTM-Attention model for classifying the sit-to-stand phase.The study compared traditional machine learning algorithms such as SVM, NB, 1NN, DT, LR, RF, and deep learning algorithms including CNN, LSTM, CNN-Bi-LSTM, Bi-LSTM, and Gated-Transformer [28].To further verify the generalizability of the classification of the proposed model, each activity in the three public datasets of human activities (UCI HAR, WISDM, and OPPORTUNITY) was identified, respectively.In this paper, the results are evaluated on test dataset using the trained model (TABLE IV).Accuracy, precision, recall, and F1-score were selected as the evaluation metrics.The experimental results demonstrated that the CNN-BiLSTM-Attention model proposed in this paper outperformed other benchmark models in terms of accuracy and F1 score in all datasets.

V. DISCUSSION
In this paper, we proposed a method for segmenting and identifying the sit-to-stand five phases using two inertial sensors.We also verified the effectiveness of the combined features as well as the integration of the CNN, Bi-LSTM, and Attention modules on the performance improvement of the CNN-BiLSTM-Attention algorithm.When the CNN-BiLSTM-Attention algorithm was used for the sit-to-stand transition phases, the average classification accuracy achieved the highest (approximately 99.5%).To verify the generality and accuracy the CNN-BiLSTM-Attention algorithm, we compared the traditional machine learning algorithms and deep learning algorithms on the UCI HAR, WISDM, OPPORTUNITY public datasets, and the STS-PD customized dataset.Martinez-Hernandez et al. [16] employed a Bayesian classifier to identify the three activity states (sitting, sit-tostand transitions, and standing) using acceleration signals, while also identifying the three corresponding transition phases of the sit-to-stand transitions.Johan et al. [17] proposed the K-Nearest Neighbors (KNN) algorithm, achieving the highest classification accuracy of 99.01% for the offline recognition of five activities, including sitting, standing, lying, walking, and jumping.Kun Xia et al. [19] proposed a deep neural network combining CNN and LSTM to automatically extract activity features for human activity recognition.Shahab et al. [20] developed a Multi-Layer Perceptron (MLP) network architecture achieved 98% accuracy in recognizing six daily activities including sitting, standing, walking, biking, climbing stairs up, and climbing stairs down.Compared with the previous studies, they have not focused on segmenting phases and identifying the sit-to-stand transitions.We combined acceleration and angular velocity as inputs to the CNN-BiLSTM-Attention model, as well as cascaded three network structures, CNN, Bi-LSTM, and Attention, to achieve the recognition of the sit-to-stand transition phases with an accuracy of 99.5%.However, as shown in Fig. 7, there is a slight overfit of the model, which may be attributed to the following factors: 1) insufficient training data; 2) inadequate application of regularization techniques; 3) excessive hyperparameters (TABLE I).To address the issue of overfit several approaches can be adopted: 1) increasing the data size through data augmentation techniques; 2) applying regularization constraints such as L2 regularization; 3) reducing the complexity of the model in the future.
Both the position and the number of inertial sensors worn determine the two feature variations, acceleration, and angular Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.velocity, which affect the optimal feature selection.We used two inertial sensors in the experimental data acquisition phase.One sensor was worn on the outer thigh and the other was worn at the L5 position.When studying the human body movement patterns from a sitting to a standing position, inertial sensors were usually placed on the sternum, wrist, back (L4-L5), thigh, tibia, etc. Literature results suggested that wearing the sensors at the L5 position was the most effective [13].However, if only one inertial sensor was worn at the L5 position, it was not possible to distinguish whether the subject was in the initial sitting phase or the stable standing phase with zero acceleration and angular velocity.Therefore, in this study, one inertial sensor was worn on the outer thigh to accurately distinguish between these two static phases using the Euler angle threshold.Another sensor was worn at the L5 position to distinguish the sit-to-stand transition phase by the CNN-BiLSTM-Attention algorithm.The classification performance of the model heavily depends on the selected features.This paper compared the classification accuracies of single acceleration feature, single angular velocity feature, and combined acceleration and angular velocity feature studies using different classifiers for sit-to-stand transition phase recognition in the experimental results section, as shown in Fig. 6.The combination of features is more relevant to the target category than a single feature.It provides deeper information and enhances the classification of the sit-to-stand transition phase.Ultimately, it was shown that the combined acceleration and angular velocity features have the highest classification accuracy of 99.5% in the CNN-BiLSTM-Attention algorithm proposed in this work.
To verify the reasonableness and effectiveness of the network structure designed in this paper, we performed ablation experiments to study the effect of each module of CNN, Bi-LSTM, and Attention on the classification performance.The hyper-parameters, such as optimizer, number of filters, and batch size in the entire network structure, were set according to manual experience, and the specific parameters were shown in TABLE I.The classification results were presented through a confusion matrix, as shown in Fig. 8, which demonstrated the average classification accuracy and accuracy for each phase.The CNN-BiLSTM network structure outperformed the CNN network and Bi-LSTM network by 5.2% and 1.3% in terms of average classification accuracy, respectively.CNN networks are primarily used in image-related fields, where it is more effective to extract feature information in the spatial dimension.On the other hand, Bi-LSTM networks utilize both past and future contextual information, enhancing the modeling of long-term dependencies and facilitating the extraction of feature information in the time dimension.Therefore, by combining the strengths of CNN and Bi-LSTM networks, spatial-temporal feature information was extracted, providing richer features for model training.As a result, the average classification accuracy reached 96.2%.Furthermore, the Attention module was introduced, which further improved the average classification accuracy by 3.3% compared to the CNN-BiLSTM network structure.This module redistributed and integrated the weights of the extracted spatial-temporal features into more significant features, enhancing the robustness and generalization of the model.
The vertical ground reaction force obtained from the force plate was used as the gold standard.Two IMUs were used to classify the sit-stand transition process into different phases, including the initial sitting phase, the flexion momentum phase, the momentum transfer phase, the extension phase, and the stable standing phase.The CNN-BiLSTM-Attention model proposed was utilized to classify each phase as shown in Fig. 8.The accuracy of classifying the Flexion momentum phase and Momentum transfer phase was relatively low compared to other sit-to-stand phase classifications.This can be attributed to the short duration of trunk leaning forward and buttock leaving the seat, the use of a fixed sliding window for phase segmentation leading to overlapping of phases, the experimental protocol design, and the variability in stand-up movements among different individuals.These factors affect the accuracy of sit-to-stand phase classifications.
To evaluate the classification performance and generalization of the proposed CNN-BiLSTM-Attention algorithm, several evaluation indexes including accuracy, precision, recall, F1-score, and training time were selected.Six traditional machine learning algorithms such as SVM, NB, 1NN and six deep learning methods such as CNN, LSTM, CNN-Bi-LSTM were tested on three public datasets, UCI HAR, WISDM and OPPORTUNITY, as well as the STS-PD customized dataset, respectively.The results of the evaluation of different network models with different datasets were shown in TABLE IV.It is important to note that no single algorithm in machine learning can be universally applied to all problems, and the trade-off between accuracy and speed should be considered based on the specific application scenario.While the 1NN algorithm is simple and computationally fast on a customized dataset, the main goal of this paper is to accurately identify the individual phases of the sit-to-stand transition.Furthermore, to avoid the phenomenon of data imbalance observed in the WISDM and OPPORTUNITY datasets, the overall classification accuracy metrics alone cannot fully assess model performance.The F1 score, which combines precision and recall based on the total number of correctly recognized samples, serves as an important metric for evaluating the model's performance.Among the traditional machine learning algorithms, LR classification performed the worst due to the inclusion of only two features in the model training data, which is insufficient for a good fit.On the other hand, RF, as an integrated learning algorithm, significantly improves the classification accuracy compared to other weak classifiers.Neural networks have shown superior model fit compared to traditional machine learning algorithms.Transformer networks have achieved state-of-the-art performance in various tasks such as natural language processing, computer vision, and time series classification.However, due to the complexity of their model structure, the computational cost is too high to be applied to this task.Therefore, the CNN-Bi-LSTM model proposed in this paper was selected to identify the phase of the sit-to-stand transition, achieving an impressive F1 score of 99.5%.
Although the study achieved better results, there were still a few limitations.First, the experiment recruited a limited number of healthy subjects.To enhance the generalizability of the algorithm, future experiments should involve a more diverse population and collaborate with hospitals to study patients with motor dysfunction.Second, it is noteworthy that IMU sensors, being physical sensors, are unable to directly sense human motion intention and can only perform offline detection.Therefore, future research should consider incorporating multimodal signals, such as myoelectricity and electroencephalogram, to enable real-time online detection.Finally, it is worth mentioning that the proposed algorithm is computationally expensive and time-consuming.To address this, it is recommended to explore more optimal network model structures in future studies.

VI. CONCLUSION
In this paper, we defined the sit-to-stand transition into five phases including the initial sitting phase, the flexion momentum phase, the momentum transfer phase, the extension phase, and the stable standing phase based on two IMUs.We then proposed a threshold method and CNN-BiLSTM-Attention algorithm to accurately identify these phases.Through experimental validation, we were able to demonstrate that our proposed CNN-BiLSTM-Attention algorithm achieved an average classification accuracy of 99.5%, which outperformed the compared with traditional machine learning algorithms and deep learning algorithms.Accurate recognition of sit-to-stand transition phases plays a critical role in effectively controlling a wearable exoskeleton robot and assisting patients in maintaining stable standing.Moreover, sit-to-stand rehabilitation training can help restore neuromuscular and balance functions, improve lower limb strength and daily living abilities, as well as promote brain plasticity and functional reorganization.

Fig. 1 .
Fig. 1.Overall flowchart of identification of sit-to-stand transition based on two IMUs.

Fig. 3 .
Fig. 3.The architecture of the designed CNN-BiLSTM-Attention model.The combined acceleration and angular velocity features are used as model inputs, which are sequentially passed through the CNN layer, the Bi-LSTM layer, the Attention layer, and finally the classification results are outputted by the SoftMax classifier in the fully connected layer.TABLE I LIST OF SELECTED HYPER-PARAMETERS

t
are the hidden states of layer 1 and layer 2 at time t, h(1)t−1 and h(2)

Fig. 5 .
Fig. 5. Sit-to-stand IMU signals worn on L5.(a) the linear acceleration raw and filtered signals.(b) the angular velocity raw and filtered signals.The red dashed lines and arrows mark the individual phases corresponding to the force plates.

Fig. 8 .
Fig. 8. Confusion matrix for different classification models.The values on the diagonal (blue cells) represent the individual phase classification accuracies, and the other cells are the proportion of incorrectly labelled observations to the total number of observations.

TABLE III ABLATION
EXPERIMENT OF EACH MODULE

TABLE IV PERFORMANCE
EVALUATION OF DIFFERENT CLASSIFICATION MODELS ON PUBLIC DATASETS AND CUSTOM DATASETS