Accuracy Comparison of CNN, LSTM, and Transformer for Activity Recognition Using IMU and Visual Markers

Human activity recognition (HAR) has applications ranging from security to healthcare. Typically these systems are composed of data acquisition and activity recognition models. In this work, we compared the accuracy of two acquisition systems: Inertial Measurement Units (IMUs) vs Movement Analysis Systems (MAS). We trained models to recognize arm exercises using state-of-the-art deep learning architectures and compared their accuracy. MAS uses a camera array and reflective markers. IMU uses accelerometers, gyroscopes, and magnetometers. Sensors of both systems were attached to different locations of the upper limb. We captured and annotated 3 datasets, each one using both systems simultaneously. For activity recognition, we trained 8 architectures, each one with different operations and layers configurations. The best architectures were a combination of CNN, LSTM, and Transformer achieving test accuracy from 89% to 99% on average. We evaluated how feature selection reduced the sensors required. We found IMU and MAS data were able to distinguish correctly the arm exercises. CNN layers at the beginning produced better accuracy on challenging datasets. IMU had advantages over other acquisition systems for activity recognition. We analyzed the relations between models accuracy, signal waveforms, signals correlation, sampling rate, exercise duration, and window size. Finally, we proposed the use of a single IMU located at the wrist and a variable-size window extraction.


I. INTRODUCTION
Human activity recognition (HAR) studies the capture systems and algorithms to recognize activities performed by people in any situation.Data for this task can be captured by inertial sensors, cameras with visual markers, and cameras using human pose estimation, EEG, or EMG.All these sensors produce time series data, in consequence, the computer algorithms able to classify these activities are: The associate editor coordinating the review of this manuscript and approving it for publication was Dost Muhammad Khan .Recurrent neural networks, Convolutional neural networks, Long Short Term Memory networks, and Transformer networks [1], [2].HAR is an essential field of study in computer vision and artificial intelligence.It has many potential applications in various industries including security, surveillance, and healthcare [3], [4].HAR systems are used to monitor the movements of elderly individuals in care facilities and to alert caregivers if they fall or exhibit other signs of distress [5], [6], [7], [8], [9].HAR technology is being used in sports training to help athletes measure their performance by providing real-time feedback on their exercises [10].
One of the key challenges in human activity recognition is the high variability of human movements and activities.Due to this variability, traditional machine learning algorithms often struggle to accurately classify movements and activities [11].To overcome this challenge, researchers have developed a range of techniques and approaches, including deep learning and other advanced machine learning methods.Our work aims to identify deep learning architectures that achieve higher accuracies.
One capture system is the Movement Analysis System (MAS) which detects the movement of human joints and limbs.MAS uses infrared cameras and visual markers to measure the physical world and obtain the dimensions and positions of objects.Data is generated using specialized software to analyze images taken from different angles and positions to create a 3D model of the object [12].MAS takes the video streaming as input and estimates the positions of the visual markers worn by a person [13].
Another capture system is the use of inertial measurement units (IMUs).An IMU is a system of sensors that measures three axes acceleration, angular velocity, and magnetic field [14].IMUs are attached to different parts of the human body to identify activities such as walking, running, or jumping.Making IMU suitable for healthcare applications such as rehabilitation programs to provide feedback on movements and monitor progress during therapy sessions [15].Our work aims to identify the advantages and disadvantages of IMU and MAS capture methods.
HAR has three levels of abstraction in exercise recognition: full body exercise, single limb exercise, and stages inside a single exercise.We captured data and built models to classify upper limb exercises and to distinguish the flexion or extension stage inside an exercise.Most HAR studies [7], [10], [16], [17], [18], [19], [20], and [21], perform exercises that involve the entire human body and differ one from each other such as walking, climbing stairs, sitting, jumping, or running.Our work studies 6 specific exercises of the upper limb where they share common behavior making them harder to distinguish.
In this work, we compared the accuracies of two acquisition systems: Inertial Measurement Unit (IMU) vs Movement Analysis System (MAS).Studied the time series data using plots and labeling the signal according to each exercise.Understand graphically which sensor discriminates exercises better (acc, gyro, mag, visual marker).We trained 8 SOA deep learning architectures to recognize arm exercises and compared their accuracy.Designed 8 deep learning architectures using different layers and operations.Compared the performance for every architecture and capturing method to identify the best combination.Applied feature selection to identify minimum sensors and locations to correctly recognize exercises.We studied the effect of different sampling rates, exercise duration, and window size.Finally, we proposed the use of a single IMU located at the wrist and a variable-size window extraction.

II. RELATED WORKS A. ACTIVITY RECOGNITION USING IMU
The use of IMUs for human activity recognition has become popular in recent years due to the widespread availability of sensors in wearable devices such as smartphones, wrist watches, and fitness trackers.Advances in machine learning and deep learning allowed the development of sophisticated algorithms for recognizing human activities from IMU data.These algorithms involve all kinds of architectures from artificial neural networks which process data with spatial and temporal characteristics.Neural networks are effective in recognizing complex patterns in time series data and can be trained on large datasets to achieve high accuracy.
Monitoring and analyzing human motion can provide valuable information for various applications.In 2014, Ronao and Cho [16] proposed a multi-task learning approach for human activity recognition.Using accelerometer data, the system uses a single CNN to learn multiple tasks, including activity recognition and pose estimation.The authors found that this approach improved the overall accuracy of the activity recognition system.Then, in 2017 Yarnan et al. proposed a framework for detecting arm and human activities based on data fusion from inertial measurement units (IMUs) and surface electromyography (EMG) sensors [22].Supervised and unsupervised machine learning algorithms were used to train the models and obtain evaluation indicators.The combined IMU and EMG data outperformed the IMU data alone and the EMG data alone, significantly reducing the error in determining activities for supervised algorithms.
In 2018, Xiong et al. [17] proposed a two-stage model for recognizing activities from accelerometer data.The first stage of the model uses a CNN to extract spatial and temporal features from the data.The second stage then uses a recurrent neural network (RNN) to combine features from the CNN with contextual information to recognize the activity.
In 2019, Sarcevic et al. developed a system to detect arm and body movements using wrist sensors that contained an accelerometer, a gyroscope, and a magnetometer [6].Multiple datasets were tested using various feature extractive approaches, sampling frequencies, processing window widths, and sensor combinations.The authors achieved almost 90% accuracy on validation data.
Lu and Tong [18] worked on HAR using a single 3-axis accelerometer, focused on movement monitoring using wearable sensors and devices.Their method consists of encoding 3-axis signals as 3-channel images using a modified recurrence plot.Then, residual neural networks were used to classify images and, thus, signals.As a result, the authors obtained highly competitive accuracies and good efficiencies on the ASTRI motion dataset, which contains data on human hand movements, and the ADL Dataset from wrist-worn accelerometer data.
The work of Avilés et al. presented a framework to recognize user movement using a smartphone equipped with a tri-axial accelerometer and a tri-axial gyroscope sensor.
The framework used three parallel CNNs for local feature extractive, later fused in the classification stage.The whole CNN scheme is based on a feature fusion of a fine CNN, a medium CNN, and a coarse CNN [10].The algorithm successfully classified six human activities: walking, walking upstairs, walking downstairs, sitting, standing, and laying.
Yen et al. proposed a wearable device capable of recognizing six basic activities using deep learning and data from a gyroscope and an accelerometer [19].They used waist devices worn by dialysis patients, whose activities could not be accurately determined using wrist devices.The model achieved recognition rates of 95.99% and 93.77%.That same year, Lemieux and Noumeir proposed a hierarchical CNN model for human activity recognition [23].The model consists of two levels of CNNs.The first level extracts spatial features from the accelerometer data and the second level combines the spatial features with temporal information to recognize the activity.
In 2020, Clouthier et al. analyzed the movement of athletes [20].They collected optical motion data on 417 athletes performing 13 athletic movements.The authors trained an existing deep neural network architecture that combines convolutional and recurrent layers.They obtained classification accuracies of 90.1 and 90.2% for full body measurements.The authors concluded that classifying athletic movements using wearable sensors was feasible.
More recent research is the work of Uddin and Soylu [7], focused on the well-being of elderly people using wearable sensors to detect unprecedented events such as falls or other health risks.The authors proposed a ''body sensor-based activity modeling and recognition system using time-sequential information-based deep Neural Structured Learning (NSL)'' [7].The algorithm is powered by data from multiple wearable sensors, which then undergo statistical feature processing.The framework is powered by kernel discriminant analysis (KDA) and long short-term memory (LSTM) based models.The authors achieved around 99% recall on the mobile health application dataset (MHEALTH) [24].The framework also surpassed the recall rate of other algorithms, such as deep belief networks, convolutional neural networks, and recurrent neural networks.
Another recent research work is the paper of Han et al. which focused on enhancing the convolution capacities of CNNs instead of modifying the architectures [25].The authors proposed the idea of heterogeneous convolution for activity recognition tasks.All filters within a specific convolutional layer are separated into two uneven groups.The authors examined the effectiveness of the framework on several benchmark HAR datasets, finding that the heterogeneous convolution is simple to integrate into convolutional layers without increasing extra parameters and computational overhead.In the same year, Luwe et al. proposed a ''hybrid deep learning model that amalgamates a one-dimensional Convolutional Neural Network with a bidirectional long short-term memory (1D-CNN-BiLSTM) model for wearable sensor-based human activity recognition'' [26].This one-dimensional neural network transforms the time series information from the sensor into representative features, which are then encoded by the bidirectional LSTM.The authors found the approach outperformed the existing methods, obtaining a recognition rate of 95.48% on the UCI-HAR dataset, 94.17% on the Motion Sense dataset, and 100% on the Single Accelerometer dataset.

B. ACTIVITY RECOGNITION USING MOVEMENT ANALYSIS SYSTEM
In 1999, Ramsey and Wretenberg published a research paper reporting on the use of intracortical pins to measure knee movement as an alternative to the use of reflective markers [27].The authors found that their method allowed them to take more precise readings with low error.
In 2005, Cutti et al. proposed an experiment to test the error when using reflective markers for photogrammetric measurements [28].The authors put the markers on different subjects and made them execute different movements.Then, the readings affected by the error were compared with normal readings.The authors concluded that the error has a strong influence and should not be ignored, opening the way for new research that could compensate for this error.
Tokarczyk and Mazur compiled different Movement Analysis System techniques and methods [8].They presented the advancements of two Movement Analysis System techniques.The first is Moiré's method of stripes, which involves overlaying two sets of parallel stripes with slightly different spacings to create a moiré pattern, which can be used in the human body to detect conditions such as scoliosis.The second method uses multiple video cameras around the body, in conjunction with physical markers, that the camera system can easily detect and read.
In 2008, Van Andel et al. carried out research to determine a standardization protocol for the clinical application of upper extremity movement analysis [9].The authors developed measurement methods for hand orientation in different movements, using a stereophotogrammetric recording of active LED markers with a camera system.The wrist, elbow, shoulder, and scapula joint angles were analyzed, and minimum/maximum angles were determined.This way, the authors determined the trajectories and angles of all the movements, cementing the basis for developing more precise and standardized reports on movements that would allow for future comparisons with pediatric and/or pathologic movement patterns.MAS generates easy-to-understand reports on movements because it outputs markers position [9].
Jaén-Vargas et al. used wearable sensors and two reflective markers (mocap) to recognize the activities of walking, sit-to-stand, and squatting [21].The authors evaluated the performance of four deep learning networks: deep neural network, CNN, LSTM, and a combination of CNN and LSTM.The authors found that a hybrid network (CNN-LSTM) was better than an individual network.The hybrid approach accounted for class imbalance, making it more 106652 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
versatile, obtaining 99% accuracy in both datasets and an F1 score of 99% and 87% with wearable sensors and reflective markers, respectively.The authors also find that the use of wearable sensors yielded better results.
Jaén-Vargas et al. 2022 analyzed the performance of deep neural networks, CNN, LSTM, and CNN-LSTM when variating the sliding window size [29].The sliding window is a technique in which a fixed-size window is moved over a time series, processing the information in divided sections.The intention was to find an optimal window size for HAR using a sampling rate of 100 Hz.Windows of small sizes 5, 10, 15, 20, and 25 frames and long ones of sizes 50, 75, 100, and 200 frames were compared.The results showed that windows from 20 to 25 frames were optimal, obtaining an accuracy of 99,07% and an F1-score of 87,08% on sensor data and an accuracy of 98,8% and an F1-score of 82,80% on MOCAP data.
A particular field of research that involves the use of cameras for movement recognition is the study of human gait using silhouettes and skeletons.In this field, an important work is the paper of Cicirelli et al. [30], which is a review of human gait analysis with application in neurodegenerative diseases.They compared sensors, features, and processing methodologies where deep networks such CNNs or LSTMs achieved the best results.In another gait-related work [31], the authors performed gait analysis classification for neurodegenerative diseases using support vector machines (SVMs) in optical motion capture data and achieved 99.1% accuracy.This year, three relevant works in gait analysis were published.The first one is the paper of Shayestegan et al. [32], in which they implemented Dual-Head Attentional Transformer-LSTM (DHAT-LSTM) in kinetic data to classify stages of gait disorders and achieved an accuracy of 81%.In the work of Cheriet et al. [33], they applied a Multi-Speed Transformer Network in video data to classify stages of neurodegenerative diseases and achieved an accuracy of 96.9%.Finally, Cosma, Catruna, and Radoi [34] published a paper where they used Self-Supervised Vision Transformers in human gait video data as a biometric authentication method.
A summary of the papers considered in the Related Works section is presented in Table 1.

III. EQUIPMENT AND DATASETS ACQUISITION A. EQUIPMENT
Three different Inertial Measurement Units were used.Different sensor manufacturers were used to capture variability in the sensor's sensitivity and sampling rate.The first two devices, the MPU9250, and Trigno Avanti [35] inertial sensors are composed of an accelerometer, gyroscope, and magnetometer, each one with 3 axes.The third IMU device used was the Metawear, composed of a 3D accelerometer and a 3D gyroscope.The three systems send data to a PC using Bluetooth communication, using their software to capture data simultaneously from all devices at all locations.The sensors were mounted in the shoulder, forearm, arm, and hand, as described in Figure 1 and in the work of Tobar et al. [36].
The equipment used for movement analysis was a Kinescan/IBV.It is a movement analysis system based on visual markers that use rigid segment models.It captures PAL video at 25 frames per second.The system has 10 cameras located at a height of 2.4 m and distributed around the person.The analysis of object movement is performed by tracking the position of markers.These markers are small spheres covered with reflective material attached to the subject.The system provides position and speed for all markers.The markers are placed as follows: one on the shoulder, three on the forearm, two on the elbow, three on the arm, two on the wrist, and one on the hand.Figure 2 shows the distribution of the markers and the cameras used.

B. ARM EXERCISES DESCRIPTION
The datasets were recorded while the subjects performed the follow-arm exercises.Each exercise is described as: • Elbow Flexion-Extension: During elbow flexion the angle formed by the elbow joint decreases.The forearm approaches the arm.During elbow extension the angle formed by the elbow joint increases.The arm separates the forearm.Figure 3.a shows the exercise.• Hand Pronation-Supination Starts with the hand and forearm aligned, the palm facing upwards with the thumb facing outwards.A rotation results in the palm facing downward with the thumb facing inwards.The exercise is shown in Figure 3.b.
• Shoulder Abduction-Adduction: Abduction is a lateral movement of the entire upper limb away from the trunk until the arm forms a 90-degree angle with the trunk.Adduction is the lateral movement that brings the upper limb closer to the trunk.Figure 4 shows the exercise.In extension, it returns to the starting position.Figure 6 shows the exercise.
• Internal and External Shoulder Rotation: This exercise begins with the arm next to the trunk, arm, and forearm making a 90-degree angle in a L shape.The forearm begins next to the stomach and then moves away horizontally from the body.Figure 7 shows the exercise.

C. DATASET ACQUISITION
A group of 10 people without arm diseases, between 20 and 25 years old, were involved in the data acquisition.
The acquisition was separated into 3 sessions of people performing arm exercises.For each session, a person starts   in a neutral position and performs 10 repetitions of each exercise.At the end of the exercise, they return to a neutral position and repeat the process for the next exercise.
Each session was recorded with Movement Analysis System (MAS) and IMU instruments simultaneously.Exercises were performed one after the other, and data for each exercise was recorded and labeled as a whole.Within the 3 sessions, 7 sub-datasets were created: 3 for MAS and 4 for IMU.Within the last session, 2 different IMU equipment were used.Dataset 1 has 2 elbow exercises, dataset 2 has 3 shoulder exercises, and Dataset 3 has 3 elbow and 3 shoulder exercises.
Table 2 shows the names of the datasets, instruments used, and their capture frequencies.Tables 3 and 4 describe the features captured from IMUs and Movement Analysis System.Each feature is described by its type (Acc, Gyro, Mag, Visual marker), location, and sensor axis.For IMU data sensors were located at the shoulder, forearm, arm, and hand.For Movement Analysis System data, markers were located as follows: one on the shoulder, three on the forearm, two on the elbow, three on the arm, two on the wrist, and one on the hand.

D. DATASET DESCRIPTION
Each sample was built as a matrix where the columns represent the features and rows represent the length given by the window size.The label for the sample was the most common label from the array of data points.
IMU data was captured at frequencies of 1, 25, and 50 Hz with 18, 24, and 27 features.IMU data used window sizes of 10, 20, 40, and 200 data points.A single IMU dataset has at most 500k timesteps and at most 4500 samples.According to the window size, we have data tensors e.g.2511 × 200 × 24 (samples × window size × features).The acquisition sampling rate determines the amount of data for an exercise duration.A lower sampling rate will generate less data, the movement will not be correctly captured, and the exercise will not be correctly recognized.
Visual markers data was captured at frequencies of 50, and 200 Hz with 18, 24, and 36 features.Visual Markers data used window sizes of 100 and 200 data points.A single Visual Marker dataset has at most 545k timesteps and 5450 samples.According to the window size, we have data tensors e.g.1350 × 200 × 24 (samples × window size × features).
For training, we used 100 epochs and a batch size of 64 samples for both acquisition systems.Available data was split using 75% for training, 10% for validation, and 15% for testing.All information related to datasets is shown in table 5.

E. TIME SERIES SIGNALS
The dataset is a multichannel time series.Each channel relates to a location, sensor, and axis.The first row in Fig 8 is from the arm, gyroscope, and x-axis.
IMU signals are shown on the left of Figure 8 and visual markers signals on the right.Signals from both sensors are pseudo-periodic with square-like waveforms.Some channels allow us to distinguish movements easily.Some channels are highly correlated.All channels are height limited and do not show outliers.
Different exercises show responses on different sensors, locations, and axis.During elbow flexion-extension, the arm remains locked, and the forearm moves.The sensors located at the forearm show the most movement while the arm sensors remain still.To recognize different kinds of movements, it is important to have independent information sources, e.g., gyro and accelerometer, located in at least two different positions.The datasets can be accessed through the GitHub repository of this work. 1 Each exercise repetition depicts a square-like waveform and builds up a pseudo periodic signal as in figure 9.All studied arm exercises have two stages: flexion and extension.These two stages are reflected in a square-like waveform with two levels.A low-level signal for extension and an upper level for flexion.When the limb moves from flexion to extension a slope is visible in the signal.The two levels and the slope are easily recognizable in the signals of IMU and visual markers.
Given the square-like waveform, we can identify the movement duration.For example, in dataset 1 the time needed to perform an exercise was 90 seconds while in dataset 2 the time was 2 seconds.The window size has to be chosen correctly according to the movement duration and sampling rate.
Figure 10, shows IMU signals of ten repetitions of elbow flexion-extension.The imu is located at the arm with a gyro, accelerometer, and magnetometer.Each movement repetition depicts a square-like waveform in the gyro and magnetometer.The square-like waveform has a lower level for flexion and an upper level for extension.Gyro and magnetometer data are easily interpretable to recognize an exercise.The accelerometer shows peaks going up and down, being harder to distinguish the movement being performed.
X and z arm gyroscope axes are highly correlated.Y and z arm magnetometer axes are correlated.X and y arm accelerometer axes are correlated.
The IMU signals show fewer amplitude differences than visual markers, presented in figures 10 and 11.Gyroscope and magnetometer data show a cleaner square than the visual marker data.Accelerometer data is centered at zero.
Figure 11, shows arm markers signals of ten repetitions of elbow flexion-extension.There are three markers located at the arm, each one with three axes.Position signals from markers at the X axis at the lower, middle, and upper arm are highly correlated.The same behavior happens for y and z-axis markers at the lower, middle, and upper arm.
Position signals have different offsets at each repetition, and the offset behaves as a moving offset.It is harder to distinguish the exercises vs the IMU signals.The offsets describe the relative position of the person to the camera system.When the person moves inside the measured area, the position signal will be different.The height and arm size of the person will affect the signal values.
Similar to IMU data, markers data show two levels.We can observe a signal higher level when the forearm is up and a lower level when the arm is down.
Markers signals are affected by occlusion, the angle and distance to the cameras, and the posture and height of the person.IMU is not affected by these variables.We can observe that lower, middle, and upper markers signals at the arm and forearm are similar because it is the same solid section.To show the signal correlation of IMU and markers data, we computed the correlation matrices using all the available features.From the figure 12, we note that MAS signals are highly correlated.

IV. METHODOLOGY A. PREPROCESSING
Before training, all datasets were preprocessed following these steps: normalization, reshaping into 3D arrays with dimensions (window size, samples, features), and label encoding.Additionally, in every training process, the dataset was split into 5 parts using k-fold cross validation [38] to alleviate the effects of small datasets and class imbalance.

B. ARCHITECTURES
For training and testing, eight neural network architectures were considered and evaluated.The main components of the architectures are LSTM, 1D convolutional layers, and transformer encoders.All architectures used categorical cross entropy as their loss function and Adam as their optimizer.
Long Short-Term Memory (LSTM) is a recurrent neural network layer used in deep learning architectures.It comprises memory cells and gates that allow the network to store or discard information over time selectively [39].The basic LSTM cell consists of three gates.The input gate determines how much new information is added to the memory cell, the forget gate decides how much old information should be removed, and the output gate regulates the amount of information outputted from the cell.Additionally, each gate has its own set of learnable parameters, allowing the network to adaptively adjust the amount of information stored or discarded based on the input data, making LSTM effective in processing sequential data [40].
One-dimensional (1D) convolution is a mathematical operation frequently used in signal processing and deep learning, which involves sliding a small window or kernel over a one-dimensional input signal, computing the dot product between the kernel and the signal at each position, and generating a new output signal.The output signal is a compressed representation of the input signal, highlighting patterns and features relevant to the task at hand [41].1D convolutional layers can also be used to process sequential data by learning a set of filters [42].
The Transformer architecture is a framework used typically for natural language processing (NLP), but can also be used for sequential data because of its attention mechanism.The attention mechanism functions by extracting information from the entire sequence, by using a weighted sum of all the past states of the encoder, generating a matrix.This means that all parts of the sequence are treated by their real importance, and the overall context is considered, prioritizing words with higher weight, allowing the model to focus on the right element of the input to predict the next element of the output [43], [44].This attention mechanism is improved by using multi-head attention, which applies self-attention to different segments of the input, allowing the transformer to have better discrimination capabilities.As each head will produce its resulting matrix, all matrices are concatenated and multiplied by an additional weight matrix, generating an output matrix that contains information from all the heads [43], [45].The code can be found in our GitHub repository.The summary of the structures of the architectures is presented in Figure 13.The architectural description used in this work is presented below.''Implementing LSTM for Human Activity Recognition using Smartphone Accelerometer data'' [46].The architecture comprises an LSTM layer of 128 neurons, a dropout layer, and two fully connected layers.The LSTM layer allows for time series data analysis due to its ability to handle variable-length input sequences, noisy data, and missing data.Thus, the network can make accurate predictions based on past observations.This network was originally evaluated with the Wireless Sensor Data Mining (WISDM) dataset, obtaining an accuracy of 96.20%.
• Architecture 2 (LSTM + Dropout + LSTM + Dropout + Dense): This architecture was based on the GitHub repository ''Human-Activity-Recognition'' [47].This architecture is an extension of the last one, adding the LSTM layer after the first dropout layer, adding another dropout, and then a fully connected layer.This network was originally evaluated with the UCI-HAR dataset, obtaining an accuracy of 93.17%.
• Architecture 3 (Conv1D + Conv1D + Dropout + Max Pooling + Flatten + Dense + Dense): This architecture was taken from the GitHub repository ''ETFA-Workshop'' [48].In contrast to the previous architectures, this network focuses on using convolutional layers.The architecture comprises two 1D convolutional layers of 64 neurons, a dropout layer to avoid overfitting, a max pooling layer to reduce dimensionality, and a final section of a flatten and two fully connected layers.Convolutional networks can be effective for time series analysis, as they are good at extracting features from the input data, which can be useful for identifying patterns and trends in the data.Also, convolution allows downsampling data, reducing the computational complexity of the model and making it easier to train and run.This network was originally evaluated with the UCI-HAR dataset, obtaining an accuracy of 89.89%.
• Architecture 4 (Conv1D + Max Pooling + LSTM + Dropout + Dense + Dense): This architecture was assembled empirically as a combination of convolution and LSTM, to analyze the effectiveness of putting convolutional layers at the start of an LSTM network.As mentioned previously, convolution is useful for extracting features and reducing the complexity of the data, along with helping to reduce the amount of noise and irrelevant information.This way, the LSTM layer can work with more refined data, and get better results.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.dropout and two dense layers.Additionally, from the benefits of using convolutional layers at the start of the architecture, the bidirectional LSTM allows the network to capture information from past and future inputs, helping it better capture dependencies and relationships between different parts of the sequence.This network was originally evaluated with the UCI-HAR dataset, obtaining an accuracy of 95.48%.
• Architecture 6 (LSTM + Dropout + Reshape + Conv1D + Dropout + Dense + Dense + Dense): This architecture was assembled empirically to test the performance and effects of placing convolutional layers at the end of the architecture instead of at the start.
• Architecture 7 (LSTM + Dropout + Dense + Dense + Dense): A simple LSTM architecture with 64 neurons in the first layer, a dropout layer, and three dense layers.The purpose of this network is to test how LSTM performs singlehandedly, without any particular enhancements.
• Architecture 8 (Normalization + Position Embedding + Transformer Encoder + Normalization + Dense): This architecture was based on the proposed model and findings of the paper ''Wearable Sensor-Based Human Activity Recognition with Transformer Model'' [49], which used a unidirectional Transformer-based architecture as an improvement for HAR.The network is composed of a normalization layer, a positional embedding layer coupled with a sum of weights, a transformer encoder, another normalization layer, and a fully connected layer.The encoder itself contains more layers: first, there is a normalization layer, then a multi-head attention layer that does the main work, and a dropout layer.Then, a sum of weights processes the results obtained previously, which go to a normalization layer, a feedforward network, and a dropout layer.Before exiting the encoder, a final sum of weights is performed.Using this architecture, the authors take advantage of the benefits of using multi-head attention, described previously.This network was originally evaluated with the KU-HAR dataset, obtaining an accuracy of 99.20%.

C. FEATURE SELECTION
Due to multiple correlated features available, at most 36 features, we propose to find the most important features using random forest feature selection.After the feature selection, we evaluated the 8 architectures with the reduced number of features.It is relevant to find the most important features for IMU and MAS because it allows to reduce the computational complexity, reduce the physical sensors required, sensor information redundancy, and possible  overfitting.Figure 12 shows a higher correlation in MAS than in IMU signals because of the three markers located at each position.
The algorithm used for feature selection, random forest (RF) [50], [51], is an ensemble learning method combining multiple decision trees to improve the accuracy of the model.106660 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The algorithm creates an ensemble of decision trees during training, where each tree is trained on a different subset of data and features, i.e., bootstrap and bagging.The majority vote of the individual trees determines the final output of the algorithm.Each node in a tree makes a binary decision according to a single feature.RF identifies the features which were the best to split the data, then are organized as most important features according to their score [52].
Random forest was trained on every dataset.Feature selection was based on the importance score from a random forest.To select the amount number of important features for training, we began with a small number of features according to their importance.Then, we added features until the model no longer demonstrated any enhancement in accuracy or showed signs of overfitting.Tables 8 to 10 show the best sensors and locations found and their score for each dataset.Figure 15 shows the training curves of the best architectures for every dataset using only the best features.

V. RESULTS
We trained models using the 8 architectures and the 7 subdatasets.Our metrics were: test accuracy, precision, recall, and F1 score.Each metric was the average of 5 repetitions using k-fold cross validation.The result analysis needs to consider the exercise duration, sampling rate, and window size.
Table 7 shows the summary of the test accuracy.

VI. DISCUSSION
In this section, we present the most relevant results and how they relate to previous works.
Test accuracy summary: Table 7 includes the capture conditions and test accuracy for all the experiments.We discuss how the sampling rate, window size, and exercise duration affect the test accuracy.
In Dataset 1, we have IMU sampling at 1 Hz and MAS at 200 Hz.IMU data at 1 Hz can distinguish exercises because the people took 90 seconds to complete each exercise repetition.Approximately 40 seconds stay at flexion and 40 seconds at extension.The change from flexion to extension took 3 seconds.The exercises are elbow flexion-extension and elbow pronation-supination.These exercises involve principally the movement of the hand and forearm.The best features were accelerometers located at the hand.This suggests we only need a wrist smartwatch equipped with an accelerometer and gyroscope as in [26] where they suggest using a single accelerometer.
Dataset 1 IMU has 27 features and 4 selected features, both with 97% acc.Dataset 1 MAS has 36 features and 15 selected features, both with 89% acc.Reduced accuracy in MAS is related to occlusion during pronation-supination exercise.

MAS system cannot correctly detect small movements.
There is no improvement in using fewer features because of the problems with occlusion and small movements.More features were needed from MAS than from IMU because MAS do not have enough information to distinguish the exercises.
In Dataset 2, we have IMU sampling at 370 Hz and MAS at 200 Hz.Both rates achieve a test accuracy of 99%.We have a good amount of data and used a window size 106662 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.  of 200 points.The exercises in dataset 2 performs fully extended arm exercises, which are easily distinguished due to visible markers and large movements.This is the only dataset with acceleration estimated from the MAS system.Using only 3 position features we achieve 99% accuracy, the good performance is related to the exercise but not the use of acceleration from MAS. MAS data be augmented with the velocity markers as an additional feature.IMU data could be augmented with additional features from the accumulation and derivative.Dataset 2 IMU has 24 features and 6 selected features, both with 99% acc.Dataset 2 MAS has 24 features and 3 selected features, both with 99% acc.The selected features were located at the arm, forearm, and shoulder.IMU at hand features were not selected because those sensors were affected by large accelerations and were not able to distinguish the exercises.Hand acceleration due to different arm lengths didn't show any effect on the exercise classification.
In Dataset 3, we have MPU9250 sampling at 1 Hz, Metawear at 50 Hz, and MAS at 50 Hz.This dataset performs the same exercises from Dataset 1 plus Dataset 2. D3 MPU9250 at 1 Hz is the first dataset that shows overall bad performance, 13% to 85% accuracy.This is because sampling at 1 Hz is too low for a movement that takes 4 seconds to complete.We used a window size of 40 points capturing multiple exercise repetitions, which produced bad accuracy.Metawear showed low performance, 68% to 92% because the system presented data transmission problems.MAS system did not show any problems with a sampling of 50 Hz and a window size of 100 points.This result shows we can have low sampling rates around 50 Hz.
Dataset 3 MPU9250 has 27 features with 84% acc and 5 selected features with 89%.D3 Metawear has 18 features and 12 selected features with 92% acc.D3 MAS has 36 features with 99% acc and 15 selected features with 98% acc.MAS requires more features than IMU because of the complexity of the exercises, markers occlusion, and difficulty in detecting small movements.
Training curves: During training, it takes at most 100 epochs to achieve a steady state of accuracy and loss.All datasets show loss reduction and accuracy increase over time.
Failure cases: The most challenging datasets were D2-Trigno and D3-MPU9250.D3-MPU9250 showed bad test accuracy, this is related to a sampling rate of 1 Hz and a window size of 40 points.The sampling rate is slow for a movement that takes 4 seconds to complete.The window size is too long, capturing multiple repetitions in a single window.Having multiple exercise repetitions captured in a single window show bad performance because it is ideal to have aligned signals to recognize them.The accuracy of the challenging datasets increased using selected features.The best architectures stand out by achieving better performance on challenging datasets.
Confusion Matrices: Reflects the correct and incorrect classification for each dataset using the best features and the best model.We can see the hardest exercise to recognize was elbow pronation-supination for the MAS dataset.
Effect of window size, sampling rate, and exercise duration: We found the window size should be variable because it is proportional to the sampling rate and exercise duration.For example, D1-MPU9250 with a window size of 10 points is enough to recognize an exercise sampled at 1 Hz with a duration of 90 points.In this case, the ratio from window size to exercise duration is 10/90=0,11.From D3-MPU9250, a window size of 40 points was too big to recognize an exercise sampled at 1 Hz with a duration of 4 points.In this case, the ratio from window size to exercise duration is 40/4=10.A window should capture less than a single repetition to achieve good accuracy.From our results, ratios less than 50% achieved high accuracy.
We found the sampling rate should be proportional to the exercise duration.In daily activities, a muscular-focused exercise or a full-body exercise takes around 2 seconds to perform.To capture a well-defined shape of the movement, we found the sampling rate should be around 60 Hz.A sampling rate of 1 Hz was insufficient to correctly recognize exercises.The sampling rate of 200 and 370 Hz didn't show any improvement and generated too much information.
Best models architectures: The best architectures were the same using all and selected features.The 3 best architectures were: 2LayerConv 93.8% acc, 2LayerConv+BiLSTM 92.3% acc, Conv+LSTM 90% acc.The networks analyzed are divided into LSTM first, CNN first, and Transformer.
Best architectures have two convolutional layers at the beginning.The networks are learning filters with the shape of the multiple-channel signals.During the signals analysis performed in the time series section, we noted the exercises show well-defined shapes for each exercise.
LSTM first networks achieved lower test accuracy on the challenging datasets.
Convolution + BiLSTM has good performance showing the importance of convolution and the importance of bidirectional learning on LSTM.Transformer didn't achieve good performance alone.The transformer was a unidirectional encoder.Accuracy could be improved using a bidirectional decoder architecture.
Feature reduction: Using as low as 4 features, IMU and MAS systems achieved high accuracy above 89%.Feature reduction was feasible without affecting accuracy.
To achieve good accuracy with few features, a single IMU with accelerometer and gyroscope should be located in the wrist.Therefore we recommend a single IMU in a smartwatch.Feature selection shows the IMU acceleration variable is the most representative similar to [16], [17], [18], [23], and [26].The test accuracies above 89% found for our datasets using the 8 models were similar to the accuracies found in [26], [46], [48], [53], and [49].

VII. CONCLUSION
The duration of daily exercises is variable, then we need a variable window size.The window size is proportional to the equipment sampling rate and exercise duration.From our experiments sampling rate of 60 Hz is able to distinguish arm movements.
State of the art deep learning models was able to correctly classify exercises.Best architectures were 2LConv 93.8%, 2LConv+BiLSTM 92.3%, Conv+LSTM 90%.Selected features lead to a reduction of 4 features for IMU or MAS, with accuracies higher than 89%.
IMU allows measuring the activity of people simultaneously in any environment, even on water.IMU is portable and can be used in all kinds of daily activities.MAS needs an equipment room and complex video capture system and wearable markers.The use of cameras interferes with people's privacy inside the everyday environment.
The IMUs achieve high accuracy, low cost, and noise reduction compared to other instruments such as EEG or EMG where the signals are naturally mixed from the source.The accuracy improves using independent sources, such as independent axes in accelerometers and gyroscopes [6], [22].In MAS, the visual markers have problems related to occlusion, distance, size of movements, and out-of-plane movements [36], e.g.pronation-supination, shoulder rotation.
From our experiments detailed in table 7, we found a relation to estimating the window size given by equation 1. Two-second exercises captured at 60Hz will require window sizes between 30 and 60 points.
window size exercise duration points < 0.5 (1) window size exercise duration seconds × sampling rate < 0.5

VIII. FUTURE WORK
A limitation of this work was the number of persons involved which do not reflect the behavior of the signals for a sample of population.The 10 persons chosen for this study represent only a test group to validate the capture methodology and evaluate the performance of deep learning architectures.The people involved had different arm lengths and the differences in acceleration didn't show misclassifications.A limitation of this work was the reduced number of samples captured, at most 540k timesteps for an exercise.
It is important to capture data about elderly in their daily lives.The data needs to have multiple people and data for several weeks.To evaluate overfitting and to perform robust statistical accuracy tests, we need larger datasets with more people, more exercises, and include full-body exercises.
With larger datasets, we can identify better algorithms for data extraction and models to recognize a wider range of movements.To achieve robust models we need to perform data augmentation adding noise, scaling, offset in time, and random signal erase.To have a better window extraction we propose replacing the fixed sliding window with dynamic time warping, wavelet transform, and spectrogram for scale and location, anchors like in image segmentation (different sized sliding windows), sliding window align by regression as in image segmentation.Feature reduction suggested it is possible to achieve high accuracy with few sensors.
We propose to use a single IMU located at the wrist being worn as a smartwatch.The challenge is to recognize exercises with a single IMU and implement variable window size.
To classify exercises with a single IMU we need to capture the variations in signal amplitudes, shapes, and durations.
A single IMU smartwatch solution with HAR capabilities is interesting for elderly care, athletes, and healthcare.The solution involves a smartwatch to capture and send data to a processing unit.The data necessary would be at 60 Hz and 6 features, making this solution feasible.Our main contributions were: • Capture methodology and comparison for IMU and MAS acquisition systems.
• We analyzed the relations between 8 architectures accuracies, signal waveforms, signals correlation, sampling rate, exercise duration, and window size.
• Feature reduction analysis.
• Deep learning models were able to recognize human exercises.The next challenge is to classify using a single IMU and use variable window size.

FIGURE 2 .
FIGURE 2. Visual markers and cameras distribution.

FIGURE 5 .
FIGURE 5. Horizontal Shoulder Flexion-extension.(a) flexion with an adduction of 140 • , (b) 90 • abduction in the frontal plane [37].• Vertical Shoulder flexion-extension: This movement begins with the upper limb close to the trunk.During flexion, the arm moves frontally in a vertical manner until the upper limb reaches a horizontal position.In extension, it returns to the starting position.Figure6shows the exercise.

FIGURE 9 .
FIGURE 9. Visual marker at the forearm.A single period of elbow flexion-extension movement.The blue line is the flexion state, and the red line is the extension state.

FIGURE 10 .
FIGURE 10.Dataset 1. MPU9250 is located at the arm during elbow flexion-extension.Plot of accelerometer, gyro, and mag signal with x,y, and z axes.

FIGURE 11 .
FIGURE 11.Dataset 1. Visual markers located at the lower, middle, and upper arm during elbow flexion extension.The plot of marker position signal in the x,y, and z axes.

FIGURE 13 .
FIGURE 13.Graphical representation of the eight architectures.

FIGURE 14 .
FIGURE 14. Plot of best training curves for every dataset using all features.

FIGURE 15 .
FIGURE 15.Plot of best architectures for every dataset using best features.

FIGURE 16 .
FIGURE 16.Confusion matrix of the best-performing architecture of every dataset (best features).

TABLE 1 .
Summary of related works on HAR.

TABLE 2 .
Datasets and instruments.

TABLE 4 .
Movement analysis system features.

TABLE 5 .
Summary of datasets capture conditions.

TABLE 6 .
Summary of sampling rate, exercise duration, and window size.

table 12 .
Confusion matrices figure 12. Training curves on figures 14, 15.Best features in tables 8 to 10. Best architectures for each dataset in table 11, and best architectures using only best features in

TABLE 7 .
Mean test accuracy for all models.All features and best features.Datasets description.

TABLE 8 .
Best features for dataset 1, ordered by score.

TABLE 9 .
Best features for dataset 2, ordered by score.

TABLE 10 .
Best features for dataset 3, ordered by score.

TABLE 11 .
Results for all datasets using all features.

TABLE 12 .
Best results for all datasets using best features.