Light Residual Network for Human Activity Recognition using Wearable Sensor Data

This letter addresses the problem of human activity recognition (HAR) of people wearing inertial sensors using data from the UCI-HAR dataset. We propose a light residual network, which obtains an F1-Score of 97.6% that outperforms previous works, while drastically reducing the number of parameters by a factor of 15, and thus the training complexity. In addition, we propose a new benchmark based on leave-one (person)-out cross-validation to standardize and unify future classifications on the same dataset, and to increase reliability and fairness in the comparisons.

This letter focuses on the HAR problem using data from wearable inertial sensors applied to the classification of ADL.Recent approaches to this problem have applied deep learning techniques obtaining high classification rates [19], [20], [21], [22], [23], [24], [25], [26], [27].However, deep learning approaches demand high computational power, long training times, and high energy consumption [28], [29].A reduction in those demands is desirable, in particular for applications based on wearables, smartphones, or the Internet of Things technologies [30].
In this letter, we introduce a light residual network for the HAR problem when using temporal data from wearable inertial sensors.Our architecture is a modification of the ResNet18 [31], in which we have reduced the number of residual blocks and have adapted the kernels to the 1-D nature of the temporal signals provided by inertial sensors.As a result, our model drastically reduces the number of trainable parameters from several million to 234 950, thus improving efficiency in performance and reducing complexity.The full architecture is shown in Fig. 1.
We tested our approach in the popular ADL dataset UCI-HAR [4], which is one of the most cited in the UC Irvine Machine Learning Repository [32].Our classification results outperform previous approaches while reducing the complexity of the model, the training time per epoch, and thus, the energy consumption.
Previous works using the UCI-HAR dataset use the original fixed division of participants for training and testing, which limits the details on performance.Therefore, we propose a benchmark based  on leave-one-out cross-validation (LOOCV) by iteratively leaving one participant out for testing while training with the rest, which is a more standardized way of comparing results in these kinds of problems.We think this benchmark will unify comparisons better, and will increase their reliability and fairness.
In summary, the contributions of this letter are threefold.First, we present a light architecture that outperforms previous deep learning works on the UCI-HAR dataset.Second, our simplified architecture drastically reduces the complexity of the model.Third, we propose a standard benchmark in order to unify future comparisons.
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 1. Parameters for the light residual model
In particular, the work in [23] presents a comparison between automatic features from a CNN with human crafted features (HCFs), and confirms that CNN-based features provide performances comparable to the best set of HCFs.Moreover, Ronald et al. [26] presented a CNN residual network with a modified inception module to improve the predictions.Also, a bidirectional LSTM is proposed in [20], to explore the impact of temporal features in the classification performance.Similarly, the work in [22], defines a model based on stacked LSTM, which improves performance.In [24], a residual bidirectional LSTM is proposed for a better temporal feature extraction, thus improving the classification while avoiding the vanishing gradient problem.
The work in [21] introduces a hybrid CNN+LSTM model for temporal feature extraction, obtaining better results than models based on LSTM, LSTM+dense layers, and CNN+LSTM+dense layers.In [25], a comparison among different hybrid models showed the best results using a combination of CNN+LSTM with a self-attention mechanism, which keeps a good balance between performance and the number of parameters.Still, our light model has fewer parameters than [25] while outperforming the classification.
A parallel two-branch model is presented in [19] where the first branch uses residual attention blocks for the spatial feature extraction, and the second branch applies bidirectional GRU with self-attention for the temporal features.While the classification rates are high, the number of parameters remains very high (1.6 million).Finally, the work in [27] proposes a multifrequency channel attention framework combined with residual networks and obtains the best classification results so far.In comparison, our light model slightly outperforms [27] while reducing its number of parameters by a factor of 15.
Our light architecture outperforms all previous methods while drastically reducing the number of trainable parameters to 234 950, thus improving efficiency in performance and reducing complexity.

III. HAR USING INERTIAL SENSORS
In this letter, we address the HAR problem using temporal data from inertial sensors worn by a person while carrying on different ADLs.For this, we use the UCI-HAR dataset [4], which contains inertial data from different participants executing different activities.We focus on the six activities in the dataset: walking (WA), walking upstairs (WU), walking downstairs (WD), sitting (SI), standing (ST), and laying (LA).The dataset contains data from 30 participants that wore a smartphone on their waist while performing the activities.The temporal signals were obtained from the inertial sensors inside the smartphone and were composed of nine 1-D signals: triaxial acceleration from the accelerometer, triaxial estimated body acceleration, and triaxial angular velocity, all sampled at a frequency of 50 Hz.The signals were preprocessed using a butterworth low-pass filter to separate body acceleration and gravity.Each 1-D temporal signal is divided into windows of 2.56 s containing 128 samples each, with an overlapping of 50%.In total, the dataset contains 10 299 signal windows of 128 samples each [4].
The signal from each inertial sensor is represented as follows: where x w c indicates the sample vector from window w and cue c, with S = {1, ...128}, and c ∈ C = {1...9}.Thus, each input to our architecture is composed of nine parallel 1-D window vectors in the form which translates into a tensor X w with dimensions (|C| × 1 × |S|), i.e., (9 × 1 × 128), corresponding to (depth, height, and width).Thus, our classification problem consists of labeling each tensor X w into one of the six activities {WA, WU, W D, SI, ST, LA}, as shown in the input and output of Fig. 1.

IV. LIGHT RESIDUAL NETWORK
We propose a light CNN with residual connections based on the ResNet18 [31].Our model reduces the number of residual blocks to four in order to keep a balance between classification performance and model complexity.Moreover, instead of 2-D kernels, we defined 1-D kernels that adapt better to the 1-D input signals (see Section III).As a consequence, the total number of trainable parameters in our model was reduced to 234 950.
Our architecture is shown in Fig. 1.The input contains the raw signal data in tensor X w (cf., Section III).The initial structure includes a convolutional, batch normalization, and a rectified linear unit (ReLU) layer, with a final max pooling layer.Afterward, we have four residual blocks (1a, 1b, 2a, and 2b), each composed of two convolutional layers Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.TABLE 4. Classification results for the LOOCV benchmark (see bottom of Fig. 1).Blocks 1a and 1b contain 64 filters, while blocks 2a and 2b contain 128 filters.Finally, the features are transformed using an average pooling and a fully connected layer into the six output probability labels, that are discretized using a Softmax function.Table 1 gives the most important parameters.
Like in some previous works, e.g., [19] and [23], our input tensor has dimensions (9, 1, 128) which adapts better to the 1-D nature of the input data.This also allows us to simplify the architecture, to reduce the parameters, and to prioritize the correlations among all signal cues.Some other works, like [20], used a tensor with dimensions (1,128,9) to adapt to their specific architectures, but those architectures grow in complexity and number of parameters.

V. EXPERIMENTS
We tested our light architecture on the UCI-HAR dataset [4].We trained with a learning rate of 0.0008 that increased by a factor of 0.4 every 50 epochs, and a batch size of 16.We applied 300 epochs and used the Adam optimizer with a weight decay value of 0.0001.We implemented our model on a GPU Quadro RTX 6000 using Pytorch 1.10.1+cu102 on Ubuntu 18.04.6LTS.
Table 2 compares our average F1-Score, accuracy, and number of parameters with previous works, with our model outperforming them (entries marked with "-" indicate that the value was not made available on the corresponding paper).These results show that our light model obtains better results using much fewer parameters and thus reducing the complexity of the model.
In addition, Table 3 details the confusion matrix in this first experiment.The main confusions are between SI and the standing activity ST.This result is in accordance with previous works [19], [20], [22], [24], [25].In addition, our model reduces the confusion among the activities WA, WU, and WD, improving over previous results [21], [22], [24], [25].Our light model does not apply temporal relations between consecutive tensors.Instead, we used a 1-D input and focused on finding the optimal number of residual blocks to reduce complexity and increase efficiency.An extra ablation study shows that four residual blocks provide the best F1-Score while keeping the number of parameters low (see the Appendices in [35]).
In the previous experiments we followed the original experimental conditions presented in [4] to keep a fair comparison with previous works.However, they restricted to only one pair of training and test sets with a fix number of participants.To unify future comparisons we present a benchmark based on LOOCV, where iteratively one participant is left out for testing and the rest (29) are used for training.We applied our light model to this benchmark and the global confusion matrix is given in Table 5.The biggest confusion occurs between SI and ST but in small percentages.Tables 2 and 5 present similar behaviors, which confirms the robustness of our model in different training and test sets.
Finally, Table 4 gives the average values for the precision, recall, and F1-score metrics among the six activities for each participant in the LOOCV benchmark.We think the low results for participant 14 can be due to a problem in the data collection.This could be contrasted with other works when using our proposed one-person-out scheme, which is also a reason to support the use of our benchmark.Further results on the benchmark, and the code for our model, are available in the Appendices in [35].We carried out two additional tenfold cross-validation studies, with and without stratification, with F1-Scores of 97.1% and 96.7%, respectively (see the Appendices in [35]).

VI. CONCLUSION
We presented a new light architecture for the HAR problem that outperformed previous works.We simplified the deep-learning models to increase their suitability to real life wearable-based applications.Future work will extend our study to new datasets, and will investigate further reductions in the models' complexity.

Fig. 1 .
Fig. 1.Light residual architecture.The input signals window is represented by tensor X w (cf., Section III).Four residual blocks share a common architecture (bottom diagram).The output is one of the six activities: walking (WA), walking upstairs (WU), walking downstairs (WD), sitting (SI), standing (ST), and laying (LA).

TABLE 2 .
Comparison with previous works

TABLE 3 .
Confusion matrix for the first experiment

TABLE 5 .
Global confusion matrix for the LOOCV benchmark