Boosting Inertial-Based Human Activity Recognition With Transformers

Activity recognition problems such as human activity recognition and smartphone location recognition can improve the accuracy of different navigation or healthcare tasks, which rely solely on inertial sensors. Current learning-based approaches for activity recognition from inertial data employ convolutional neural networks or long short term memory architectures. Recently, Transformers were shown to outperform these architectures for sequence analysis tasks. This work presents an activity recognition model based on Transformers which offers an improved and general framework for learning activity recognition tasks. For evaluation purposes, several datasets, with more than 27 hours of inertial data recordings collected by 91 users, are employed. Those datasets represent different user activity scenarios with varying difficulty. The proposed approach consistently achieves better accuracy and generalizes better across all examined datasets and scenarios. A codebase implementing the described framework is available at: https://github.com/yolish/har-with-imu-transformer.


I. INTRODUCTION
Human activity recognition (HAR) and smartphone location recognition (SLR) aim to identify the user activity from sensory data. HAR measurements can be collected using video [1], utilizing channel state information of WiFi signals [2], [3], radar [4] or by sensors installed in wearable devices such as inertial sensors (accelerometers and gyroscopes) or ambient environment sensors (temperature and humidity). HAR has numerous applications, relying on one or more of these sensors, including surveillance [5], gesture recognition [6], [7], gait analysis [8], healthcare [9], [10], and indoor navigation [11], [12]. Due to its wide applicability it has been addressed and surveyed extensively in the literature [13]- [20]. In SLR, the user's actions are reflected through changes in the location of the smartphone. For example, consider a walking pedestrian where the smartphone is placed in their trouser's front right pocket (pocket mode). The pedestrian can remove the phone to send a message (texting mode) and then continue holding the phone while walking (swing mode).
The associate editor coordinating the review of this manuscript and approving it for publication was Larbi Boubchir .
HAR and SLR play a particularly important role in navigation solutions which rely solely on the smartphone inertial sensors. Specifically, HAR and SLR were shown to improve the accuracy of traditional pedestrian dead reckoning (PDR) by using it as a prior [21]- [25]. SLR was also shown to improve the performance of other navigation-related problems such as step length estimation [26]- [28] and adaptive attitude and heading reference system (AHRS) [29].
Given the emerging importance of HAR and SLR for navigation performance, different learning-based approaches were proposed to reason about inertial sensory data. Earlier methods for performing HAR for PDR relied on classical machine learning techniques [22]. Hand-crafted features were typically extracted from the raw signal and fitted with a classical machine learning classification methods such as Support Vector Machine and Decision Trees [21], [23]. More recently, feed forward networks (FFN) and long short term memory (LSTM) architectures were proposed for this task [24], [25], removing the burden of feature engineering while achieving improved accuracy. Recent detailed and extensive surveys describing traditional and deep learning techniques for HAR are available for the interested reader [13], [18]- [20]. There, various types of deep learning approaches such as convolutional neural networks (CNNs), recurrent neural networks, stacked autoencoders, temporal convolutional networks, and variational autoencoders are reviewed.
Recently, CNN and LSTM architectures were shown to improve SLR performance, compared to other learning-based approaches [30]. Methods coupling SLR with step length estimation proposed to use CNNs with or without LSTM [28], [30] or employed LSTM for SLR, similarly to previous works learning SLR for PDR [26]. Interestingly, CNN architectures with/without LSTMs yielded on-par performance, suggesting that LSTMs do not necessarily add an informative temporal aggregation, which is missing from CNNs, for this task [30].
In this work, Transformers [31] are proposed for learning inertial-based HAR and SLR problems. Transformers implement an attention-based encoder-decoder architecture for sequence analysis. Attention mechanisms [32] learn to aggregate information from the entire sequence. By stacking attentional layers which scan the sequence, Transformers generate a position and context aware representations. This method was shown to outperform recurrent neural networks (RNNs) and LSTMs for various sequence-based problems in Natural Language Understanding and Computer Vision, achieving state-of-the-art performance [31], [33]- [36]. Here a Transformer-based architecture is presented for performing both HAR (classifying common user dynamics, such as walking, standing, running, stairs and so on) and SLR (classifying smartphone locations, such as talking, pocket or swing). The proposed approach is the first Transformer-based architecture to serve as a general framework for inertial based activity recognition tasks.
In order to evaluate the proposed approach, multiple HAR and SLR datasets collected by a total of 91 users with more than 27 hours of inertial data recordings are employed. Across all datasets, and considering various scenarios with changing difficulty, the proposed approach demonstrates a consistent boost to accuracy and robustness.
In summary, the main contributions of the paper are as follows: 1) Derivation of a framework for inertial data classification with Transformers. The proposed approach is the first to present a Transformer-based architecture to serve as a general framework for general activity recognition in both HAR and SLR tasks.
2) A Transformer network architecture design for handling inertial measurements, along with publicly available implementation. 3) Evaluation of the proposed framework for three commonly used classifications tasks: HAR, SLR and a combination of the two smartphone and human activity recognition (SHAR), demonstrating a consistent improvement in accuracy and a better generalization across datasets.
The rest of the paper is organized as follows: Section II describes the proposed Transformer-based architecture for classification with inertial data. Section III reviews the datasets used in this work while Section IV presents the results. Finally, Section V gives the conclusions of this research.

II. INERTIAL DATA CLASSIFICATION WITH TRANSFORMERS
An Inertial Measurement Unit (IMU) measures the specific force f ∈ R 3 and angular velocity w ∈ R 3 vectors over time. These two outputs are typically concatenated and aggregated depending on the sensor's recording frequency, such that a sample S ∈ R k×6 represents a sequence of k measurements (i.e., recorded in a window of size k). In this work, the problem of activity recognition from inertial data is modelled as a sequence-to-one problem, where the input is a learned sequential embedding of the raw sensory measurements and their temporal positions. Following the success of Transformers in text classification [33], [37] and image recognition [36], [38], a Transformer Encoder is proposed for summarizing a sequence of (embedded) inertial measurements into a latent vector. A Multi-Layer Perceptron (MLP) with SoftMax can then be applied to output the class probability distribution, similarly to standard classifier heads used in sequence-to-one architectures [33], [36].

A. NETWORK ARCHITECTURE
The scheme proposed in this paper is depicted in Figure 1. Given a sample of inertial measurements S ∈ R k×6 , a series of four 1D convolutions is applied with GELU non-linearity. This step embeds the raw data in a higher dimension d, generating a latent embedding E S ∈ R k×d (latent features).
Similarly to state-of-the-art Transformer-based architectures for sequence classification [33], [36], class token VOLUME 9, 2021 C ∈ R d is prepended to the embedded sequence. In addition, an embedding E P i ∈ R d for each position P i in the sequence (including the class token) is learned and further added to the latent sequence representation. The initial input Z 0 to the Transformer Encoder is thus given by: with t = k + 1.
A standard Encoder architecture [31] is employed, stacking L layers, each consisting of a self multi-head attention (MHA) layer and an MLP. In the proposed implementation, the MLP block includes two fully connected (FC) layers with a hidden dimension of 2 · d and GELU non-linearity. The MHA operation is the core of the Transformer architecture. Given three sequences of length t and dimension d, namely, a query Q ∈ R t×d , a key K ∈ R t×d and a value V ∈ R t×d , each head h computes a weighted aggregation of V with respect to Q: with: where d = d n h and n h is the number of heads. The matrices W Q h , W K h , W V h ∈ R d×d are linear projections from d to d . The outputs of (2) from all the heads are concatenated across the channel dimension. The resulting updated representation is a weighted aggregation of the sequence in each position, based on the relative importance of all other positions. In self MHA (sMHA), Q, K and V are taken to be the same sequence.
Each layer l, l = 1..L in the Transformer Encoder performs the following computation by passing the input through a LayerNorm (LN) [39] before each module and adding it back with residual connections: The output of the Transformer Encoder at the position of the class token represents a temporally aware aggregation of input sequence: Y C is provided as an input for a classifier head, consisting of LN and FC layers with GELU non-linearity and Dropout, reducing the dimension to d 4 . A second FC layer maps d 4 to the number of classes. A Log SoftMax is applied on the output vector in conjunction with Negative Log Likelihood (NLL) loss to learn a multi-label classification task.

B. IMPLEMENTATION DETAILS
The proposed architecture is implemented in PyTorch [40]. The latent dimension d to 64 for the convolutional backbone and positional embedding. The Transformer Encoder consists of six layers with an eight-heads-sMHA block and an MLP. Finally, a Dropout p = 0.1 is used for both the Transformer Encoder and classifier head. The implementation of this framework (proposed architecture and its training and testing) is publicly available at: https://github.com/yolish/har-withimu-transformer.

III. DATASETS
Three datasets are employed in order to evaluate the proposed approach. The first dataset represents the SLR problem (SLR dataset), containing five different common smartphone locations. This dataset was created by combining six different SLR datasets. The second dataset considers the HAR problem (HAR dataset). It consists of six different human dynamics, including the division of the stairs class into two separate classes: walking upstairs (Stairs up) and walking downstairs (Stairs down). The third dataset contains data with a combined SLR and HAR class labeling, which is referred to as SHAR (SHAR dataset). For example, the class ''Walking Pocket'' refers to a scenario of a human walking (HAR) with the smartphone placed in their pocket (SLR). This dataset includes 21 classes.
In total, the SLR, HAR and SHAR datasets contain 27.76 hours of recordings made by 91 people. Each dataset contains many different files. Each file has the name of the user, which made the recording and a description of its type (e.g. user1 walking texting), and can have a different time duration. When creating the unified dataset, all files from all users were marge into a single file. The specific class labels and data properties for each dataset are summarized in Table 1.

A. SLR
The SLR dataset consists of six different datasets. In all six datasets, the smartphone location was at least in one of five locations: Texting, Pocket, Swing, Talking and, Body, while the user was walking. In most of the datasets, there was no limitation on how the smartphone was held or on the walking characteristics. From this dataset, only the normalized accelerometer readings are used.
The first dataset [30], contains 164min of recorded data in four smartphone locations: Texting, Pocket, Swing and, Talking. It has three different sampling rates: 25, 50, and 100Hz, all recorded by a single user with a single smartphone. The second datatset [41], was created for PDR applications, not related to SLR, using eight people. Since the recordings were made while the users were walking with a smartphone, this dataset can also be used for SLR. It contains three smartphone locations: Pocket, Texting and, Body, with a total of approximately 70min of data, recorded at 200Hz. Similarly, the third dataset [26] was recorded to examine deep-learning PDR, but can also be used for the SLR problem. There,  eight people were recorded in a sampling rate of 100Hz about 240min, while the smartphone was in Pocket or Texting classes. The fourth dataset [30], has recordings of four locations: Texting, Pocket, Swing and, Talking, made by six people, with three different sampling rates (25, 50, and 100Hz), and six different smartphones, yielding a total of 15min of recorded data. The fifth dataset [42], was recorded for HAR applications and is included also in the HAR dataset (Section III-B). For the SLR dataset, only the walking part is used. The dataset was recorded by 24 people using a smartphone in their pocket with a sampling rate of 50Hz. The last (sixth) dataset [43], was also created for an HAR research. There, the goal was to evaluate HAR performance with smartphones' and smartwatches' recordings. For the SLR dataset, walking from the smartwatches were employed since they share the same dynamics as smartphones in a swing motion. The Body class was also used for this purpose. The recordings where made using ten people, with a sampling rate of 50Hz, for a total of 48min.
To summarize, the combined SLR dataset has 3, 383, 950 samples (in each accelerometer axis) recorded by 57 people using different sampling rates of 25−200Hz. The distribution of the samples in each class is shown in Figure 2. Note that no pre-processing was performed on the raw data. All of the datasets were stacked together to a single one regardless of the sampling rate they were recorded by. The motivation for doing so was to make the network robust to the sampling rate since in real time the user is free to choose the sampling rate between a predefined range.

B. HAR
The HAR dataset was collected with an iPhone 6s kept in the participant's front pocket [42], [44]. 24 participants (10 women and 14 men) with varying age, weight, and height performed six activities in 15 trials in the same environment and conditions: Walking, Jogging, Sitting, Standing, Stairs down and, Stairs up. This dataset has 1, 304, 950 samples in each accelerometer and gyroscope axis. The sampling rate was 50Hz leading to a total of 435min. The distribution of the samples in each class is shown in Figure 3.

C. SHAR
The SHAR dataset is derived from a dataset created by [43] in order to evaluate how and when various motion sensors, which are available on a smartphone, can best be exploited (individually or combined) for improving activity recognition. To that end, [43] have collected recordings with a sampling rate of 50Hz of ten participates during seven physical activities: Walking, Running, Sitting, Standing, Jogging, Biking, Upstairs (walking upstairs) and Downstairs (walking downstairs). Each of those participants was equipped with five smartphones in five different locations: right jeans pocket, left jeans pocket, on the belt position towards the right leg using a belt clip, on the right upper arm and, on the right VOLUME 9, 2021 wrist. The SHAR dataset is constructed with combined SLR and HAR labels (classes) capturing both the physical activity and smartphone location. For this purpose, the right and left pocket recordings from [43] were first united under the Pocket label. Recordings from [43] were then labelled based on three smartphone locations and seven human activities: [Pocket, Belt, Uparm (upper arm) ] x [Walking, Running, Sitting, Standing, Jogging, Biking, Upstairs, Downstairs], yielding a total of 21 classes. Each class has 87, 00 samples except the three upstairs classes (Upstairs Belt, Upstairs Pocket, Upstairs Uparm) with have 60, 900 samples. Thus, this dataset has 1, 748, 700 samples in each accelerometer and gyroscope axis resulting in 583min of recorded data.

A. EXPERIMENTAL SETUP
For evaluation purposes, the method proposed in this paper is further compared to a CNN model shown to achieve the best performance on different SLR tasks [30]. This model consists of a convolutional encoder and a classifier head. The encoder includes two 1-dimensional convolutional layers with RELU non-linearity followed by Dropout and max pooling. The classifier head consists of two FC layers with RELU nonlinearity. Log SoftMax is applied on the output of the final FC layer.
Each dataset is arbitrarily split into a train set and test set (where 85% of the samples, on average, are selected for the train set), while ensuring all classes are represented in both sets.
Both models (CNN model and the proposed approach) are optimized using Adam, with β 1 = 0.9, β 2 = 0.999 and = 10 −10 . A batch size of 128 and a weight decay of 10 −4 are employed. An initial learning rate of λ = 10 −4 is used and further reduced by half every m epochs depending on the experiment (m is set to the same value for both models). Note that in order to support a fair comparison, all hyperparameters, except for the number of epochs, are not finetuned and are kept fixed for both models. Each model is trained for up to 30 epochs for small datasets and for up to 80 epochs for larger datasets. The full configuration used for training is available in the shared codebase.
For ease of description, from here on the CNN model and the proposed approach are refereed to as IMU-CNN and IMU-Transformer, respectively. The following sections describe different experiments aimed at evaluating performance, robustness and generalization, across different activity recognition scenarios.

B. THE EFFECT OF WINDOW SIZE
An IMU sample contains measurements aggregated over a window of time. The size of the window (k), determines the length of the sequence passed to the model. In general, in HAR, SLR or SHAR tasks, it is desirable to work with the smallest window size which still achieves a target accuracy. In the datasets employed, the sampling rate varies between

25-200Hz.
Considering the slowest sampling rate, a window size of 50 corresponds to a time duration of two seconds. Larger window sizes increases the probability of a mode change during a single window, which is an undesired behavior. With this motivation in mind, the IMU-CNN and IMU-Transfomer models were trained using decreasing windows sizes starting from 50: 50, 38, 26 and 14. Since the SHAR dataset represents both HAR and SLR tasks, it was chosen for this analysis (training and testing). Figure 4 shows the results obtained with the two models. The best accuracy is achieved when using a window of 50 samples, with a gap of 1.8% in favor of the IMU-Transformer model. For smaller windows the improvement gap grows to 4.2% (k = 26, 38) and 4.9% (k = 14). For both models, a similar trend emerges where a more notable degradation occurs when using k = 26, 38. This can be explained by events of class mixture per sequence, which are less significant and frequent when using the smallest window size (14). When considering the variance in accuracy, the intra-difference (within smaller windows) is smaller compared to the inter-distance, with respect to the original window size. In addition, a more significant degradation in performance is observed with the IMU-CNN model. Specifically, when comparing the performance between k = 50 and k = 14, 26, 38, the accuracy decreases in 4.5% on average with IMU-CNN, compared to only 1.9% with IMU-Transformer. Hence, the IMU-Transformer model not only consistently improves the performance, regardless of the window size, but is also more robust to smaller window sizes, making it a favorable option for real-time applications. Based on the results above, a window size of 50 was selected for further experiments and analysis.

C. ACTIVITY RECOGNITION PERFORMANCE ACROSS DIFFERENT DATASETS
In order to evaluate the performance per task, the IMU-CNN and IMU-Transformer models were trained and tested on the SLR, HAR and SHAR datasets. Table 2 gives the accuracy for each dataset and the mean accuracy across datasets. 53544 VOLUME 9, 2021 TABLE 2. Results obtained for the SLR, HAR and SHAR datasets. The accuracy (%) for a CNN model (IMU-CNN) and the proposed architecture (IMU-Transformer) is reported per dataset and overall (mean performance). IMU-Transformer consistently achieves better accuracy, with a 2% improvement on average. The performance of both models depends on the dataset, with a decreasing accuracy for the SLR, HAR and SHAR datasets, respectively. In addition both models achieve a significantly higher accuracy (> 8%) on the SLR dataset, compared to the two other tasks. These results are consistent with previous observations on the SLR dataset (distinct patterns between classes that are relatively easy to learn [30]) and suggest that the IMU-Transformer model learns IMU data better than the IMU-CNN model, regardless of how challenging the specific dataset is.

D. CHALLENGING MODES AND GENERALIZATION
Due to the underlying dynamics some modes are more challenging to learn than others. For example, the stairs related dynamics expressed in stairs-up and stairs-down scenarios. In order to evaluate the proposed approach in this scenario, the SHAR dataset is subset by taking samples only from the following six classes: Downstairs Belt, Downstairs Pocket, Downstairs Uparm, Upstairs Belt, Upstairs Pocket, Upstairs Uparm. Figure 5 presents the confusion matrix for the IMU-CNN model, showing a total accuracy of 86.6%. Four out of six modes received more than 91% yet, both Downstairs Pocket and Upstairs Uparm achieved only 73% and 72%, respectively. In particular, about 28% of the Upstairs Uparm samples were misclassified as Downstairs Uparm. In a similar manner, Figure 6 shows the confusion matrix for the IMU-Transformer model, achieving a total accuracy of 92.3% corresponding to a 5.7% improvement. Focusing on the Upstairs Uparm mode, the 28% misclassification of the IMU-CNN model is reduced to 7.5%. Another important aspect of model performance is its ability to generalize across datasets. For this purpose, the stairs experiment is further extended by training on the SHAR dataset but evaluating on  the HAR test set. In order to obtain class compatibility, the six stairs-related SHAR classes are collapsed into two classes: Stairs up (for Upstairs Belt, Upstairs Pocket and, Upstairs Uparm) and, Stairs down (for Downstairs Belt, Downstairs Pocket and, Downstairs Uparm). The HAR dataset is then subset with these two classes. The results of this experiment are depicted in Figures 7-8 for the IMU-CNN and IMU-Transformer models, respectively. The IMU-CNN reached a total accuracy of 33.8%, where most of the error (72% of the samples) is attributed to the misclassification of Stairs up as Stair down. IMU-Transformer significantly improved the total accuracy (80.2%) and obtained a symmetrical behaviour between the two classes.

E. SUMMARY OF RESULTS
The results of the experiments described in this paper are summarized in Table 3. The proposed approach and a CNN model were extensively evaluated on three datasets with more than 27 hours of recordings, collected by 91 users, considering (1) the target tasks, namely SLR, HAR and SHAR, (2) the effect of the window size (SHAR-50/38/26/14 in Table 3), (3) challenging dynamics (SHAR-Stairs in Table 3) and (4) generalization (SHAR/HAR-Stairs in Table 3).
For each experiment conducted, the difference between the accuracy of the proposed approach (IMU-Transformer)  In addition to accuracy, the mean runtime was also evaluated (using the SHAR dataset). When run on a GPU (Tesla V100 16Gb), the IMU-CNN model classifies a sample with a window size of 50 in 0.48ms on average versus 4.76ms with the IMU-Transformer model. When tested on a CPU, the average inference time increases, with 3.93ms and 14.3ms for the IMU-CNN and IMU-Transformer, respectively. In both cases the classification runtime is negligible compared to the runtime in which a classification is expected, even when considering the highest sampling rate (200Hz with a window size of 50 corresponding to 250ms). Integrating the proposed framework in a real-time application involves deploying the trained model (weights) and executing its forward pass with an inertial signal aggregated over a predefined time window.

V. CONCLUSION
This paper presents a deep learning framework for activity recognition. The proposed approach employs Transformers for performing sequence aggregation using attention, which have been successfully used for sequence analysis tasks in other domains. To-date, this is the first time a Transformer architecture is employed for inertial-based activity recognition. Three types of datasets, with more than 27 hours of recordings, collected by 91 users, were used for an extensive evaluation: 1) smartphone mode location recognition, created from six different datasets, 2) human activity recognition and, 3) combined smartphone location and human activity recognition. In addition more challenging scenarios were addressed: 1) classification of stairs up/down motion in three different smartphone locations and, 2) testing the second dataset with a network trained on the third dataset for stairs only data.
Throughout multiple experiments, representing different activity recognition scenarios and settings, the proposed approach demonstrates an improved prediction accuracy that can be transferred better between datasets compared to a CNN-based solution. While this approach is an order of magnitude slower (4.76ms vs 0.48ms), its runtime is negligible compared to the runtime in which a classification is expected, even when considering the highest sampling rate.
An immediate extension of the proposed framework is to evaluate it with more user and smartphone modes. In addition, further enhancements can be made to the proposed framework, leveraging on Transformers acceleration techniques. Finally, transfer learning can be investigated to evaluate whether a model trained on one dataset can serve as a better starting point for new models trained on incoming datasets.
Since the proposed approach performs classification of inertial data it can be directly applied for other inertialbased classification tasks. In addition, it can be further adapted to handle other sensory data collected in a sequential manner for activity recognition, by simple modifications to the CNN backbone. In order to support results reproduction and an easy transfer to other domains, the implementation of the proposed framework is available at: https://github.com/yolish/har-with-imu-transformer.
YOLI SHAVIT received the B.Sc. degree in computer science and life science from Tel Aviv University, the M.Sc. degree in bioinformatics from Imperial College London, and the Ph.D. degree in computer science from the University of Cambridge. She is currently a Postdoctoral Researcher with Bar-Ilan University and a Principal Research Scientist with the Huawei Tel Aviv Research Center, Toga Networks, a Huawei Company. Her research interests include algorithms in deep learning and their applications to real-life domains and visual and multi-sensor localization problems.