ConvAE-LSTM: Convolutional Autoencoder Long Short-Term Memory Network for Smartphone-Based Human Activity Recognition

The self-regulated recognition of human activities from time-series smartphone sensor data is a growing research area in smart and intelligent health care. Deep learning (DL) approaches have exhibited improvements over traditional machine learning (ML) models in various domains, including human activity recognition (HAR). Several issues are involved with traditional ML approaches; these include handcrafted feature extraction, which is a tedious and complex task involving expert domain knowledge, and the use of a separate dimensionality reduction module to overcome overfitting problems and hence provide model generalization. In this article, we propose a DL-based approach for activity recognition with smartphone sensor data, i.e., accelerometer and gyroscope data. Convolutional neural networks (CNNs), autoencoders (AEs), and long short-term memory (LSTM) possess complementary modeling capabilities, as CNNs are good at automatic feature extraction, AEs are used for dimensionality reduction and LSTMs are adept at temporal modeling. In this study, we take advantage of the complementarity of CNNs, AEs, and LSTMs by combining them into a unified architecture. We explore the proposed architecture, namely, “ConvAE-LSTM”, on four different standard public datasets (WISDM, UCI, PAMAP2, and OPPORTUNITY). The experimental results indicate that our novel approach is practical and provides relative smartphone-based HAR solution performance improvements in terms of computational time, accuracy, F1-score, precision, and recall over existing state-of-the-art methods.


I. INTRODUCTION
Human activity recognition (HAR) has been a popular research area for several decades due to its wide applications in smart health care, ambient assisted living, disease prediction, video surveillance, remote health care and so on [1], [2]. According to a report released by the United Nations (UN) [3], the expected worldwide population of elderly people is expected to reach 2 billion by 2050. Elderly The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato . people need special attention and care, as most elderly people suffer from many diseases. Moreover, the doctor-to-patient population ratio was determined to be 1:1800. The monitoring of real-time human physical activities, particularly the daily living activities (DLAs) [4] of elderly people, is an indispensable aspect in smart health care and can effectively enhance medical rehabilitation and elderly care. Daily lifestyles have significant impacts on several critical diseases. Therefore, daily physical activity monitoring provides an important health indicator [5]. The identification and classification of human physical activity are popularly used to VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ monitor, analyze and understand various postures across a variety of applications and systems. Various sensor-based HAR frameworks have been proposed in the literature, such as smartphone sensor-based, body-worn sensor-based and audio/video data-based frameworks. Body-worn sensors are not comfortable for users, and audio/video data have several privacy concerns. Moreover, both body-worn sensor signals and audio/video signals require complex signal processing techniques [6]. The audio recording of any long-term activity becomes noisy due to background noise or white noise. Therefore, an audio signal alone at any given moment does not provide valuable information. It may also be difficult to differentiate between two pieces of audio information. Therefore, audio data are insufficient on their own for recognizing some basic activities. The collection of video data turns out to be difficult in populated locations, in locations where many physical obstacles exist, or when brightness is low [7]. To infer the descriptions of human behaviors and transport modes, sensor data are also obtained by using smartphones. In human activity monitoring and understanding, the utility of tactile information provided by smartphones affects analyses because of their distinct centers of attention over other sensor modalities. Normally, smartphone sensor-based physical HAR systems are motivated by their ubiquity, discretion, inexpensive installation procedures, noninvasive properties and ease of use [8]. By utilizing a smartphone, continuous data can be collected while performing any type of physical activity. Moreover, mobile health-related data monitoring becomes more elegant and accurate due to the variety of built-in smartphone sensors [9]. Several built-in smartphone sensors can be used to collect data for HAR. However, the most commonly used built-in sensors are accelerometers and gyroscopes [10]- [13].
Smartphone sensor-based datasets contain multivariate time-series data. Local dependency is the intrinsic nature of time-series data. Moreover, the natures of human activity signals are hierarchical and translation-invariant [14], and they are rich in dynamic information regarding the underlying system. As a result, the need to accurately model such high-dimensional datasets is increasing. Physical activities consist of some special distinctive features. Hence, several methodological challenges are involved in HAR (except for handcrafted feature extraction), such as imbalanced datasets, intraclass variability, interclass similarity, the empty class problem [15] and the multiclass window problem [16].
Recently, smartphone-based HAR systems utilizing conventional machine learning (ML) algorithms or deep learning (DL) approaches have gained popularity [17]. Feature engineering is a dominant phase in traditional ML methods because it extracts the relevant features that are responsible for differentiating various activity patterns. The accurate performance of HAR solutions greatly depends on the feature engineering of raw signals [18]. Then, the features are fed to classifiers to recognize human activities [6], [19]. Without an adequate feature engineering process, conventional classifiers fail to competently and accurately identify physical activities. Hence, to provide sensory data in an appropriate form, complex data preprocessing techniques are required, and handcrafted features are extracted from the acquired sensory data based on expert domain knowledge [11]. Finally, the handcrafted feature vector is fed to the conventional classifiers to recognize various human physical activities. However, past research has shown that some of these handcrafted features are good at distinguishing one activity but not as good at recognizing others [20]. Moreover, different research domains require different handcrafted feature vectors to properly handle classification problems.

A. MOTIVATION
Automatic feature learning capabilities make DL algorithms more popular than conventional ML algorithms. DL algorithms can extract relevant features efficiently without any manual assistance while simultaneously identifying human activities [21]- [23]. DL approaches have proven their outstanding predictive capabilities in speech and image recognition, intelligent gamification, and natural language processing. In the HAR literature, numerous DL approaches have been investigated and applied. Convolutional neural networks (CNNs) are popular supervised DL methods that are used in the HAR domain (see Figure 1) to overcome the problem of handcrafted feature extraction. CNN layers are used to automatically extract features from raw time-series data without human intervention. In a CNN, the number of feature maps often increases with the network depth, causing the computational complexity of the architecture to also increase. To overcome this problem, a dimensionality reduction technique can be employed to reduce the number of feature maps. Moreover, existing DL-based HAR systems require large quantities of labeled training data to achieve good performance. However, in a real scenario, a large volume of labeled training data is not easy to obtain because the task of creating such a volume is tedious, time-consuming, laborious, and expensive. Moreover, in real-time HAR applications, the availability of labeled data is quite poor. To overcome these issues, in this research work, we take advantage of an 'autoencoder (AE), which possesses the property of unsupervised feature learning and enables dimensionality reduction with convolutional layers [24].
A convolutional AE (ConvAE) is a type of AE in which nonlinear transformation is performed by a CNN [25]. This is the motivation behind the use of a combination of a CNN and an AE in our proposed architecture. Convolutional layers are less computationally complex than connected layers [26]. Since CNNs primarily work in vector spaces, learning the high-dimensional properties of input time series is more difficult. As a result, CNN architectures alone cannot efficiently predict time-series signal data [27]. Moreover, CNNs cannot extract temporal features, leading to a reduction in activity recognition accuracy. 'Long short-term memory (LSTM) networks are adept at sequential learning because they carry signal information across time steps [27]. By leveraging the complementary strengths of a CNN and an LSTM neural network, the combination of CNN and LSTM models preserves both spatial and temporal information and performs better in terms of sequential learning [27], [28].
Motivated by the architectures of CNNs, AEs, and LSTM networks, this work proposes an integrated architecture using ConvAE and LSTM for recognizing human activities. Our exhaustive literature study also suggests that this combination is novel. The end-to-end model in the proposed work consists of the following three distinct modules.
• A convolutional AE module. • An LSTM module. • A fully connected layer followed by a softmax function.

B. CONTRIBUTIONS
The contributions provided in this paper are as follows.
1) We perform an extensive literature survey regarding DL-based HAR, which can be helpful for readers to understand and compare the state-of-the-art methods in this domain. 2) We propose ConvAE-LSTM, which is a novel DL architecture that (a) can automatically extract features from unlabeled raw sensory data, (b) uses fewer parameters due to the presence of a convolution layer that minimizes the risk of overfitting, (c) reduces the required computational time and d) enhances the accuracy of HAR. 3) We demonstrate the effectiveness of our proposed ConvAE-LSTM network through empirical experiments on two different standard public smartphone sensor-based HAR datasets the same experimental environment. 4) Additionally, we compare the experimental results obtained using our proposed method with those of some state-of-the-art methods drawn from the HAR literature.

II. RELATED WORK
A substantial number of reviews have been conducted regarding the recognition of human physical activities using various approaches, as elucidated in [11], [29], [30], which include both ML and DL approaches. Various shallow machine learning approaches have been used in HAR solutions. For instance, in [6], [12], [13], [31]- [36], the authors emphasized only accuracy based on various ML algorithms by using 17 different baseline time-frequency domain features mentioned in [11]. Previously, the researchers in [29], [32], [37]- [45] used various feature extraction and/or selection methods before feeding the obtained data to a classification algorithm for recognizing diverse human activities. VOLUME 10, 2022 ML models rely on handcrafted features in the HAR domain, and such features require expert domain knowledge. Moreover, they increase the time complexity of the resulting model. To overcome these issues, researchers have started exploring DL approaches, as DL models possess automatic feature extraction capabilities. In this section, we establish an underlying basis for DL-based HAR while looking at some of the earlier works that are relevant to our approach and how their methods differ from ours.

A. DEEP LEARNING FOR HAR
Relevant features can be automatically learned by DL algorithms. As a result, in the case of smartphone-based HAR models, the performance of DL algorithms is astonishingly high. Because of their hierarchical feature extraction capacities, CNNs are gaining prominence in the HAR domain. CNNs containing at least one convolutional layer and one pooling layer followed by at least one fully connected layer have acquired fame because of their capability of learning unique representations from images or speech while capturing local dependencies and distortion invariance [8]. In [18], the authors proposed a CNN-based HAR model by capturing the local dependencies and scaleinvariant features of activity signals. To prove the efficiency of the proposed framework, the authors used three public datasets : OPPORTUNITY [82], [83], Skoda [84], and Actitracker [85]. Ronao and Cho [14] proposed another CNN method for HAR by using the UCI public dataset and handcrafted features. Another work proposed in [8] used a 1D-CNN to recognize human activities in the UCI public dataset [11] with extra temporal fast Fourier transformation (FFT) information.
Ravi et al. [46] presented a HAR framework using convolutional layers and shallow features obtained from smartphone sensors and wearable sensors. The WISDM dataset [10] and DAPHNet-FoG [86] datasets were used in this study. Bevilacqua et al. [87] suggested a CNN-based HAR that uses five distinct sensors, including an accelerometer and a gyroscope, to recognize 16 different lowerlimb actions. In another work, Jiang et al. [47] described a CNN-based HAR framework using the UCI-HAR public dataset. The adaptive moment estimation (Adam) hyperparameter optimization technique was employed in this study. Another HAR framework proposed by Ignatov [48] used a combination of manually extracted features and automatically extracted features obtained from a CNN. To demonstrate the efficiency of the proposed framework, the authors used two popular public datasets (WISDM and UCI). They also experimented without extracting handcrafted features from the UCI dataset. A body-worn sensor-based HAR framework was proposed by Rueda et al. [49] using a CNN. Three different datasets were used in their experiments to prove the efficiency of the proposed model; two different public datasets, OPPORTUNITY and PAMAP2, were used in [88]. In [50], the authors proposed an HAR framework using a 2D-CNN and calculated both the accuracy and computational time of the developed approach.
In [51], the authors suggested a HAR framework using a CNN architecture for the ''UCI-HAR'' public dataset. Moreover, the authors calculated the training and testing times of their approach as 3.4274 seconds and 372.6 ms, respectively. Zebin et al. [52] proposed an HAR model to recognize five different activities such as ''walking on a level surface'', ''walking upstairs'', ''walking downstairs'', ''remaining sedentary'' and ''sleeping'' by utilizing a CNN. Waist-mounted inertial sensors such as an accelerometer and a gyroscope were used to collect the data. In [53], the authors proposed a CNN-based HAR solution using smartphone accelerometer, magnetometer, gyroscope, and barometer sensor data. In this work, the authors identified nine different activities. In [54], the authors proposed a 2D deep CNN architecture to solve an HAR problem. In this work, the authors used a separate data compression technique for smartphone sensor data. Gamble and Huang [55] described a 1D-CNN architecture with accelerometer and gyroscope smartphone sensors for HAR to identify human physical activities. Cruciani et al. [56] proposed a smartphone sensorbased and audio-based HAR method using a CNN. They used the UCI-HAR dataset, a real-world extrasensory dataset [89] and the DCASE 2017 dataset. Yen et al. [57] suggested a CNN-based HAR framework using the smartphone-derived UCI-HAR public dataset and self-collected data from wearable sensors. Wan et al. [58] suggested an HAR framework incorporating three different deep learning methods, namely, a CNN, LSTM, and bidirectional LSTM, with two different public datasets; one included the smartphone sensor-based UCI datasets, and the other was derived from the wearable sensor-based PAMAP2 datasets.
In [59], a sparse AE (SAE) was used to automatically learn features, and based on this concept, a smartphone-based HAR framework was proposed. Three different channels (an accelerometer, a gyroscope, and the magnitudes of both sensors) were used by the authors. Statistical metrics were used to demonstrate the achieved performance measures. In [60], the authors proposed an AE-based HAR system built on various video datasets. Utilizing a stacked autoencoder, a smartphone sensor-based HAR system was proposed by Almaslukh et al. [61]. In [62], the authors identified eight locomotion and transportation activities via an adversarial AE. Data were collected by using smartphone sensors with four different positions (torsos, bags, hips, and hands) to perform the experiment. Via a combination of stacking denoising AEs and LightGBM, an HAR solution was proposed in [63] using the UCI dataset. Ozcan and Basturk [64] proposed a stacked AE-based HAR system. The authors used both WISDM and UCI smartphone-based sensor data to perform their experiments. Recently, a regularized AE-based HAR framework was proposed in [65] using body-worn sensors. Based on the idea that one encoder is associated with one class, an ensemble of autonomous AE-based HAR solutions was proposed in [66]. In this study, the authors used three  different datasets : WISDM, MHealth and PAMAP2. A typical AE-based HAR system was proposed in [67] using bodyworn sensor data such as the PAMAP2, OPPORTUNITY, USC-HAD, and DAPHNet datasets.
Geng and Song [68] proposed an HAR solution using video data (KTH dataset). In this study, the authors used a CNN with a convolutional AE. Varamin et al. [69] proposed a deep convolutional AE-based approach to identify human activities using both a smartphone sensor and body-worn sensors with matching ratios of 94.9% and 84.9%, respectively. In this study, the authors emphasized only unsupervised feature learning concepts.
A context-aware HAR framework was proposed in [70]. A 'multilayer LSTM with batch normalization was used by the authors to recognize static and dynamic physical activities using body-worn inertial sensors. However, the computational cost and memory requirements were quite high, as edge computing was used in this study. Yu and Qin [71] suggested an HAR framework using bidirectional LSTM with the UCI-HAR dataset. Zhao et al. [72] suggested a residual bidirectional LSTM architecture to identify different human activities using both UCI smartphone sensor and body-worn sensor (OPPORTUNITY) datasets.
In [73], the authors proposed CNN-and LSTM-based HAR solutions using two different public datasets collected by wearable sensors. In [74], the authors used a CNN followed by an LSTM-based DL architecture for HAR by using the UCI smartphone-based dataset. Wang et al. [75] suggested a HAR framework using a CNN in combination with LSTM. The HAPT public wearable sensor dataset was used by the authors to prove the effectiveness of their work. Mutegeki and Han [76] used the UCI smartphone sensor dataset to identify human activities using a CNN-LSTM architecture. Ercolano and Rossi [77] proposed a CNN-LSTM-based architecture using video data (the CAD-60 dataset) for HAR. In all the aforementioned works, researchers used combinations of CNNs and LSTM to extract spatial and temporal features.
Ye et al. [78] suggested a two-stream convolutional network-based ''convolutional LSTM'' architecture to recognize various daily life activities. They used the HMDB51 and UCF101 video datasets and extracted features by using the convolution layer of the CNN.
Xia et al. [80] suggested an LSTM-CNN-based HAR framework to identify different daily life activities using three different datasets: UCI, WISDM, and OPPORTUNITY. In this study, a two-layer LSTM network, followed by a convolutional layer, was used. Two additional layers, global average pooling (GAP) and the batch normalization layers, were used to model parameters and speed up the convergence of the network, respectively. After convolution, the fully connected layer was replaced by a GAP layer.
Zou et al. [81] proposed an AE long-term recurrent convolutional network (AE-LRCN)-based HAR framework that consists of three different modules: an AE, a CNN, and LSTM. The proposed framework can identify five different activities : : emptying, sitting, walking, running and standing.
Xu et al. [79] suggested ''InnoHAR'', which is the combination of an inception neural network and an RNN with different scale-based convolution kernels, for HAR. Karim et al. [90] suggested 'the use of multivariate LSTM-FCNs for time-series classification. The authors used an HAR dataset that included 34 other datasets obtained from different domains to demonstrate the performance of the proposed model. In this work, a squeeze-and-excitation block was incorporated to improve the performance of the proposed model.
We summarize the aforementioned state-of-the-art DL-based HAR methods in Table 1. In Table 1, we can easily verify that for smartphone-based HAR systems, UCI and WISDM are the most popular public standard smartphone sensor data used in previous works. Moreover, we can also determine a research gap: convolutional AR with LSTM is a novel architecture by which we can obtain higher accuracy with permissible computational times in HAR domain applications. In [81], the authors used a combination of an AE, a CNN, and LSTM, although this approach exhibited several clear differences from our method. First, we introduce a combination of convolutional layers with an AE, and then the output of the convolutional AE passes through LSTM. Conversely, in [81], the authors used three different modules, where the input first passed through the AE, then through the CNN, and finally through LSTM. In our work, we take advantage of the convolutional layer of a CNN in combination with an AE and LSTM. Second, in [81], the authors used channel state information (CSI) frames as inputs. In contrast, in this study, we take time window segments of raw signals as inputs. Third, in [81], to prove the efficiency of the proposed AE-LRCN architecture, the authors used a self-collected dataset for activities such as emptying, sitting, walking, running and standing. In this paper, to exhibit the efficiency of our proposed architecture, we use four popular public datasets (UCI, WISDM, OPPORTUNITY, and PAMAP2).

III. PRELIMINARIES
The proposed DL architecture is a combination of a convolutional AE and LSTM. The Convolutional AE leverages the convolutional filtering performance of CNNs with unsupervised AE pretraining. Therefore, to understand the concept of a convolutional AE, it is necessary to separately understand the concepts of both the CNN and AE.

A. CNNs
Recently, CNNs have achieved great successes in various domains, such as image classification and speech recognition, due to their ability to learn locally connected features. Generally, CNNs consist of three different layers: convolution layers, pooling layers and fully connected layers [91]. The convolution layers are the fundamental concepts of a CNN architecture that perform feature extraction. Input feature map downsampling is performed by the pooling layers, and the fully connected layers are used for classification. Both max-pooling and average-pooling layers are commonly employed to perform local maximization and averaging operations on the input features, respectively. Motivated by [92], we present the max-pooling layer concept by using the following equation : : An average-pooling layer is represented as follows : where q l i (k) is the value of the k th neuron in the i th feature map of the l th layer, t denotes the t th moving step of the filter, p is the width of the pooling filter, and mp l+1 i or mp l+1 i represents the corresponding output in the (l +1) th layer provided by the pooling operation.
In comparison with a fully connected layer, a convolution layer has much fewer parameters due to sparse connectivity and weight sharing, thereby minimizing the possibility of overfitting. However, due to the tremendous popularity of traditional CNNs in HAR, the recently proposed CNN-based HAR models adopt 1-2 fully connected layers as classifiers [57], [87]. Although fully connected layers can adequately perform classification, various parameters lead to the risk of overfitting.

B. CONVOLUTIONAL OPERATION
The given input data are processed by the convolution kernel, which produces processed features as outputs. These processed features are known as feature maps. The multiple kernels that reside in convolution layers are used to extract the relevant features. Motivated by [92], the ubiquitous convolutional operation is denoted by where w l i and b l i represent the weight and bias of the i th kernel in the l th layer and x l (j) denotes the j th local region of layer l.
Generally, after the convolution operation, a nonlinear transformation is subsequently employed by using an activation function. A commonly used activation function is the rectified linear unit (ReLU), which is represented by The basic principles of AEs were proposed in [93]. An AE is a feedforward neural network that accepts x ∈ R d an input and first maps it to a latent representation h ∈ R d to produce an output under certain constraints. An AE encodes and decodes the given inputs to produce unsupervised pretraining data. The deterministic encoding function used to construct a nonlinear mapping for the given input x is as follows : where σ is the nonlinear activation function and w and b are the weights and biases, respectively. The decoding function used to reconstruct the input vector x with encoded features is as follows : whereŵ andb are the weights and biases of the decoder, respectively.

IV. THE PROPOSED MODEL
Multivariate time-series data are collected from built-in smartphone-based sensors to identify various human activities. Fine-grained information can be obtained by using sensory data from sensors such as triaxial accelerometers and gyroscopes. However, the data collected using smartphone VOLUME 10, 2022 sensors are noisy and not in an appropriate form. It is not possible to recognize fundamental patterns with such noisy raw sensory data. To remove this noise, conventional filtering techniques such as low-pass, high-pass, and median filtering are used. After removing the noise, feature engineering is applied to extract relevant features, and ultimately, the extracted feature vector is fed to the classifier as its input. As mentioned earlier, conventional noise removal, feature engineering, and classification methods require substantial human expertise and intervention, and they fail to reveal the temporal interdependence of data [81]. CNNs are popularly used for automatic feature extraction in several domains, including HAR. CNNs, however, use backpropagation neural networks to train the kernels/weights used for convolution, which takes a long time.
In both ML and DL architectures, time window-based sensor data segmentation is required to assign a single activity class. Researchers approximate the size of the sliding window over the sensor data streams to extract features in cases involving labeled data [22]. Sometimes, this strategy leads to the loss of important activity information. Activity recognition accuracy may increase with an increase in the length of the segments, but a long window size causes response time delays in real-time HAR. Hence, an unsupervised feature learning approach such as an AE is beneficial in scenarios in which we do not have labeled data.
The proposed model consists of three modules, as represented in Figure 4. The first module is a convolutional AE, which consists of a convolution layer, a pooling layer and a deconvolution layer. The output of the convolutional AE is passed through a flattened layer to change it to the LSTM input format. The LSTM output is passed through a fully connected layer to obtain a high-level representation. Finally, a softmax layer is used for the final human physical activity recognition step. Thus, this model is capable of identifying the temporal dependencies among the time-series signals acquired through smartphone sensors.
As mentioned, a flattened layer is added after the deconvolution layer in the convolutional AE to format the feature data for the LSTM layer. This is because the data format of the convolution layer is different from the input data format of the LSTM layer. A time-distributed wrapper, which is provided by the Keras library in Python, takes a layer as an argument and applies convolutions to the signal while maintaining its temporal integrity for the LSTM layers [94]. As the time-distributed layer works with a 3D data format, we need to reshape the input signal from 128 time frames with an accurate number of signals. A total of 128 time frames are divided into four slices with 32 time frames each.
The three different aforementioned modules used in this proposed deep learning architecture are further explained as follows : A. CONVOLUTIONAL AE One of the AE variations is the convolutional AE [69], in which a fully connected layer is replaced by a  convolutional layer. Convolutional AEs have the advantages of both convolutional layers and the unsupervised pretraining capability of an AE. In contrast to the conventional AE network, the convolutional AE contains convolutional layers in the encoder and deconvolution layers instead of a fully connected layer in the decoder. Our proposed convolutional AE includes convolution, pooling, and deconvolution layers, as presented in Figure 2. The encoder consists of one convolution layer and a pooling layer. The decoder consists of a deconvolution layer. Encoding the result of the convolution operation with a max-pooling layer permits higher-layer representations that are invariant to small input translations and reduce the computational cost of the proposed approach [26]. The convolution-deconvolution layer is followed by an activation function, which is represented as follows [95]: where • h k = the latent representation of the k th feature map of the current layer  layer. For instance, if the size of a feature map x l is p × p and the size of the filter is q × q, then after performing the valid convolution, the size becomes (p-q+1) × (p-q+1), and after performing the full convolution, the size becomes (p+q-1) × (p+q-1) [95].
By utilizing the maximum activity within the input feature maps, a max-pooling layer pools features, and according to the size of the pooling kernel, it constructs reduced-size output feature maps.

B. LSTM
Temporal features in time-series sensor data have great importance when modeling human movement [23]. Recently, recurrent neural networks (RNNs), most remarkably those that depend on LSTM [96], have achieved impressive performance in different domains, including HAR. The LSTM architecture is responsible for extracting the temporal features from sensory signals due to its temporal characteristics and long-term dependencies. The conventional architecture of LSTM [81] is represented in Figure 3. In our proposed model, the convolutional AE, as explained in section IV-A, is followed by an LSTM model. The output of the convolutional AE and the compressed features are the inputs of the LSTM for deducing the latent temporal interactions throughout the timeframes. According to Figure 3, at time frame t, x t is the input signal and h t is the hidden state. At time frame t-1, C t−1 is the memory cell state. w f , w i , w c , and w o and b f , b i , b c , b o are the weights and biases, respectively. σ and tanh are the activation functions. In the first step, the LSTM calculates the previous information from the cell state C t−1 by using a forget gate as follows : : where f t is either 0 or 1 to denote the total block and total transit of the information, respectively. In the next step, the LSTM calculates the upcoming information to be stored by using a two-step process. The first part regulates the parameters to be used via the following equation: The second part determines an optimal state valueC t by using the following equation: In the third step, the LSTM determines the current state C t by using the following equation: As exhibited in Figure 3, the filtered version of the compressed cell state tanh(C t ) is the hidden network output h t . The part of the information that should be preserved is calculated by using the sigmoid layer o t , which is determined according to the following equation: Ultimately, the final hidden output h t is articulated as

C. FULLY CONNECTED AND SOFTMAX CLASSIFICATION LAYERS
Fully connected layers are used to follow high-level representations. In this work, the LSTM outputs are fed into two hidden layers, and ultimately, a softmax layer is used for the final activity identification step.

V. PERFORMANCE EVALUATION
We present the experimental results of our proposed method (ConvAE-LSTM) on two smartphone sensor-based public standard datasets (UCI [11] and WISDM [10]) and two body-worn, sensor-based public standard datasets (OPPORTUNITY [82] and PAMAP2 [88]) in this section.
In this article, we mainly focus on smartphone sensorbased HAR. Hence, we present detailed experimental results obtained using the UCI and WISDM datasets. To demonstrate the efficiency of our proposed model, we also present the experimental results obtained by using OPPORTUNITY and PAMAP2 in terms of accuracy and F1-scores.

A. DATASET
To exhibit the efficiency of the proposed method, we use two popular public standard smartphone sensor-based HAR datasets and two body-worn, sensor-based public standard datasets that represent both static and dynamic activities. The standard datasets used are explained as follows : • UCI dataset [11] : This standard dataset is taken from the publicly available 'University of California Irvine VOLUME 10, 2022  (UCI) Machine Learning'' repository. This is a balanced dataset, as shown in Figure 5. This dataset was collected from 30 subjects aged between 19 and 48 years who performed 6 different daily life activities such as ''sitting'', ''standing'', ''walking'', ''lying'', ''walking upstairs'' and ''walking downstairs''. To collect the data, a waist-mounted smartphone (SamsungGalaxySII ) with a built-in accelerometer and gyroscope was used. This dataset was also collected in a laboratory environment with proper surveillance. This collected dataset consists of 10,299 instances in total. Triaxial linear acceleration and angular velocity measurements were collected at a constant sampling rate of 50 Hz.
• WISDM Actitracker dataset [10] : This standard public dataset is provided by the 'Wireless Sensor and Data Mining (WISDM) lab. The dataset was collected from 36 subjects using smartphone accelerometer sensors. Each subject was asked to perform 6 different human physical activities, such as ''sitting'', ''standing'', ''walking'', ''jogging'', ''walking upstairs'' and ''walking downstairs''. This dataset was also collected in a laboratory environment with proper surveillance. In this dataset, the total number of instances is 1,098,207. The 3-axial linear acceleration measurements were collected at a constant sampling rate of 20 Hz.
• OPPORTUNITY dataset [82] : This dataset consists of complicated naturalistic activities including a large number of atomic activities (over 27,000) recorded in a sensory-rich environment at a constant sampling rate of 30 Hz. It includes recordings of 12 participants obtained using 15 networked sensor systems with 72 sensors from 10 different modalities that are embedded in the environment, in objects, and on the body. We only consider the on-body sensors, including the 3-axial accelerometer and inertial measurement units. This is an 18-class classification problem (including the null class).
• PAMAP2 dataset [88] : It contains recordings from 9 subjects (8 men and 1 woman) who were asked to perform 18 lifestyle tasks, including household chores, at a constant sampling rate of 100 Hz. Over the course of 10 hours, data from inertial measurement equipment on the hand, chest, and ankle were collected, including accelerometer, gyroscope, magnetometer, temperature, and heart rate data. The resulting dataset has 52 dimensions. Using a continuous sequence of sensory data, an end-to-end HAR model is implemented in this work. During this process, from the raw sensory data, a sequence of short time-series data is extracted. To save the transient connections between the information focused on a given activity, a sliding window with a 50% overlapping rate is used to segment the collected raw sensory data. For the above datasets, the length of the sliding window is 128 with a step size of 64.

B. EXPERIMENTAL SETUP
In the training phase, forward calculation is used with the training set to obtain the network output. Afterward, in between the predicted outputs and actual outputs, the cross-entropy errors are calculated. Then, the Adam optimizer is used to backpropagate the errors in the sequence of layers to update the hyperparameters of our proposed network. After calculating the adaptive learning rate of each parameter, the hyperparameters of the objective function are optimized by Adam [97]. The Keras API permits one to move from the beginning to the end result with the least viable delay [94]. During this experiment, we build a sequential Keras model (version 2.4.3) with TensorFlow in the backend (version 2.3.1). For our experiments, we use a single GPU (NVIDIA GTX 1060 GPU with 6 GB of memory).
To perform the experiment, the first two datasets are divided into two different groups: 70% of the volunteers are selected for training, and 30% are used for testing the proposed HAR solution. Hence, the same subjects' data are not included in both the training and testing sets. In our experiment, simple 5-fold cross validation is used to generate multiple training and validation splits from the training set, as cross validation is less computationally complex than other methods such as leave-one-out cross validation [98]. Leave-one-subject-out (LOSO) cross validation is also performed to provide a more comprehensive evaluation. We use data from one subject for testing and those from the remaining subjects for training. This cross-subject test is more rigorous because the test data are hidden from the models, making it a more realistic setting for validating the models' generalization abilities. By using all the datasets, in the input layer of the CNN, 1D convolution is performed. In our experiment, a ReLU is used as an activation function for the convolution layers with a kernel size of 3, a stride of 2, and a filter size of 64. Similarly, in the max-pooling layer, the pooling size and stride are both of size 2. The learning rate is set to 0.001. The optimizer updates and calculates the network parameters that affect the model training and output processes to approximate or reach the optimal value, thereby reducing the loss function.

C. EXPERIMENTAL RESULTS
In this section, we discuss the experimental results of the proposed method in terms of accuracy, precision, recall, computational complexity, and testing time by using the first two datasets. The experimental results obtained with the other two datasets are provided later. To prove the capability of our proposed model, we also compare the results of our proposed model with those of other commonly used DL approaches, such as a CNN, LSTM, an AE, CNN-LSTM, and a convolutional AE. In this experiment, we take the simple CNN and LSTM architectures as proposed in [58]. In the cases of the AE and convolutional AE, we utilize a max-pooling layer for encoding, which is similar to our proposed method (ConvAE-LSTM).

1) UCI DATASET
Utilizing the UCI-HAR dataset, we perform exhaustive experiments on various DL approaches, such as a CNN, LSTM, an AE, CNN-LSTM, a convolutional AE, and the proposed method (ConvAE-LSTM). Table 2 demonstrates the training time, testing time, and testing accuracy of the aforementioned DL approaches, including our proposed model. All the DL approaches are used in our experiment to demonstrate the effectiveness of our proposed method in the same experimental environment. The computational time of our proposed model is very competitive with those of the aforementioned DL approaches in the same computational environment. Moreover, the computational time of ConvAE-LSTM is very competitive with that of the state-ofthe-art approach proposed in [51], where the computational times for training and testing are 3.4274 s and 372.6 ms, respectively, when using the CNN with the UCI dataset. The testing accuracy of ConvAE-LSTM is 98.14%, which is much higher than that of other popular DL approaches. However, the computational times of the AE and convolutional AE are the lowest among all the mentioned approaches in Table 2, whereas their accuracies are very poor in comparison with that of our proposed approach, as these two methods do not consider the temporal dependencies among the raw sensory time-series data. Table 3 demonstrates the detailed classification results of our proposed model. In this proposed model, as we take both a convolutional AE and LSTM in combination, the F1-scores of activities such as ''walking'', ''walking downstairs'' and ''walking upstairs'' are 99%, 100%, and 94%, respectively. Therefore, we can conclude that our proposed method can distinguish similar activity patterns very efficiently, which is not achieved by using only the CNN method, as mentioned in the HAR literature. Similarly, in cases with static features such as ''sitting'', ''lying'' and ''standing'', we achieve F1-scores of 97%, 98%, and 98%, respectively, which are much better than those of any CNN-based HAR solution. From the experimental results, we can easily conclude that our proposed method not only efficiently differentiates among the static and dynamic activities but can also efficiently identify similar activity patterns. Figure 7 depicts the confusion matrix of the different activities in the testing set. We also analyze the activity recognition accuracy of the proposed ConvAE-LSTM method and compare its performance with that of UCI data-based state-of-the-art HAR solutions, the CNN, a CNN +handcrafted features [8], LSTM with bidirectional LSTM [58], bidirectional LSTM alone [71], CNN-LSTM [74] and LSTM-CNN [80], as well as two popularly used shallow ML approaches (a random forest (RF) and a support vector machine (SVM)). Table 4 compares the average accuracy of ConvAE-LSTM with that of the aforementioned approaches. ConvAE-LSTM provides  the best activity recognition accuracy (98.14%) among all tested approaches.

2) WISDM DATASET
Utilizing the WISDM dataset, we perform exhaustive experiments on various DL approaches, such as the CNN, LSTM, the AE, CNN-LSTM, the convolutional AE, and the proposed method (ConvAE-LSTM). Table 5 demonstrates the training times, testing times, and testing accuracies of the aforementioned DL approaches, including our proposed model. The computational time of our proposed model is very competitive with those of the other DL approaches in the same computational environment. The testing accuracy of the proposed model is 97.76%, which is much higher than that of other popular DL approaches. However, the training and testing times of the AE and convolutional AE are the lowest among all the mentioned approaches in Table 5, whereas their accuracies are very poor in comparison with that of our proposed approach, as these two methods do not consider the temporal dependencies among the raw sensory time-series data. It is pertinent to mention that the WISDM dataset is imbalanced, as depicted in Figure 6. In the case of an imbalanced dataset, several techniques are required to balance the dataset according to the HAR literature, and most conventional ML algorithms fail to classify imbalanced datasets properly. However, while performing our experiment, none of these techniques are applied to convert the imbalanced dataset into a balanced dataset. The imbalanced WISDM dataset is employed in our proposed model directly, and we obtain very high identification performance for all the activities, as the F1-score of each activity is greater than 95% and the overall accuracy is 99%. Hence, we can conclude that our proposed method provides an added advantage to overcome the imbalanced dataset issue. Table 6 shows the detailed classification results of our proposed method with WISDM. In this proposed method, as we utilize both a convolutional AE and LSTM in combination, the F1-scores for activities such as ''jogging', ''walking'', ''walking downstairs'' and ''walking upstairs'' are 100%, 99%, 95%, and 97%, respectively. Therefore, we can conclude that our proposed method can distinguish similar activity patterns very efficiently, which is not achieved when using only the CNN method, as mentioned in the HAR literature. Moreover, for both static activities (''sitting'' and ''standing''), the F1-score is 99%, which is very remarkable according to the HAR literature. Figure 8 depicts the confusion matrix for the different activities in the testing set.
We also analyze the activity recognition accuracy of our proposed ConvAE-LSTM method, and compare its performance with that of WISDM dataset-based standard HAR models, the CNN [18], a CNN + handcrafted features [48], an AE ensemble [66], the convolutional AE [69] and LSTM-CNN [80], as well as two shallow ML approaches (an RF and an SVM). Table 7 compares the average accuracy of ConvAE-LSTM with that of the aforementioned approaches. ConvAE-LSTM provides the best activity recognition accuracy (98.67%) among all tested approaches.

D. PERFORMANCE EVALUATION OF THE PROPOSED MODEL USING LOSO CROSS VALIDATION
We also perform LOSO cross validation to provide a more comprehensive evaluation. We use the data from one subject   for testing and the data from the remaining subjects for training. This cross-subject test is more difficult because the test data are hidden from the models, making it a more realistic setting for validating the models' generalization abilities. After testing the models with a unique subject for each fold, we obtain different evaluation metric values, one from each fold. To assess the accuracy of the models, we take the mean ± SD of all the accuracy metrics. We perform the LOSO cross-validation evaluation technique with a 95% confidence level as follows : • In LOSO cross-validation, for each fold, we obtain different accuracy metrics. We calculate the mean and SD for each accuracy metric.
• After calculating the average accuracy, we calculate the error: error = 1 − accuracy.
• Next, we calculate the confidence interval for the classification error using where n is the number of samples used to evaluate the model and the value of the constant is 1.96, which is provided by the statistics for the 95% confidence level.  By sing the aforementioned steps, we calculate the true accuracy and confidence interval with a 95% confidence level (the significance level is 0.05) for each of the models, as presented in Tables 8 and 9.
Even when using LOSO cross validation, which is more realistic and difficult, the proposed technique outperforms the aforementioned state-of-the-art methods on both the UCI and WISDM datasets. By utilizing LOSO cross validation, the accuracies of the proposed model are 97.13% and 97.56% on the UCI and WISDM datasets, respectively, with a 0.05 level of significance. Similarly, the F1-scores of the proposed model are 97.08% and 97.38% on the UCI and WISDM datasets, respectively, with a 0.05 level of significance. The performance of the different models according to LOSO cross validation with a 95% confidence level is presented in Tables 8 and 9 for the UCI and WISDM datasets, respectively. VOLUME 10, 2022

E. EXPERIMENTAL RESULTS OBTAINED ON THE OPPORTUNITY AND PAMAP2 DATASETS
We use LOSO cross validation to perform an experiment using these two body-worn sensor datasets. The performance of the different models according to LOSO cross validation with a 95% confidence level is presented in Table 10 for the OPPORTUNITY and PAMAP2 datasets. We achieve an accuracy of 95.69% and an F1-score of 95.54% on the OPPORTUNITY dataset and an accuracy of 94.33% and an F1-score of 94.46% on the PAMAP2 dataset. Our proposed method outperforms the existing methods on both datasets.

F. STATISTICAL ANALYSIS
To prove the generalization and robustness of our proposed technique, it is also necessary to perform a statistical test. In this study, we consider four different datasets: UCI [11], WISDM [10], PAMAP2 [88] and OPPORTUNITY [82]. We perform the Friedman test [100], [101], which is a nonparametric equivalent of the repeated-measures ANOVA technique. In our statistical test, we assume that the null hypothesis is as follows: ''There are no significant differences among the model performances''. The alternate hypothesis is as follows: ''There is a significant difference among the model performances''.
The following steps are executed to perform the Friedman test.
• First, represent the observed accuracy in a matrix x ij with n rows and k columns, where the accuracies of 16 different models are presented corresponding to the 4 different datasets. In our experiment, n=16 and k=4.
• Then, for each dataset, separately calculate the ranks of the models.
• Replace the data with a new matrix {r ij } n×k , where entry r ij is the rank of x ij within block i.
• Calculate the valuesr ·j = 1 n n i=1 r ij . • Then, calculate the test statistic (Friedman test statistic) using the following formula : Effects of the optimizers on model performance with the UCI dataset.
• Finally, using a chi-squared distribution, approximate the probability distribution of Q. In this case, the p-value is given by P(χ 2 k−1 ≥ Q). After performing the abovementioned test with α = 0.05, the observed level of significance p ≤ α. Hence, the result is statistically significant. However, this p-value is based on a single accuracy and thus may give an inappropriate result. Therefore, to adjust our statistical confidence measures based on the number of completed tests, we require multiple testing correction processes. The Bonferroni correction [102] is the simplest and most extensively used multiple testing correction method. If we use a significance threshold of α but run n independent tests, the Bonferroni adjustment only considers a score significant if the matching p-value ≤ α/n. A Bonferroni correction [102] is used to control the familywise Type-I error rate, resulting in an adjusted significance of 0.0031. In Table 11, the p-values are less than 0.31%, and we can statistically conclude that the proposed model outperforms the state-of-the-art models as the assumed null hypothesis is rejected. Hence, there are significant differences between the model performances.

G. EFFECTS OF THE HYPERPARAMETERS ON THE PERFORMANCE OF THE PROPOSED MODEL
The performance of a classification model is heavily influenced by its hyperparameters. The impacts of two major hyperparameters, that is, the number of epochs and the batch size, on model performance are presented in this section. Tests are run on the first two datasets, and the performance of the model is assessed by tweaking a few model parameters. The accuracy is utilized as the criterion for evaluation.

1) IMPACT OF THE OPTIMIZER
An optimizer updates and estimates the network parameters that affect the model training and output processes to approximate or reach the optimal value, thereby reducing the loss function. This is an essential part of any DL approach. As a result, selecting an appropriate optimizer for DL training is   critical. As illustrated in Figures 9 and 10, several common optimizers, such as Adam, RMSprop, SGD, AdaGrad, and AdaDelta, are empirically verified. The Adam optimizertrained model has the best fitting effect and the steadiest gradient descent curve fluctuation. Hence, to train the CNN model, Adam is employed as the optimizer. Figures 15 and 16 present the accuracies and losses induced by different numbers of epochs on the WISDM dataset.

2) IMPACT OF THE BATCH SIZE
In regard to DL, minibatch processing is a popular technique for training neural networks. The gradient descent process may slow down due to the optimization of the cumulative error over the entire training set and possibly lead to a local optimum for the corresponding model. If the error due to one sample only is optimized in one iteration, the gradient VOLUME 10, 2022   descent step may fluctuate dramatically, resulting in training difficulty. Figures 11 and 12 depict the accuracies obtained with the five different batch sizes. When the batch size is set to 192, the accuracy is at its maximum for the WISDM public dataset, and when the batch size is 128, the accuracy is at its maximum for our UCI dataset.

3) EFFECT OF THE NUMBER OF EPOCHS
The number of epochs is a type of hyperparameter that plays an important role in a DL model's training process. The total number of epochs to be used helps us determine whether the data have been overtrained. Figures 13 and 14 present the accuracies and losses obtained with different numbers of epochs on the UCI dataset. The validation set is used to minimize overfitting as much as possible. We stop the training procedure when the validation error is minimal.

H. COMPLEXITY ANALYSIS OF THE PROPOSED ARCHITECTURE
Suppose that n 0 = the number of input channels, n = the number of filters, s 1 Xs 2 = the size of each filter, and m 1 Xm 2 = Size of the output feature map. The complexity of the convolution layer = O(n 0 * s 1 * s 2 * n * m 1 * m 2 ). In the case of the deconvolutional layer in the AE, only the dimensionality is decreased by the downsampling factor, so there is no effect on the complexity. The polling layer is a fixed operation with no weighting factor. Fully connected layers -Input dimensions = m, and number of output dimensions = n. The number of parameters = (m + 1) * n. LSTM is local in terms of space and time; therefore, the overall complexity of an LSTM network per time step is equal to O(w), where w is the number of weights. Overall complexity = O(((n 0 * s 1 * s 2 * n * m 1 * m 2 ) + w) + i * e), where i is the input length and e is the number of epochs.

VI. DISCUSSIONS
In this study, we propose a DL framework by combining a convolutional AE and LSTM. The fully connected layer of the CNN increases the computational time of the model as the number of parameters increases. Additionally, CNNs are efficient in extracting features from labeled data, which are very rare in real scenarios. To overcome these shortcomings, we combine the convolution layer with an AE, and the output of the convolutional AE is given as the input of the LSTM module to extract temporal features and make the proposed DL architecture more accurate and effective in recognizing human activities. In the traditional ML method, feature engineering is a challenging and tedious job. In contrast, DL approaches are blessed with automatic feature learning characteristics. However, various DL approaches have their own merits and demerits. Hence, in this study, we consider the advantages of the convolution layer in combination with an AE for automatic feature extraction and for overcoming the overfitting issue. Moreover, sensor data streams are time series; hence, LSTM-based approaches with excellent sequential modeling capabilities are inherently appropriate. With the typical memory and computational resource restrictions, however, training an LSTM model on raw sensory data with a high sampling frequency is impossible. The suggested model not only avoids complicated data preprocessing and feature engineering techniques but also provides high recognition accuracy in an acceptable amount of computational time.
In this study, we mainly focus on smartphone-based HAR. Hence, for detailed experimentation, we consider smartphone-based public standard sensor data for exhaustive experiments. To prove the effectiveness of the proposed method, we also experiment with body-worn sensory data drawn from the ''PAMAP2'' and ''OPPORTUNITY'' public standard datasets and compare our results with those of stateof-the-art methods developed in other studies.
In the literature, convolutional AEs and LSTM are both popularly used for manifold data. Our proposed architecture is the combination of a convolutional AE and LSTM. However, in this paper, we do not show the experimental results obtained when using manifold data. In our future work, we can adopt manifold data to prove the effectiveness of our proposed framework.

VII. CONCLUSION
A novel DL approach in which a convolutional AE is followed by LSTM for HAR, namely, ConvAE-LSTM, is proposed in this paper. To establish the generalizability, potentiality, and efficacy of the suggested model, two standard smartphone sensor-based datasets (UCI and WISDM) and two standard body-worn sensor datasets (Opportunity and PAMAP2) are considered for experimentation. The proposed method achieves average precision, recall, F1-score, and accuracy values of 97%, 96.83%, 97.67%, and 98.14% on the UCI dataset and 98.17%, 98.33%, 98.17% and 98.67% on the WISDM dataset, respectively. Furthermore, we also explore the computational times of our proposed method and other commonly used DL approaches in the same experimental environment. The computational time of the proposed model is highly competitive with those of the other mentioned DL approaches. We also examine how several hyperparameters, such as the type of optimizer used, the number of epochs, and the batch size, affect the model performance. Finally, the model is trained with the best hyperparameters for the final design. To summarize, compared with the other tested DL approaches and two popularly used shallow ML approaches described in the HAR literature, the proposed ConvAE-LSTM model demonstrates consistently higher performance and has better generalization.
Despite the fact that much work has been done in this field, our findings show that many challenges remain unsolved, particularly in the area of activity recognition. Several aspects will be involved in future work. First, we will compare the proposed method with other recent DL-based methods and perform experiments on more available datasets by using other available classifiers. Second, in real-life applications, the applicability of the proposed method should be analyzed.