Ultra-Wideband Radar-Based Activity Recognition Using Deep Learning

With recent advances in the field of sensing, it has become possible to build better assistive technologies. This enables the strengthening of eldercare with regard to daily routines and the provision of personalised care to users. For instance, it is possible to detect a person’s behaviour based on wearable or ambient sensors; however, it is difficult for users to wear devices 24/7, as they would have to be recharged regularly because of their energy consumption. Similarly, although cameras have been widely used as ambient sensors, they carry the risk of breaching users’ privacy. This paper presents a novel sensing approach based on deep learning for human activity recognition using a non-wearable ultra-wideband (UWB) radar sensor. UWB sensors protect privacy better than RGB cameras because they do not collect visual data. In this study, UWB sensors were mounted on a mobile robot to monitor and observe subjects from a specific distance (namely, 1.5–2.0 m). Initially, data were collected in a lab environment for five different human activities. Subsequently, the data were used to train a model using the state-of-the-art deep learning approach, namely long short-term memory (LSTM). Conventional training approaches were also tested to validate the superiority of LSTM. As a UWB sensor collects many data points in a single frame, enhanced discriminant analysis was used to reduce the dimensions of the features through application of principal component analysis to the raw dataset, followed by linear discriminant analysis. The enhanced discriminant features were fed into the LSTMs. Finally, the trained model was tested using new inputs. The proposed LSTM-based activity recognition approach performed better than conventional approaches, with an accuracy of 99.6%. We applied 5-fold cross-validation to test our approach. We also validated our approach on publically available dataset. The proposed method can be applied in many prominent fields, including human–robot interaction for various practical applications, such as mobile robots for eldercare.


I. INTRODUCTION
According to a 2017 report by the Department of Economic and Social Affairs in the United Nations, the population of older adults is increasing more rapidly than other age groups [1]. In 2015, one out of eight people worldwide were aged 60 years or older. By 2050, the number of older adults The associate editor coordinating the review of this manuscript and approving it for publication was Genoveffa Tortora . is expected to reach nearly 2.1 billion. A major challenge in working with an ageing population is the effective delivery of healthcare services [2]. Moreover, healthcare for older adults is a matter of great concern for their relatives. This is particularly true when older adults are alone at home as they are at a risk of being affected by unforeseen circumstances, such as falls. Recently, independent living among older adults has become a significant challenge from both social and economic perspectives. Therefore, assisting older adults with their well-being and autonomy has become a research topic of great interest [3].
Understanding the current state and context of users is crucial for assisting in their everyday lives. Human movement has been actively studied using distinguished ambient sensors [4], [5]. Previously, video-based sensors have been used for human activity recognition and fall detection [6]. However, video-based sensors often face challenges in their use owing to privacy issues. In contrast, a non-contact ambient sensor that has no such privacy issues is the XeThru ultra-wideband (UWB) radar [7]- [9]. Hence, sensors, such as UWB radars, can be used for general robot navigation sensing and emergency analysis based on human body movement while preserving privacy, particularly for older adults living independently.
Sensors with wireless communication ability make human-machine interaction robots suitable for human behaviour and vital sign analysis [9]. Many researchers have explored ambient sensors for monitoring human behaviour and health status [10]- [15]. For instance, CASAS adopted machine learning tools for user behaviour analysis [10]. GatorTech is an earlier research project wherein many ambient sensors were used to provide user services, such as voice and behaviour recognition [11]. Zhang et al. [12] proposed an assisted living environment to help prolong the time for which older adults could live in their homes. SWEET-HOME, a French project, aimed at developing an assisted living technology mainly based on audio analysis [13]. Billias et al. [14] adopted ambient sensors, such as cameras and microphones, to analyse the daily activities of older users. Recently, Lio [15], a personal robot assistant developed by F&P Robotics, was introduced with a multifunctional arm. The robot could assist patients autonomously and provide several healthcare functions. During the long COVID-19 pandemic, additional functions, such as disinfection operations and remote detection of elevated body temperature, were performed by Lio.
Human activity recognition (HAR) and emergency detection have made significant progress in recent years through machine learning techniques [16]. Most previous HAR studies have relied on hand-crafted features, which are sometimes difficult to distinguish with sufficient accuracy to classify activities [17]. Conventional pattern recognition techniques, such as K-nearest neighbour (KNN) [18], support vector machines (SVMs) [19], artificial neural networks (ANNs) [20], and random forest (RF) [21], perform well in HAR and emergency detection. Meanwhile, in recent years, we have witnessed an incredible growth in machine learning research enabled by advancements in deep learning approaches. Deep learning has resulted in remarkable performance in many research areas, such as computer vision [22], business analytics, and natural language processing [23]. Recently, convolutional neural networks (CNNs) have shown significant improvement in classifying human activities [24]. Yang et al. [25] built a CNN-based architecture that could analyse multi-channel time-series data. A unified layer was introduced to merge multiple channels prior to classification. Moreover, the CNN-based multi-channel time-series architecture is task-dependent and is characterised by a higher discrimination accuracy for classifying human activities. Previous research has shown the use of UWB sensors to recognize human activities [26], [27]. Singh et al. [28] proposed a framework for HAR using point clouds generated by mmWave radar. Their activities were related to exercise. Sharma et al. [29] introduced a channel impulse response based HAR system which can recognize sitting, standing and lying positions.
It has been observed that the deep network structure in deep learning is more suitable than traditional machine learning approaches for supervised and incremental learning [30]. Thus, deep learning is an ideal approach for analysing human behaviour and health status using data from newly introduced sensors, such as XeThru UWB radars [8], [9], [31]. A recurrent neural network (RNN) is one of the most popular deep learning techniques for time-series data [30], [32]. An RNN is adopted to decode time-sequential data for modelling various events, such as an emergency due to an unusual heart rate. Therefore, a special type of RNN, namely long short-term memory (LSTM), is proposed in this study to classify the XeThru UWB radar data. The performance of conventional approaches, such as SVM, AdaBoost, multilayer perceptron (MLP), quadratic discriminant analysis (QDA), KNN, RF, and decision trees (DTs), have also been evaluated and compared with that of LSTM. Furthermore, as XeThru UWB radar sensors collect a large number of data points in a single frame, principal component analysis (PCA) and linear discriminant analysis (LDA) are introduced for dimensionality reduction. Figure 1 illustrates the basic flows of the proposed system in two steps, namely training (left-hand side) and testing (right-hand side).
In this study, we investigate whether a XeThru UWB radar sensor can recognise different complex human activities. Thus, the contributions of this work are two-fold: • A XeThru UWB radar sensor is used with a novel LSTM-based approach to classify activities.
• Enhanced discriminant analysis (EDA), combining PCA with LDA, is proposed to reduce data dimensionality and extract significant features before feeding them into the classifiers.

A. SUBJECTS
This study aimed to classify five different activities using a UWB sensor. Overall, 13 participants were included in the study with six of them being female subjects. All of them were normal healthy people. The ages of the participants vary from 22 -50 years. All subjects voluntarily participated in the experiments, and written consent was obtained from all subjects before participation. The experiments and data collection were approved in advance by the Norwegian Centre for Research Data (NSD). All experiments were performed in accordance with relevant guidelines and regulations.

B. DATA ACQUISITION FROM UWB RADAR
The UWB radar has been used for imaging in sensing through walls, [33]- [35], detecting humans [36], [37], assisting in public security [38], and recognising moving subjects [39], [40]. In the current study, we used XeThru X4, a compact impulse-radio UWB radar system on a chip, as shown in Figure 2. The radar is configurable and provides developers with a high degree of freedom to develop new applications, ranging from basic presence detection to vital sign analysis. The pulse transmitted by the radar can be configured within two bands, namely the lower and upper bands. The lower pulse generator enables transmission within the 6.00-8.50 GHz band, whereas the higher pulse generator enables transmission within the 7.25-10.20 GHz band. To capture the reflected energy, the radar applies a high-speed sampler with a sampling rate of 23.328 GS/s, which can sample up to 1,536 samples [8]. The distance from the radar to an object is called the slant range, which can be determined by where C is the speed of light and T is the time required for signal reflection. The divisor 2 is used because the radar signal travels to the target and then travels the same distance back to the radar. The radar system is dependent on waveform design in several ways. The range resolution is proportional to the bandwidth, and the signal-to-noise ratio (SNR) of the output signal is directly proportional to the waveform energy.
In contrast, the signal wavelength affects the radial velocity resolution [41]. Because of its short duration, good spectrum coverage, and ease of implementation in CMOS, a frequencyshifted Gaussian pulse can be considered as an excellent candidate for a UWB. The frequency-shifted Gaussian pulse can be determined by where p(t) denotes the Gaussian pulse envelope and τ determines the −10dB bandwidth.
The pulse amplitude V TX is dependent on the regulatory limits for the peak and average output power [42]. For sweptthreshold (ST) sampling, the sweep time is dependent on the number of pulses (n pulses ) and the pulse repetition frequency (PRF).
The matched filter radar equations are used to obtain the SNR for an ideal pulse-based radar receiver (RX) from a target, given its range R and radar cross-section (RCS) σ RCS [43]: where G is the antenna gain, k B is Boltzmann's constant, λ is the wavelength, T 0 is the temperature in Kelvin, P t is the transmitted pulse power, F is the RX noise factor, and t p is the pulse duration. The transmitted pulse is approximated as  a rectangular windowed pulse with length t p such that the energy of the pulse E p equals In an ST-based pulse radar, (5) becomes where n steps is the number of steps in the threshold sweep that causes an SNR loss and G ST is the SNR gain obtained from multiple threshold levels covering the noise region around the signal value, which results in an integration effect. The largest signal in the sampled frame can be used to determine the threshold sweep range, which occurs at the minimum distance, in addition to the noise. Assuming 5 · sigma noise is sufficient as the maximum noise voltage, the number of threshold steps required is where A FE is the voltage gain of the RX front end, and the received pulse amplitude V RX is given by The ST-SNR loss can be expressed as The major advantage of using ST rather than a multi-bit system is that it only requires a 1-bit quantiser, which can increase the inherent linearity of the system and simplify the design. Moreover, with no reduction in the SNR, ST can be operated over longer consecutive ranges.
In our study, we used a UWB impulse radar, which can sense vital sign data, such as presence, breathing patterns, and movement from people who are either sitting on a bed or walking around. The UWB signal returns to obtain the required echo matrix with frames corresponding to each range bin, as shown below [44]: where x and y denote the range and time, respectively, while X is the extent of the range and Y is the extent of the time span of the data. Hence, the total size of the data is XY . The frame M further passes through PCA, which is discussed in more detail in the DATASET AND METRICS section.

III. DATASET AND METRICS
The obtained dataset comprised five normal activities: lying (A1), sitting on the bed with the legs on the bed (A2), sitting on the bed with the legs on the floor (A3), standing (A4), and walking (A5). An illustration of each activity is given in Figure 3. Figure 4 shows an A3 posture wherein a robot with a UWB sensor monitors a subject. The possible classification outcomes are based on true positives (TP), true negatives (TN ), false positives (FP), and false negatives (FN ). To measure the performance, we used the following metrics.
Accuracy is the ratio of the number of correctly predicted observations to the total number of observations.
Precision is the ratio of the number of correctly predicted positive observations to the total number of predicted positive observations.
Recall is the ratio of the number of correctly predicted positive observations to the total number of observations in VOLUME 9, 2021 the actual class.
F1-score is an overall measure of the accuracy of the model combining precision and recall.
A. FEATURE EXTRACTION USING PRINCIPAL COMPONENT ANALYSIS (PCA) We apply Gaussian kernel-based PCA to the input data to approximate the original data with fewer dimensions. PCA focuses on the direction of the maximum covariance in the new feature space. It reduces the dimensions by focusing mainly on the essential variations in the data. Because the data are nonlinear, a Gaussian kernel-based PCA was used [45]. The covariance matrix of the data is defined as whereγ is a Gaussian kernel and N is the total number of events in the activity period. Eigenvalue decomposition can be applied as where E represents the principal components, α represents the eigenvalues, and K is the diagonal matrix of the eigenvalues. Then, the features for an event can be represented by projection of the principal components as The size of the matrix E becomes t × m, where t is the dimension of each vector, m is the number of principal components to be considered, and K is an m × m diagonal matrix. Moreover, E reflects the original coordinate system onto the eigenvectors. The eigenvector corresponding to the largest eigenvalue indicates the axis of largest variance, and the eigenvector corresponding to the next largest eigenvalue indicates the axis that is orthogonal to the first indicating the second largest variance, and so on. Typically, eigenvalues close to zero have negligible variance and can thus be excluded. Hence, the m eigenvectors corresponding to certain large eigenvalues can be used to define the subspace. Figure 5 shows the top 20 eigenvalues corresponding to the first 20 eigenvectors.
PCA is a second-order statistics-based method of analysis that represents global information [46]. Applying PCA to human activities produces global features that represent frequently moving parts of the human body engaged in various activities [47], [48]. Because we apply EDA, which is a combination of PCA and LDA, for dimensionality reduction, the principal components extracted through PCA are passed through LDA.

B. FEATURE EXTRACTION USING LDA
LDA is popular in supervised classification approaches. It creates hyperplanes to separate the different classes. The hyperplanes maximise the separation between classes and minimise the intra-class variance. LDA, which is known to extract the best features and reduce the dimensionality of the data [49], projects the input data in a lower-dimensional space. The equations below define the within-class S W and between-class S B scattering comparison.
where J i is the number of vectors in the ith class C i , c is the number of classes (number of activities), n i is the mean of class c i , and m k is the vector of a specific class. The optimal discrimination matrix is selected by maximising the ratio of the determinant of the between-and within-class scatter matrices as where D opt is the set of discriminant vectors of S W and S B corresponding to the (c − 1) largest generalised eigenvalues λ, and can be obtained by solving (24): where the rank of S B is (c − 1) or less; hence, the upper bound value of t is (c−1). The PCA features of the different activities are projected onto the LDA features as Thus, the EDA features (i.e. L) are obtained to apply machine learning algorithms for training and testing of activities.

C. CLASSIFICATION
We applied different classification algorithms to the dataset for comparative and performance analyses. SVM [50], introduced by Vapnik, uses support vectors. It has been widely used in HAR systems owing to its high classification performance [51], [52]. It creates hyperplanes to maximise the margins between classes. By minimising the cost function, the optimal solution can be obtained, namely, the solution that maximises the distance between the hyperplane and the nearest training point. Herein, a nonlinear multiclass SVM with a sigmoid kernel was used. Sigmoid is used as it's a popular kernel. However, we also implemented RBF and Gaussian kernel, but it didn't improve the results remarkably.
Adaptive boosting, known as AdaBoost [53], is used primarily for ensemble learning or meta-learning. It applies an iterative approach to learn from the mistakes of classifiers to improve their performance. AdaBoost has been widely used in HAR by researchers [54]. MLP [55] is also known as a feed-forward ANN. It consists of multiple hidden layer in addition to the input and output layers. With the help of error backpropagation, it can be trained to classify data that are not linearly separable. MLP has been used in ambient assisted living to recognise poses and to monitor dangerous situations [56]. QDA is closely related to LDA, but it does not assume that the covariance of each class is identical [57].
KNN is the simplest classification technique used for machine learning. The KNN algorithm determines the points from the training data that are close enough to be considered when selecting the class to predict a new observation [58]. The RF [59] method is used for both classification and regression problems. It generates multiple DTs based on the random selection of variables and data, and recognises dependent variables based on the DTs. RF has been widely used to recognise different human activities [60], [61]. In this study, 10 DTs were used to explore the classes.
A decision support tool that utilises a model of decisions or tree-like graphs and their possible consequences, including the utility and probability of event outcomes, is called a decision tree. A decision tree is a well-known classifier used in machine learning. Its structure is similar to a flowchart in which each internal node represents a test of an attribute, such as the probability a coin flip producing heads or tails. Each branch corresponds to a possible test outcome, and each leaf node corresponds to the class label. The decision is taken after applying all the features. The classification rules are based on the paths from the root to the leaf [62].

1) RECURRENT NEURAL NETWORKS
The events are represented based on time-sequential data from the sensor. A machine learning model able to encode time-sequential data is suitable for our purpose. Hence, RNNs were used in this study. They are one of the most widely applied deep learning methods for modelling events underlying time-sequential data. An RNN typically consists of recurrent relations within the model's hidden units, which connect its history (i.e. memory) to the present. RNNs often face vanishing gradient problems that cause challenges in processing long-term information. This phenomenon is known as long-term dependency. LSTM, proposed by Hochreiter and Schmidhuber [63], can solve the vanishing gradient problem typical of RNNs. RNNs and LSTMs have performed well in various fields, such as handwriting and speech recognition [64]. Figure 8 shows a sample deep RNN consisting of N LSTM units. Each LSTM block consists of an input gate I , a forget gate F, and an output gate U . The input gate I is expressed as The input gate Y represents the weight matrix, bias a, and sigmoid function s. The forget gate F is expressed as follows: The long-term memory is stored by a cell in a state vector B, which can be represented as  The output gate O represents the output as The hidden state H is represented as The final output N can be determined by where i represents the LSTM number and H hidden states. We have used four hidden layers stacked LSTM. The first three layers have 100 memory units (or smart neurons) followed by 50 memory units in the next layer. The best hyper-parameters were chosen by grid search. Finally, because this is a classification problem, we use a dense output layer with a softmax activation function to make predictions for five classes. Our model has 847,055 trainable parameters.

IV. RESULTS AND DISCUSSION
In this section, we describe the experiments performed on the XeThru UWB sensor dataset to recognise various human activities. The dataset consisted of data from 13 participants, with a total of 65,000 samples for the five activities. Ten radar frames per second were used, with 1,535 data points in each frame. Therefore, the data size for each sample was 10 × 1, 535. In total, we used nine different classification approaches, with 80% of the total mixed up dataset used for training all models and 20% used for testing. Moreover, 10% of the training dataset was used for validation.
For performance measures, the accuracy, precision, recall, and F1-scores were evaluated for each classifier. The accuracy of each classifier is listed in Table 2. All classifiers were tested using a built-in Python library, scikit-learn. Two conventional classifiers, SVM and AdaBoost, performed poorly in classifying activities, with accuracies of less than 50% as shown in Figure 10a and 10b, respectively. Neither classifier could distinguish between the standing and walking activities. MLP performed only slightly better than these classifiers, with an accuracy of 51.3%. This classifier could detect the lying posture well, as shown in Figure 10c.
The QDA and KNN classifiers performed moderately well and classified all activities with an accuracy of 75%. Figure 10d and 10e shows the confusion matrices of QDA and KNN. QDA primarily misclassified the activities when the subjects were sitting on the bed with their legs on the bed or floor. RF and DT were the only two conventional approaches with accuracies of approximately 90%. The confusion matrices of these two classifiers are shown in Figure 10f and 10g, respectively.  Finally, we tested the dataset using a state-of-theart LSTM, built using Keras [65], with Tensorflow as the back end. It outperformed all other classifiers by achieving an excellent accuracy of 98%. Figure 10h shows the confusion matrices, while graphs of the model's accuracy and the model's loss are shown in Figure 11.
As we had 1,535 data points in each frame, EDA was introduced to reduce the dimensionality. In the EDA, a PCA was introduced to reduce the dimensions. The performance was VOLUME 9, 2021  insufficient when using PCA alone, as shown in the scatter plot in Figure 6; thus, LDA was introduced. The combination of PCA and LDA, which we refer to as enhanced discriminant analysis (EDA), outperformed all other approaches. The 3D plot of the features extracted after application of LDA is shown in the scatter plot in Figure 7. The EDA with LSTM training model performed better than the LSTM model alone and demonstrated excellent performance in all activities, as shown in Figure 10i. The trained model and loss graphs are shown in Figure 12. The overall accuracies of all nine classifiers that were implemented are listed in Table 2. The precision, recall, and F1-scores of the top four classifiers alone are presented in Table 1. Furthermore, to validate our approach, 5-fold cross-validation was performed on LSTM and EDA with LSTM for HAR. The confusion matrices of both approaches are shown in Figure 9.
Afterwards, we continued our experiments using leaveone-subject-out validation. Initially, we took 5% data out from the testing subjects, mixed it with the training data, and obtained an average accuracy of 86%. Figure 13 shows the average confusion matrix of our results. In the next experiments, we applied no data taken from the testing subjects, i.e., leave one whole subject out and obtained the average accuracy of 66%, as shown in Figure 14. Given the lower performance, we are planning to work on improving the subject generic validation results in the future by improving the model. However, during the experiments with similar data selections, we experienced that none of the traditional approaches yielded accuracies more than 60%.
Singh et al. [28] introduced exercise-based activities by using mmWave radar. Bouchard et al. [66] introduced 15 daily life activities based on ten males. No female participated in their study. Moreover, the age range was between 22 to 39. In our case, we have quite a diverse age range, and almost half of the participants were females to avoid bias in the study. Sharma et al. [29] proposed a channel impulse response based activity recognition system, but their activities were limited, i.e. standing, sitting and lying. While our  activities also cater to a different sitting posture, i.e. either legs are on the bed or the floor. Furthermore, we also introduced an EDA based feature extraction approach. We also tested our approach on a Ahmed et al. [67] dataset (as they used a similar UWB sensor), which showed promising results and outperformed their results. Figure 15 shows the confusion matrix and whose columns from a to l representing the twelve different dynamic hand gestures as proposed in [67]. To the best of the author's knowledge, there is no current study based on UWB sensors that focused on enhanced discriminant features analysis before feeding the data into the machine learning algorithms.

V. CONCLUSION
In this study, a novel approach was proposed for HAR using a UWB sensor and state-of-the-art deep learning models. The proposed approach is beneficial for older adults because it is difficult for them to wear actigraphy devices 24/7 or to be monitored through an RGB camera, which could breach their privacy. In this work, a UWB sensor was mounted on a robot at a certain distance. Because the UWB sensor has several features, EDA was used to reduce the dimensions of the features before feeding them into the deep learning model. The results were compared with those of conventional approaches. The proposed approach was found to perform significantly better, with an accuracy of 99.6%. Moreover, 5-fold cross-validation was performed for generalisation of the system. Furthermore, we implemented our approach on a publicly available dataset and got better results.
In the future, we intend to perform more complex experiments in real-time environments. Furthermore, we plan on extending the algorithms by introducing the heart rate into the monitoring system to detect emergencies. Moreover, we also need to acquire data from older people since the approach focuses on eldercare. Finally, we can also introduce multiple UWB sensors in the apartment or other thermal/depth-based sensors to localize the exact position of persons. In that context, different sensor fusion strategies might be explored.

ACKNOWLEDGMENT
In Figure 4, the first author shows the experimental environment for better understanding of the setups. He was not a subject in the experiments though, to avoid the biasness of the study.