Detection of Freezing of Gait Using Unsupervised Convolutional Denoising Autoencoder

At the advanced stage of Parkinson’s disease, patients may suffer from ‘freezing of gait’ episodes: a debilitating condition wherein a patient’s “feet feel as though they are glued to the floor.” The objective, continuous monitoring of the gait of Parkinson’s disease patients with wearable devices has led to the development of many freezing of gait detection models involving the automatic cueing of a rhythmic auditory stimulus to shorten or prevent episodes. The use of thresholding and manually extracted features or feature engineering returned promising results. However, these approaches are subjective, time-consuming, and prone to error. Furthermore, their performance varied when faced with the different walking styles of Parkinson’s disease patients. Inspired by state-of-art deep learning techniques, this research aims to improve the detection model by proposing a feature learning deep denoising autoencoder to learn the salient characteristics of Parkinsonian gait data that is applicable to different walking styles for the elimination of manually handcrafted features. Even with the elimination of manually handcrafted features, a reduction in half of the data window sizes to 2s, and a significant dimensionality reduction of learned features, the detection model still managed to achieve 90.94% sensitivity and 67.04% specificity, which is comparable to the original Daphnet dataset research.


I. INTRODUCTION
Parkinson's disease (PD) is the second most common age-related neurodegenerative disorder. It is estimated that more than 10 million people worldwide are living with PD and this number is expected to increase as the elderly population increases [1]. PD occurs when neurons in the brain gradually die, decreasing the neurotransmitter dopamine and causing abnormal brain activity. Although researchers have yet to discover the cause of PD, most of the symptoms of PD result from the decrease in the brain's dopamine levels. PD is characterized by both physical and psychological symptoms, and disease progression varies individually due to the diversity of the disease [2].
Common symptoms include tremors, muscle rigidity, slow movement, postural instability, and impairment of motor skills. Walking or gait difficulties in PD patients are exemplified by slow, small steps, shuffling, or rapid short The associate editor coordinating the review of this manuscript and approving it for publication was Qi Zhou. steps. At the advanced stage, PD patients may suffer from freezing of gait (FOG) episodes. FOG is the brief episodic absence of forward feet progression despite the intention to walk [3]. In the words of PD patients, it is described that their ''feet feel as though they are glued to the floor'' [4]. These episodes increase the risk of falling, causing it to be one of the most debilitating aspects of PD.
Since the progression of PD is assessed by walking characteristics, various technologies have been developed to enhance human motion analysis. Low cost, low power, and unobtrusive wearable devices, often comprising of small computers and sensors, are worn by PD patients to objectively capture continuous movement information [5]. This assists diagnosis, the monitoring of daily activities of PD patients [6], [7], and the tracking of FOG occurrences even in their own homes [8]. In addition, medical researchers have discovered that cueing a rhythmic auditory stimulus (RAS) shows promising results in improving gait, enhancing gait speed, improving stability, and shortening or even preventing FOG episodes all together.
In the last decade, many online FOG detection and prediction models have been developed by researchers using various techniques for the automatic cueing of a rhythmic auditory stimulus (RAS). Many of these returned promising results for this area of research. However, most, if not all, of the models still face difficulties and exhibit large variations in their performance when it comes to the different walking styles of PD patients, such as ''foot drop'' or ''very slow walk'' [8]. This is largely due to the design of the models using thresholding or manual feature extraction techniques in the classification of FOG. These models lacked salient features to represent distinct data characteristics for the different walking styles of PD patients.
Many researchers attempt to resolve this problem by introducing more features and parameter sets. The features are derived directly from the sensor data by computing the statistical and heuristic measurements, such as mean, standard deviation, and interquartile range or by deriving frequency-domain features from the Fourier transformed signal [9]. However, these approaches still need to be manually determined by the researcher. This means handcrafting new features and parameter sets becomes subjective, prone to error, and relies on the skillset and experience of the researchers in understanding the Parkinsonian gait. With the increased numbers of features and parameter sets, the model becomes more complex, and this often increases the latency of the FOG detection model in the process. In addition, manually handcrafting features is not only time-consuming, but it increases the difficulty of selecting optimal features, especially when the features are highly correlated to each other. Making a small change in one feature could easily affect other features and potentially deteriorate the performance of the model. Even if the optimal features are determined, the feature sets may not be transferrable to similar patterns, and the search for optimal features might have to be repeated for each PD patient.
The current trend for this area of research is the application of deep learning techniques in FOG detection models to reduce or eliminate the use of manually handcrafted features.In deep learning, feature representation of the data is automatically learned by two main operations, namely convolution and pooling or down sampling. The convolutional and pooling layers are stacked alternately to extract salient features in a hierarchical manner whereby the first few layers extract primitive features while the deeper layers extract high level features that are specific the application. In recent works, there have been several attempts to detect FOG by classification with deep learning models [10], producing similar and comparable results. Even if the results produced are only similar and comparable to the ones from the use of manually handcrafted features, the application of deep learning techniques in the model greatly increases its complexity and its computational cost.
In this paper, an unsupervised feature learning method based on a deep denoising autoencoder is proposed for detecting FOG episodes. The proposed autoencoder exploits the capability of convolutional and pooling layers to leverage temporal structure and learn the salient features of gait data. Furthermore, the deep denoising autoencoder can produce a compact feature representation, therefore reducing the overfitting of the data. The proposed method is evaluated using a benchmark public dataset. The experimental results show that the proposed method can achieve a high detection accuracy and its performance is comparable, if not better, than the existing methods.
The remainder of this paper is organized as follows. Section 2 reviews the related works. In Section 3, the proposed method is presented, and Section 4 presents the experimental results and their discussion. Finally, the conclusions are presented in Section 5.

II. RELATED WORKS
As one of the pioneers in Parkinsonian gait monitoring research with the use of wearable devices, Moore et al have analyzed stride data in terms of its vertical linear acceleration and pitch angular velocity from ankle mounted sensors. They have discovered small, variable stride lengths as gait characteristics in PD patients. Furthermore, they extended their research to include FOG detection in advanced PD patients. From the power spectra analysis on the frequency components of the gait acceleration data, they managed to observe two distinct frequency bands: the locomotion band, with the mentioned Parkinsonian gait characteristics from 0.5 to 3Hz, and the freeze band, with high-frequency acceleration components from 3 to 8Hz during FOG episodes [11]. Due to the consistency of high-frequency bands during FOG episodes, the authors defined freeze index (FI) as ''the power in the freeze band divided by the power in the locomotor band'' with selected threshold values. Any frequencies above the FI threshold are designated as FOG episodes. The FI was then normalized with a natural logarithm upon the multiplication of 100 to reduce large variations in the magnitude of the FI. With a window size of 6s, they reported a 78% FOG detection using a global threshold for FI, indicating the strong correlation between the high-frequency bands with FOG episodes. Calibrating the thresholds to each of the patients increased the FOG detection rate to 89%. Since then, the FI has become the benchmark for assessing the severity of FOG in PD patients.
Following that, Bächlin et al. conducted their research on 10 PD patients, each equipped with three sensors. These sensors were located on the shank (above the ankle), the thigh (above the knee), and the trunk (above the waist). The patients also wore a computer to record and analyze the 3D acceleration data from the sensors. Based on the works of [11], the authors introduced the improvement of reduced latency for real-time online detection and the inclusion of an energy threshold to distinguish between standing and other states. The energy threshold, labeled PowerTH, was used to distinguish standing and other states of the patient. Using 4s window sizes, a 256-point Fast Fourier Transform (FFT) was calculated for the power spectral density of the movement data. An algorithm was designed to detect FOG, achieving a sensitivity of 73.1% and a specificity of 81.6%. However, the performance of the proposed algorithm varied with the difference in the walking styles of patients. The proposed algorithm had difficulties distinguishing between slow walking periods and voluntary standing with very short FOG episodes. To further improve performance, the detection parameters were calibrated to each of the patients respectively, producing two patient-specific parameter sets for 'smooth' and 'intensified stepping' walking styles to achieve an average sensitivity and specificity of 88.6% and 92.4% respectively. The collected dataset was made public and known as the Daphnet FOG dataset. Due to the diversity in the walking styles of PD patients in the dataset, it is widely used and considered the benchmark dataset in Parkinsonian gait research.
With the advancement of computational power, many researchers start to trend towards the application of machine learning techniques for the detection of FOG episodes. One of the first few works that proposed the use of machine learning techniques is reported in [12]. The authors proposed the use of smartphones and wearable accelerometers for online detection of FOG episodes and evaluated several machine learning algorithms on the Daphnet dataset. The correlation-based feature subset selection technique was performed to select 7 distinct features for the classifier training: the mean, standard deviation, variance, entropy of the frequency distribution, energy from the sum of squared discrete FFT, and the freeze and power thresholds from the research in [8].
Several machine learning classifiers have been built including Random Forests, Decision Trees and C4.5 Pruned Decision Trees, Naïve Bayes, Bayes Net, K-Nearest Neighbor, Multilayer Perceptron, and Decision Trees with Adaboost and Bagging. The evaluation is divided into patient-dependent and patient-independent experiments. The patient-dependent experiments refer to the training and testing of data from the same patient whereas patient-independent experiments refer to the generalized training and testing from the overall dataset. With 10-fold cross-validation on the classification models with the features selected in the patient-dependent experiments, the Decision Trees with Adaboost classifier recorded the highest detection accuracy of 99.69% and 99.96% for its sensitivity and specificity respectively for 4s window sizes. As for the patient-independent experiments, the evaluation of classifiers with the leave-one-patient-out technique is used. In the patient-independent experiments, the Random Forest classifier achieved the highest average performances of 66.25% sensitivity and 95.38% specificity for 4s window sizes, and 62.05% sensitivity and 95.15% specificity for 1s window sizes. The poor performance is due to the different walking styles of the PD patients. Similar work is reported in [13] whereby machine learning classifiers are built for classifying PD gait data from the ankle, thigh, and trunk sensors used in the Daphnet dataset into FOG, walking, and pre-FOG. Thirty (30) features were extracted from both time and frequency domains, such as entropy, skewness, and FFT coefficients. These results show the best performance of FOG detection, achieving sensitivity and specificity of 87.23% and 80.00% respectively. Another similar work is [14] wherein a k-Nearest Neighbor classifier was built to classify FOG, pre-FOG, and non-FOG events using ankle, thigh, and trunk sensor data from the Daphnet dataset. The leave-one-out scheme on an individual patient is used in the evaluation of the model. The proposed model achieved an average sensitivity and specificity of 94.1% and 97.1% for FOG detection.
In [15], a multistage classifier based on a support-vector machine (SVM) with only a single tri-axial accelerometer at the waist is proposed. The research carried out patient-independent experiments, with the data collected from 20 PD patients, out of which only 8 patients had presented FOG episodes. Two sets of features are built. The first set is a reduced set that consists of FFT and the second set contains additional features such as mean, standard deviation, entropy, peak amplitude, and FI. The SVM classifier was trained with sets of features according to different window sizes and varying the SVM configurations with linear and radial basis function (RBF) kernels on different weights and costs. The output from SVM is aggregated over time to determine the confidence level from the additional upper and lower FI thresholds introduced for FOG classification. This research concluded, although the RBF kernel favors the reduced feature set, the linear kernel trained with FFT alone is sufficient to achieve similar accuracies. With the use of 128 samples from the 40Hz data, equivalent to an approximate of 3.2s window size, the linear SVM managed to achieve an accuracy of 98.7%. That said, it is largely limited by the one-minute lag between the appearance of the FOG episode and its detection, rendering it unsuitable for real-time FOG detection.
In [4], two SVM classifiers were built: one trained generically with the dataset from all patients with the leave-onepatient-out validation in patient-independent experiments, and the second with part of the dataset from each patient in patient-dependent experiments. This managed to improve the accuracy of the SVM classification to outperform the generic patient-independent classifier by [8] and [11]. The dataset was collected from 21 PD patients using a single tri-axial accelerometer at the waist. Their data preprocessing steps include signal conditioning and windowing, removing the high-frequency noise with a low pass filter, and proposing the use of 3.2s window sizes of 128 samples with a 50% overlap to prevent information loss between windows. From the data windows, a total of 55 features were extracted and categorized within 12 groups for the data of 21 PD patients. They also applied the Principal Component Analysis (PCA) to identify the principal components of the data and the Singular Value Decomposition (SVD) to identify the latent variables in the data for the real-time implementation of the feature extraction process. Due to its high performance in classifying binary data, the SVM classifier managed to achieve an average of 79.03% sensitivity and 74.67% specificity in the patient-independent experiments and an average of 88.09% for both the sensitivity and specificity in the patient-dependent experiments.
Another work by [16] applied the same sensor configuration and SVM technique with only slight changes in the feature extraction and validation techniques. The dataset was collected from 15 PD patients. Similar to previous works, frequency features were extracted with FFT and analyzed with PCA. However, the number of features was significantly decreased from 55 to 28 to reduce the computational load and cater for real-time implementation. The classifier was validated with the stratified 10-fold-cross-validation and the leave-out-patient-out techniques on the 1.6s window sizes with 64 samples. Even with a reduced number of features, the research managed to perform similarly or outperform the previous systems in terms of accuracy. In [17], two inertial sensors are attached to the shin of 11 PD patients for FOG detection. The decision tree is used to select the relevant features and SVM is used to predict the FOG episodes. The proposed method achieved an accuracy of 92.0% on the leave-one-subject-out cross validation.
In [18], the temporal structure of the gait data is exploited using long short-term memory recurrent neural network (LSTM-RNN) by feeding the raw data directly to the network. Subsequently, the dependencies of the sequential data could be modelled, improving detection accuracy. To evaluate the performance of the proposed network, the authors performed a comparison of classical machine learning methods and handcrafted features. The features were extracted using hybrid discrete wavelet transform and FFT to distinguish between FOG and non-FOG episodes. Patient-dependent experiments were conducted, and it was reported that linear SVM achieved the best performance among the machine learning algorithms with an average accuracy of 79.48%. The proposed LSTM-RNN managed to achieve a better performance with an average accuracy of 83.83%. Another research work by [19], utilized transfer learning to leverage pre-trained networks for training the LSTM-RNN network. The results show that the proposed model achieved an average accuracy of 87.54% with a 1s window.
With recent developments in deep learning, the ability to extract features from time series data has improved considerably. Deep learning models, particularly the convolutional neural network (CNN) is highly appealing in time series data analysis due to its properties for an end-to-end pipeline in pattern classification and learning of fundamental features. One of the first works to propose a deep learning model for FOG detection is reported in [10]. The authors proposed a deep learning model that exploits spatially local correlation by encouraging the connectivity of adjacent-layer neurons. The model was hybridized with a CNN, forming 8 layers: 5 convolutional layers, 2 fully connected layers, and one output layer. Apart from the windowing strategy, an augmentation strategy was proposed that randomly shifted the window starting point and rotated each windowed signal to replicate the patient's movement. Furthermore, a stacking strategy was proposed to overlap consecutive window data with a stacking parameter and FFT to provide the model with more data samples. The evaluation of the model did not produce any improvements in the performance of FOG detection, the results are only comparable to researchers using signal processing or machine learning techniques [10]. However, this shows that automatic feature extraction with deep learning techniques can produce results comparable to manually extracted features from long feature extraction processes. This allows researchers to optimize the performance of the model to eliminate manually handcrafted feature extraction, feature selection, and classifier boosting techniques.
In another research, a deep CNN architecture for automatic feature learning and the detection of FOG episodes is proposed [20]. The proposed CNN architecture consists of three convolutional and pooling layers for feature learning and one SoftMax layer for classification. In the pre-processing stage, outliers were detected with the three-sigma rule and replaced with the median value of the time-series data. A sliding window is used to segment the data to a length of 256 (4s) as the input data for the CNN. The experiments were conducted using the Daphnet dataset. The generic patient-independent performance of the proposed architecture is evaluated with the leave-one-out technique, while the patient-dependent performance is evaluated with the 10-fold cross-validation technique. Similar work is reported in [21] whereby a deep learning architecture is proposed for detection of FOG episodes. The architecture consists of squeeze-and-excitation blocks to capture the interdependencies between channels of the feature maps and attention mechanism to encode the relationship between the features and the desired targets. These experiments were also conducted using the Daphnet dataset. The proposed model is evaluated with the 10-fold cross-validation and leave-one-out cross-validation. Although the proposed CNN improved the specificity performance, the overall sensitivity performance was still poor. This aligns well with the findings of [8], that the proposed architecture did not perform equally well for all PD patients due to the difference in the walking styles. In this case, as the SoftMax classifier can only provide linear classification, the performance might be improved with the use of classifiers with a non-linear classification capability.
Based on the literature review, the limitations of manually handcrafted features are apparent, especially in determining the optimal feature sets and their combinations to differentiate between FOG episodes and other states, such as standing or slow walking. The limitations of manually handcrafted features also affected the results, producing large variations in the performance of classification models when faced with different PD patient walking styles. Many of the researchers attempted to resolve this problem by introducing even more manually handcrafted feature sets from the frequency and statistical analysis of gait acceleration signal data in addition to the freeze and power index thresholding methods proposed by [8] and [11]. This leads to an explosion in the number of features; up to 55 feature sets were categorized into 12 groups [4]. However, more features do not necessarily VOLUME 9, 2021 mean the model will perform better as shown in [16] wherein the authors reduced the number of feature sets to 28 features to improve the performance of the model. Hence, the manually handcrafted features are not only subjective and varied throughout the works, but the different combinations of features also produce different results, which makes it difficult to select optimal features.
The introduction of CNN deep learning methods in recent years has contributed to the elimination of manually handcrafted features used in previous works. This reduces many tedious steps, complex calculations, and countless different manually handcrafted features from the FOG detection model. However, the use of CNN with its many hidden layers and nodes increases model complexity, which in turn often increases latency due to the increased delay lag between the FOG episode and its detection. This is one of the main concerns of researchers in many of the related works above, as the classification models need to work in real-time for the effective cueing of RAS.
Therefore, this research aims to propose an unsupervised deep denoising autoencoder to eliminate the use of the manually handcrafted features by learning the salient characteristics of PD gait acceleration data. Unlike the number of parameters, which increases in each layer of the CNN and increases the complexity of the model, the constriction of the hidden layer with the bottleneck layer of the autoencoder limits the number of nodes between the encoder and the decoder to be significantly smaller and simplifies the data for the classification of the FOG episodes. This prevents the autoencoder from directly copying inputs to the output and forces the autoencoder to prioritize the importance of the learned features, retaining only the important ones and reducing the dimensionality of data in the process. Only features learned from the encoder portion of the autoencoder are retained to be used with the classification model for the detection of FOG episodes. In the following section the proposed methodology is briefly outlined.

III. PROPOSED METHODOLOGY
An overview of the proposed method is illustrated in Figure 1, and it can be divided into two stages. First is the feature learning stage wherein the feature representation of the data is learned (automatically extracted) via a deep learning autoencoder (Figure 1, top). The next stage is the classification stage wherein the extracted features are used to detect FOG episodes using the gait acceleration data. The collection and pre-processing of this data is detailed in the following section.

A. DATA COLLECTION AND PRE-PROCESSING
To execute the proposed methodology, this research was carried out solely based on the Daphnet FOG dataset. The PD gait data was collected using three accelerometers that were attached on the shank (above the ankle), the thigh (above the knee), and the trunk (above the waist) of the patients, transmitting the acceleration data to a base station through a 64 Hz Bluetooth link, corresponding to a 15ms time interval. Ten (10) PD patients in a laboratory environment were asked to perform three walking tasks that represent the different aspects of daily walking. These tasks were done under the supervision of therapists and assistants for observation and safety purposes. The tasks included walking in straight lines with 180 degree turns, random walking with several stops and turns, and stimulating activities of daily living such as getting drinks from the kitchen and returning to the room. The gait data acquired were annotated with ground truths: 0 for non-experimental periods, 1 for non-FOG episodes, and 2 for FOG episodes.
In the data pre-processing, the outliers were detected and eliminated by replacing data values exceeding three standard deviations with the median value of each attribute and removing the data if it exceeded four standard deviations. The three-sigma rule is commonly applied to data with normal distributions to retain at least 95% of the data. To prevent too much information loss, this research extends the tolerance of the outliers to the 99th percentile of the data to retain more of the original dataset. Due to this, most of the original data are still retained for the application of deep learning and machine learning later in the research. The data were normalized to a range of 0 to 1. The use of the resulting dataset for feature learning in this research is explained in the ensuing section.

B. FEATURE LEARNING
The feature learning pipeline is based on an autoencoder variant called denoising autoencoder [22]. An autoencoder is an unsupervised learning neural network model trained to learn useful features of the data to reconstruct the output to be as similar as possible to the original input, without the need for class labels. The autoencoder consists of two parts: the encoder, which encodes learned features of the input data, and the decoder, which reconstructs the output data from the learned features. Typically, a bottleneck constricts the hidden layer between the encoder and the decoder, significantly limiting the number of nodes to prevent the autoencoder from directly copying inputs to the output. This also forces the autoencoder to prioritize the importance of the learned features, and retain only the important ones, reducing the dimensionality of data in the process. Unlike the vanilla autoencoder (the simplest form of autoencoder), the denoising autoencoder is trained to reconstruct the output from corrupted inputs. This allows the autoencoder to learn more stable and robust features of the data. In this research, the data is corrupted with Additive White Gaussian Noise. The target noise level will be based on the signal-to-noise Ratio (SNR) producing more corrupted data windows, where lower SNR indicates more noise. The target SNR is set to 20 in this research, which is deemed to be a reasonable amount of noise to corrupt the data. The SNR is expressed mathematically as follows: where P signal represents the average power in the original signal and P noise represents the average power in the noisy signal. The average power is calculated in decibels (dB). Therefore, the data is converted into decibels prior to calculating the SNR for the noise generation. The noise generated is then added back to corrupt the original data and thereby form the corrupted data. While a simple autoencoder can be trained with only one hidden layer -an input layer, a hidden encoder layer, and a decoder output layer -it is not sufficient for solving complex problems such as the one in this research. This requires the composition of a multi-layered non-linearity architecture to generalize complex relationships between features efficiently. Deep architectures perform hierarchical feature learning by first learning low-level features and increasing the layers deeper into the network model. By convention, a symmetrical architectural design is desired. The 15-layer deep denoising autoencoder in this research consists of an 8-layer encoder (including the input layer) and a 7-layer decoder as described in the summary of the architecture in Table 1 below.
The initial input shape of the deep denoising autoencoder is the same as the output shape of the last layer. This is given as 128 × 3, representing the input dimension and the number of channels respectively. The input dimension, 128, comes from the 2s data windows from the data preparation process, while the second parameter, 3, comes from the x, y, and z channels of the accelerometer data. The output shapes change depending on the processes within the layers of the autoencoder. The bottleneck is the constricted layer that can tremendously reduce the dimensionality of the data and has dimensions 4 × f where f is the number of filters in the constricted layer. To extend the learning capabilities of the deep denoising autoencoder, the 1D convolutional layer is used to leverage the local temporal structure of the data and the 1D pooling layer is used to minimize the translation of the network invariance in the data. The encoder consists of three stacks of 1D convolutional layers and max-pooling layer followed by a convolutional layer that acts as the constricted layer (bottleneck). The number of filters in the convolutional layers increases from 16 to 48 as the network gets deeper. This allows the network to extract the features in a hierarchical manner, resulting in more discriminative features. In terms of the layer structure, the decoder is symmetric to the encoder. Up-sampling is performed after each convolutional layer to increase the dimension of the data to the original size. All 1D convolutional layers use the rectified linear unit activation function.
The mean squared error (MSE) is used to train the deep denoising autoencoder to reconstruct the original data from the corrupted data. The MSE measures the average differences between the predicted value (reconstruction),x and the true value (original), x i as shown in the following equation: where n is the dimension of the data. As the value is squared, this loss function is sensitive to any deviation from its true  value. However, it only measures the magnitude of the error irrespective of its direction. To improve the reconstruction of the data, the Kullback-Leibler (KL) divergence is added to the loss function to minimize the divergence between the distribution of the true values and the predicted values. The loss function is defined as follows: where β is a small constant weight and D(x||x) is the KL divergence, which is defined:

C. DETECTION OF FOG
Essentially, only salient characteristics of FOG episodes are encoded in the constricted layers, which represent the features of the data after the training procedure. The decoder is then removed from the network and replaced with a classifier as shown in Figure 1 (bottom). The classification task is framed as a binary classifier to detect the presence or absence of FOG episodes. The supervised learning approach is employed to model the mapping between the encoded features and the FOG episodes. In this research, various machine learning algorithms are evaluated to determine the optimal FOG detection model. The machine learning algorithms are naïve Bayes, SVM with RBF kernel, SVM with polynomial kernel, Random Forest, and Ensemble Voting. Each classifier is fine-tuned accordingly to obtain the best results, and these results are laid out in the next section.

IV. EXPERIMENTS AND RESULTS
The following subsections present a detailed analysis of the results obtained and are divided into three sections. The first section introduces the experimental setup, and the second section presents the results and their discussion. The final section is a comparative analysis of the proposed method with the state-of-the-art methods.

A. EXPERIMENT SETUP
The Daphnet dataset was used in the experiments to evaluate the proposed method. In the experiments, the gait data from the thigh sensor was chosen as it gave the best performance in the original Daphnet research. The data was segmented using a fixed sliding window technique with the window size set to 2s (or 128 samples per window) with no overlapping. The dataset was partitioned into training and test sets with a ratio of 7:3. The deep denoising autoencoder was implemented using Keras with TensorFlow as its backend engine. Xavier initialization was used to initialize the network weights and the autoencoder was trained using the RMSprop optimizer. Early stopping and dropout techniques were used to prevent overfitting the training data. A batch size of 32 was used and a dropout rate of 0.2 was applied to prevent sparsity, which may lead to the loss of too much information.

B. RESULTS AND DISCUSSION
Due to the black-box nature of the neural network in the autoencoder, several initial trial runs were performed in the development of the deep denoising autoencoder to explore the influence of the many different parameters on the performance of the model. The parameters are the number of filters in the convolutional layers of the encoder, decoder, and its bottleneck layer. The baseline model as described in Table 2 has 8 filters in its bottleneck layer, with 16, 32, and 48 filters in the encoder layers, with the symmetrical 48, 32, and 16 number of filters for the decoder layers respectively. For simplification, the number of filters of the model will hence be referred to with just the number of filters from the encoder layer, such as 16-32-48 for the baseline model. Deviating from the baseline model, experiments are conducted by either reducing the number of filters in the convolutional layers to 8-16-32 or increasing the number of filters in the convolutional layers to 32-48-64. To determine the most suitable number of filters for the convolutional layers, the stratified 10-fold cross-validation is applied to the supervised learning classifier of the FOG detection model. The average detection accuracy from the experiments conducted is tabulated in Table 2, with the best-achieved performance from the model being 16-32-48 filters.
Extending from the first set of experiments on the number of filters in the convolutional layers of the deep denoising autoencoder, the number of filters in its bottleneck layer were tuned as well, with the use of 6 and 10 filters in each experiment respectively. The bottleneck layer is set to a maximum of 10 filters, constricting the number of learned features to prevent the autoencoder from copying its inputs to its outputs. The average results for the stratified 10-fold cross-validation experiments on the number of filters in the bottleneck layer are tabulated in Table 3, which observes the best performance with 8 filters in its bottleneck layer.
To summarize from Table 2 and Table 3, the deep denoising autoencoder with the 16-32-48 filters for its convolutional layers and 8 filters in its bottleneck layer learned the most salient characteristics in the Daphnet dataset. As for the FOG detection model, the representative classification model is chosen from the best performing model from the naïve Bayes, SVM with RBF and polynomial kernels, Random Forest, and Ensemble Voting classifiers. In this case, the Random Forest classifier produces the highest averages of sensitivity, specificity, and balanced accuracy compared to other classification models; therefore, it was selected as the FOG detection model. It is surmised that the Random Forest can generalize better as it is an ensemble learning technique with multiple decision tree classifiers.
Both the SVM with RBF and SVM with polynomial kernels also achieved good results. This is in line with the reputation of SVM in this area of research, as discussed in Section 2. Besides the efficient feature learning, the high performance of the SVM also shows that it is capable of mapping complex non-linear relationships with its use of the RBF and polynomial kernels. It is robust to outliers with the use of its regularization penalty error, and it can handle imbalanced datasets with the introduction of class weights as a cost-sensitive function.
For a more detailed analysis of the FOG detection model, the confusion matrix of the Random Forest classification model shows the predicted results of the test samples in Table 4. From the confusion matrix, the number of false negative instances are rather high, leading to the low specificity of the classification model and signifying incorrectly predicted non-FOG episodes. Reflecting the real-life use case for the cueing of RAS, this means some FOG episodes may not be detected by the FOG detection model, occasionally failing to cue RAS to help the PD patient to continue walking. One of the reasons for this is the different walking styles of the PD patient, as indicated by past researchers such as [8], [12], [23].
While the deep denoising autoencoder can learn the salient characteristics of the Daphnet data, it tries to learn all the characteristics of the different walking styles of the PD patients as well. For example, an indication of an oncoming FOG episode may be the slowed down movement to the point of freezing, but a high volume of movement data from PD patients walking very slowly without FOG episodes leads the autoencoder to assume the patients are just walking very slowly and it isn't flagged as a FOG episode. Due to the volume of data for non-FOG episodes and the minority of patients with different walking styles exceeding the volume of data for FOG episodes, the autoencoder may not learn   the characteristics of FOG episodes from the data well. This problem can be resolved with the collection of more data from both the different walking styles of the PD patients and data with FOG episodes. In order to compare this method with similar existing models, the following section will provide a discussion of how each method measures up against the proposed method.

C. COMPARISON WITH EXISTING METHODS
For the objective evaluation of the feature learning and FOG detection model proposed in this research, its performance is benchmarked by the FOG detection models contributed by other researchers. However, due to the non-uniformity in the use and placement of sensors, along with the different methods of data collection and processing, the performance of this proposed model only benchmarks with models tested with the Daphnet dataset. This significantly narrows down the potential benchmarking reference to only three, which coincides with the three main techniques discussed in Section 2.
The first reference is by [8], the original research in which the Daphnet dataset originates. It uses the thresholding techniques for the detection of FOG episodes. The references reported in [12], [13], applies manually handcrafted frequency features extracted into a Random Forest supervised learning classifier. While the first three references rely on manually handcrafted features, the fourth by [18] attempts the detection of FOG episodes through modelling the sequence of input signals without the manual feature extraction process. Similarly, the fifth by [20] attempts the detection of FOG episodes through a deep learning convolutional neural network. Table 5 below summarizes the three references in the order of their publication year with the applied methods, window size, and the evaluation of sensitivity, specificity, and balanced accuracy for each reference respectively along with the proposed model.
A quick glance from Table 5 shows that, for models using 4s window sizes, the best performance with the highest sensitivity, specificity, and accuracy was achieved by [18], followed by [8], [12], [13] and [20]. Although the FOG detection model proposed in this research only uses 2s window sizes, its performance is comparable to those models using the 4s window size and surpassed the performance of the thresholding model contributed by [8] and the 1s and 2s window size models contributed by [12] and [13]. That said, the classification metrics from Table 5 only provides the comparison at a superficial level. Due to the difference in the data processing and FOG classification methods, a more in-depth analysis is required for an objective benchmarking of the performances of the proposed FOG detection model against the models contributed by other researchers.
Both the thresholding and the manual feature extraction techniques from the frequency and statistical features of the gait acceleration data applied by [8], [12] and [13] respectively are manually handcrafted features. This means the researchers had to manually produce a generalizing formula for the detection of FOG episodes, and the model relied heavily on that formula to detect the next FOG episodes. As discussed earlier, this typically involves the use and combination of many handcrafted features for the right generalization formula and it is highly dependent on the skillset VOLUME 9, 2021 and experience of the researcher. As there is no singular generalization formula for all the different walking styles of PD patients, the manually handcrafted features cannot be transferred to others who walk differently. In contrast, the feature learning of the proposed model that simply learns salient characteristics of the data can be transferred to other walking styles as well.
While the 4s window size classification models by [12], [13] had a slightly better performance, this may be due to the training of the data from all three sensor placements in their research, compared with only the thigh sensor placement from the Daphnet dataset in this research. As discussed in Section 2, the study in [12] also focused more on patient-dependent experiments; that is, they calibrated the detection models with manually handcrafted features specifically designed for the training and testing of the dataset from the same patient for each of the patients in the detection of FOG episodes. There is not much description provided on the data processing and classification techniques for the testing and training of the overall dataset in the patient-independent detection of FOG that contributed to the results. Furthermore, a study in [13] demanded heavy computation due to the complexity of the features and the number of features needed to be extracted. Hence, it is hard to conclude if the models contributed perform better.
The models reported in [18], [20], [21] eliminated manually handcrafted features with the use of deep learning methods. Although the evaluation of its performance produced results with a high detection accuracy, it is not comparable to the proposed model of this research due to its data processing techniques. The authors handled the outlier by performing the three-sigma rule; any data three standard deviations away from the mean of the data was considered an outlier. The research replaces the data within the range of three-sigma to four-sigma with the median of the data and removes outliers beyond four-sigma. As acceleration data do not typically fall within the normal distribution, this data processing method removed a lot of data from the training and testing of the model, including the data representing the different walking styles of the PD patients. With the removal of too much data, the model may have been evaluated based on the general walking styles of PD patients; there is no evidence the contributed model can generalize the different walking styles of the PD patients.
That aside, the use of CNN and LSTM-RNN typically involves the use of large feature vectors for feature extraction and classification. In the case of [20], at least 20 filters were used with a kernel size of 33 in its convolutional layers while the LSTM-RNN [18] used 100 neurons in its hidden layer. The proposed model reported in [21] consists of three convolutional layers and two squeeze-and-excitation blocks followed by LSTM, which adopted the attention mechanism. This results in the increased dimensionality of the data, which in turn increases the computational cost of the model as well. In comparison, the size of feature vectors of the proposed deep denoising autoencoder is relatively small, using only 8 filters with a kernel size of 5 in its bottleneck layer. This contributes to a significant dimensionality reduction in the data for a reduced computational cost that is not only desirable in any learning algorithms but crucial in this research for low latency-period detection.
Additionally, this research proposes a window size of only 2s, half the common window size practiced by all the benchmarked references, further contributing to the reduction in the model latency. Another added advantage for the use of autoencoder-based representation learning is in the freedom of choice in selecting preferred classification techniques. Unlike the classification of CNN, which uses the fully connected layer, the proposed method allows any classification technique to be used to model the encoded representation by the deep denoising autoencoder. That is the case in this research, which uses the balanced Random Forest ensemble learning classifier.

V. CONCLUSION
This paper proposes a feature learning method using deep denoising autoencoder for detecting FOG episodes in PD gait acceleration data. The proposed method automatically learns feature representation of the data in an unsupervised manner, eliminating the need for manually handcrafted feature engineering. The deep denoising autoencoder is trained to minimize the cost function with Kullback-Leibler divergence, which improves the reconstruction outputs. As a result, more salient features can be learned, improving the accuracy of the FOG detection model. We evaluate the proposed method on a benchmark Daphnet dataset. The results showed that the proposed method managed to achieve 90.94% sensitivity and 67.04% specificity. These results are comparable to the original Daphnet dataset research. The low specificity results are due to the significant lack of data from the FOG class label compared with non-FOG class label data, which hinders feature learning of the deep denoising autoencoder.