Exploiting spectral and cepstral handwriting features on diagnosing Parkinson’s disease

Parkinson’s disease (PD) is the second most frequent neurodegenerative disease associated with several motor symptoms, including alterations in handwriting, also known as PD dysgraphia. Several computerized decision support systems for PD dysgraphia have been proposed, however, the associated challenges require new approaches for more accurate diagnosis. Therefore, this work adds spectral and cepstral handwriting features to the already-used temporal, kinematic and statistics handwriting features. First, we calculate temporal and kinematic features using displacement; statistic features (SF) using displacement, and horizontal and vertical displacement; spectral (SDF) and cepstral (CDF) using displacement, horizontal and vertical displacement and pressure. Since the employed dataset (PaHaW) contains only 37 PD patients and 38 healthy control subjects (HC), then as the second step, we augment the percentage of the smaller training set to equal the larger. Next, we augment both classes to increase the training patient’s data and added random Gaussian noise in all augmentations. Third, the most relevant features were selected using the modified fast correlation-based filtering method (mFCBF). Finally, autoML is employed to train and test more than ten plain and ensembled classifiers. Experimental results show that adding spectral and cepstral features to temporal, kinematics and statistics features highly improved classification accuracy to 98.57%. Our proposed model, with lower computational complexities, outperforms conventional state-of-the-art models for all tasks, which is 97.62%.


I. INTRODUCTION
Biometrics can be used for e-security and e-health [1] and can be grouped based on two traits. Morphological biometrics, such as fingerprints or eye pupils, use direct measurements of the physical traits of the human body [2], [3]. Behavioral biometrics, such as handwriting and drawing, use specific drawing and handwriting tasks performed by the subjects involved in data collection [4]. From a health perspective, online handwriting biometrics are more appealing and informative on states of diseases, such as dementia, than other popular biometrics traits, such as fingerprints or iris [3], [4] because they make part of routine functional activities from which evidence are drawn affected by the disease.
In the last two decades, online handwriting processing has been employed in the computerized assessment of neurodegenerative disorders (e.g., Parkinson's disease (PD)) [5], [6]. Patients with PD experience progressive loss of dopaminergic neurons in substantia nigra pars compacta (located in the midbrain), which is consequently associated with cardinal motor symptoms such as bradykinesia, rigidity, resting tremor, or postural instability [7]- [9]. Therefore, especially during the clinical phase of the disorder, we can observe freezing of gait [10], hypokinetic dysarthria [11], hypomimia [12], or alterations in handwriting [6]. The latter was initially linked with micrographia, i.e., a progressive decline in amplitude (vertical micrographia) or with (horizontal micrographia) of letters [13]. Nevertheless, micrographia is one manifestation of altered handwriting in patients with PD. Others include more pronounced changes in kinematics and dynamics too. Letanneux et. reported a connection to developmental dysgraphia and proposed a new and more general term, PD dysgraphia [14].
Recently, several designs of decision support systems for diagnosing different PDs based on speech/voice analysis [15]- [20] or gait monitoring [21]- [23], have been proposed. However, compared with online handwriting processing, both speech assessment and gait monitoring require more technical equipment and are vulnerable to low signal quality due to a noncontrolled environment. Speech assessment requires highquality recording conditions without background noise and further postprocessing of recorded speech. This includes human-operated speech segmentation, making the whole process more difficult. Gait monitoring or tremor assessment techniques require specialized equipment, such as motion capture systems, accelerometers, and gyroscopes. However, the diagnosis of PD using handwriting processing can be easily administered at the clinic or a patient's home. Handwriting acquisition is simple and natural and requires no timing or exhaustive repetitions.
A comprehensive review of quantitative analysis of PD dysgraphia and its computerized diagnosis for published works until 2019 has been summarized [4]- [6], [14]. Furthermore, we review the state-of-the-art designs published in 2020 and 2021, focusing on articles using online handwriting.
The rest of the paper is organized as follows: Section II reviews related works and presents state-of-the-art results obtained in PD diagnosis based on the PaHaW database. Section III describes the H2O platform used in this work. Section IV describes the PAHAW database. Section V describes the feature extraction process used and describes the type of feature obtained. Section VI presents a brief explanation of the modified version of the fast correlationbased filtering feature selection methodology. Section VII describes the front-end hyperparameters. Section VIII describes the experiments conducted and their results. Finally, in Section IX, we present final remarks, comments, and conclusions.

II. RELATED WORKS
This section reviews related works and state-of-the-art results obtained for PD diagnosis. Table I shows a summary of the state-of-the-art results.
Ammour et al. [24] quantitatively analyzed online handwriting in 28 PD patients and 28 age-matched healthy controls (HC). They quantified the performance of subjects (when writing a text in Arabic letters) according to 1482 kinematics (velocity, acceleration, jerk, etc.), dynamic (pressure, tilt, azimuth, etc.), temporal (e.g., duration), and some additional features. From a semi-supervised approach (employing cluster analysis), they differentiated the PD group with 71.44% accuracy. Furthermore, they concluded that the complications of fine motor skills in PD patients were mainly manifested in the kinematic feature set. Liaqat Ali, et al. [25] propose a method for dealing with the highly unbalanced handPD dataset. To improve the PD detection accuracy on this dataset, they developed a cascaded learning system that cascades a Chi2 model with an adaptive boosting (Adaboost) model. Experimental results confirmed that the proposed cascaded system outperforms six similar cascaded systems using six state-of-the-art machine learning models, respectively.
Taleb et al. [26] introduced a PD diagnosis concept that uses convolutional neural networks (CNN) fed by spectrograms (calculated from various online handwriting/drawing tasks) and CNN bidirectional long-short-term memory networks (CNN-BLSTM) fed by raw time series. In the publicly available dataset called HandPDMultiMC, containing 21 PD and HC, respectively, a classification accuracy of approximately 97.62% was achieved by combining CNN-BLSTM models trained with jittering and synthetic data augmentation. They trained 204,060 parameters model for one day using an NVIDIA GTX 1080 GPU of 8 GB.
Gupta et al. [27] explored the effect of age and gender on the performance of classification models. Thus, they used the PaHaW database [28] containing 37 PD patients and 38 HC. The subjects performed seven tasks including a sentence or isolated words. The data were parametrized using kinematic, entropic, and energetic features and fed into age-and genderdependent support vector machine (SVM) models. A distinct set of discriminative features was observed in each category (age vs. gender). The results showed an improved classification accuracy of a general model from 75.76% to 83.75% and 79.55% in a female and male set, respectively.
Aouraghe et al. [29] focused on the effect of progressing fatigue in PD dysgraphia. They enrolled 40 PD patients and HC, respectively, copying a multiline paragraph in Arabic letters. First, the paragraph was segmented into individual lines and then, each line processed separately using a set of temporal, kinematic, dynamic, spectral, entropy-based, and wavelet-based features. The feature space was modeled by knearest neighbor classifier (KNN), SVM and decision trees. An accuracy of 92.86% was obtained when processing the last line of the paragraph, i.e., the line where the fatigue is mostly accented.
Deharab et al. [30] introduced a novel online handwriting parameterization using dynamic writing traces warping (DWTW).DWTW was applied to kinematic patterns of handwriting and returned parameters linked with the similarity between normative and pathological time series. The features were modeled using SVM and were evaluated on the PaHaW dataset (29 PD and 32HC; all eight tasks including handwriting and drawing of Archimedean spiral), and an accuracy of 88.33% was achieved.
Parziale et al. [31] addressed a recurrent issue in most published works, clinical interpretability. More specifically, authors frequently use handcrafted features poorly linked to physiological processes and employ less interpretable machine learning models (so-called black boxes).

ACC = 91.67%
The best performance was obtained when combining 3 tasks (grapheme "l", bigram "le" and a word). The study confirms the effectiveness of the sequence learning paradigm for processing sequential handwriting data.
Such systems are unacceptable for clinicians. Therefore, cartesian genetic programming (CGP) (which provides a tradeoff between performance and interpretability) was used in comparison with the more common classifiers. The proposed methodology was evaluated on two datasets, PaHaW (37 PD and 38 HC; all tasks were used) and NewHandPD (31 PD and 35 HC; subjects performed a spiral and a meander). Using conventional temporal, kinematic, and dynamic features, CGP produced more accurate results than white-box methods (reaching 71.18% in PaHaW and 80.39% in NewHandPD) and more interpretable than the black boxes.
Lamba et al. [32] analyzed basic temporal (e.g., duration) and kinematic (e.g., velocity, acceleration, jerk) measures in 62 PD patients and 15 HC (enrolled in the frame of the Irvine (UCI) Parkinson's disease spiral drawings dataset). Due to high imbalance, the synthetic minority-oversampling technique was employed to balance the cohort. Next, data were modeled by several machine learning models, e.g., SVM, AdaBoost, and XGBoost. Finally, a classification accuracy of 96.02% was reported for AdaBoost.
Diaz et al. [33] discussed processed time series of online handwriting (including velocity, acceleration, jerk, displacement, pressure, etc.) using one-dimensional convolutions and bidirectional gated recurrent units (BiGRUs). This end-to-end pipeline was applied to PaHaW (37 PD and 38 HC; all tasks were used) and NewHandPD (31 PD and 35 HC; all tasks were used). The method provided competitive results (96.25% accuracy in PaHaW and 94.44% in NewHandPD), thus confirming the effectiveness of the sequence learning paradigm for processing sequential handwriting data.
Impedovo et al. [34] investigate different velocity-based signal processing techniques to address PD assessment. He uses kinematic, energy, and cepstral features. The energy and cepstral features are similar to the ones used in this work, but they do not use filterbank, and they so not use the filterbank output to calculate the cepstral. An accuracy result of 93.7% for all tasks, and 98.44% when he selects the top three tasks was reported.
Mucha et al. [35] combine kinematic features with fractional-order derivatives and reported an accuracy of 97.14%, for the continuous and repetitive task, such as Archimedean spiral.
Finally, in [36] we proposed the use of spectral and cepstral features for emotion recognition. Here, we concatenated these features with very simple temporal features.
This study explores new approaches of online handwriting parameterization, augmentation, analysis, and modeling with a special focus on improved diagnostic accuracy. Furthermore, we explore the impact of newly proposed spectral and cepstral features on classification accuracy and improve the pipeline by adding data augmentation and modified fast correlationbased filtering feature selection method.

III. DATA MODELING
For data modeling, we use autoML H2O [37], [38]. Automatic machine learning (AutoML) is the process of automating algorithm selection, feature generation, hyperparameter tuning, iterative modeling, and model assessment. It eases training and evaluation of machine learning models. AutoML includes many ML models, however, we limit the number of models to the ones shown in Table I. Also, it ensembles the best models that outperform individual models. Furthermore, it uses the area under the ROC curve as the default ranking metric for binary classification. The configuration is such that individual models are tuned using a two-fold cross-validation set. AutoML automaticly performs Bayesian hyperparameter optimization.
Since a default performance metric for each machine learning task is specified internally, the leaderboard is sorted by that metric.
In Table II, ML models include stacked ensemble models. The stacked ensemble is an efficient ensemble method, such that the predictions, from machine learning algorithms, are used as inputs in a second layer learning algorithm. In the second layer, H20 ensembles all models, (StackedEnsemble_AllModels), and the best of family, (StackedEnsemble_BestOfFamily), including the best models of each kind in the final ensemble.

IV. PAHAW DATABASE
This study employed the Parkinson's disease handwriting database (PaHaW), containing 37 PD patients and 38 age-and gender-matched HC enrolled at the department of neurology, St. Anne's university hospital in Brno [28]. Besides age and gender, the PD group is described in terms of PD duration, unified Parkinson disease rating scale part V score, and levodopa equivalent daily dose.
All subjects have no history or presence of any psychiatric symptom or disease affecting the central nervous system, except for PD. The acquisition was performed when the patients were in their ON state, i.e., approximately one hour after taking levodopa. During the acquisition, the subjects were rested and seated in front of a table in a comfortable position. They completed a protocol on a printed template at a comfortable speed. The prefilled template was shown to the subjects; no restrictions on the number of repeated syllables/words in tasks or their heights were given. The signals were recorded at a 133 Hz sampling rate using the Intuos 4 M (Wacom technology) digitizing tablet and Wacom inking pen.

Classification Algorithms
The protocol consists of the following eight tasks: Task 1 asks the user to draw, from inside out, an Archimedean spiral; tasks 2, 3, and 4 asks the user to repetitively write a cursive letter "l," syllable "le," and trigram "les," respectively; tasks 5, 6, and 7 asks the user to repetitively write a simple orthography and an easy syntax word, such that they are written in one continuous movement; finally, task 8 requires the user to write a longer sentence.
When the user was writing or drawing on the tablet (Fig.  3.1), the application captured the horizontal and vertical displacements of the pen tip in the x-axis, ( ) and the y-axis, ( ), respectively. Furthermore, the on-surface/in-air pen position information or status (touching/not-touching tablet's surface), ( ), the altitude of the pen with respect to the tablet's surface, ( ) , the pressure applied by the pen tip, ( ), the azimuth angle of the pen with respect to the tablet's surface, ( ), and the signal's timestamp, , were captured.  F # = F $ * + * where F $ * is the feature vector of a random user, is a value less than 0.2, and a vector with Gaussian random values. F # , F $ * , and are vectors with equal dimensions.

V. FEATURE EXTRACTION
The front-end used in this study is shown in Fig. 5.1. This section describes the kinematic, statistics, spectral-and cepstral domain features used in the front-end. Definitions for calculating these features are provided in the next subsections and its graphical representation is shown in Fig. 5

A. TEMPORAL AND KINEMATIC FEATURES
The row vector of temporal and kinematic features ( ) [34] for task and user , applied to displacement, is defined as follows: Then, the row vector is the concatenation of of each task ; mathematically shown below, using relational algebra: We observe that to concatenate columns, we transpose the row vector, then we append the resulting column vectors using the union function. Finally, the row feature vector is obtained by transposing the column vector.

B. STATISTICS FEATURES
The statistics are obtained from the kinematic and stroke signals [39]. First, consider the set for task and user : ⁄ is the standard deviation ℊ ⃖⃗ q&,$ is the outlier robust range (percentile 99th-percentile 1st); all above definitions applied to all signals in set ℊ &,$ ( ), and is the set of tasks to perform. The row vector of mean features is defined as follows: is the row vector of quartiles ( C = 25( , ∀ = 1,5,10,20,30,90,95,100, is the row vector of percentils, , ∀ = 3 ℎ, 4 ℎ, 5 ℎ, 6 ℎ, is the row vector of ⁄ , is the kurtosis; all above definitions applies to all signals in ℊ &,$ ( ), and = { , , syllable , , 1, 2, 3, }. Then, the row vector of the statistics feature for user , using relational algebra, is shown below:

C. SPECTRAL-DOMAIN FEATURES
Spectral-domain feature row vectors for task and user , applied to signals is &,$ ( ), is defined as follows: Then, the row vector of the cepstral domain features for user , again using relational algebra, is shown below:

E. USERS FEATURE
The row feature vector F $ of each user, using relational algebra is shown as follows: Alternatively, we define the disease state for each user as follows: The row vector relating features to the emotional state is The data frame is defined as the union $ of all users, and can be expressed using relational algebra notation as follows: In this dataframe, the rows represent the number of users and the columns represent the features and its users' disease state.

VI. FEATURE SELECTION
Feature selection is a popular and common premodeling step in machine learning, especially in high-dimensional databases. Irrelevant features decrease the accuracy of data models because models also learn irrelevant information. Thus, selecting the right number of features increases the performance of the machine learning method.
Several methods for selecting features exist. All of them aim to obtain the best features and most do so by employing statistical tools with certain correlations to selection. The major difference between these tools is the selection criterion. Each patient has a considerable number of features, so we reduce the dimension of the number of features using a modified fast correlation-based filtering (FCBF) [40].
FCBF is based on two correlation factors: correlation between each feature and output and correlations among major difference between these tools is the selection criterion. Each patient has a considerable number of features, so we reduce the dimension of the number of features using a modified fast correlation-based filtering (FCBF) [40].
FCBF selection is based on two steps. In the first step, the selected features are the ones whose correlation with the output are higher than a threshold. In the second step, it takes the features of the first steps and selectes the ones with correlation lower than a threshold. In our modified version [36], mFCBF differs of the original FCBF at step 5, where the selected feature is the one having higher correlation with the output. Algorithm 5.1 shows a pseudo-code for the mFCBF process. mFCBF algorithm receives, as inputs, a dataframe and thresholds ℎ and ℎ. ℎ is used to set the lower correlation value of each of the selected features and the output; ℎ is used to set the higher correlation value between features. Using the values of ℎ and ℎ, we can find the right features to maximize the performance of the machine learning method. This operation is expressed as follows: ¡ B8Q,08Q = B8Q,08Q ( ). Note that in ¡ B8Q,08Q is a 2-D array, where one dimension represents the number of users and the other, the number of selected features. ← Select columns whose output correlation is > 4: Calculate corr( ) 5: ¨, ← Select columns whose correlation with the input is < and with the highest correlation with the output. 6: return (¨, ) 7: end function Feature selectivity is controlled with ℎ and ℎ values. For example, given 370 user feature vectors, then for ℎ = 0.15 and ℎ = 0.7, the number of selected features reduce to 26, 28, and 20 for the depression, anxiety, stress states, respectively.
One way to visualize the relevance of features in improving performance is to use RadViz [41]. In RadViz, each data frame sample is represented inside the circle using the value in each series according to a physical metaphor. Each point is attached to each characteristic with a force proportional to the value the sample takes in the corresponding series. This implies that the final position is the equilibrium position between all forces representing the characteristics. Figs. 5.1 and 5.2 show the RadViz of the 2658 features and the selected 47 features, respectively. RadViz shows the dominant proportional values (DPV) of the features. In the graph, the higher cloud dispersion means higher DPV, whereas, a higher DPV means that features are easily exploited to improve the classification. Furthermore, these 47 features have higher DPVs than the complete set of 2658 features.

Spectral-domain features (
) is a function of the filterbank bandwidth ( ), the bandwidth of the filters on the filterbank ( ), the filterbank's initial frequency ( ), and the overlap between filters on the filterbank ( ).
Therefore, the parameters for the final vector of features are ( , , , , ℎ, ℎ) For practical, the range of values for each parameter is defined as follows: ℎ A different set of features is selected for each combination of values. More so, each set of features produces a corresponding performance accuracy. Since one of these combinations is optimal, we find the combination that maximizes the ML accuracy.
Since augmentation of the training data, user selection, and Gaussian noise is random, we are unaware of which users and random sequences generate a better model. Therefore, we train and test the model for different sets of users and different random sequences, and we select the maximum accuracy.

VIII. RESULTS
The Leave-Percentage-Out (LPO) was used for testing. Here, the data model is trained with all database registers but a percentage, and the test is performed on registers that were out. This was repeated until all possibilities were checked, then, we averaged the accuracy of all tests. In our experiments, we leave the 15% out.
We sample different filterbank's hyperparameters as follows: , Therefore, we find the combination of this sample space that maximizes ML accuracy. Table III shows the different accuracies for different feature sets. The accuracy results for , when using feature selection or not, are 88.87% and 80%, respectively. The accuracy results for concatenating and using either feature selection or not are 94.28% and 80%, respectively.
From these two experiments, we find that adding statistics feature when combined with improves the result accuracy. Table III shows that the accuracy of the results when concatenating using either feature selection or not are 97.14% and 82.85%, respectively. The last accuracy result is for concatenating using either feature selection or not, are 98.57% and 88.57%, respectively. The training data for all experiments were augmented by 80%, and the amplitude of the random Gaussian ( ) was set to 0.2.

IX. CONCLUSIONS AND FUTURE WORK
We applied spectral and cepstral features on Parkinson's disease detection. Although spectral and cepstral features have been successfully applied for emotion detection, here, we concatenate them with kinetic and statistical features. Similar features were used in [52] without the filterbank, thereby providing the flexibility for changing filterbank's bandwidth, filterbank's filters bandwidth, filterbank's filters overlapping, and filterbank's initial frequency to improve performance.
As the first step, we calculated using the displacement signals; using displacement, and horizontal and vertical displacement; the and from the displacement and the horizontal and vertical displacement, and pressure signals.
Since the employed dataset (PaHaW) contains 37 PD patients and 38 HC subjects, then as a second step, we augmented the smaller class of the training set such that both are equal in size; next, we augment both classes of the training data by randomly selecting 80% of the training patient's data and added random Gaussian noise in all augmentations. For the third step, we selected the most relevant features using mFCBF method. Finally, autoML was employed to train and test more than ten plain and ensembled classifiers.
Experimental results show that adding spectral and cepstral features to the kinematics and statistics features highly improves the classification accuracy, reaching a combined classification accuracy of 98.57%. This result shows that our proposed model outperforms the best state-of-the-art result, which sits at 97.62%. Moreover, the state-of-the-art model has higher computational complexity and is required to train 204,060 parameters model for one day using an NVIDIA GTX 1080 GPU of 8 GB.