Parkinson’s Disease Detection Using Smartphone Recorded Phonemes in Real World Conditions

Parkinson’s disease (PD) is a multi-symptom neurodegenerative disease. There are no biomarkers; the diagnosis and monitoring of the disease progression require clinical and functional symptom observation. Voice impairment is an early symptom of PD, and computerized analysis of voice has been proposed for early detection and monitoring of the disease. However, there is poor reproducibility of many studies, which is attributed to the experimental data having been collected under controlled conditions. To overcome the limitations of earlier works, this study has investigated three sustained phonemes: /a/, /o/, and /m/, which were recorded using an iOS-based smartphone from 72 participants (36 people with PD and 36 healthy) in a typical clinical setting. A number of signal features were obtained, statistically investigated, and ranked to identify the suitable feature sets. These were classified using machine learning models. The results show that a combination of phonemes /a/+/o/+/m/ was most suited to differentiate the voice of PD people from healthy control participants, with an average accuracy, sensitivity, and specificity of 100%, 100%, 100%, respectively, using leave-one-out validation. The findings of this study could assist in the clinical assessments and remote telehealth monitoring for people with parkinsonian dysarthria using smartphones.

1. Data were recorded in a normal clinical setting and with 97 background noise conditions. 98 2. The recordings were made using commercially available 99 smartphone with default settings. 3. Only three phonemes were recorded and it was not depen-101 dent on language skills. 102 4. The performance was perfect, with 100% sensitiv-103 ity and specificity, outperforming the state-of-the-art 104 methods.

106
A. PARTICIPANTS 107 Seventy-two age-matched volunteers comprising 36 people 108 with PD and 36 healthy age-matched participants as the HC 109 group participated in this study. The data can be found in 110 our previously reported work [30]. All the people with PD 111 had been diagnosed with PD within the last ten years based 112 on procedures complying with the Queen Square Brain Bank 113 criteria for idiopathic PD [33]. The presence of any advanced 114 PD clinical symptoms such as visual hallucinations, frequent 115 falling, cognitive disability, or need for institutional care was 116 an exclusion criterion [34]. People with PD were recruited 117 from the movement disorder clinic at Monash medical cen-118 ter and Dandenong neurological clinic while the HC group 119 participants were recruited from several retirement centers. 120 Table 1 presents participants' demographics, cognitive stage, 121 and health history. The UPDRS-III scores [35] Figure 1 illustrates the block diagram of the proposed method 133 of classifying PD from HC. As shown in Figure 1, three 134 phonemes were recorded from PD and HC participants using 135 a smartphone. Each phoneme was segmented before extract-136 ing features from it. Machine learning based classification 137 was applied to identify PD from HC. The detail of each 138 section is described below:   The participants were asked to speak the phonemes for as 156 long as it was comfortable, in their natural pitch and loudness.

157
During the recording, they held the smartphone as if they 158 were talking a phone call. The voice of 72 participants (36 159 PD and 36 HC) was recorded using an iOS-based smartphone 160 (iPhone 6S plus) with its built-in microphone and default set-  from each segment. Recordings with the voice of the instruc-176 tor were removed. In the original recordings, the signal-to-177 noise ratio was 16-24 dB (average 19.26 dB), similar to 178 typical Australian clinical conditions. The first step for fea-179 ture extraction was to locate the time instances (t i ) and the 180 amplitude (A i ) of the pulses in the recording representing the 181 glottal vibration. The instantaneous period of the glottal wave 182 (T i ) was calculated as the difference between subsequent 183 instances of the pulses, T i = t i+1 − t i .

209
Shimmer (abs, dB)  in the signal and the turbulent noise created due to incomplete 243 closure of vocal fold could be captured by GNE features [41].

244
GNE was computed using the following steps proposed by 245 Michaelis et al. [42].  The above-mentioned features are mainly targeted for char-261 acterizing vocal fold dynamics as it is affected in PD patients.

262
Since the coordinators of articulator of vocal tract such as 263 tongue, jaw, lips are also affected by PD [43], we incorporated 264 those features that best characterize the vocal tract coordina-265 tors such as mel-frequency cepstral coefficients (MFCCs).

266
MFCCs measures the energy of speech signal in each 267 frequency band (equation 15). Since the coordinators of artic-268 ulators of the vocal tract such as the tongue, jaw, lips are also 269 affected by PD [43], it is hypothesized that MFCC will be 270 different for PD and HC. Spectral analysis is used to understand the oscillatory trend 280 of the signal but does not carry the temporal information. 281 Wavelet transform (WT) is a technique that is based on the use 282 of time limited waves, referred to as wavelets, and performs 283 multi-resolution, time-frequency analysis. In this context, 284 it converts the single dimension time domain signal to two-285 dimensional time-frequency domain without losing the tem-286 poral information. The discrete WT (DWT) decomposes the 287 signal into different frequency bands into approximation and 288 detail coefficients, with each scale corresponding to scaling 289 of the frequency by half. In this study, the recordings were 290 decomposed at level 10 which covers the entire audible range 291 of the recordings. Daubechies 10 (Db10) mother wavelet 292 was chosen as the vanishing moment. Energy, entropy, and 293 TEKO features were computed from each DWT decomposed 294 approximation and detail coefficients.

296
A large number of features increases the risks of overfitting, 297 can lead to higher error, and increases the computational 298 complexity [45], [46]. That is why the exclusion of redundant 299 features is necessary [46]. During feature selection, the first 300 step was to identify those features that were tested to be 301 statistically different (p < 0.0001) for the two groups using 302 the Mann-Whitney U test. Next, feature selection algorithms 303 were applied to identify the best features. For the removal 304 of algorithm bias, four different feature selection algorithms 305 were compared: i) infinite latent feature selection (ILFS), ii) 306 least absolute shrinkage and selection operator (LASSO), iii) 307 Relief-F, and iv) unsupervised discriminative feature selec-308 tion (UDFS).
Here, w is the weight vector, b is a constant, C is a positive 328 regularization parameter, and β i is the slack variable. Apply-329 ing the Lagrange multipliers α i , for vector x, the solution of 330 the decision function can be expressed as: For the nonlinear SVM, a nonlinear mapping function ϕ(x) 334 is used to map the input feature into a higher dimensional 335 feature space, thus making the samples more separable: where, x j are the support vectors and K x j , x is the kernel 338 function, for the polynomial kernel K x j , x = (x j .x + 1) d 339 and radial basis function (RBF) kernel K x j , x = 340 exp −γ x j − x 2 . SVM details can be found in [47].

341
In this study, SVM with linear, polynomial and RBF kernels 342 were used.

344
We evaluated the model performance using leave one out 345 cross validation (LOOCV) techniques [48]. The LOOCV 346 method uses N-1 subjects for model training, 1 for testing, 347 and is repeated N times, so that each subject gets a chance 348 to be tested. The final result is the mean of the individual 349 evaluations. The detail of the model training and testing 350 using LOOCV is illustrated in Figure 2. Accuracy, sensitiv-351 ity, specificity, and F1-score were computed as performance 352 metrics.   Table 3.     representation of the top two features of each phoneme is 405 demonstrated in Fig. 5.

407
A larger sample size is necessary for the training to repre-408 sent modelled phenomena. However, with limited labelled 409 data samples, which is often the case with medical data, the 410 resultant model needs to be tested for robustness. Hence, the 411 system performance as a function of the minimum number 412 of data points (participants) was conducted and is presented 413 in Fig. 6. The performance was obtained by increasing the 414 number of participants from 8 to 50 at an increment of 6. 415 For this purpose, the complete dataset was subdivided into 416 two groups to construct the training set and they were ran-417 domly subdivided to get the training set by stratified random 418 sampling. This ensured that class balance was maintained 419 for the training set. Each step was iterated ten times and the 420 VOLUME 10, 2022  pauses, vocal tremors, breathy vocal quality, harsh voice 433 quality, and dysfluency [4]. Speech disorders are related to 434 several factors such as inability to perform habitual tasks, loss 435 of fine control, weakness, tremor, and rigidity of the speech 436 production muscles.

437
This study has investigated the use of the utterance of 438 phonemes /a/, /o/, and /m/ for differentiating the voice of 439 people with PD from HC. The classification results confirm 440 that identifying the voice of HC from PD improves when the 441 combination of phonemes /a/+/m/+/o/ are used. The results 442 also indicate that among the single phonemes, /o/ is more 443 effective in differentiating the two groups than phoneme /a/ 444 and /m/. The phoneme /a/ is produced while the tongue is 445 pressed towards the jaw and the lips are wide open. Similarly, 446 the production of the phoneme /m/ does not require voice 447 box muscles because the lips are closed, and the air is passed 448 through the nasal cavity. On the other hand, the production 449 of phoneme /o/ requires precise positioning of the tongue 450 at a mid-height position and the small-rounded position of 451 the lips [50] than /a/ and /m/. Since the production of the 452 phoneme /a/ and /m/ does not require the precise control of 453 the tongue and lips, the tremor or weakness in the tongue or 454 lips positioning should be more prominent in the production 455 of /o/ than /a/ and /m/. This supports our finding that PD and 456 HC are better distinguished with /o/ compared to /a/ and /m/. 457 However, these are only logical deductions at this stage, and 458 further research needs to be conducted to confirm these.  Table 5. As shown in Table,  515 Finally, the model is trained and tested without favoring 516 hyperparameters that are tailored to a specific gender, so this 517 is a gender independent model.

518
The limitation of this study is that we did not consider 519 factors such as accents because all participants were of sub-520 urban Melbourne only. There is also the need to test the 521 individual multiple times to check for the repeatability of the 522 results and to use multiple devices while this study used one 523 phone only. Another weakness of this study was that people 524 with PD were more than two years post-diagnosis and not in 525 the very early stage of the disease.

527
This study has investigated the use of sustained phonemes 528 for computerized diagnosis of PD based on the utterance of 529 three phonemes /a/, /o/, and /m/ recorded using a handheld 530 smartphone in real-world clinical conditions with ambient 531 noise conditions of about 20 dB. It has been found that there 532 were number of features with significant differences between 533 PD and HC. After feature selection from the three phonemes, 534 /a/+/m/+/o/, the classifier differentiated between HC and PD 535 with 100% accuracy. Two prominent differences between PD 536 and HC based on the selected features are a decrease in voice 537 energy and increase in relative voice-noise. The novelty of 538 this study is the selection of the acoustic features that are 539 suitable for differentiating between PD and HC while using a 540 handheld smartphone and is not sensitive to clinical ambient 541 noise conditions. This study shows the potential of using 542 phoneme based computerised diagnosis of PD that can be 543 performed remotely using a smartphone. It has applications 544 for assisting in the clinic or for telehealth.

546
The authors acknowledge the team at Dandenong Neurology 547 and RMIT University which collected the data, and its online 548 availability. Special thanks to Dr. Susmit Bhowmik and Dr. 549 Sumaiya Kabir for their support and helpful discussion.