Loading web-font TeX/Math/Italic
Parkinson’s Disease Detection Using Smartphone Recorded Phonemes in Real World Conditions | IEEE Journals & Magazine | IEEE Xplore

Parkinson’s Disease Detection Using Smartphone Recorded Phonemes in Real World Conditions


The block diagram of identifying PD from HC using sustained phonemes. The model is trained and tested using 72 PD and HC participants.

Abstract:

Parkinson’s disease (PD) is a multi-symptom neurodegenerative disease. There are no biomarkers; the diagnosis and monitoring of the disease progression require clinical a...Show More

Abstract:

Parkinson’s disease (PD) is a multi-symptom neurodegenerative disease. There are no biomarkers; the diagnosis and monitoring of the disease progression require clinical and functional symptom observation. Voice impairment is an early symptom of PD, and computerized analysis of voice has been proposed for early detection and monitoring of the disease. However, there is poor reproducibility of many studies, which is attributed to the experimental data having been collected under controlled conditions. To overcome the limitations of earlier works, this study has investigated three sustained phonemes: /a/, /o/, and /m/, which were recorded using an iOS-based smartphone from 72 participants (36 people with PD and 36 healthy) in a typical clinical setting. A number of signal features were obtained, statistically investigated, and ranked to identify the suitable feature sets. These were classified using machine learning models. The results show that a combination of phonemes /a/+/o/+/m/ was most suited to differentiate the voice of PD people from healthy control participants, with an average accuracy, sensitivity, and specificity of 100%, 100%, 100%, respectively, using leave-one-out validation. The findings of this study could assist in the clinical assessments and remote telehealth monitoring for people with parkinsonian dysarthria using smartphones.
The block diagram of identifying PD from HC using sustained phonemes. The model is trained and tested using 72 PD and HC participants.
Published in: IEEE Access ( Volume: 10)
Page(s): 97600 - 97609
Date of Publication: 12 September 2022
Electronic ISSN: 2169-3536

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Parkinson’s disease (PD) is the second most common neurodegenerative disorder [1], which is expected to increase with an ageing population. There are no biomarkers to diagnose the disease, which requires the observation of the complex set of symptoms of the patients. Acoustic speech abnormalities have been reported even in early-stage PD patients and even when there is no perceptible dysarthria [2]. Several investigators have found impaired speech parameters in early-stage PD using objective acoustic measures [3], [4]. Several studies have investigated the difference between the voice of PD and healthy control (HC) using different approaches [4], [5], [6], [7], [8], [9], [10], [11], [12].

Human speech requires fine-motor control, cognitive abilities, auditory feedback, and muscle strength. Parkinsonian dysarthria can be characterized by reduced vocal tract loudness, reduced speech prosody, imprecise articulation, significantly narrower pitch range, longer pauses, vocal tremor, breathy vocal quality, harsh voice quality, and disfluency [4]. The differences in the voice parameters of sustained phonemes have been examined for detecting and monitoring PD [4], [13], [14]. A number of works have considered the signal features previously used for speech studies, such as speaker recognition [15], [16]. The investigation of sustained phoneme and text-dependent speech modalities for PD screening is reported in [13]. However, such analysis has confounding factors such as language skills, vision and hearing [17]. Tsanas et al. [18] have extended this to associate these with the motor disability score of PD patients.

The use of non-linear and hybrid features such as the fractal dimension (FD), entropy [19], deep multivariate features [20], and linear predictive models [21], [22] has been proposed. Godino-Llorente et al. [12] proposed an articulatory biomarker based on the kinetic envelope trace of voice that had an accuracy of 85%. In [6], 132 features were extracted from phonemes recorded in a sound-treated booth with a head-mounted microphone to train a support vector machine (SVM) and random forest classifier which achieved an accuracy of 97.7% and 90.2%, respectively to identify PD from HC.

Signal features have often been selected based on the understanding of the disease [23], [24]. The difference between the voice of healthy people and those with PD has been observed in their pitch frequency, jitter, shimmer, and harmonics to noise ratio [25]. The pitch frequency or the fundamental frequency of the vocal cords, f_{0} , is the number of cycles of the glottal vibration. Jitter, the perturbation of the glottal vibration period, is influenced by the motor control, rigidity, and tremor of the larynx. Shimmer, the amplitude perturbation, is related to the glottal resistance and increases with a lack of fine muscle control. Harmonics to noise ratio (HNR) or noise to harmonic ratio (NHR) indicates the relative harmonic strength are the ratios between the periodic (voiced) and non-periodic (noise) components of the speech. These reduce with diminished glottal vibration and low HNR is an indicator of dysarthria. However, some of these parameters may also be affected by other factors such as age, gender, and ethnicity.

The above studies have shown that there are several signal features that show significant differences between the voice of PD and HC. However, most studies have not considered real-world conditions where there is background noise, and there are differences between recording devices and conditions [26], [27], [28]. There are only a few studies that used data recorded in real-life clinical setup [19], [29], [30]. Therefore, further work is required to validate the use of this for real-life scenarios, especially for remote monitoring of the patients and other telehealth applications.

The aim of this study was to identify the most suitable signal classification method that can differentiate between PD and HC when the recordings are made in real-world conditions. We investigated the phonatory parameters of three sustained phonemes and compared people with PD with HC. The data were recorded in a typical clinical setting to check for its real-world suitability using smartphones [31], [32]. Besides the statistical analysis, the SVM classifier was used to classify the voice in two classes: PD and HC. The proposed model provides the following advantages over the existing alternatives:

  1. Data were recorded in a normal clinical setting and with background noise conditions.

  2. The recordings were made using commercially available smartphone with default settings.

  3. Only three phonemes were recorded and it was not dependent on language skills.

  4. The performance was perfect, with 100% sensitivity and specificity, outperforming the state-of-the-art methods.

SECTION II.

Matetrials and Methods

A. Participants

Seventy-two age-matched volunteers comprising 36 people with PD and 36 healthy age-matched participants as the HC group participated in this study. The data can be found in our previously reported work [30]. All the people with PD had been diagnosed with PD within the last ten years based on procedures complying with the Queen Square Brain Bank criteria for idiopathic PD [33]. The presence of any advanced PD clinical symptoms such as visual hallucinations, frequent falling, cognitive disability, or need for institutional care was an exclusion criterion [34]. People with PD were recruited from the movement disorder clinic at Monash medical center and Dandenong neurological clinic while the HC group participants were recruited from several retirement centers. Table 1 presents participants’ demographics, cognitive stage, and health history. The UPDRS-III scores [35] of all the participants show a clear difference between the groups, while the MoCA score confirms that both PD and HC did not have cognitive impairment.

TABLE 1 Participants’ Demographics and Clinical Characteristics
Table 1- 
Participants’ Demographics and Clinical Characteristics

The study protocol was approved by the ethics committee of Monash Health, Melbourne, Australia (LNR/16/MonH/319) and RMIT University Human Research Ethics Committee, Melbourne, Australia (BSEHAPP22-15KUMAR). Before the experiments, written consent was obtained from all the participants.

B. Methods

Figure 1 illustrates the block diagram of the proposed method of classifying PD from HC. As shown in Figure 1, three phonemes were recorded from PD and HC participants using a smartphone. Each phoneme was segmented before extracting features from it. Machine learning based classification was applied to identify PD from HC. The detail of each section is described below:

FIGURE 1. - The block diagram of identifying PD from HC using sustained phonemes. The model is trained and tested using 72 PD and HC participants.
FIGURE 1.

The block diagram of identifying PD from HC using sustained phonemes. The model is trained and tested using 72 PD and HC participants.

1) Voice Recording

Three sustained phonemes /a/, /o/, and /m/ were recorded from each participant. The phonemes were selected to examine a range of voice production models [36]. The vowel /a/, as in “car”, is an open-back or low vowel, produced while the jaw is wide open, with the tongue that is inactive and low in the mouth. In this, the vibration of the vocal folds dominates the sound of the vowel. The vowel /o/, as in “oh”, is a closed-mid-back vowel. The back of the tongue is positioned mid-high towards the palate, and the lips are rounded. The phoneme /m/ is a nasal phoneme produced by the vibration of the vocal folds with the air flowing through the nasal cavity. Although all three phonemes require control of the respiratory and laryngeal vocal fold muscles, there are considerable differences in patterns of activation of the rostral muscles of articulation (of pharynx, tongue, jaw, and lips).

The participants were asked to speak the phonemes for as long as it was comfortable, in their natural pitch and loudness. During the recording, they held the smartphone as if they were talking a phone call. The voice of 72 participants (36 PD and 36 HC) was recorded using an iOS-based smartphone (iPhone 6S plus) with its built-in microphone and default settings, while the participants were located in typical Australian clinics or office settings. The recordings were saved into a single-channel uncompressed WAV format with a sampling frequency (f_{s} ) of 48.1 kHz and a 16-bit resolution. Each file contained one single sustained phoneme with varied duration, as shown in Table 2. In between each recording, there was minimum 15 seconds rest time.

TABLE 2 Duration of the Recordings
Table 2- 
Duration of the Recordings

2) Automated Segmentation and Feature Extraction

All computations, including pre-processing, automated segmentation, and statistical analysis, were performed using Matlab2018b (MathWorks) and Python. All the recorded phonemes were segmented using an envelope detection and thresholding approach. The signal features were computed from each segment. Recordings with the voice of the instructor were removed. In the original recordings, the signal-to-noise ratio was 16–24 dB (average 19.26 dB), similar to typical Australian clinical conditions. The first step for feature extraction was to locate the time instances (t_{i} ) and the amplitude (A_{i} ) of the pulses in the recording representing the glottal vibration. The instantaneous period of the glottal wave (T_{i} ) was calculated as the difference between subsequent instances of the pulses, T_{i} = t_{i+1} - t_{i} .

The first set of features were six jitter parameters: jitter absolute (jitter abs), jitter relative (jitter rel), period perturbation quotient-3 (jitter ppq3), period perturbation quotient-5 (jitter ppq5), period perturbation quotient-11 (jitter ppq11), and frequency modulation (Jitter FM). Here, ppq 3, ppq5, and ppq11 are the perturbation of the difference between T_{i} and the moving average of T_{i} with a window size of 3, 5, and 11, respectively. The equations to calculate jitter parameters [32] are shown in equations 1 to 6:\begin{align*} Jitter\, \left ({abs }\right)=&\frac {1}{N-1}\sum \nolimits _{i=1}^{N-1} \left |{ T_{i+1}-T_{i} }\right | \tag{1}\\ Jitter\, \left ({rel }\right)=&\frac {\frac {1}{N-1}\sum \nolimits _{i=1}^{N-1} \left |{ T_{i+1}-T_{i} }\right |}{\frac {1}{N}\sum \nolimits _{i=1}^{N} T_{i}} \tag{2}\\ Jitter\, (ppq3)=&\frac {\frac {1}{N-2}\sum \nolimits _{i=2}^{N-1} \left |{ T_{i}-\left ({\frac {1}{3}\sum \nolimits _{n=i-1}^{i+1} T_{n} }\right) }\right | }{\frac {1}{N}\sum \nolimits _{i=1}^{N} T_{i}} \tag{3}\\ Jitter\, (ppq5)=&\frac {\frac {1}{N-4}\sum \nolimits _{i=3}^{N-2} \left |{ T_{i}-\left ({\frac {1}{5}\sum \nolimits _{n=i-2}^{i+2} T_{n} }\right) }\right | }{\frac {1}{N}\sum \nolimits _{i=1}^{N} T_{i}} \tag{4}\\ Jitter\, (ppq11)=&\frac {\frac {1}{N-10}\sum \nolimits _{i=6}^{N-2} \left |{ T_{i}-\left ({\frac {1}{11}\sum \nolimits _{n=i-5}^{i+5} T_{n} }\right) }\right |}{\frac {1}{N}\sum \nolimits _{i=1}^{N} T_{i}}\qquad \tag{5}\\ Jitter\, (FM)=&\frac {{\mathrm {max}(T_{i})}_{i=1}^{N}-{\mathrm {min}(T_{i})}_{i=1}^{N}}{{\mathrm {max}(T_{i})}_{i=1}^{N} +{\mathrm {min}(T_{i})}_{i=1}^{N}}\tag{6}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Six shimmer parameters that were extracted from the segments are the absolute shimmer (shimmer abs in dB), the relative shimmer (shimmer rel), amplitude perturbation quotient-3 (apq3), amplitude perturbation quotient-5 (apq5), amplitude perturbation quotient-11(apq11), and amplitude modulation (Shimmer AM). Here, apq3, apq5, and apq11 represent the perturbation of the difference between A_{i} and the moving average of A_{i} with a window size of 3, 5, and 11, respectively. The calculations to compute shimmer parameters are described in equations 7 to 12.\begin{align*} Shimmer\, \left ({abs,dB }\right)=&\frac {1}{N-1}\sum \nolimits _{i=1}^{N-1} \left |{ 20\ast \mathrm {log}\left ({\frac {A_{i+1}}{A_{i}} }\right) }\right | \\{} \tag{7}\\ Shimmer\, \left ({rel }\right)=&\frac {\frac {1}{N-1}\sum \nolimits _{i=1}^{N-1} \left |{ A_{i+1}-A_{i} }\right |}{\frac {1}{N}\sum \nolimits _{i=1}^{N} A_{i}} \tag{8}\\ Shimmer\, (apq3)=&\frac {\frac {1}{N-2}\sum \nolimits _{i=2}^{N-1} \left |{ A_{i}-\left ({\frac {1}{3}\sum \nolimits _{n=i-1}^{i+1} A_{n} }\right) }\right | }{\frac {1}{N}\sum \nolimits _{i=1}^{N} A_{i}} \\{} \tag{9}\\ Shimmer\, (apq5)=&\frac {\frac {1}{N-4}\sum \nolimits _{i=3}^{N-2} \left |{ A_{i}-\left ({\frac {1}{5}\sum \nolimits _{n=i-2}^{i+2} A_{n} }\right) }\right | }{\frac {1}{N}\sum \nolimits _{i=1}^{N} A_{i}} \\{} \tag{10}\\ Shimmer\, (apq11)=&\frac {\frac {1}{N-10}\sum \nolimits _{i=6}^{N-5} \left |{ A_{i}-\left ({\frac {1}{11}\sum \nolimits _{n=i-5}^{i+5} A_{n} }\right) }\right |}{\frac {1}{N}\sum \nolimits _{i=1}^{N} A_{i}} \\{} \tag{11}\\ Shimmer\, (AM)=&\frac {{\mathrm {max}(A_{i})}_{i=1}^{N}-{\mathrm {min}(A_{i})}_{i=1}^{N}}{{\mathrm {max}(A_{i})}_{i=1}^{N}+{\mathrm {min}(A_{i})}_{i=1}^{N}}\tag{12}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Teager-Kaiser energy operator (TKEO) measures the energy of a time varying signal. It detects the amplitude and frequency modulation of a signal by estimating the product of time varying amplitude and frequency. Mean, standard deviation, and percentile values of TKEO for the contour T_{0} and A_{0} were computed.

HNR and NHR quantifies noise in the speech signal, which is due to the incomplete closure of vocal folds. The standard deviation of pitch was computed from the instantaneous pitch frequency f_{0} =1/T_{0} . The HNR and NHR were calculated based on the normalized autocorrelation function of the segment. R_{xx}[T_{0}] is the peak at R_{xx} corresponding to the T_{0} of the recordings, the HNR and NHR were calculated as described in equations 13 and 14 [37], [38]:\begin{align*} HNR=&10\ast log\frac {R_{xx}[T_{0}]}{1-R_{xx}[T_{0}]} \tag{13}\\ NHR=&1-R_{xx}[T_{0}]\tag{14}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Glottal Quotient (GQ) measures the required time to open or close the glottis. The mean and standard deviation of the time when vocal folds were apart (glottis is open) or in collision (glottis is closed) were also computed. The voice analysis toolbox [7], [8], [39], which used DYPSA [40] algorithm, was used to compute GQ.

Glottal to Noise Excitation ratio (GNE) measures the noise in the signal and the turbulent noise created due to incomplete closure of vocal fold could be captured by GNE features [41]. GNE was computed using the following steps proposed by Michaelis et al. [42].

  • Down sampling the phonemes recordings to 10 kHz and inverse filtering to detect each glottal cycle.

  • Computing the Hilbert envelopes to each glottal cycle with a different frequency.

  • Obtaining the maximum value among the cross-correlation of pair-wise envelopes where the central frequencies of the bands are greater than half the bandwidth.

Vocal fold excitation ratio (VFER) is a measure to detect dysphonia. Proper glottis cycle results in synchronous excitation on different frequency bands; however when this is impaired, there is turbulence and there is asynchronous and uncorrelated excitation on a different frequency and thus reduced VFER.

The above-mentioned features are mainly targeted for characterizing vocal fold dynamics as it is affected in PD patients. Since the coordinators of articulator of vocal tract such as tongue, jaw, lips are also affected by PD [43], we incorporated those features that best characterize the vocal tract coordinators such as mel-frequency cepstral coefficients (MFCCs).

MFCCs measures the energy of speech signal in each frequency band (equation 15). Since the coordinators of articulators of the vocal tract such as the tongue, jaw, lips are also affected by PD [43], it is hypothesized that MFCC will be different for PD and HC. \begin{equation*} {MFCC}_{n}=\sum \nolimits _{k=1}^{K} {E_{k}\mathrm {cos}\left[{n(k-0.5)\frac {\pi }{K}}\right]}\tag{15}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where n = 0,\ldots.., L . L is the number of MFCC. E_{k} is the mean energy of kth frequency band. In addition to MFCCs, features from the first- and second- time derivative of MFCC that are known as delta and delta-delta coefficients respectively were computed which have been used for voice quality assessment [39], [44]. We computed 22 MFCC features.

Spectral analysis is used to understand the oscillatory trend of the signal but does not carry the temporal information. Wavelet transform (WT) is a technique that is based on the use of time limited waves, referred to as wavelets, and performs multi-resolution, time-frequency analysis. In this context, it converts the single dimension time domain signal to two-dimensional time-frequency domain without losing the temporal information. The discrete WT (DWT) decomposes the signal into different frequency bands into approximation and detail coefficients, with each scale corresponding to scaling of the frequency by half. In this study, the recordings were decomposed at level 10 which covers the entire audible range of the recordings. Daubechies 10 (Db10) mother wavelet was chosen as the vanishing moment. Energy, entropy, and TEKO features were computed from each DWT decomposed approximation and detail coefficients.

C. Feature Selection

A large number of features increases the risks of overfitting, can lead to higher error, and increases the computational complexity [45], [46]. That is why the exclusion of redundant features is necessary [46]. During feature selection, the first step was to identify those features that were tested to be statistically different (p < 0.0001) for the two groups using the Mann-Whitney U test. Next, feature selection algorithms were applied to identify the best features. For the removal of algorithm bias, four different feature selection algorithms were compared: i) infinite latent feature selection (ILFS), ii) least absolute shrinkage and selection operator (LASSO), iii) Relief-F, and iv) unsupervised discriminative feature selection (UDFS).

D. Model Training and Classification

Support vector machine (SVM)-based machine learning classifier was deployed to label the selected features into two classes: PD and HC. The details of the SVM classifier and cross-validation are described below.

1) Support Vector Machine

Support vector machine (SVM) is a widely used supervised machine learning technique for classification. The decision boundaries or hyperplanes are developed based on the support vectors during training.

Let, vector x denotes the feature to be classified and its label is denoted by y where y\epsilon (+1, -1) . Now, for a given set of training data, \left \{{\left ({x_{i},y_{i} }\right), i=1,2,\ldots.,n }\right \} , the separating hyperplanes can be obtained by maximizing the margin, which is the minimization of the following function.\begin{equation*} J\left ({w,\beta }\right)=\frac {1}{2}w^{T}w+C\sum \beta _{i}\end{equation*}

View SourceRight-click on figure for MathML and additional features. With the following constrain function \begin{equation*} y_{i}(w^{T}x+b)\gg 1-\beta _{i}~where~\beta _{i}\ge 0\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Here, w is the weight vector, b is a constant, C is a positive regularization parameter, and \beta _{i} is the slack variable. Applying the Lagrange multipliers {\alpha }_{i} , for vector x, the solution of the decision function can be expressed as:\begin{align*} w=&\sum {\alpha _{i}y_{i}x_{i}} \\ f(x)=&\sum {\alpha _{i}y_{i}x_{i}^{T}x}+b\end{align*}
View SourceRight-click on figure for MathML and additional features.
For the nonlinear SVM, a nonlinear mapping function \varphi (x) is used to map the input feature into a higher dimensional feature space, thus making the samples more separable:\begin{equation*} f(x)=\sum \alpha _{j}y_{j}K\left ({x_{j},x }\right)+ b\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where, x_{j} are the support vectors and K\left ({x_{j},x }\right) is the kernel function, for the polynomial kernel K\left ({x_{j},x }\right)={(x_{j}.x+1)}^{d} and radial basis function (RBF) kernel K\left ({x_{j},x }\right)=\exp \left ({-\gamma \left \|{ x_{j}-x }\right \|^{2} }\right) . SVM details can be found in [47]. In this study, SVM with linear, polynomial and RBF kernels were used.

2) Cross Validation

We evaluated the model performance using leave one out cross validation (LOOCV) techniques [48]. The LOOCV method uses N-1 subjects for model training, 1 for testing, and is repeated N times, so that each subject gets a chance to be tested. The final result is the mean of the individual evaluations. The detail of the model training and testing using LOOCV is illustrated in Figure 2. Accuracy, sensitivity, specificity, and F1-score were computed as performance metrics.

FIGURE 2. - Performance evaluation of the proposed model. The dataset consists of 72 PD and HC participants. The model performance is evaluated using the leave one out cross validation.
FIGURE 2.

Performance evaluation of the proposed model. The dataset consists of 72 PD and HC participants. The model performance is evaluated using the leave one out cross validation.

SECTION III.

Results

A. Statistical Analysis and PD Classification

Anderson-Darling test confirmed that the voice parameters of three different sustained phonemes for two groups were not normally distributed and thus unsuitable for the parametric test. So, the group differences and significance of each feature for PD vs. HC were computed using Mann-Whitney U test [49]. Features having p-value \le0.0001 were the input to the Relief-F feature selection algorithms to sort the most significant features from the pool of feature sets. The model performance with the variation of the sorted features of each phoneme using SVM classifier is shown in Fig 3. For all phonemes, the accuracy of the model increases with the increasing number of features till 15. The model performance remained almost unchanged between feature number 15 to 40 and the performance decayed after any inclusion of feature beyond 40. Since, the non-significant features have a very low separable capability, the inclusion of large number of insignificant features may mislead the classifier and decreases the classification accuracy. As is observed from Fig. 3, inclusion of features above 40 reduces the model performance.

FIGURE 3. - The classification accuracy using Relief-F based feature selection techniques for phoneme /a/, /m/, /o/, and /a/+/m/+/o/ respectively. These results were computed using leave one subject out cross validation techniques.
FIGURE 3.

The classification accuracy using Relief-F based feature selection techniques for phoneme /a/, /m/, /o/, and /a/+/m/+/o/ respectively. These results were computed using leave one subject out cross validation techniques.

The accuracy of the proposed model with top 15 sorted features extracted from individual phoneme /a/, /m/, and o/ using SVM with RBF kernel is 97.22%, 95.83%, and 98.66% respectively. Based on the combined features extracted from two phonemes, PD classification was 97.22%, 98.66%, and 100% for /a/+/m/, /m/+/o/, and /a/+/o/, respectively. The proposed model accuracy became 100% when the features obtained from the three phonemes /a/+/m/+/o/ were combined. The detailed performance of the proposed model using the different combinations of phonemes is shown in Table 3. It is found that features extracted from phoneme /o/ identified PD from HC with higher accuracy compared to other phonemes and the inclusion of features from phonemes /a/ and /m/ improved the performance. It showed the highest performance when features from all three phonemes were combined to train the model. The confusion matrix is shown in Fig. 4. The confusion matrix summarises the predicted and actual classes, providing an accurate assessment of the performance by providing true positives, true negatives, false positives, and false negatives.

TABLE 3 The Performance of the Model is Assessed on Both Individual and Combination of Phonemes
Table 3- 
The Performance of the Model is Assessed on Both Individual and Combination of Phonemes
FIGURE 4. - Confusion matrix for PD vs. HC classification. The confusion matrix for individual and combination of phonemes are shown in the top and bottom of the figure respectively.
FIGURE 4.

Confusion matrix for PD vs. HC classification. The confusion matrix for individual and combination of phonemes are shown in the top and bottom of the figure respectively.

B. Computing the Effect Size and Spearman Correlation of Each Significant Feature

The statistically significant features of each phoneme were sorted and ranked by the ReliefF-based feature selection technique. The effect size computed by Cohen’s d and the Spearman correlation coefficient of each selected phonemes are shown in Table 4. Based on the Mann-Whitney U test, each feature was assessed for statistical significance, and the corresponding p-value is listed in Table 4. The two-dimensional representation of the top two features of each phoneme is demonstrated in Fig. 5.

TABLE 4 Effect Size, Spearman Correlation and {p} -Value of Top Five Features From Each Phoneme /a/, /m/, and /o/ Using ReliefF Based Feature Selection Algorithm
Table 4- 
Effect Size, Spearman Correlation and 
${p}$
-Value of Top Five Features From Each Phoneme /a/, /m/, and /o/ Using ReliefF Based Feature Selection Algorithm
FIGURE 5. - Selected pair of smartphone-recorded phonemes features plotted in two-dimensional space with optimal decision boundary (black line) between PD and HC for phoneme /a/ (left), /m/ (middle), and /o/ (right).
FIGURE 5.

Selected pair of smartphone-recorded phonemes features plotted in two-dimensional space with optimal decision boundary (black line) between PD and HC for phoneme /a/ (left), /m/ (middle), and /o/ (right).

C. Robustness of the Model

A larger sample size is necessary for the training to represent modelled phenomena. However, with limited labelled data samples, which is often the case with medical data, the resultant model needs to be tested for robustness. Hence, the system performance as a function of the minimum number of data points (participants) was conducted and is presented in Fig. 6. The performance was obtained by increasing the number of participants from 8 to 50 at an increment of 6. For this purpose, the complete dataset was subdivided into two groups to construct the training set and they were randomly subdivided to get the training set by stratified random sampling. This ensured that class balance was maintained for the training set. Each step was iterated ten times and the results were averaged. The average system performance as a function of the minimum number of data points (participants) is shown in Fig. 6. The figure shows that accuracy improved with the increasing number of training subjects and plateaued with 14 subjects with classification accuracy reaching above 95.00%.

FIGURE 6. - Evaluation of model performance with different number of training subjects. The boxplot represents the distribution of accuracy of the model for a different number of training subjects varies from 8 to 50. The box represents the 1st, median, and 3rd quartile of the accuracy using a varying number of subjects from the training pool randomly for ten iterations. The average accuracy of ten iterations is shown as a circle in each box.
FIGURE 6.

Evaluation of model performance with different number of training subjects. The boxplot represents the distribution of accuracy of the model for a different number of training subjects varies from 8 to 50. The box represents the 1st, median, and 3rd quartile of the accuracy using a varying number of subjects from the training pool randomly for ten iterations. The average accuracy of ten iterations is shown as a circle in each box.

SECTION IV.

Discussion

People with PD often have dysarthria or speech impairment which may appear in phonatory, articulatory, prosodic, and linguistic aspects. The change is complex and characterized by reduced loudness, reduced speech prosody, imprecise articulation, significantly narrower pitch range, longer pauses, vocal tremors, breathy vocal quality, harsh voice quality, and dysfluency [4]. Speech disorders are related to several factors such as inability to perform habitual tasks, loss of fine control, weakness, tremor, and rigidity of the speech production muscles.

This study has investigated the use of the utterance of phonemes /a/, /o/, and /m/ for differentiating the voice of people with PD from HC. The classification results confirm that identifying the voice of HC from PD improves when the combination of phonemes /a/+/m/+/o/ are used. The results also indicate that among the single phonemes, /o/ is more effective in differentiating the two groups than phoneme /a/ and /m/. The phoneme /a/ is produced while the tongue is pressed towards the jaw and the lips are wide open. Similarly, the production of the phoneme /m/ does not require voice box muscles because the lips are closed, and the air is passed through the nasal cavity. On the other hand, the production of phoneme /o/ requires precise positioning of the tongue at a mid-height position and the small-rounded position of the lips [50] than /a/ and /m/. Since the production of the phoneme /a/ and /m/ does not require the precise control of the tongue and lips, the tremor or weakness in the tongue or lips positioning should be more prominent in the production of /o/ than /a/ and /m/. This supports our finding that PD and HC are better distinguished with /o/ compared to /a/ and /m/. However, these are only logical deductions at this stage, and further research needs to be conducted to confirm these.

It was also found that the MFCC and the features from the first and second derivatives of MFCC of phonemes /a/, /m/, and /o/ were significantly different between PD and HC. The cepstral analysis identifies the changes to the source and vocal cord factors, and this observation confirms that Parkinsonian dysarthria is associated with these changes. The average log energy of phoneme /a/ was found to be significantly different which also indicates the reduced source strength of PD.

The significant difference between PD and HC of HNR and GNER of phoneme /o/ indicates the weakened vocal cords, due to which the relative voiced noise compared to resonatory sound is higher in the voice of PD. The classification results show that the inclusion of these features improves the model performance. The classification accuracy was 100% when using these features from the three phonemes, /a/, /m/, and /o/. Since, PD is a multi-symptom disease with complex display of the symptoms, and while the analysis of each phoneme captures some of the symptoms, it is the combination of all the three that appears to be capturing all the symptoms of the disease. The study has also investigated the effect of sampling frequency in differentiating between PD and HC. For sampling frequencies, fs = 48.1 kHz and 8 kHz, the model shows exactly similar results. This indicates that the relevant frequency of interest is less than 4kHz.

Further, this work explored the performance of the four feature selection algorithms for phoneme-based PD classification. Though ReliefF and ILFS slightly performed better than LASSO and UDFS, similar performance was noticed for the higher number of features. It was also observed that any top twenty features selected by any of the four-feature selection algorithms showed above 95% classification accuracy.

The performance comparison of our approach with the existing state-of-art techniques in the literature is summarized in Table 5. As shown in Table, the model performance for phonemes recorded in noise-free soundproof environment with a microphone varies from 89.5% to 97.7%. On the other hand, the model performance varies from 81% to 93.1% for phonemes recorded in a normal clinical setting. While the ambient noise resulted in a fall of performance of the models in literature by 5.6% to 8.4%, our proposed model was less prone to the ambient noise and capable of identifying PD from HC with 100% accuracy.

TABLE 5 The Comparison of the Proposed Model With the Existing Studies in Literature for Two Class (Sleep-Wake) Classification Problem
Table 5- 
The Comparison of the Proposed Model With the Existing Studies in Literature for Two Class (Sleep-Wake) Classification Problem

There are four major achievements of this study. Firstly, it has been found that people with PD and healthy age-matched have the most significant difference in the production of the phoneme /o/ which is differentiable even with background noise and recorded using handheld smartphone. The statistical analysis and classification results confirm that the voice features of phoneme /o/ can discriminate people with PD from HC participants more accurately than /a/ and /m / but the combination of phonemes /a/, /m/ and /o/ is the most accurate. Secondly, it has shown that computerized assessment of the voice of people with PD is suitable for real-world, regular clinical settings with background noise and using low sampling rate smartphone. Thirdly, this model requires only phonemes and thus, it is language independent. Finally, the model is trained and tested without favoring hyperparameters that are tailored to a specific gender, so this is a gender independent model.

The limitation of this study is that we did not consider factors such as accents because all participants were of sub-urban Melbourne only. There is also the need to test the individual multiple times to check for the repeatability of the results and to use multiple devices while this study used one phone only. Another weakness of this study was that people with PD were more than two years post-diagnosis and not in the very early stage of the disease.

SECTION V.

Conclusion

This study has investigated the use of sustained phonemes for computerized diagnosis of PD based on the utterance of three phonemes /a/, /o/, and /m/ recorded using a handheld smartphone in real-world clinical conditions with ambient noise conditions of about 20 dB. It has been found that there were number of features with significant differences between PD and HC. After feature selection from the three phonemes, /a/+/m/+/o/, the classifier differentiated between HC and PD with 100% accuracy. Two prominent differences between PD and HC based on the selected features are a decrease in voice energy and increase in relative voice-noise. The novelty of this study is the selection of the acoustic features that are suitable for differentiating between PD and HC while using a handheld smartphone and is not sensitive to clinical ambient noise conditions. This study shows the potential of using phoneme based computerised diagnosis of PD that can be performed remotely using a smartphone. It has applications for assisting in the clinic or for telehealth.

ACKNOWLEDGMENT

The authors acknowledge the team at Dandenong Neurology and RMIT University which collected the data, and its online availability. Special thanks to Dr. Susmit Bhowmik and Dr. Sumaiya Kabir for their support and helpful discussion.

References

References is not available for this document.