Disease Delineation for Multiple Sclerosis, Friedreich Ataxia, and Healthy Controls Using Supervised Machine Learning on Speech Acoustics

Neurodegenerative disease often affects speech. Speech acoustics can be used as objective clinical markers of pathology. Previous investigations of pathological speech have primarily compared controls with one specific condition and excluded comorbidities. We broaden the utility of speech markers by examining how multiple acoustic features can delineate diseases. We used supervised machine learning with gradient boosting (CatBoost) to delineate healthy speech from speech of people with multiple sclerosis or Friedreich ataxia. Participants performed a diadochokinetic task where they repeated alternating syllables. We subjected 74 spectral and temporal prosodic features from the speech recordings to machine learning. Results showed that Friedreich ataxia, multiple sclerosis and healthy controls were all identified with high accuracy (over 82%). Twenty-one acoustic features were strong markers of neurodegenerative diseases, falling under the categories of spectral qualia, spectral power, and speech rate. We demonstrated that speech markers can delineate neurodegenerative diseases and distinguish healthy speech from pathological speech with high accuracy. Findings emphasize the importance of examining speech outcomes when assessing indicators of neurodegenerative disease. We propose large-scale initiatives to broaden the scope for differentiating other neurological diseases and affective disorders.

(CatBoost) to delineate healthy speech from speech of people with multiple sclerosis or Friedreich ataxia.Participants performed a diadochokinetic task where they repeated alternating syllables.We subjected 74 spectral and temporal prosodic features from the speech recordings to machine learning.Results showed that Friedreich ataxia, multiple sclerosis and healthy controls were all identified with high accuracy (over 82%).Twenty-one acoustic features were strong markers of neurodegenerative diseases, falling under the categories of spectral qualia, spectral power, and speech rate.We demonstrated that speech markers can delineate neurodegenerative diseases and distinguish healthy speech from pathological speech with high accuracy.Findings emphasize the importance of examining speech outcomes when assessing indicators of neurodegenerative disease.We propose large-scale initiatives to broaden the scope for differentiating other neurological diseases and affective disorders.

I. INTRODUCTION
N EURODEGENERATIVE disease can alter speech due to impaired motor control and execution.Acoustic features of speech can be used as objective clinical markers for diseases of the central nervous system (CNS).Previous studies examining acoustic changes in neurodegenerative disease have primarily focused on differences between healthy controls and various patient populations, such as, multiple sclerosis (MS) [1], [2], [3], [4], [5], Huntington's disease (HD) [6], [7], [8], Parkinson's disease (PD) [4], [5], [9], [10], and Friedreich ataxia (FA) [9], [11], [12].These studies report that various acoustic features change as the disease progresses, and patients tend to exhibit slower and more variable speech rates, lower and more variable pitch, and reduced spectral clarity compared to patients.Although machine learning has been used to differentiate healthy controls from single, well-defined patient populations (e.g., Parkinson's disease, spasmodic dysphonia), real-world machine learning implementations will encounter multiple different diseases that may have overlapping acoustic profiles.The present study aims to broaden the utility of these speech markers by determining how different acoustic profiles of speech may accurately identify specific neurodegenerative diseases.We will move beyond discriminating between healthy and pathological speech by examining differences between different neurodegenerative diseases with similar speech phenotypes (ataxia) simultaneously across multiple acoustic dimensions.
Clinicians use a combination of tools to diagnose neurodegenerative disease including genetic sequencing, neurological scans (e.g., magnetic resonance imaging), and neuropsychological and motor tests.Comorbidities across modalities makes differential diagnosis or genetic test selection (where possible) challenging and can exacerbate the length of time to correct diagnosis (e.g., [13], [14]).Clinical acoustic markers have several advantages over traditional tools.First, speech can be recorded remotely in a home environment without the need to visit a specialist or hospital.Given that some populations with neurodegenerative disease are considered at-risk, remote identification reduces the risk of contracting potentially lifethreatening pathogens.Remote testing is also more accessible for populations with limited mobility and those living in rural areas.Second, acoustic markers are obtained using noninvasive techniques.Invasive procedures (e.g., blood tests, surgery) can cause discomfort and have a risk of infection.These procedures can also impose large financial burdens, particularly when multiple tests are required due to misdiagnosis.Acoustic markers have the potential to alleviate these burdens and streamline processes by providing accessible, low-cost, and low-risk tests that can guide clinician decisions in the early stages of diagnosis.
Acoustic features of speech can be used to construct profiles of different patient populations.The most common acoustic features used to identify pathologies include speech rate, the number of syllables per second, pauses, the duration between utterances or syllables, and frequency information related to the pitch of the voice (fundamental frequency; f0) and its formants [15], [16].These features reflect underlying cortical, subcortical, or cerebellar pathology of clinical populations leading to multiple speech subsystem impairments [7].
Speech can be elicited through specific tasks or naturalistic settings.The diadochokinetic (DDK) task is a common task in which the speaker repeats a syllable string.(e.g., /PATAKA/) as quickly and clearly as possible for 10 seconds [17].The DDK task is a controlled method of speech elicitation that allows high consistency between different speakers while remaining sensitive to speech performance [18].Although other speech tasks (e.g., reading, semi-structured interviews) increase ecological validity, they may also increase cognitive load, which may induce speech changes based on individual differences like education level, language or reading impairments, or cognitive ability [19], [20].Moreover, the linguistic content may encourage changes in prosodic features based on emphasis, stress patterns, and emotion which may differ based on personality, accent, or emotional state.To avoid these concerns, the present study examined speech from a DDK task that was performed by healthy controls (HC) and two patient populations (FA, MS) using uniform practices (see Methods).We calculated acoustic features that have been examined in previous studies comparing HCs and various patient populations (1-9) and include several new acoustic features related to speech quality [15] and speech timing [21], [22] that may improve the differentiation of these groups.
Previous implementations of machine learning on speech have compared healthy controls with only a single patient group (cf.[16]).For example, machine learning approaches using acoustic features have shown high accuracy (>90%) when differentiating healthy control groups from patient groups with Parkinson's disease [23], spasmodic dysphonia [24], and various other vocal conditions (e.g., oral cancer or vocal fold nodules) [16].Although these approaches are useful as initial triage for identifying pathological voice disturbances that require further investigation [25], they do not provide nuanced classification of the underlying pathology or disease phenotypes potentially due to small sample sizes and, consequently, low accuracy [26].This is especially relevant for deep learning models with hidden layers that reflect latent variables that are not defined and, therefore, do not aid in developing specific acoustic profiles that may characterize a disease [27].We used an interpretable machine learning approach using gradient boosting that quantifies the contribution of each acoustic feature in distinguishing between healthy and pathological voices, and between multiple diseases [28].

A. Participants
Healthy controls were recruited through advertisements within Australia.Clinical groups were recruited through medical centers in Australia.We recruited people diagnosed with multiple sclerosis (N = 112) and Friedreich ataxia (N = 73) as well as healthy controls (N = 229).All patient participants were diagnosed by a physician and confirmed genetically for patients with Friedreich ataxia.Demographic information of participant groups is shown in Table I.Some participants were recorded on more than one occasion, leading to a larger final number of speech tokens per group: Multiple sclerosis (N = 787), Friedreich ataxia (N = 158), and healthy controls (N = 483).To ensure that results were not driven by profiles of individuals, data were averaged over participants, resulting in one data point per participant for training and test phases (see [29]).

B. Apparatus
A condenser headset microphone (AKG C520, AKG Acoustic, Vienna, Austria) positioned 8-10cm from the mouth at an angle of 45 • recorded speech.A Roland Quad-capture external soundcard connected to a Dell laptop using captured speech through Audacity software [30] and Redenlab ®software at a sampling rate of 44.1kHz.

C. Procedure
Participants performed a DDK task where the syllables /PA/, /TA/, and /KA/ were repeated in an alternating fashion as many times as possible within one breath for a maximum of 10 seconds.Speech recordings were screened prior to feature analysis to manually remove speech artefacts and background noise.

D. Acoustic Feature Extraction
Acoustic features were extracted using custom-made MAT-LAB scripts that used standard signal processing functions from MATLAB [31], onset and offset detection algorithms [32], beat detection algorithms [22], music information retrieval [33], and speech analysis toolboxes [34], [35].Acoustic features consisted of summary statistics (mean, standard deviation, coefficient of variation, minimum, maximum, range) of 74 variables that measure different aspects of speech qualia [15], resulting in an initial set of 444 features.These features include speech rate, utterance duration, pause duration, fundamental frequency, the first five formants, intensity, summed and peak energy across frequency bands, spectral decrease and spread, and a range of other spectral features used in the clinical acoustic marker literature (see Supplementary Materials for a full list and additional references).
The acoustic features used in the present study and their physiological and perceptual correlates have previously been described in detail in comprehensive reviews [15], [36], [37].The fundamental frequency (f0) is the lowest frequency of a periodic waveform and is perceived as the pitch of a voice [38].Formants are the resonant frequencies in the vocal tract that contribute to the timbre and quality of a voice [39].
Other measures of speech and sound quality were also used [40], [41].Spectral centroid measures the center of mass of the frequency spectrum.Spectral slope is the slope of the linear regression line over across the spectral amplitude values.Spectral flatness measures the uniformity of the frequency spectrum.Spectral decrease is the reduction in signal magnitude across higher frequency bands.Spectral spread measures the distribution of frequencies around the mean frequency in the spectrum.Spectral skew measures the asymmetry in the spectral distribution around its mean frequency.Spectral kurtosis measures the shape of the spectral distribution, indicating the presence of heavy tails or peaks.Spectral crest is the peak amplitude in a frequency spectrum, indicating its highest point.Spectral entropy is the degree of randomness in the distribution of spectral components.Spectral flux is the rate of local change of spectral magnitude and reflects shifts in energy distribution over time.
Correlates of perceived loudness included acoustic intensity, amplitude, the alpha ratio, and energy as measured by wavelet analysis.Intensity is the power of a sound wave per unit area, perceived as loudness [42].The alpha ratio is the ratio of energy below 1kHz and between 1-4kHz [43].Amplitude is the magnitude of the maximum displacement of a wave from its equilibrium position and is also perceived as loudness [44].Acoustic energy (measured here by Morlet wavelets) is the quantification of sound energy across different frequency components using Morlet wavelet transforms [45].Both the summed and peak energy within frequency bands were measured, as was the frequency at which the energy peaked.Five frequency bands consisting of sub-bands between 1Hz and 8,000Hz were examined, specifically 1Hz to 4,000Hz (i.e., the "broad frequency range"), 75Hz to 500Hz (i.e., the "f0 frequency range"), 4kHz to 8kHz (i.e., the articulator and expiration spectrum or "high frequency range"), 75Hz to 4,000Hz (i.e., the common vocal frequency spectrum or "mid frequency range"), and 1Hz to 75Hz (i.e., the articulatory-unit and speech-unit range or "low frequency range").The latter was measured to obtain energy metrics for articulation and speech rate [22], [46], [47].
Speech rate was also measured by determining the onsets and offsets of speech syllables based on amplitude, intensity, spectral flux, and the summed and peak energy in the five frequency bands described above.Onsets and offsets were obtained using the Schultz Musical Instrument Digital Interface Toolbox applied to the time series of these features [32], [48].From the onsets and offsets, we determined speech rate (i.e., the time difference between consecutive onsets), speech duration (i.e., the time between the onset and offset of speech), and pause duration (i.e., the time between the offset of speech and the next onset).The stress rate of speech was also measured using a beat tracking algorithm that measures recurrent moments of increased energy [22], [49].

E. Machine Learning Procedure
We used CatBoost as our machine leaning classification algorithm.CatBoost is an open-source decision tree-based algorithm with gradient boosting and hardware optimization [50].The main advantage of CatBoost over other algorithms is that it builds symmetric trees, employs weighted sampling, and performs ordered boosting.It also lowers the weights of variables that are less useful in identifying groups.These features decrease the need for hyperparameter tuning and reduces the chance of overfitting [50].Cross-validation was performed using 67%-33% Train-Test splits with 100 resamples using stratification to achieve the same balance for each class [28].

F. Statistical Analysis
One-sample t-tests were conducted on Matthew's correlation coefficients and f 1 scores to assess if model performance surpassed chance levels.Effect sizes were measured using Cohen's d.Performance differences between groups were analyzed using an analysis of variance with group as a fixed Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Mean, dispersion, and range for classification accuracy (A = F1 Accuracy, B = Precision, C = Recall) for healthy controls, and groups with Friedreich ataxia and multiple sclerosis for the full model, and models using a subset of acoustic features using the maximum group-wise average SHAP value cut-offs of 2% and 5%.
factor and resample number as the random factor.Effect sizes for group differences were measured using generalized eta squared (η 2 G ). Spearman rank-order correlations were used to assess relationships between the acoustic features and disease severity scores (see Supplementary Materials).All analyses were performed using R software [51].

III. RESULTS
A one-sample t-test revealed that overall model performance as assessed by Matthews's correlation coefficient (M = 0.82, SD = 0.04) was significantly above chance (33% accuracy), t (99) = 114.56,p < 0.001, Cohen's d = 11.46.Classification accuracy between groups was assessed using f 1 scores that equally weight model specificity and sensitivity (see Supplementary Materials for full statistical analysis); these were also significantly better than chance for all groups (ps < 0.001) with large effect sizes for HC (Cohen's d = 32.29),MS (Cohen's d = 11.82), and FA (Cohen's d = 10.70).There was a significant main effect of group, F (2, 198) = 238.20,p < 0.001, η 2 G = 0.49.As shown in Figure 1, classification accuracy was higher for HC compared to FA ( p < 0.001) and MS ( p < 0.001), and higher for FA compared to MS ( p < 0.001).Receiver operating curves (ROC) for the model with average performance based on Matthew's correlation coefficients are shown in Figure 2. The ROC area under the curve (AUC) values were 0.97 for HC, 0.98 for FA, and 0.96 for MS.These values indicate outstanding discrimination by the model [52].

A. Model Optimization
To measure the contribution of each acoustic feature for categorizing each group, the Shapley additive explanation (SHAP) values were examined.These show the probability of each outcome based on the information provided by each feature [53], [54].To achieve a more parsimonious model, we performed the same machine learning procedures twice including features that produced SHAP values above criteria of 2% (n = 87) and 5% (n = 21) for at least one group (see Supplementary Materials for rankings).Overall model accuracy (Matthew's Correlation Coefficient) significantly increased relative to the full model (M = 82.3%,SEM = 0.4%) for the 2% cut-off ( p = 0.001; M = 83.7%,SEM = 0.4%) and 5% cur-off ( p < 0.001; M = 83.6%,SEM = 0.4%).Pairwise comparisons of accuracy between models for each group revealed significant increases in F1 accuracy between the full model and the 2% cut-off for all groups (ps < .002),and between the full model and 5% cutoff for the HC and MS group (ps < 0.03) but not the FA group ( p = 0.11) (see Figure 1).These results suggest that high discrimination accuracy can be achieved with a reduced subset of 21 acoustic features.It should be noted, however, that larger subsets of variables may be required to achieve high discrimination accuracy if a broader scope of clinical groups are included.

B. Optimal Clinical Acoustic Markers
We describe the top 21 features overall, and top 10 for each group and overall (see Supplementary Materials for all features).As shown in Table II, the dominant acoustic features for accurate classification were spectral decrease, peak f0 energy, peak energy in the low, high, and broadband frequency ranges, low-frequency summed energy, utterance duration based on summed broadband energy (including low-, mid-and highfrequency sub-bands), spectral spread, and acoustic intensity.Figure 3 shows that healthy controls were characterized by a less steep and less variable spectral decrease, a smaller spectral spread and range of energy produced in low frequencies, greater energy in low and f0 frequency bands, and shorter utterance durations.The FA group was characterized by low intensity and energy in low, high, and broadband frequency bands, a higher and more variable spectral spread, and longer utterance durations.The MS group was characterized by a steeper and more variable spectral decrease, as well as utterance durations and spectral spread values that fell between the control and FA groups (see link in Figure 3 note for figures of all acoustic features).Other acoustic features that were useful in delineating groups include metrics of speech timing duration, speech rate, and stress rate [22]), spectral features (crest, slope, centroid, flatness, and entropy [55]), formants 1-5, and the alpha ratio [56].

IV. DISCUSSION
Our machine learning approach distinguished between healthy controls, people with Multiple Sclerosis, and people with Friedreich Ataxia with high accuracy using acoustic properties of speech alone.These results indicate that multiclass supervised machine learning has the potential to discriminate between diseases, a step beyond mere healthy-pathological dichotomies.Through the accumulation of big data that merges speech data from various patient populations, we may be able to use machine learning to assist in the detection of specific diseases using acoustic markers.
There are numerous advantages for using acoustic markers to detect neurodegenerative disease including the decreased risk and burden of travelling to a hospital to undergo a range of tests, some of which are invasive.Speech, on the other hand, can be recorded within a familiar and comfortable setting, using common household devices (e.g., smartphones).Smartphones have demonstrated relative robustness for obtaining acoustic clinical markers and, therefore, increase accessibility to these automated detection methods [57].Although the present study recorded speech within laboratory settings, it is also possible to record speech data remotely [58].Practitioners could use this information to refine test selection for differential diagnosis.This would be particularly useful for people living in rural communities with increased travel burdens or during situations where the risk of infection is heightened (e.g., pandemics).Speech markers can be used as a remote tool to initially detect signs of neurodegenerative disease, expand our understanding of the clinical characteristics of these diseases to improve our ability to develop targeted interventions, and to monitor disease progression or treatment response.
We identified several acoustic features that strongly contributed to distinguishing between groups.Spectral decrease, the average of all slopes between the peak amplitude at the fundamental frequency and the peak amplitude of the formants (i.e., harmonics), was the most useful variable in distinguishing our three groups.This finding is in line with previous results that suggest vocal fold dysfunction is associated with greater energy in the lower frequency range relative to higher frequencies (e.g., the soft phonation index [59]).Other spectral features associated with the distribution of vocal energy also contributed to classification accuracy, including summed and peak energy within low-frequency bands (1-75 Hz), peak energy within f0 (75-500Hz), high (4000-8000 Hz), and broadband (1-8000Hz) frequency ranges, and the spectral spread of peak frequencies.Therefore, the distribution of acoustic energy across the spectrum that reflects voice qualia culminates as a strong set of acoustic clinical markers for distinguishing neurodegenerative diseases.
Speech timing measures were also strong contributors to classification accuracy, specifically, the duration of syllables based on summed energy in low and broadband frequency ranges, and the rate of stressed syllable onsets based on peak energy across broadband frequencies [22].These results corroborate previous findings that demonstrate slowed speech rate and decreased phonation time for a range of neurodegenerative diseases including Parkinson's disease [60], Huntington's disease [8], [61], multiple sclerosis [4], [5], and other diseases [3], [16], [61] and ataxias [62], [63], [64].Speech rate and phonation time reflect both pneumo-articulatory capacity and oral-motor function, and could serve as clinical acoustic markers for monitoring the progression of neurodegenerative Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.disease, distinguishing between diseases, and determining disease severity.
To the knowledge of the authors, this is the first paper to use machine learning to simultaneously differentiate three groups of disease classes (i.e., healthy controls, and people Multiple Sclerosis or Friedreich Ataxia) using speech data.This novel application of machine learning and acoustic analysis paves the way for new pre-diagnostic methods that could leverage big data to discriminate between a range of neurodegenerative diseases and/or other conditions.Through initiatives that obtain and share speech data from various clinical populations, our innovative approach could be applied to any population that is able to produce speech.The implications of this approach are substantial and provide new opportunities for healthcare, particularly for remote and rural areas where access to health providers might be limited.

A. Limitations and Considerations
We used the most common acoustic features (or similar proxies) based on an a priori analysis of the neurodegenerative disease literature and timbral features used in music information retrieval.There are, however, other acoustic features that may increase the accuracy and sensitivity of the machine learning algorithm that were not considered here.For exam-ple, voice onset time, the time between the burst of a stop consonant and the onset of the vowel, is an acoustic feature that differs significantly between controls and people with Parkinson's disease [65].We opted not to use this measure because our data contained a high degree of coarticulation, and there is little agreement for the best way to extract the burst and vowel onset times and which acoustic features should be considered (see [66]).Similarly, we did not include measures from other voice assessment tasks (e.g., sustained vowel) [67] that can more reliably measure certain features (e.g., jitter and shimmer) but preclude the measurement of speech timing.We chose to constrain the number of variables and tasks to avoid overfitting.Future studies could use feature selection and pruning methods (e.g., [68]) to find the best feature set and remove unreliable variables prior to analysis.
The inclusion of non-speech performance measurements could also increase discrimination accuracy, for instance, cognitive [69] and motor performance [70] measures.The primary aim of this experiment was to examine accuracy using speech features alone because speech data can easily be obtained in the of a clinician through websites and smartphone applications [71].Other cognitive and motor tests often require scoring by a clinician or dedicated tools to measure gait and tremor, although some remote tests are available [72].We show that neurodegenerative diseases can be delineated with high accuracy from speech data alone, but future applications could also consider other non-verbal features, for example, irregular gait patterns using smartphone accelerometers or irregular typing patterns.Whether these movement features or others would increase the accuracy of machine learning algorithms for neurodegenerative disease remains unknown.

B. Future Directions
The current study differentiated neurodegenerative diseases with high accuracy, but the approach did not aim to determine the severity or stage of the disease [35], [73], [74].Future studies could employ an approach in which the severity of the disease is predicted or estimated following identification.A two-phased approach might be necessary because measures of disease severity tend to be idiosyncratic to the specific disease.Therefore, it remains a challenge to provide a measure of severity that can be applied to a range of diseases and conditions while capturing the relevant clinical markers.

V. CONCLUSION
We provide strong evidence that neurodegenerative diseases can be differentiated through acoustic clinical markers and machine learning, even when the speech phenotype is subtle or similar across groups.This model can be expanded and improved through the inclusion of additional diseases and phenotypes.Big data initiatives that bring together researchers and speech data from multiple laboratories are necessary to increase the scope of diseases that can be identified by acoustic clinical markers and machine learning.Moreover, a combination of remote testing tools for physical and cognitive assessment could be included in addition to speech to improve identification accuracy.These technologies promise to provide tools that can aid practitioners in reaching a diagnosis and relieve the physical and financial burden of patients.

CONFLICT OF INTEREST
APV is the CSO of Redenlab, a speech clinical marker company.

Fig. 1 .
Fig. 1.Mean, dispersion, and range for classification accuracy (A = F1 Accuracy, B = Precision, C = Recall) for healthy controls, and groups with Friedreich ataxia and multiple sclerosis for the full model, and models using a subset of acoustic features using the maximum group-wise average SHAP value cut-offs of 2% and 5%.

Fig. 2 .
Fig. 2. Receiver operating curves for healthy controls (HC), Friedreich ataxia (FA), and multiple sclerosis (MS) obtained from the model with average performance.

Fig. 3 .
Fig. 3. Normalized (z-scores) values of the top 21 acoustic features for identifying members of the healthy control (HC), Friedreich ataxia (FA), and multiple sclerosis (MS), groups.Note.See here for interactive figures for all acoustic features.

TABLE I DEMOGRAPHIC
INFORMATION FOR HEALTHY CONTROLS (HC), FRIEDREICH ATAXIA (FA), AND MULTIPLE SCLEROSIS (MS)

TABLE II TOP
10 ACOUSTIC FEATURES FOR CATEGORIZING GROUPS BASED ON SHAP VALUES