Modeling the Progression of Speech Deficits in Cerebellar Ataxia Using a Mixture Mixed-Effect Machine Learning Framework

Background: Accurate and reliable prediction of changes in the severity of cerebellar ataxia (CA) will be necessary for trials of disease-modifying therapies. Cerebellar dysarthria (CD) is a common feature of CA. This study demonstrated that objective acoustic measures were more sensitive than perceptive analysis through the Scale for the Assessment and Rating of Ataxia (SARA) in assessing the progression of CD, within a time window of two years (mean). Method: Thirty-seven people with CA were tested at baseline (time point 1, TP1) and two years later (time point 2, TP2). A machine-learning framework with a robust three-step feature selection criterion and a Bayesian data-driven clustering technique based on the multivariate mixture extension of the generalized linear mixed model (GLMM) was used. The outcomes included two (time and cepstral-based) objective speech parameters recorded at TP1 and TP2. Subject testing involved dynamic prediction and was conducted using samples from the posterior distributions of parameter estimates and random effects. This study further employed the penalized expected deviance (PED) criterion for model comparison and the selection of the number of groups in the clustering procedure. Results: First, the selected objective speech metrics in the individual patients showed a significant worsening of the speech impairment (p<0.001, Kolmogorov–Smirnov test) between TP1 and TP2. Second, the cluster analysis divided the CA patients into two distinct subgroups showing a strong association between objective speech measures and disease duration, with ~96% of observed values falling within the 95% credible intervals. Third, for the training data, our multivariate model ( $PED_{Fea1+Fea2}=5175$ ; number of groups = 2) performed more reliably than the univariate models ( $PED_{Fea1}=4225$ , $PED_{Fea2}=3850$ ; number of groups = 2) in discriminating the CA patients. Fourth, the individual-level predictions of the change in profiles of the objective measures over time were performed for the testing data. Conclusion: Such a framework using objective speech metrics indeed holds promise to predict the rate of clinical progression of CD in individuals with CA.


I. INTRODUCTION
The cerebellum integrates information from a range of sensory afferent inputs to produce coordinated movement.
The associate editor coordinating the review of this manuscript and approving it for publication was Li He . Cerebellar Ataxia (CA) refers to the uncoordinated movement resulting from dysfunction of the cerebellum caused by many processes, including neurodegeneration, multiple sclerosis, stroke and trauma. Here ''CA'' will refer specifically to neurodegenerative cerebellar conditions. The cerebellum regulates many aspects of movements, including coordination VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of the limbs, trunk, balance, gait, eyes and speech, the latter being the specific focus of this study. Ataxia of speech is sometimes referred to as ataxic dysarthria but here is described as Cerebellar Dysarthria (CD), where the tempo and articulation of speech is affected and the voice is unstable and impaired in quality [1], [2]. Progression of ataxia is reflected in increasing scores on clinical rating scales such as the Scale for the Assessment and Rating of Ataxia (SARA) [3]. The SARA is an eightitem clinical scale devised by Schmitz-Hubsch et al. [3] for measuring ataxia severity in different domains, including upright posture, gait, speech, upper and lower limbs. It ranges from a total score of 0 (no ataxia) to 40 (severe ataxia).
Neurodegenerative cerebellar disease generally progresses relatively slowly and consequently measurement instruments must be reasonably sensitive to detect these changes [4]. Generally, in multiple system atrophy (MSA) and the spinocerebellar ataxias (SCAs), the minimum detectable changes is 1-2 points per year [5]- [7] with an mean annual deterioration of 1.38/40 [standardized response mean (SRM) = 0.5] in the total SARA scores [8]. Thus, it is unlikely that the SARA speech item, which contribute no more than 6 of the total 40 points will change significantly in two years. This is not surprising because the SARA was designed to be a composite score for ataxia rather than an independent score of each domain. Furthermore, the SARA and other clinical scales are ordinal and the interval between each score most likely does not represent a similar increase in severity: that is, they do not linearly correspond to severity. The assumptions underpinning the present study were that the severity of CD will worsen over time and that the measures to detect this deterioration of speech performance with greater sensitivity and lower variance are the most sensitive measures. This study aimed to design an automated system with objective measures to detect the worsening of CD with greater sensitivity than clinical scales such as the SARA.
Most recent studies (Table 1) were designed to find objective acoustic features to diagnose CD and measure its severity [12], [13]. The term 'diagnosis' will be used here to mean identification of CD from the speech of non-ataxic subjects. Additionally, most recent studies were cross-sectional [12], [13] with only one longitudinal study [9]. The latter examined the changes over time in perceptual and acoustic features of the speech of individuals with SCAs. Previous studies were specifically focused on exploiting time [9], [11]- [15], spectral and cepstral [10] based speech characteristics for the diagnosis of CD. All the previous researchers [9]- [11] used either sustained phonations or connected speech in their studies. However, speech tasks involving syllable repetition have proved to be more useful than sentence utterances [16]- [18] for identifying differences between the speech of ataxic and non-ataxic individuals during perceptual analysis. Variations on the ''repeated Consonant-Vowel (C-V) syllable paradigm tasks'' have been most commonly used for syllable repetition.
While mixed-effect models have been commonly used to assess longitudinal progression [19], [20], [43], they are usually only univariate analyses. As the present study assessed different outcomes-that is, continuous objective speech parameters from multiple speech tasks [21], [22], a multivariate model (rather than a univariate model) was warranted. Multivariate generalized linear mixed effect models (GLMMs) can be adapted simultaneously for inference, integrating not only the association of repeated measurements for each outcome within a subject, but also the use of random effects to correlate multiple outcomes. Komarek and Komárková [23] demonstrated a Bayesian data-driven clustering technique to draw inferences based on a multivariate mixture extension of the classical GLMM [24], [25]. This approach proved to be a computationally efficient and reliable alternative to an existing method [26], and has relevance for predicting deterioration in CD over time. To our knowledge, our work is the first study to use optimally-integrated multivariate objective acoustic measures to predict the progression of CD over time.
The main contributions of this study can be summarized as follows: 1) Define a robust three-step objective speech feature selection criterion to select distinctive acoustic features as outcomes from different repeated C-V syllable paradigm speech tasks. 2) Build a multivariate mixture extension of a GLMM for prediction based on the repeated measurements of selected multivariate continuous outcomes. 3) Perform a cluster analysis based on this multivariate model to identify groups of patients with similar characteristics and draw meaningful inferences. 4) Evaluate the model's performance and further compare its performance with univariate analyses. 5) Predict, at a specific timepoint, the probability of a subject belonging to a particular patient group. The rest of this paper is organized as follows: section II introduces the proposed speech progression assessment framework, data collection strategy and feature selection scheme, and describes the multivariate mixture generalized linear mixed model which will serve as the basis for our clustering procedure; section III describes the results of the research. Section IV discusses the significance of the proposed approach; and section V concludes the paper and suggests the future scope.

A. SPEECH PROGRESSION ASSESSMENT FRAMEWORK (SPA CA )
The framework of our proposed assessment of CD progression (hereafter, referred to as SPA CA ) involved the following steps: 1) Speech Inputs generated by instrumented versions of the standard clinical test for assessing ataxic speech. 2) Speech recordings were captured by a condenser microphone clipped at an average distance of 10 cm from the subject's lips, in a quiet room with low ambient noise. Recording was conducted using the BioKinMobi TM [27] application on an Android phone, under the supervision of a trained investigator. 3) Recordings were wirelessly transmitted to a blockchain based distributed cloud network [28] where the proposed SPA CA algorithm was deployed. 4) Data analysis results were transformed into a clinically relevant format. A pictorial representation of the assessment platform is demonstrated in Figure 1.

B. DATA COLLECTION
Thirty-seven native speakers of Australian English with a bilateral neurodegenerative cerebellar disorder were followed for two years (median: 2, mean: 1.99, standard deviation: 0.02). The demographics and clinical characteristics of participants are summarized in Table 2. Subjects were assessed upon entry to the study and again after two years (median: 2, mean: 1.99, standard deviation: 0.02). None of the participants had received speech therapy prior to the investigation. The speech intelligibility was perceptually scored by an experienced clinician in accordance with the SARA, on a scale of 0 to 6 as follows: '0', normal speech; '1', disturbed speech; '2', distorted but simple to comprehend speech; '3', words sometimes difficult to understand; '4', several words difficult to understand; '5', only single words are comprehensible; and '6', unintelligible speech.
The assessment consisted of participants performing the following two speech tasks:

2) Speech Task 2: Utter the phrase British Constitution
(BC: a classical phrase for eliciting the features of ataxic speech) thrice. Individual's acoustic measures were taken as the average of the three recordings. These speech tasks resulted in 74 speech recordings (37 from each task) at each timepoint. This study was approved by the Human Research and Ethics Committee at the Royal Victorian Eye and Ear Hospital, Australia (HREC Reference Number: 11/994H/16). It complied with the NHMRC's guidelines for research using human participants. Written consent was sought from all participants prior to their enrolment. The subject in the Figure 1 provided informed consent to publish their image.

C. MACHINE-LEARNING FRAMEWORK
In this study, the data analysis constituted of developing the SPA CA algorithm ( Figure 1) to predict the progression in the severity of CD was developed. Its three stages were: 1) Extract and select distinctive acoustic features with a 3-step criterion. 2) Build a multivariate mixture generalized linear mixed model (GLMM) for prediction based on the joint exploitation of the selected features' change over two timepoints. 3) Classify the subjects into two groups based on the mixture extensions of the multivariate model.

D. DATA REPRESENTATION AND OUTCOME DEFINITION
Data tr and Data ts were used to denote a training dataset and a testing dataset with a sample size of N 1 and N 2 respectively. Data tr was used to build the prediction model and Data ts is used to assess the prediction for new subjects. Y irj denoted the objective measure for subject i at time t ij , extracted from a specific speech task. The index i = 1, 2, . . . , N represented the subject while the index j = 1, 2, . . . , n represented speech measurements from a subject at different timepoints (TP) for a specific speech task. In our designed Ataxic Speech progression study, we used two different speech tasks, and the measurement times followed a protocol with a common set of follow-up times, t ij = t j where subjects were measured at baseline (that is, TP1 or t 1 = 0) and after 2 years (that is, TP2 or t 2 = 2). Exploration of the observed changes in the profiles of objective speech features ( Figure 3) suggested that changes in the features for each subject, where the respective values of the features may differ across subjects, could be linearly modeled over time. Having been motivated by this inference, we considered Y iqj to be continuous.

E. EXTRACTION AND SELECTION OF SPEECH PARAMETERS
In this study, the time-based features were extracted from speech task 1 and the spectral and cepstral features were extracted from speech task 2. Initially, six topographic prominence-based time features and 23 cepstral features (12 Mel-frequency cepstral coefficients (MFCC) [29] and 11 modified group delay cepstral coefficients (MGDCC) were extracted, applying the same methods that were established in our previous studies for Ataxic Speech diagnosis and severity prediction [12], [13]. Feature descriptions are presented in Table 3 and the results of the three-step feature selection criterion is described via a pictorial representation in Figure 2.
A feature x was from a specific speech task was selected through the three-step feature selection criterion. In the first step, features (in the training data) that do not change significantly from TP1 to TP2 were eliminated using a massunivariate approach. The threshold that determined whether changes in the feature observations from TP1 to TP2 were significant was the p-value of 0.001 from a KS test (TH 1 ). All the significant features with a p-value < 0.001 were selected during this step. In the second step, the selected features were pruned using another exclusion mechanism. As the existence of multicollinearity is indicated by an absolute correlation coefficient of > 0.7 between two or more predictors, only features with a spearman correlation < 0.5 (TH 2 ) were selected. The correlation between features was compared and one of two features with a correlation > 0.5 was removed. In the third step, the selected features were further sorted based on the ANOVA ω 2 effect size (> 0.14) [30], [31], observed power (> 0.85) and F-statistic test (p = 0.001). The selected features from the final step were carried forward to design the model. The one-way repeated measures ANOVA [32] statistics of all the selected features from step 2 are tabulated in Table 4.

F. CONSTRUCTING MIXTURE MODEL-BASED CLUSTERING
The clustering procedure used in this study was based on a multivariate mixture GLMM (MMGLMM). The change in the profile of the r th selected feature (r = 1, 2, . . . , R) belonging to the i th subject (i = 1, . . . , N ) was denoted by the random vector, represented the random vector of the change in measurements of all selected features at the different timepoints for the i th subject and Y = (Y T 1 , . . . , Y T N ) T was a random vector representing the available outcomes of all subjects. This mixed mixture model was designed based on the changes in measurements of the selected speech parameters from TP1 to TP2. Two continuous features were selected through the three-step feature selection criterion, Y i1j and Y i2j with Gaussian distribution. Following the data interpretation in Section II-D, when the subjects i = 1, 2, . . . , N were each measured at timepoints j = 1, 2, . . . , n, the outcome at each timepoint j could be represented with the mean structure of the change in profiles of the two parameters as follows: Here, t irj is the time in years and Y irj is the corresponding outcome for r = 1, 2, that is the timepoints from the start (TP1) to the follow-up (TP2).
Next, to identify groups of patients with similar characteristics using the change in profiles of the measurements over time, they were classified into two groups (K = 2) in the distribution of the four-dimensional random effect vector, intercepts from each selected feature and b i12 , b i22 are their respective random slopes.
The estimates of the cluster specific (marginal) mean change in profiles over time is denoted as Here, r = 1, 2 over time and K = 1, 2, representing the two groups.
Following the clustering procedure, the optimal classification of the i th subject to a specific cluster or group was computed based on the subject-level marginalization over the posterior distribution [33] as A priori, the independence between the mixture related parameters and the GLMM related parameters were assumed as θ and ψ respectively.

G. EXTERNAL PREDICTION OF CA SUBJECTS USING THE TESTING DATASET
To classify a test subject into either Group 1 or Group 2 with 95% highest posterior density credible intervals (HPD CI), the patient-specific component probability was computed as: Here, i is a patient and K is the number of groups. Only if the lower limit of the corresponding credible interval approaches a certain threshold, such as 0.5 (considering the classification into K = 2 groups), the patient is categorized into one of the considered groups.

H. MODEL PERFORMANCE STATISTIC
In this study, the penalized expected deviance (PED) was selected as a criterion for choosing the number of clusters/ groups in the multivariate model [36]. Furthermore the PED was also exploited to compare the performance of the multivariate model with univariate models of similar group size. PED has been successfully employed in various applications [37], [38] and is defined as Here, D(ψ, θ) =˘2logL(ψ, θ) is the observed data deviance of the model. The expected deviance, represented as E {D(ψ, θ)|y} (posterior mean), can be calculated from the Markov chain Monte Carlo (MCMC) sample. The penalty term, optimism, is denoted in Equation 6 by p o , which can be further computed from importance sampling and two parallel MCMC chains [36].

A. CROSS-VALIDATION AND SELECTED MODEL PARAMETERS
Participants were sorted into a training set (N = 32) and a test set (N = 5). The time (years) since the baseline visit, and the two objective speech parameters resulting from the three-step parameter selection criterion were the only features included in this study to build the multivariate prediction model. Interestingly, the three-step feature selection criterion resulted in two heterogeneous features, each belonging to a different speech task. This suggested that each provided complementary information to aid in identifying the presence and severity of Ataxic Speech. The two outcomes were: 1) RT parameter (RT _Dr50) (hereafter, referred to as Fea1) 2) BC parameter (BC_MGDCC5) (hereafter, referred to as Fea2), as indicated in the Figure 2. The feature descriptions are listed in Table 2. The model was trained by running the prediction algorithm with a burn-in of 1000 iterations [39] and 100 subsequent iterations with 1:10 thinning to provide samples from the joint posterior distribution (two parallel Markov chain Monte Carlo (MCMC) sampled chains with different sets of initial values).

B. COMPARISON OF SPEECH PARAMETERS AT TP1 AND TP2 (APPROXIMATELY 2 YEARS LATER)
The SARA speech scores at TP2 were not significantly different from those at TP1 (p = 0.1, KS test). However, there were statistically significant differences VOLUME 9, 2021 (p ≤ 0.001, KS test) between the two speech features (Fea1 and Fea2, see Section III-A) extracted from each speech task after the three-step feature selection criterion. Descriptive statistics are recorded in Table 5 and the change in profiles are depicted in Figure 3.

C. GROUP-SPECIFIC MEAN CHANGE FROM TP1 TO TP2
The multivariate mixture GLMM (MMGLMM) for the clustering of ataxic subjects measurements at TP1 was compared with measurements at TP2 for both Fea1 (Y i1j ) and Fea2 (Y i2j ). Both features were logarithmically transformed because they were assumed to have Gaussian distributions. The changes in clusters from TP1 to TP2 for each subject are plotted in Figure 4 and the means are plotted in blue for Group 1 and red for Group 2. Group 1 was thus characterized by a lower Fea1 level at TP1 (baseline). Also, Fea1 increased at a slower rate in Group 1 than in Group 2 ( Figure 4(A)). In contrast, Fea2 for Group 1 was remarkably lower at TP1 than for Group 2 and seemed to drop quicker in Group 1 as compared to Group 2 (Figure 4(B)). Figure 5 shows the estimates of the posterior distributions of the subject-level component probabilities of the follow-up (TP2) outcomes for two patients in the testing cohort. Patient 49 had a high density of 0.990 (red bar) with a narrow 95% HPD CI (0.970, 1.000) and therefore could be confidently classified into Group 1; conversely, as the posterior probability of Patient 3 belonging to Group 1 was 0 (black vertical line), with a very narrow 95% HPD CI (1.000, 1.000),they could be classified into Group 2. Majority (∼96%) of the 95% credible intervals included the true observed values in the training cohort.

E. COMPARISON WITH UNIVARIATE MODELS AND MODEL PERFORMANCE
In this study, the optimal model type (univariate\ multivariate) and group size (K = 1\ 2\ 3\ 4) were selected using PED [36]. For both the univariate (Figure 6(A), (C)) and multivariate ( Figure 6(B)) models, the plots reveal a significant change in the model's deviance when moving from the model with group size K = 1 to the model with group size K = 2.   In the multivariate model ( Figure 6(B)), the variability of the posterior distribution of the deviance in a model with group size K = 2 was practically the same as with K = 1. Nonetheless, as opposed to K = 1, the K = 2 deviance posterior distribution was distinctly skewed to the left. When comparing models of group sizes K = 3 and K = 2, as well as K = 4 and K = 3, a similar observation was made.
Under the assumption that there were no associations between the outcomes, Fea1 and Fea2, related methods were used to compare the univariate models ( Figure 6(A, B)). The plots indicated that our multivariate model (Figure 6(C)) (PED Fea1+Fea2 = 5175; K = 2) performed better than the univariate models (PED Fea1 = 4225, PED Fea2 = 3850; K = 2). However, further confirmation would require a larger trial or cohort analysis.

IV. DISCUSSION
In this study, a machine-learning framework based on the multivariate extension of a mixture generalized linear mixed model (GLMM) was introduced for assessing whether an automated analysis of CD parameters could predict deterioration in speech ataxia. Our proposed framework yielded accurate and meaningful results. This led to the identification of a reliable feature profile, specific to CD obtained from a dataset of 74 speech recordings collected at baseline and approximately 2 years later.
The descriptions of CD in the literature mostly highlight variability in syllable and pause durations [16], [40], [41] and variations in loudness [42]. We initially selected a feature-set of 29 time and cepstral domain features from our previous objective assessments [23] to examine their contributions to assessing the change in CD from TP1 to TP2. A robust three-step feature selection criterion was adopted to select the objective measures that best detected the worsening of CD with greater sensitivity than clinical scales such as the SARA. Interestingly, the three-step feature selection resulted into two heterogeneous objective features, RT _Dr50 (Fea1) and BC_MGDCC5 (Fea2), one from each C-V speech task, indicating that features from different speech tasks capture complementary information on the ataxic symptoms coexisting in different speech dimensions.
In a routine clinical CA assessment, only the most recently available SARA score representing the current patient status is normally used to determine the severity of a patient's CD. Clearly, such a protocol disregards the existing data on CD evolution over time, which could be more relevant for an accurate classification than simply the last recorded state. To address this shortcoming, the present study proposed a clustering method to jointly consider the entire history of the changes in the profiles of the measurements of all considered parameters. In the framework design, the computational complexity arising from the multivariate extension of the mixture GLMM was handled using a Markov chain Monte Carlo (MCMC) simulation based Bayesian data-driven clustering approach [23]. Interestingly, at a prespecified time point from the start to follow-up, the changes in profiles of the selected speech parameters over time could be used to classify patients into groups with similar characteristics.
Cluster analysis using the objective measures at the two time points permitted the classification of patients into two groups based on the amplitude of change between the two time points. Group 1 was differentiated from Group 2 by lower values of Fea1 and Fea2 at TP1 and smaller changes in Fea1 and Fea2 between TP1 and TP2 ( Figure 4). Most (95%) Group 1 patients had a disease duration of <8 years at TP1 whereas in Group 2, the disease duration was >8 years. The rate of change (increase\ decrease) in the speech parameters (Fea1\ Fea2) indicated that the speech impairments progressed at a different rate in CA patients who were in their early years of the disease (Group 1) as compared to those in their later years of the disease (Group 2). In contrast, VOLUME 9, 2021 the respective SARA speech scores showed no association with the disease duration, with 11/ 37 CA patients indicating the same SARA speech scores at the two timepoints. This suggested that the proposed method is more sensitive than clinical rating in assessing change in CD over 2 years.
The proposed framework demonstrated promising results with most (∼96%) observed values falling within the 95% credible intervals in the training cohort. Multivariate outcomes showed improved performance in prediction compared to univariate models. Notably, to date, the estimation of multivariate outcomes, their changes in trajectory and their joint modeling have not been rigorously studied. Such studies will contribute to a deeper understanding of the heterogeneity of CA and its progression in each manifested domain (for example, speech, as in the present study). Additionally, this work could cross-examine questions that are clinically and translationally relevant; in particular, designing a patient-specific scheme to predict disease progression using multi-domain (for example, kinematic data extracted from upper-limbs, lower-limbs, gait and balance) longitudinal objective metrics.

V. CONCLUSION
In conclusion, a multivariate mixed-effect model-based framework was applied to predict the progression of CD by jointly incorporating the correlations among multiple outcomes and the correlations among repeated measurements. This is the first study to evaluate the changes in multiple clinically relevant quantitative acoustic parameters over time to predict the progression of Ataxic Speech. The predictive performance of joint multivariate modeling was analysed and compared to univariate modeling using the training cohort. The non-inferiority of the joint multivariate modeling in terms of bias and PED showed the potential of the proposed model for estimating the rate of progression of CD in a clinical environment. LAURA POWER is currently pursuing the Ph.D. degree in future approaches to the cerebellar examination with The University of Melbourne. She is currently with the Royal Victorian Eye and Ear Hospital and in private practice with Dizzy Day Clinics, Melbourne, Australia. She is a Physiotherapist with a particular interests include cerebellar and vestibular disorders. DAVID J. SZMULEWICZ is a Neurologist, a Neuro-otologist, and a Medical Researcher. He is currently the Head of Balance Disorders and Ataxia Service, Royal Victorian Eye and Ear Hospital, and the Cerebellar Ataxia Clinic, Alfred Hospital. He is also a Neurologist with the Monash Medical Centre, Friedreich's Ataxia Clinic. He is a Lead Investigator on research defining a novel ataxia-Cerebellar Ataxia With Neuropathy and Vestibular Areflexia Syndrome (CANVAS), a project to develop instrumented ataxia metrics and an objective bedside test of imbalance-the video VVOR. His clinical and research interests include coordination and balance disorders that affect the cerebellum, vestibular systems, and the combination of the two. VOLUME 9, 2021