Artificial intelligence for dysarthria assessment in children with ataxia: a hierarchical approach

Early onset ataxia represents a group of heterogeneous neurological conditions typically characterized by motor disability. Speech problems are one of the core features of ataxic syndromes; hence, the automatic characterization of speech impairment may represent a source of biomarkers for early screening and stratification of patients. The main contribution of this paper consists in proposing a novel hierarchical machine learning model (HMLM) to improve detection and assessment of dysarthria from a structured speech disturbance test. Performances are tested on a new audio dataset containing 10 seconds recordings of standardized clinical PATA test for 55 subjects: 18 healthy subjects and 37 with ataxia. Results show that the proposed HMLM achieves performances with an accuracy of about 90% at the first level (healthy vs patients) selecting an optimal subset of conventional features. In cascade, at the second level, speech disturbance severity (Low vs High) is assessed using deep learning feature extraction technique based on a VGG pre-trained network with maximum accuracy of about 80%. Both levels are processed through the majority voting ensemble technique testing Support Vector Machine (SVM), k-Nearest Neighbors (kNN), Decision Tree (DT) and Naïve Bayes (NB). In our results, the use of HMLM considerably outperforms the results achieved with a single machine learning or deep learning modeling. These outcomes demonstrate that the investigation of the PATA speech test through HMLM can be considered very promising. We also observed that the use of conventional feature extraction techniques and machine learning modeling seems to be a good solution for the diagnosis of patients with ataxia, while the deep learning approach is more appropriate for stratification of severity of dysarthria.


I. INTRODUCTION
Early onset ataxia (EOA) represents a heterogeneous group of neurological disorders, with inherited or acquired aetiology and usually with onset before 25 years [1]. Depending on the clinical progression, they can be divided in progressive ataxias (PAs) and congenital non progressive ataxias (CAs), including respectively entities with different aetiology, phenomenology, and prognosis [1]- [7]. Regardless of such clinical features, although rare (estimated European prevalence 26/100,000) [8], EOAs are responsible for relevant disability and high costs, since no effective treatment is still available [9]- [11]. Patients suffer many neurological disturbances responsible for severe physical limitations, which negatively affect their wellbeing [9]. Generally, ataxia is characterized by coordination disturbances with an effect on walking, standing and voluntary movements of the upper limb. Moreover, patients can manifest speech disturbances that can be responsible for communicative and social limitations, significantly decreasing patients' quality of life [9]. Indeed, dysarthria (motor difficulties in speech) and dysphagia (motor difficulties in swallowing) are frequent signs in ataxic syndromes. The number of experimental trials, covering both potential disease-modifying treatments [12] and symptomatic interventions (physical therapy or neuromodulation) [13], are significantly increasing in the ataxia field. Indeed, there is an urgent need of specific and reliable biomarkers, either for early stratification of patients or for the accurate monitoring and follow-up. Actually, assessment of patients with ataxia currently relies on the clinical scores, such as the Scale for the Assessment and Rating of Ataxia (SARA) [14]. In the case of speech, it can be assessed by the perceptual tests, where an expert listener rates 21 parameters of speech considering prosody, respiration, phonation, resonance, intelligibility, naturalness and articulation. A complete evaluation of dysarthria in Friedreich's ataxia has been reported by Folker et al., in 2010 [15]. Some limitations of clinically-based ataxia rating methods are: rater variability, the ceiling and floor effects [1], [16], [17] and the loss of accuracy, particularly in the pediatric age [17], [18]. In the last years, the use of technologies is providing a promising help showing reliable, objective, accurate and continuous outcome measures either in conventional and "telemedicine" settings [19]- [26]. An objective assessment of speech could represent a potential source of biomarkers. Indeed, it has been proven, in several neurodegenerative diseases, that there is a relationship between oral motor deficits and CNS integrity [15], [27]- [29]. Concordantly, objective measures of speech have been suggested as meaningful information on the patient and health-related quality of life in clinical trials [30], [31]. However, research still lacks natural history studies of speech disturbances in patients with ataxia [15], [31], [32]. As already suggested in [35], to assess dysarthria in children with ataxia, artificial intelligence has shown very promising results. A fundamental step is the extraction of all features that can be used as input parameters in disorder characterization systems [33], [34]. The aim is to identify the relevant information contained in the speech signal. Binary classifiers are commonly used to distinguish pathological from the healthy condition [35], [36]. For instance, Rudzicz et al. [35], employed feed-forwards artificial neural networks (ANNs), and SVMs with phonological features have been used to design discriminative models for dysarthric speech. A binary classifier [37], based on Mahalanobis distance and discriminant analysis was developed for dysarthria severity classification, where 95% accuracy was achieved. An automatic intelligibility assessment system that performs a binary classification by capturing atypical variation in dysarthric speech by using linear discriminant analysis (LDA), k-nearest neighbor (KNN) and SVM classifiers was proposed [36] with an accuracy of 68%, 66% and 70% respectively. While in [33] four levels of intelligibility were recognized with an accuracy between 40-50%, using SVMs and testing different feature sets. Moreover, the combination of the statistical GMM and ANNs was used in [38], achieving accuracy of 86% over three degrees of severity levels. Speaker identification (97.2%) and severity level assessment (93.2%) revealed the best performance using SVMs and hybrid GMM/SVM systems in [34]. Existing studies were carried out through the employment of the few available dysarthric speech databases such as TORGO [39] and NEMOURS [40]. Both of these databases include few subjects (not more than 15) with different levels of dysarthria, due to various conditions such as cerebral palsy (CP), head trauma (HT) and amyotrophic lateral sclerosis (ALS). They are composed of short sentences and words or acoustic and articulatory features extracted from them. The lack of suitable and sufficient data is one of the biggest limits in the field of analysis of speech and verbal communication disorders. Moreover, in our specific case of ataxia, the design and collection of a suitable database is a critical issue since it is a rare genetic group of disorders and there are constraints such as recording conditions, patient's availability, and approval of health agencies.
Here we developed a tool aimed at automatically recognizing ataxic syndromes. For this purpose, we collected recordings of their speech disturbance assessment, made through the standardized clinical PATA speech test of the SARA scale. To our knowledge, this is the first study dealing with artificial intelligence for the assessment and stratification of severity of dysarthria in ataxia through a standardized clinical speech test. In our case, we developed a novel HMLM based on a fusion of conventional and deep learning features to automatically assess the healthy vs patients and quantify the level of speech disturbance. Results demonstrate that the use of two binary models of artificial intelligence in cascade, outperforms compared to a single machine learning or deep learning classifier. The results obtained are encouraging and highlight the validity of HMLM with mixed conventional and deep learning features to recognize ataxia and stratify the level of severity of dysarthria. However, an extensive validation phase on a greater number of subjects is needed. In fact, we plan to continue to test a higher number of subjects to validate the HMLM applied to the PATA speech test as a tool to support clinicians for optimizing screening, clinical tests and personalized treatments.

A. OVERALL ARCHITECTURE
The HMLM model is the main component developed and tested for the assessment of ataxia. Given pre-processed "PA-TA" speech data, the first level of machine learning (ML) processes and discriminates healthy vs patients. Once the speech disease is detected the second level of ML assesses the severity of dysarthria. The overall system is detailed in Fig. 1, which shows the data flow and indicates which features and ML are selected and tested for each level to achieve the best performance. All these elements are detailed in the following sections.

1)PATIENTS ENROLLMENT
The study population was recruited in 2018 at the Movement Analysis and Robotics laboratory (MARlab) of the Intensive Neurorehabilitation and Robotics Departments of IRCCS Bambino Gesù Children's Hospital (Rome, Italy). Overall, it is composed of 55 subjects: 18 healthy (H), 21 with Progressive Ataxia (PA) and 16 with Congenital non Progressive Ataxia (CA). H group included sex/age-matched healthy volunteers without personal/familiar history of neurological diseases and no signs at clinical examination (age 12 [7.6]; 12F/6M). All patients had genetically confirmed diagnosis and a routine diagnostic workup, including general and neurological examination, brain MRI, sensory evoked potentials, nerve conduction study and visual acuity evaluation; moreover, they were in follow-up at the MARlab for at least 2 years, to ensure a correct group classification. None of the enrolled subjects had relevant cognitive impairment or were taking psychoactive drugs (other usual medications, such as vitamin or antioxidant were allowed). Patients with severe disability, moderate-severe cognitive impairment affecting tests execution were excluded. Demographic data were collected for the three groups. The research conformed to the ethical standards laid down in the 1964 Declaration of Helsinki. All subjects participated on a voluntary basis, after that they or their legal responsible signed the informed consent (the study was approved by local ethical committee Protocol NET-2013-02356160 WP3, nr. 1619-2018, received 03 July 2018).

2)EXPERIMENTAL SETUP
After receiving a clinical evaluation, all the 55 subjects were asked to perform the "PATA" test in a quiet room. Each vocal task was recorded with SaraHome, a novel technology for the assessment at home of patients with ataxia symptoms [41], using the microphone array mounted on the Microsoft Kinect V2 for 10 seconds at sampling frequency (Fs) of 16 KHz. Each subject was asked to repeat the word "PATA" as many times as possible in 10 seconds, as reported in [42], [43]. At the end of each task, speech disturbance was scored by expert personnel using a standardized clinical scale: SARA [14]. For each patient with CA and PA, the same test was repeated after 12 months (time t1) to monitor the possible evolution of disturbances. It was possible to repeat the test only for 21/34 patients (12 PA and 9 CA) For this reason, 76 audio recordings were totally considered. All the data were analyzed using Matlab version 2020 (Mathworks, Natick MA).

C. SIGNAL PRE-PROCESSING AND "PA-TA" SEGMENTATION
Sometimes the collected data were affected by background noise such as external voices, door slamming sounds or environmental noises; therefore, a step of pre-processing and clean-up was necessary. Initially, we evaluated the average signal spectrum (Short Time Fourier periodogram) to detect the frequency range of interest (Fig. 2). Since patients' voice repeating "PA-TA" was mostly under the frequency of 1 kHz, in order to reduce all the noise above this frequency, we applied an eleven-order low-pass Chebishev filter with a cutoff frequency of 1 kHz and a Hanning window with length equal to the 0.5% of Fs. After, we applied the method based on fine-tuning of threshold short-term energy and spectral spread [44] to detect speech boundaries and remove the remaining noise. The envelope of each signal was extracted by applying firstly the module of the Hilbert Transform and later a zero-phase moving-average filter whose parameters were tuned according to signal approximate entropy. If signal approximate entropy was lower than the empirical threshold of 0.8, a single moving-average filter was applied to the Hilbert Transform; otherwise, the two cascade filters were employed as reported in Table I. The main steps of signals preprocessing are shown in Fig. 3. After these steps, "PA" & "TA" peaks were detected from the envelope selecting only maxima with a minimum prominence equal to the 10% of the absolute value of Hilbert Transform mean. Instead, signal minima were recognized by computing the energy and by choosing only the minimum prominence of 0.01 and at least 10% of the sampling frequency apart.

D. FEATURE EXTRACTION AND SELECTION FOR MACHINE LEARNING
Audio signals were segmented in order to increase the statistical significance of the dataset in terms of inter-subject and inter-class variability [45]. Because of windowing the signals, it was possible to assume their quasi-stationary within each frame, easing the subsequent analysis [46]. Since the performance of the system depends largely on noise reduction among peaks and the selection of useful acoustic events only (Fig. 4), it was necessary to carry out the segmentation using the "PA" & "TA" peaks as reference points, and the samples between the closest preceding and consecutive minima considering each PA-TA cycle. After performing audio segmentation, we investigated the most relevant conventional features of our targeted application. In literature, the issue of feature extraction in the field of audio processing is quite challenging because of several factors such as the simultaneous presence of different sound sources and the background noises that may affect machines performance [46]. These characteristics are considered to identify the most reliable parameters. In Fig. 1 all the features extracted and grouped by time domain (PATAfreq, Approximate Entropy), frequency domain (spectral values, mfcc and gtcc coefficients), chaotic domain (Lyapunov Exponent) and Age of children, are listed. All the features were extracted from each PA-TA cycle and then the average value for each subject was calculated. PATA frequency (PATAfreq), is a simple time domain physical feature, whose calculation is directly performed from the temporal envelope of signals in order to assess a fundamental parameter of our specific task, according to the following equation: Where )#*+, is the total number of recognized peaks and l is the length of the signal. Approximate Entropy is calculated to measure the complexity and possible fluctuations of the signals [47] and for its strength to discriminate human voice components from corrupted speech [48]. Lyapunov Exponent is calculated to consider the non-linearity of speech [49]- [53]. Frequency domain features, conventionally used for lots of applications [54]-[55] are the most described in literature. These variables, are intended to describe the physical properties of the signal frequency content and they cover a large number of different categories. Among this wide range of possibilities, we computed the following features: • Mel-Frequency Cepstral Coefficients (MFCCs): They are one of the most popular features employed in speech processing. They constitute the mel-frequency cepstrum (MFC), a compact representation of the short-term power spectrum of an audio signal, obtained through a linear cosine transform from the log power spectrum to the nonlinear mel scale frequency [56]. • Gammatone Cepstral Coefficients (GTCCs): They are a modification of MFCCs inspired from biology and are obtained applying Gammatone filters with equivalent rectangular bandwidth bands [57]. • Spectral Centroid: It can be considered the barycenter of the spectrum and indicates where most of signal energy is contained: Where + is the frequency in Hz and + is the spectral value that corresponds to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral centroid [55], [58]. • Spectral Spread: It is a measure of the spread of the spectrum around its mean value: Where + is the frequency in Hz and + is the spectral value that corresponds to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral spread and . is the spectral centroid [56], [59].
• Spectral Skewness: It is a measure of the asymmetry of the spectrum around its mean value and is computed from the 3rd order moment: Where + is the frequency in Hz and + is the spectral value that corresponds to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral skewness, . is the spectral centroid and & is the spectral spread. Skewness=0 symmetric distribution Skewness <0 more energy on the right Skewness >0 more energy on the left [60] • Spectral Kurtosis: It gives a measure of the flatness of the spectrum around its mean value and indicates a possible nonstationary or non-Gaussian behavior in the frequency domain. It is the 4th order moment and is computed starting from the short-time Fourier Transform of the signal S(t,f): Where + is the frequency in Hz and + is the spectral value that corresponds to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral slope, ! is the mean frequency and , is the mean spectral value [64]. • Spectral Decrease: It represents the amount of decrease of spectral amplitude too, but it was defined from the perceptual studies to be more correlated to human perception.
Where + is the spectral value that correspond to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral decrease.
• Spectral RolloffPoint: It is the frequency below which there is 95% of the signal energy: Where + is the spectral value that corresponds to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral rolloffpoint. [65] • Spectral Flatness: It is a measure of the noisiness/sinusoidality of a spectrum and is computed as the ratio between the geometric mean and the arithmetic mean of the energy spectrum: Where + is the spectral value that corresponds to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral flatness. For tonal signals it is close to 0, for noisy signals it is close to 1 [66].
• Spectral Crest: It is a measure of the noisiness/sinusoidality of a spectrum too but it is computed as the ratio between the minimum value within the band and the arithmetic mean of the energy spectrum: Where + is the spectral value that correspond to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral crest.
• Spectral Entropy: It describes the complexity of the distribution: Where + is the spectral value that correspond to bin k, while b1 and b2 are the band edges, in bins, over which to calculate the spectral entropy. [67] • Pitch: It is the fundamental frequency of the audio signal, so that its integer multiple best explain the content of the signal spectrum [68]- [71].
Where s is a single frame of audio data with N elements and M is the maximum lag in the calculation [72], [73]. Once features have been extracted, the next step was to eliminate redundant variables preserving the amount of information and increasing computational speed and performances [74], [75]. Among the highly correlated variables (Spearman correlation ≥ 75% [74], [75]), the least correlated variables with the output of classification were removed as shown in Fig. 5, and features min-max normalization was implemented. After this step, the ranking of the univariate features according to the predictor importance score, was performed using chi-square tests [76]- [79]. Then the optimal subset of features was defined selecting the highest difference between consecutive scores as the break-point. Finally, the best combination of features was achieved by selecting 6 main features (mfcc3, Age, PATAfreq, Spectral Centroid, Spectral Kurtosis, mfcc8) as shown in Fig. 6. We tested different techniques for feature selection obtaining comparable results so that we chose the best feature selection technique in terms of the computational cost.

E. FEATURE EXTRACTION AND SELECTION WITH DEEP LEARNING
Deep Learning Networks are complex architectures used to detect specific features directly from data. They can have hundreds of layers and a huge number of parameters such as weights and bias to be learned. Training from scratch a deep architecture in order to extract specific features avoiding overfitting, requires a large amount of data (hundreds, thousands or even millions, it depends on the application) resulting in high computational and timing costs. Generally, the time of training is related to lots of different factors like the number of epochs, dataset size, computational power etc., but to reach a certain accuracy even months could be necessary. Usually, GPUs are employed to speed up the process. In many real applications, it is difficult and expensive to obtain training data that match the feature space and predict the distribution characteristics of the test data. Therefore, in practice there is a need to create a highperformance learner for a target domain trained from a related source domain. This is the motivation for the transfer learning [80]. Leveraging a pretrained network that has already learned many features on a big dataset to exploit it for a new task, and to specialize the model on a new similar task [45], [81], [82]. There are two main techniques for Transfer Learning: • Fine Tuning: the approach of "fine-tune" the deeper layers of the pre-trained network on the new dataset is typically much faster and easier than training the model from scratch. Although it requires the least amount of data and computational resources [83], the new dataset must be large enough and similar to the pre-trained one. • Feature Extraction: a more specialized method in which data of the new dataset are passed only once through the pre-trained network and then features are extracted from one of the pools of the network. These features are then used to train a Machine Learning model such as Support Vector Machine etc. This technique is the most suitable for small datasets. In this work, the Transfer Learning approach with Feature Extraction was used because of limited dimensions of dataset. In particular, we chose the pre-trained VGGish Convolutional Neural Network (CNN) [84], [85], developed by Google and inspired by the famous VGG networks used for image classification. Its structure consists of a series of convolution and activation layers, optionally followed by a max pooling layer. The VGGish CNN contains 17 layers in total and it is designed for audio classification tasks. Originally, it was employed to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. In our method, signals were first segmented starting from the first "PA-TA" peak and considering windows with a length of 1 second and 50% overlapped. Then, they were preprocessed to obtain the format required for the network. In particular, they were resampled to 16 kHz, then a onesided short time Fourier transform was computed, only the magnitude of the complex spectral values was considered discarding the phase. Finally, the Mel spectrogram was calculated and it was converted to a log scale. Overlapped segments of 96 spectra were given in input to the network. Activations of the pooling layer "pool 4" were extracted as features to train machine learning models. We selected "pool 4" since it was the most discriminative pool layer of the pretrained model of VGGish Convolutional Neural Network [84], [85]. The choice of the pool depends on the similarity between the dataset of the pre-trained model and the dataset of the new application. Since the deeper layers extract higher level features while earlier levels extract lower level ones, the correct depth is as deeper as more similar the datasets are [45], [81], [82]. The structure of VGGish CNN is reported in detail in Table II. The flowchart of features extracted by the VGGish from each data frame of one subject is reported in Fig. 7. We extracted 12288 features from layer "pool 4". After the feature extraction step, we selected the best combination of 1444 deep features using the same approach described in the previous section D for ML. Two variables PATAfreq and Age were added also for their high predictive power.

F. CLASSIFICATION
The classification task was conducted processing audio signals as input of a hierarchical model which discerns healthy subjects, low severity patients and high severity patients using Speech Disturbance score of SARA Scale as clinical output. As shown in Fig. 1, we defined binary labels for each level: the first layer to discriminate subjects with Ataxia vs healthy and the second layer trained only on patients to recognize speech disturbance severity (Low [0-1] vs High [2][3]). Speech Disturbance item is one of the eight items that compose SARA scale. It has a score between 0 (normal) -6 (anarthria) assigned hearing words intelligibility [14]. In our dataset, since the enrolled subjects do not cover the full range of the score, we decided to label it considering the maximum observed value of the 3 to obtain a balanced dataset. Detailed information about the dataset is summarized in Table  III. The classification step was performed by testing four of the most conventional classifiers: Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Naïve Bayes (NB) and Decision Tree and adopting the majority voting ensemble technique [86]. In our approach, we used two binary levels of classifiers. The first level discriminates healthy vs patients and the second level assesses the speech disturbance severity (Low vs High). We tested the best combination of features extracted with machine learning and deep learning approaches for HMLM, and performed a comparison with a flat classification approach (a parallel multi-classifier with three classes: Healthy vs Low severity vs High severity). Cross-validation techniques such as 5-fold, 10-fold and leave-one-out were applied to check overfitting and to avoid data selection bias. Finally, majority voting ensemble technique was used to aggregate the outputs of the single audio frames into related subjects.

G. PERFORMANCE METRICS
Classification performances were assessed using Accuracy, Precision, Recall and F1-Score [87]. These metrics are summarized in Table IV. For HMLM, the employed definitions of Precision, Recall and F1-Score discriminate and weight differently each type of misclassification error, taking into account the output of each level instead of just the final one [87], [88]. As it regards accuracy, we reported the result for each level and the overall one. Given that cross-validation was carried out, we have computed correctly and incorrectly predictions of each class for each fold and we have summed them up at the end of all iterations before calculating the performance measures Fig. 7. Table V. We report the performances of machine learning with canonical features, deep learning features and their combination respectively for level 1 (healthy vs patients) and level 2 (low vs high severity). All the models were tested with ensemble majority voting and three different cross-validation techniques (5-fold, 10-fold and leave-one-out). No significant differences were found between the performance metrics of the three cross-validation methods. The combination of machine learning (level 1) and deep learning (level 2) approaches achieved optimal results in discriminating patients with ataxia from healthy individuals with a mean accuracy of approximately 90%, and in identifying ataxia speech disorders severity with an accuracy of about 80%. For level 1 and for level 2 with 5-fold cross validation (see other details of 10-fold & Leave-one-out in Table V) we achieved a precision of 93.44% and 78.55%, a recall of 98.28% and 79.50% and a f1-score of 95.80% and 79.02% respectively. While the overall precision, recall and f1-score achieved by the model is 84.67%, overall accuracy obtained is 76.32%. Detailed information about collected dataset and classification output is reported in Table VI and Fig. 8 with confusion matrix of leave-one-out. We observed that healthy subjects were never classified as patients with high severity, although they were sometimes confused with the low severity patients. Since many of these patients had a speech disturbance score of 0 as healthy subjects and so the two classes were partially overlapped. For the same reason, the network made few mistakes distinguishing patients with low and high severity. Table VII reports results achieved with flat multi-class approach. In this case the use of a unique level with three classes reached a maximum overall accuracy of about 65% with the use of deep learning approach.

IV. DISCUSSION
The aim of this study was to exploit artificial intelligence methods, such as machine learning and the most recent deep learning approaches, to explore and identify new useful strategies from speech analysis and develop innovative reliable and accurate tools for supporting clinical practice in the field of ataxia assessment and treatment. We have explored the possibility of training some automatic predictive models able to identify from audio recordings, the presence of ataxic syndromes and to classify their severity. As far as our knowledge goes, it is the first time in which a HMLM has been applied for the assessment of ataxic disorders with a particular focus on the standardized speech-based "PA-TA" test. The hierarchical approach was investigated in comparison with the flat multi-class approach. The "PA-TA" signal has been preprocessed and segmented and the HMLM has been implemented and tested using two binary classifiers in cascade and adopting the ensemble majority voting technique. We investigated the performances of each level combining conventional and deep learning models. Three combinations of models (machine learning, deep learning, and machine learning + deep learning) were created by using three cross validation approaches as shown in Table V. In Table VII we reported performances of the flat multi-class approach. Results of HMLM showed that the conventional features at the first level work better to classify healthy vs patients, while at the second level the transfer learning features-based method was more suitable to assess the severity of dysarthria. Moreover, the performed experiments demonstrate that the HMLM outperforms the conventional flat classification approach by exhibiting a higher overall accuracy (76.32% vs 65.58%, 69.74% vs 65.89% and 71.05% vs 65.79%) for 5fold, 10-fold and leave-one-out respectively. Furthermore, the similarity among the three different techniques of crossvalidation, speaks about the robustness of our approach. The employed dataset was affected by some limitations such as a relatively small number of available subjects due to the rarity of ataxic syndromes and the lack of variability of speech disturbance score having lower range of severity [0-6] as shown in Table VI. In this scenario, we observed that the HMLM overcomes these aspects performing much better than the widely adopted flat multi-class approach. Despite these limitations, it's also important to say that the collected structured dataset is the first and the biggest released till date, and there are a few works in this field [33]. As evidenced from results of cross validation matrix (Fig. 8), most of the errors were resulted because sometimes the same clinical scores were used for different classes such as healthy and low severity or low and high severity of dysarthria (see also Table  VI). This aspect emphasizes how tricky it is for the clinicians to discriminate in scoring the subtle changes of speech dysarthria. These issues highlight the need for larger training datasets for AI-based automatic score annotation. Extensive enrollment of patients will increase statistical variability of severity and the possibility to identify more homogeneous etiopathogenic classes.

V. CONCLUSIONS
This study provided initial evidence on the reliability of digital biomarkers based on speech assessment in the field of ataxia. Specifically, we demonstrated that analysis of "PA-TA" test could provide several variables that are able to accurately classify subjects depending on their conditions (patient or control). From a clinical perspective, these findings have several substantial implications. We introduced a panel of novel objective parameters for clinical evaluation in both observational and interventional contexts, which might turn to be useful as an outcome of measures. Then, the source of the biomarkers (namely, the voice and speech) is such that it may cover patients with ataxia at every disease stage, from earlysubclinical to the very advanced, overtaking some limitations of the current assessment systems and being particularly suitable for experimental trials. Finally, such biomarkers will be well fit with the need of implementing telemedicine [89], since voice recording is now possible at distance, by commercial devices, allowing remote monitoring. These results encourage the spread of artificial intelligence in meeting the need of quantitative assessment of disturbances in children with ataxia. An objective evaluation, of what can be clinically relevant in the disease, will contribute to obtaining reliable results also in clinical trials. The association between home-based treatment and devices for the remote monitoring of patients could play a crucial role, in particular, if we think at the efficacy [90], decreasing costs and stress for both patients and their families. Indeed, the HMLM model trained and described in this work could be a powerful tool of telemedicine to be exploited for initial screening and for monitoring in the field of ataxic syndromes, since it requires only an audio recording to assess the conditions of the subject. In her current research position at the CNR, she is leading a research group developing and carrying out novel experimental paradigms to examine young children with autism and to track their development in all the domains. In the last few years, she started a new research line aimed to explore the efficacy and feasibility of an early parent-mediated intervention based on ESDM, implemented through the use of tech-enabled remote monitoring (telehealth).

MARTINA
FAVETTA graduated in Developmental Therapy, at the University of Rome "La Sapienza" in 2014. She received master's degree in Rehabilitation Science, at the University of Rome "La Sapienza" in 2017. She is currently a researcher at the Movement Analysis and Robotics Laboratory (MARlab), Department of intensive Neurorehabilitation and Robotics, Bambino Gesù Children's Hospital, IRCCS, Rome, Italy. Her current research interests including neuroscience, robotics and pediatric neurorehabilitation. She conducts assessments with 3D system of Gait Analysis and clinical assessment in several neuromotor diseases and child disabilities.     Detailed performance metrics of HMLM for each level (1-2) in cascade combining machine learning, transfer learning and machine learning + transfer learning respectively. Each parameter has been extracted using the ensemble majority voting technique with four classifiers (SVM, k-NN, Naïve-Bayes and Decision Tree).  Detailed performance metrics of flat multi-class approach testing machine learning and transfer learning. Each parameter has been extracted using the ensemble majority voting technique with four classifiers (SVM, k-NN, Naïve-Bayes and Decision Tree).