Detection of Speech Impairments Using Cepstrum, Auditory Spectrogram and Wavelet Time Scattering Domain Features

We adopt Bidirectional Long Short-Term Memory (BiLSTM) neural network and Wavelet Scattering Transform with Support Vector Machine (WST-SVM) classifier for detecting speech impairments of patients at the early stage of central nervous system disorders (CNSD). The study includes 339 voice samples collected from 15 subjects: 7 patients with early stage CNSD (3 Huntington, 1 Parkinson, 1 cerebral palsy, 1 post stroke, 1 early dementia), other 8 subjects were healthy. Speech data is collected using voice recorder from Neural Impairment Test Suite (NITS) mobile app. Features are extracted from pitch contours, Mel-frequency cepstral coefficients (MFCC), Gammatone cepstral coefficients (GTCC), Gabor (analytic Morlet) wavelet and auditory spectrograms. 94.50% (BiLSTM) and 96.3% (WST-SVM) accuracy is achieved for solving healthy vs. impaired classification problem. The developed method can be applied for automated CNSD patient health state monitoring and clinical decision support systems as well as a part of Internet of Medical Things (IoMT).

Speech impairments are long known to be one of the most commons symptoms in HD [2] and PD [3]. Although, HD and PD have many different symptoms, which are related only to that one specific disease, they present a similar set of deficits expressed in speech e.g. slow, weak, imprecise, The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei . uncoordinated speech (dysarthria) [4], swallowing difficulties (dysphagia) [5], trouble sequencing the sounds in syllables and words (apraxia) [6], difficulty to express thoughts orally (aphasia) [7]. Such circumstances (also combined with cognitive impairments) lead to the need of specialized assessment and speech treatment for people with HD or PD. Usually, this treatment is provided by a speech-language pathologist (SLP) who checks for speech dysfunctions. SLP gives guidelines for maintaining safe swallowing, evaluates speech acceptance criteria i.e. pitch (degree of voice highness or lowness), loudness (ability for patient to project his own voice), articulation (ability to pronounce sounds), voice quality (ability to hold pitch properly), respiration (coordination of speech with breathing), resonance (quality of voice that is determined by the balance of sound vibration during speech), prosody (rhythm, stress and intonation during speaking) [5]- [7].
Current research in the computer science field focuses on replicating the analysis of SLP with assistive devices [8]- [11], adapting heuristic algorithms [12], [13] and deep learning [14]- [16] for monitoring change in speech patterns, speech recognition and classification [17]- [20]. In addition, wavelet transforms (discrete, continuous, tunable-Q) are successfully utilized for speech impairment monitoring based on voice signal analysis [21], [22]. Here we focus on adopting bidirectional recurrent neural network (BiRNN) with long shortterm memory (LSTM) [23], [24] and wavelet scattering transform-Gabor [25]) methods for solving healthy vs. impaired test subject classification problem based on speech signals. Our approach is based on the digitized data collection using the extended self-administered gerocognitive examination (SAGE) [26] methodology via non-invasive interface of a smart device (mobile phone or tablet) adapted for early stage patients with the CNSD disorders.
The structural organization of the paper is as follows. Section II analyses and compares the related work by showing limitations of existing solutions. Section III covers materials and methods used, i.e. the implementation and working principle of voice recorder task as part of proposed neural impairment screening software, audio file collection procedure, test subjects involved and definition and formularization of feature extraction methods for analysis of speech signal. Section IV describes two experiments for solving binary classification problem (healthy vs. impaired). Section V contains discussion, conclusion and future works.

II. RELATED WORK
There are many studies being conducted in detection of speech impairments in central nervous system disorder patients (CNSD). Gillivan-Murphy et al. [27] use voice recordings (collected in sound-treated laboratory with ambient noise measured at 50 dB level by using a AKG-C420 head mounted microphone) to detect speech tremors in PD. Acoustic analysis was performed with a Voice and Tremor Protocol (VTP), i.e., amplitude of voice, periodicity, rate, and magnitude of frequency signal features. Gaballah et al. [28] investigate subjective and objective assessment of the PD speech quality. The analyzed features are derived from the speech recordings (collected with 7 amplification devices) based on cepstral, spectral, and/or temporal parametrization (mel-frequency cepstral coefficients (MFCC) [29], gammatone frequency cepstral coefficients (GTCC) [30], discrete cosine transform (DCT) [31], speech-to-reverberation masking ration (SRMR) [32], modulation area (ModA) [33], Low Complexity Quality Assessment (LCQA) [34]. Support vector regression (SVR), Gaussian process regression, machine learning methods and correlation analysis were used achieving an accuracy of 0.85.
Wu et al. in [40] target learning acoustic features (MFCC, spherical K-means, pooling method) to detect PD. All data was captured in a soundproof room and then resampled at a 16 kHz rate. Random Forest (RF) and SVM methods were used for the evaluation of detection accuracy (best achieved result is 96.37% with RF classifier). Perez et al. [41] differentiate between healthy controls and HD patients) based on acoustic and lexical features (MFCC, GMM, pause, speech rate, goodness of Pronunciation (GoP) [42]). The results were evaluated with k-Nearest Neighbours (k-NN) and Long-Short-Term Memory Recurrent Neural Networks (LSTM-RNN) algorithms (0.87 correlation).
Sakar et al. in [43] provide a comparative analysis of speech processing algorithms for PD recognition using detrended fluctuation analysis (DFA), pitch period entropy (PPE), recurrence period density entropy (RPDE), MFCCs, wavelet transform (WT) methods for feature extraction. The results were validated with a set of supervised classifiers (Logistic Regression, Multilayer Perceptron, Naive Bayes, Random Forest, SVMs with linear and RBF kernels, and k-NN algorithms) (0.86 best achieved correlation).
The classification of PD severity is introduced by Oung et al. in [44]. The data for speech signals were acquired by using a Sennheiser DW Pro2 headset positioned in 5 cm distance from the mouth of a subject. The researchers for feature extraction in speech adapted wavelet energy (WE), Shannon wavelet entropy (ShWE), Renyi wavelet entropy (ReWe), Tsallis wavelet entropy (TsWe), permutation entropy (Pe) and fuzzy entropy (Fe). The classifiers used were extreme learning machine (ELM), K-nearest neighbour (KNN), probabilistic neural network (PNN) and (best accuracy 91.11%).
Ali et al. [45] use Parkinson speech-based dataset from the UCI repository to investigate the classification of early diagnosis of PD. 15 acoustic features were considered in research: jitter, number of pulses, number of periods, mean period, standard deviation of period, number of voice breaks, degree of voice breaks, mean pitch, standard deviation, minimum pitch, autocorrelation, noise-to-harmonic ratio and harmonicto-noise ratio. Four classifiers were examined: Bayes Net, Random Forests, Decision Stump and SVM (95.6% best accuracy).
The fusion of wavelet packet transform (WPT) and MFCC methods were applied for the diagnosis of PD from recorded speech signal by using Hidden Markov Models (HMM) and SVM classifiers [46]. Burk et al. in [47] analysed acoustic recordings (special software and hardware were used for the data collection) from PD patients based on the cepstral peak prominence (CPP) and aerodynamic measures of transglottal airflow (TAF) features in order to distinguish between speakers with no tremor and tremor (correlation 0.96).
Burk et al. [47] target vocal impairment detection for early prediction in PD. They applied MFCC and GMM for feature extraction. The data (96 kHz audio samples) was collected with a professional head mounted omnidirectional condenser VOLUME 8, 2020 microphone that was placed by 10 cm from the mouth of PD patient. The classification results were validated with bootstrap aggregation approach from log-likelihood on each frame (83% best accuracy).
Refer to Table 1 for comparison of related work to track speech impairments in PD and HD.
To sum up, speech impairments are very intensively analysed by the other computer scientists. The majority of related works involve the PD patients. However, very different approaches are adapted for the collection of voice recordings, i.e., most solutions require custom hardware (special microphones, amplification devices or headsets) and audio signal processing software (Matlab, Praat, Audacity, SPSS) for test supervision.
In addition, the evaluation metrics (features) that are used for detecting speech impairments cover a wide range of choices, i.e., from acoustic features (jitter, number of pulses, voice breaks etc.), Gaussian Mixture Model (GMM), Mel-Frequency Cepstral Coefficients (MFCC), spectrum (kurtosis, spread, entropy etc.), wavelet transforms (WT) to strategies for combining these features.
Statistically, the proposed related work models for speech impairment detection were evaluated by using regression analysis (Spearman correlation coefficient = 0.0156) and classification methods (97.6% achieved with the Random Forest algorithm).

A. TASK: VOICE RECORDER
The task is a part of Neural Impairment Test Suite (NITS) [49] system (mobile app) proposed by authors of this paper. NITS is a framework for collecting various features of data from tremor, cognitive, speech and energy expenditure tasks e.g. finger motion tracking, duration, distance evaluation of geometrical shapes, graph similarity evaluation, image collection from clock drawing (CDT) task, voice recordings, daily calorie balances, etc.) It is developed for Android OS with core supported software development kit (SDK), and includes third party libraries and custom algorithms for required functionality.
Voice recorder task is named T14 in the NITS framework. Patient is instructed to read a short text of predefined poems into the mobile device microphone ( Figure 1).
Predefined transcripts can be provided in English and Lithuanian languages if needed. The process is repeated two times, i.e., first, a poem is selected randomly; then, the remaining one is displayed. The recording begins when a patient is ready and presses the button 'Start Recording'. Single poem recording finishes by pressing 'Stop Recording' button (a patient can make a pause if needed before the second recording). The T14 is executed two times as a precaution measure for more reliable test execution. In case a test subject (a CNSD patient) did not understand or follow the T14 task properly for the first time, a chance for repeating the procedure was given. Such approach allows collecting more voice recordings from each patient (damaged audio recordings were excluded), thus resulting in a larger dataset. After T14 is completed, two audio files (compressed .mpeg4 format, 44.1 kHz sample rate and AAC audio codec), together with the associated transcripts, are stored in the external storage of a mobile device. Defined parameters for audio files were chosen based on compatibility recommendations with the latest Android devices [50]. Audio codec MPEG-4 supports standard sampling rates from 8 to 48 kHz (mono or stereo channels). In addition, there is no significant effect in the quality of collected audio files for the analysis as all the recordings were collected with direct supervision of T14 execution by author of this paper. In such setup, isolating surrounding environment for audio data acquisition without external interference was ensured and distance from mobile device microphone to speaker's mouth was adjusted accordingly.

B. TEST SUBJECTS, PROCEDURE, DATASET
A total number of 15 test subjects were involved in the audio file collection process. 7 patients with neurological disorders (3 Huntington (one of them juvenile of 18 years), 1 Parkinson, 1 cerebral palsy, 1 post stroke, 1 early dementia), other 8 were healthy subjects. Health state of neurological patients were in their early stage, e.g., the HD patients had the early (I or II) clinical form of HD according to the Shoulson-Fahn functional capacity rating scale [51]. All participants were asked to perform T14 task.
Dataset was collected during 5 rounds i.e. face-to-face patient visitations. All tests were supervised by author of this paper to explain working principle of T14. Moreover, such approach was chosen to ensure the fair execution of the test, i.e., without any cheating or faking the results. In some cases test subject were asked to perform T14 task multiple times, because CNSD patients tended to lose focus, thus resulting in interrupted audio recording process. In total, 339 samples (audio files) (including healthy and impaired test subjects) were collected in the dataset. The collected data was labelled using a healthy vs. impaired (0 or 1) objective assessment criteria for the health status of each subject, where 0 indicates that the subject is healthy, whereas 1 means that a subject has a neurological disorder (e.g. Huntington Disease). In the process of data collection, the class label is specified in the mobile application before starting the actual testing procedure.

C. AUDIO SIGNAL FEATURE EXTRACTION METHODS
Stored .mpeg4 files (Figure 1, audio1 and audio2) are used as inputs for further audio signal processing. The authors considers the following methods for speech feature extraction.
The PEF method models signal Y at time t in the spectral domain with frequency f as defined in formula (1): here K is the number of peaks in the audio signal, N t (f ) is the power spectral density of unwanted noise, a k,t is the power of the k-th harmonic at time t.
In the LHS method, the signal is modelled by (2): here nc-compression factor, s = log 2 f ,h nc -0.84 nc−1 is a decreasing sequence implying that higher harmonics contribute less to the pitch than lower harmonics to the noise, P (s) = W (s)·A(s), W (s)spectral window function, A (s)logarithmic frequency abscissa, N = 15(the number of harmonics considered).
The SRH method tracks the pitch by using calculations in (3) formula. Please consider the provided references for extra information of NCF and CEP methods.
here E (f ) -amplitude spectrum signal (f -frequency in the range of [F min , F max ], computed for each Hanning-windowed frame, covering several cycles of the resulting residual signal) of the k-th harmonic, N -number of harmonics that are taken into account.
The additional considered method is Mel-frequency cepstral coefficients (MFCC) [29]. MFCC returns the coefficients sampled at a frequency of fs as well as the change in coefficients (delta) and the change in delta values deltaDelta). WindowLength and OverlapLength default configuration setup is the same as in the Pitch method.
MFCC computes a frequency analysis based on a filter bank. A short-time Fourier analysis results in a discrete Fourier transform (DFT) for signal X t [k] in time t. DFT values are grouped together in critical bands and weighted by a triangular function. The (4), (5) and (6) formulas are used for MFCC calculations (R = 22, m-th signal sample, the number of MFCC coefficients is usually 13): here MF t [r] -Mel-frequency spectrum at analysis time t for r= 1, 2, . . .R. V r [k] is the triangular weighting function for the r-th filter, ranging from DFT index L r to U r .
A r -is a normalizing factor for the r-th Mel-filter.
Having calculated MFCC as defined in (6), it uses the leastsquares approximation of the local slope over a region around the current time sample method to determine delta (passing MFCC) and deltaDelta (passing delta). The same rule applies for GTCC.
Similarly, as MFCC, Gammatone cepstral coefficients (GTCC), including delta and deltaDelta, can be used for audio signal feature extraction. GTCC is a bio-inspired adaptation of the MFCC that employs Gammatone (GT) filters [30]. The GT filter with its properties is defined in formula (7): here n is the filter order, K is the amplitude factor, B is impulse response, f c is the central frequency, ϕ is phase shift.
In GTCC extraction, the audio signal is first sliced into short frames, usually about 10-50 ms (same as in MFCC). This allows signal to remain stationary, thus allowing for the signal analysis. Afterwards, GT filter bank is applied to the signal's FFT, highlighting the perceptually meaningful voice frequencies. Lastly, DCT is applied to model the human perception of sound and to decorrelate the filter outputs, therefore achieving better energy compaction: here R is the number of GT filters, X t [r] is the energy of the signal in the r-th spectral band, 1 ≤ m ≤ M (5) is the number of outputs.
Another considered method for speech feature extraction is called wavelet scattering transform (WST). It defines a representation which is resistant to time-warping deformations. WST extends MFCC by calculating modulation spectrum coefficients through wavelet convolutions and modulus operators [57]. In addition, WST overcomes MFCC in audio representations for the classification problems at time scales more than 25 ms.
Scattering transform restores the information lost by a Mel-frequency averaging by employing a cascade of wavelet decompositions and modulus operators. The constant-Q filter banks calculate a wavelet transform. A wavelet ϕ (t) is bandpass filter withφ(0) = 0 and is written in the centre frequency ω form as defined in formula (9): Here the centre frequency ofφ is normalized to 1. ω = 2 k/Q , Q are the wavelets per octave, k ∈ Z .φ is of the order of Q −1 .
SpectralSlope evaluates the spectral shape slope by using a linear approximation of the magnitude spectrum. A linear function is modelled from the magnitude spectrum as defined in (10) (10) here f k -is the frequency in Hz corresponding to bin k, µ f is the mean frequency, s k is the spectral value at bin k, µ s is the mean spectral value, b1 and b2 are the band edges, in bins, over which to calculate the spectral method (e.g,. slope), µ f is the spectral centroid, Spectral skewness evaluates the symmetry of the spectral magnitude distribution around their arithmetic mean (11).
b2 k=b1 s k (11) here µ 2 is the spectral spread. Spectral spread measures the concentration of the power spectrum around the spectral centroid (Eq. 12). (12) Spectral centroid represents the centre of gravity (COG) of spectral energy. It is defined as the frequency-weighted sum of the power spectrum normalized by its unweighted sum (13).
Spectral decrease assesses the steepness of the decrease of the spectral envelope. The result of the spectral decrease is a value less than 1. The spectral decrease is not defined for audio blocks with no spectral energy (silence) (Eq. 14).
Spectral kurtosis evaluates the shape of the spectral magnitude value distribution as compared to the Gaussian distribution (Eq. 15).
Spectral flux is the change of the spectral shape calculated as the mean difference between neighboring Short Time Fourier Transform (STFT) frames (Eq. 16).
Spectral rolloff is a measure of the bandwidth of the analyzed block n of audio samples and is specified as the bin of frequency below which the cumulative magnitudes of the STFT reach a certain percentage K of the overall sum of magnitudes (Eq. 17).
Spectral flatness is the ratio of geometric and arithmetic means of the magnitude spectrum (Eq. 18).

IV. EXPERIMENTAL RESULTS
Two supervised learning approaches (wavelets with SVM and deep learning neural networks) are considered in experimental research for classifying test subjects into health and impaired instances (2 target classes): 1) Wavelet scattering transform (WST) with SVM; and 2) Bidirectional recurrent neural network (RNN) with Long short-term memory (BiLSTM). Both methods apply percentage split resampling technique for the original data i.e. 70% training set and 30% testing set. For the collected dataset of voice recordings, this corresponds to 207 samples for training and 88 for testing, including 29 samples (healthy test subjects) and 15 samples (impaired test subjects) for predictions on new and unseen data.

A. SPEECH IMPAIRMENT DETECTION WITH WST AND SVM
Experiment is based based on voice recordings, collected from T14 task. WST method applies Gabor (analytic Morlet) wavelet. Such wavelets use low pass scaling function to produce low-variance representations of voice.
Wavelet is designed as follows. The signal length is a natural logarithm value of 2 19 . For WST configuration, only 3 parameters are provided: the duration of the time invariance, the number of wavelet filter banks (band-pass filters that separate voice data into multiple components, each one carrying a sub-band of the original data) and the number of wavelets per octave. Two wavelet filter banks are used: first (fb1) and second (fb2). The first filter bank has 8 wavelets per octave, and the second filter bank has 1 wavelet per octave. The time invariance scale is set to 0.5 seconds. For such setup, invariance scale parameter that is plotted on the coarsest scale [61] (Figure 2) does not exceed the invariant scale of the wavelet scattering decomposition, i.e., is indicator of low variance.
The plot of fb1 and fb2 filter banks using Littlewood-Paley of sums [62] representation is provided in Figure 3.
The audio materials are transferred to a single object in memory ads. Train (Ttrain) and test (Ttest) data are converted to tall arrays. Then, scattering train features (scat-teringTrain) and scattering test features (scatteringTest) are created by applying log transformation of each audio file and subsamples, the number of scattering windows by 8. The scattering features are combined together to a matrix by using MATLAB Parallel Pool (Number of Workers = 4) on a single GPU, resulting in the training features and the testing features (each row of the matrix is 1 time window across the N = 341 paths in the scattering transform of each audio signal).
The training features and the testing features are used to fit the data for support vector machine (SVM) model with  polynomial kernel (order = 3). SVM tuning was applied by using the Majority Vote method, which achieved 96.3% accuracy of the supplied test data, as shown in confusion matrix ( Figure 4). The model build time is 369.83 seconds.

B. SPEECH IMPAIRMENT DETECTION WITH BiLSTM
Similarly, as in the WST approach, this deep learning experiment also analyses voice recordings collected from the T14 task. First stage is the pre-processing of the original audio signal i.e. removing silence segments. To eliminate not useful information that is pertaining to the health status indicator of the speaker, the isolation of the speech segment method is applied. This method uses the thresholding approach. First, 2 features (signalEnergy, centroid) over nonoverlapping frames of the audio data are calculated. Next, the energy and spectral centroid for each frame is evaluated; centroid threshold (T_C = 5000 Hz) and energy threshold (T_E) are calculated afterwards. The speech regions where the feature values fall below or above their respective thresholds are disregarded ( Figure 5). On the contrast, the speech region is active in cases as shown in (20) (20) In the implementation, isSpeechRegion is further characterized by regionStartPos (indices of frames where a speech-to-silence or silence-to-speech transition occurs), regionLengths (length of all-silence or all-speech regions), start and end indices (SI, EI) for each speech region. Once the active speech regions are detected, the intersecting speech segments are merged and fed for the feature extraction mechanism (segments).
The speech signal changes over time, but is stationary on short time scales; thus, their processing is often done in windows of 20-40 ms. For each speech segment, a periodic hamming window [63] with 80% overlap is used and then concatenated into sequences (each vector contains 92 features, each sequence 40 feature vectors). The features used are GTCC, MFCC, pitch, slope, skewness, spread, flux, rolloff, decrease, flatness, kurtosis and entropy. These 12 features are concatenated together and can be combined in numerous ways. A feature can be removed from the sequence or swapped with other if necessary.
With this configuration setup, the next step is feature transferring to tall array T (this provides a way to work with data backed by an audio data store (audioDataStore) that can have millions or billions of rows) on the GPU. These feature sequences are re-evaluated (featureSequences) and normalized (mean and standard deviation for each coefficient is computed). Such normalized GPU features are ready to be supplied for training Bidirectional Long Short-Term Memory (BiLSTM) deep learning neural network. LSTM can learn long-term dependencies between time steps of sequence data (forward and backward directions). In the training process, The architecture of proposed BiLSTM neural network has 2 fully connected layers of 100 neurons, followed by a softmax layer and a classification (output) layer. Figure 6 illustrates healthy vs. impaired classification results on the provided dataset by applying Major Vote method (rule) for tuning classifier performance i.e. overall model accuracy of 94.50%.

C. SUMMARY OF EXPERIMENTS
Both experiments are implemented using MATLAB Audio Toolbox R2019a (Mathworks Inc., USA) software. Audio materials are transferred to MATLAB audioDatastore object.
For the WST-SVM experiment, initial preparation steps include the creation of a root folder and two sub folders naming 'healthy' and 'impaired' correspondingly. The names of the subfolders should match the names of the output target classes. The audio files must be provided as 1411 kbps sample rate .wav audio files at 22050 Hz sample rate.
For the BiLSTM experiment, the procedure starts by creating a root folder and two subfolders, naming 'train' and 'test' correspondingly. Initial audio materials must be provided as 64 kbps sample rate files (.mp3 format). Two .csv files (one for training set, another for testing set) are prepared for storing the summarized information about the collected files by using this format: linkage to the stored audio file in disk VOLUME 8, 2020 and sick or impaired indicator (as text string). In addition, .csv file structure can be expanded with transcript of read poem, recording duration. Table 2 shows the data format, software and hardware requirements for the proposed WST and BiLSTM based PD classification methods.

V. DISCUSSION AND CONCLUSIONS
In this paper, we presented an investigation for detecting speech impairments occurring to the CNDS patients. A dataset of audio files (including early stage CNDS patients and healthy subjects) was collected during a pilot study carried out in Lithuania with the usage of a smart noninvasive interface, i.e. Neural Impairment Test Suite mobile app. For proper task execution, test subject should be acquainted with Lithuanian or English languages (speech dialect is not important).
Three domains of feature extraction methods (ant their combinations) from audio signals were considered in this research: cepstrum domain (pitch contours, MFCC, GTCC), auditory spectrograms (slope, skewness, spread, centroid, decrease, kurtosis, flux, rolloff, entropy, flatness) and WST (wavelet time scattering, analytic Gabor). BiLSTM and support vector machine (SVM) with polynomial kernel methods were adapted for classifying target test subjects into healthy and impaired groups. WST-SVM achieved 96.3% accuracy and BiLSTM 94.50% accuracy on test set, thus showing strong expectations for decision support in speech impairment detection in targeting related diseases (e.g., Alzheimer's) in various progression stage. WST-SVM excels over BiLSTM considering the related research findings that collected voice recording from CNSD patients were significantly long, i.e., up to 47 sec (observed from the juvenile HD patient).
The proposed speech detection models can be compared with works of other researchers in competitive study. Tsanas et al. targeted identification of PD based on vocal performance (SVM classifier, 90% accuracy) [64]. Caesarendra et al. analysed pattern recognition with voice features in PD stage classification (SVM, 79.17% accuracy) [65]. Hauptman et. al. adopted SVM (77.20 %) for identification of distinctive acoustic and spectral features in PD [35]. Moreover, Extreme learning machine (ELM, 91.11% accuracy) approach for the classification of PD severity was introduced by Oung et al. [44], and Jeancolas et al. [48] adapted Bootstrap aggregation classifier (83% accuracy) for sound classification of Parkinsonism.
Speech disorders in HD and PD tend to progress over time, so proposed classification methods could function as a decision support system for monitoring the health state of the CNSD patients and provide insight about disease status. The designed WST-SVM and BiLSTM models are integrated into the NITS mobile app for triggering screening alert to a CNSD patient about his deterioration of speech impairment before such symptoms become much worse. The developed models also can be used as a service in the context of Internet of Health Things (IoHT) [66] ecosystem of services and devices.