Investigation of DNN-HMM and Lattice Free Maximum Mutual Information Approaches for Impaired Speech Recognition

Assistive tools that recognize impaired speech due to neurological disorders are emerging and its a fairly complex task. An Intelligent Impaired Speech Recognition system helps persons with speech impairment to improve their interactions with outside world. Impaired speakers have difficulty in pronouncing words which results in partial or incomplete speech contents. Existing Automatic Speech Recognition systems are not effective for Impaired Speech Recognition due to the speaker specific variations which depend on the severity of the neurological disorders. In this work, we have investigated two important approaches namely, Deep Neural Network-Hidden Markov Model and Lattice Free Maximum Mutual Information approach for effective recognition of impaired speech. The training and testing samples are collected from persons with different neurological disorders at varied intelligibility levels such as high, medium, low and very low. The recognition accuracy is evaluated and compared using two datasets namely 20 acoustically similar words and 50 words Impaired Speech Corpus in Tamil.


I. INTRODUCTION
Developing an assistive system for speech impairment due to neurological disorders is one of the complex pattern recognition tasks. According to the Global Burden of Diseases (GBD) Injuries and Risk Factors report [1], the neurological disorders are considered as the global cause for different types of disabilities around the world. The speech production system is mainly affected by various neurological diseases such as stroke, brain injury, tumors, Parkinson's disease and multiple sclerosis. Dysarthria [2] is a motor speech disorder in which the muscles involved in speech production are damaged or weakened. Cerebral palsy is a kind of disability which affects the speech articulation and the affected people find difficult to speak, write and move without any assistance. The impaired speech is characterized by mispronunciation, low precision, poor articulation, omissions, distortions, and substitutions of phonemes and consonants, slow speaking rate, hypernasality, hoarseness, mono loudness, mono-pitch, slurry speech, distorted vowels, and consonants that degrade the intelligibility of speech [4], [5]. People with speech impairment feel depressed and isolate themselves from the outside world. The impaired speakers usually communicate with the help of keyboard or other input devices. To improve their quality of life, there is a high demand to develop a robust Assistive Speech system that can recognize impaired speech.
Every impaired speaker produce their own phonetic patterns which are incomplete leading to lot of variations in speech utterances. Hence, existing Automatic Speech Recognition (ASR) techniques applied to Impaired Speech Recognition provides poor performance. The ASR systems are ineffective in mapping the impaired speech signals to phonemes correctly. Impaired speech recognition (ISR) converts impaired speech to text [6], [7], [8]. This text is then synthesized to normal speech in a speech assistive system. Another important challenge is the availability of limited amount of training data. Collecting huge amount of impaired speech samples from neurological disordered person is quite challenging task as it makes the impaired speakers to feel stressed. Handling insufficient dysarthric speech data and issues in pronunciation modeling for impaired speech are addressed in [9], [10], [11] [12], [13].
Recently, deep model based approaches outperform traditional machine learning approaches for automatic speech recognition. DNN-HMMs are proved to be effective for automatic speech recognition [16], [17] and [18]. DNN-HMM combines the sequential modeling ability of HMM and the representational ability of the deep neural network. Output units of DNN are trained to determine the posterior probabilities of HMM. Though the DNN-HMM is advantageous over the traditional Gaussian Mixture Model-Hidden Markov Model (GMM-HMM), DNN-HMM gives moderate performance only for impaired speech recognition. A bidirectional Deep Recurrent Neural Network (biRNN) based DNN-HMM is used for phoneme recognition [15]. In a recent work [19], the authors used a phonetic posterior feature space for matching and verifying the impaired speech with the control speakers data. Several parameters such as Linear Discriminant Analysis (LDA), context dependent states, Feature space Maximum Liklihood Linear Regression (FMLLR) are used with Teacher-Student network [20] to increase the accuracy.
In [22], DNN pretraining with sequence discriminative training is performed and experimented using a 300-hour switchboard telephone conversation data. The different sets of features such as the FMLLR, 40-LDA, LDA+semi-tied covariance (STC)+ feature-space maximum likelihood linear transformation (FMLLT), and single STC over LDA features obtained with various transformations are studied. Another improvement over the DNN-HMM is aligning transcripts using a two-step alignment process [23]. The preprocessed input is aligned in first step. The next step performs insertions, deletions, and substitutions to identify the correct word with the help of National Institute of Standards and Technology (NIST) sclite utility. Reduction in WER is achieved using sequential discriminative training with regularization techniques [24]. In another work [25], the phone posterior along with the regularization techniques such as L2 regularization is used to differentiate among the dysarthric severity levels. It mainly handles the mismatch between the normal and the dysarthric speech.
To address lack of sufficient training data, augmentation [26] is performed by perturbing the data with respect to time and tempo which resembles the dysarthric data. Then DNN-HMM is trained on the synthesised dysarthric speech. In [27], authors proposed a two-step adaptation. The first step is adapting an ASR model to multiple Dysarthric speakers and then further adapted to target dysarthric speaker. The authors used the Connectionist temporal classification (CTC) [28] based recognition system and proposed a voice conversion system to synthesize the new set of speech samples from the existing set of samples.
In recent literatures, DNN-HMM is proved to be effective in complex acoustic modelling, discriminative feature extraction, pronunciation error correction and knowledge transfer between normal speech and impaired speech. In this paper, we focus on investigating DNN-HMM approach and a lattice free Maximum Mutual Information (LF-MMI) approach for Impaired speech recognition. Section II deals with DNN-HMM based ISR. Section III presents the Lattice Free-Maximum Mutual Information approach. Experimental studies and performance analysis are discussed in Section IV.

II. DEEP NEURAL NETWORK-HIDDEN MARKOV MODEL (DNN-HMM) BASED IMPAIRED SPEECH RECOGNITION
In a generative model based HMM approach, the observation sequence is generated by a sequence of state transitions where each state is modeled using a GMM. DNN is capable of learning any arbitrary distribution. In DNN-HMM, the temporal characteristics of impaired speech utterances are modeled using HMM and the observational probabilities are estimated using DNN and hence DNN-HMM is termed as a hybrid model. A DNN is a feed-forward, artificial neural network that has more than one layer of hidden units between its input and output layers as shown in Figure 1. At each hidden layer, a hidden unit typically maps the weighted sum of its inputs from the layer below to a deterministic value using a nonlinear activation function and passes it to the layer above.
A single DNN is used to model posterior probabilities of all states. But in case of GMM-HMM, a separate GMM is used to model each state. Deep Neural Network (DNN) is used to estimate the posterior probabilities of the context dependent tied triphone HMM states. The DNN outputs the posterior probabilities that are scaled using the class wise prior probabilities. The likelihood probability of triphone feature vector are estimated using the posterior probability given by DNN and the prior probability of states given by HMM. The cross entropy criterion is used during the DNN training. Usually, each impaired speech utterance is divided into 9 to 13 frames and the features extracted from these frames are fed as input to DNN. For recognition, the sum of log-likelihood probabilities of triphone feature vectors of impaired speech utterance is used.
Given a feature vector x, the output of the DNN specified by the model parameters {W, b} = {W , b }, 0 < ≤ N can be calculated by computing the activation vectors from layer 1 to layer N − 1 [31]. The model parameters W, b can be learned with the back propagation algorithm . The model parameters can be improved based on the first-order gradient information as where W t and b t are the weight matrix and the bias vector at layer after the t th update.
are the average weight matrix gradient and the average bias vector gradient at iteration t estimated from the training batch of M b samples, is the learning rate parameter, x is the feature vector and the corresponding output vector y is the probability distribution.
For an utterance with T frames, the state sequence is given by where q 0 is the initial state. The probability of such a state sequence Q can be written as where π(q 0 ) and a qt−1qt are the initial state probability and state transition probability, respectively, determined by the HMM. The embedded Viterbi training algorithm minimizes the average cross-entropy, which is equivalent to the negative log likelihood where Q is the state sequence. If the new model (W , b ) improves the training criterion over the old model The score of the aligned utterance The new model improves the likelihood score of the utterance given the correct word sequence. During the decoding process, we convert the posterior probability to the likelihood where p(s) = Ts T is the prior probability of a state estimated from the training samples, T s is the number of frames labeled as state s, and T is the total number of frames, p(x t ) is independent of the word sequence and hence can be ignored.
The decoded word sequenceŵ iŝ where p(w) is the probability given by the language model, and is the acoustic model probability, where p(q t | x t ) is computed from the DNN. The final decoding path is determined byŵ where λ is the weight of the language model. DNNs are powerful in modeling any arbitrary mapping between inputs and outputs. However, it is difficult to train a DNN with many hidden layers. After initializing the DNN weights, supervised fine-tuning is conducted using backpropagation to adjust the weights which leads to overfitting. To avoid overfitting, weight decays and dropout regularizations are used. Weight decay is applied when the training set size is small compared to the number of parameters in the DNN. Dropout is used to randomly omit a certain percentage of the neurons in each hidden layer for each presentation of the samples during training. During the training each random combination of the remaining hidden neurons need to perform well in the absence of the omitted neurons.

III. LATTICE FREE-MAXIMUM MUTUAL INFORMATION (LF-MMI) APPROACH
Maximum mutual information (MMI) is used to achieve discriminative training of sequences and to maximize the probability of the reference phonetic transcription of a word sequence while minimizing probability of other transcriptions. In MMI training, the HMMs of all the words classes are considered simultaneously. The parameters of the correct word model are updated to maximize its contribution while the parameters of the other word models are updated to minimize its contribution. The training thus provides high discriminative ability leading to improved performance.
Hence, we explore Lattice free-Maximum Mutual Information(LF-MMI) approach for impaired speech recognition where there is a need for better discrimination among incomplete utterances of different word classes. The diagrammatic representation of Lattice Free-Maximum Mutual Information based Impaired Speech Recognition is shown in Figure 2. In LF-MMI, the output of Deep Neural Network (DNN) corresponds to tied biphone or triphone HMM states, where the state tying is done using a context-dependency tree. Biphone is used to represent a monophone with left or right context dependent monophones. This context-dependency tree is constructed using the GMM-HMM alignments.
The objective function of Maximum Likelihood (ML) estimation [29] is given as where x (u) is the u th speech utterance with transcription w (u) , U is the total number of training utterances and λ is the set of all HMM parameters. The composite HMM graph is denoted by M (u) w . The objective function of MMI is given as The denominator can be estimated as where M den is the HMM denominator graph which includes all possible sequences of words and M w is the numerator graph. The previously trained cross entropy model or GMM generates the denominator lattices. It compactly encodes a small set of likely alternative word sequences for a training utterance.
The full denominator graph with Deep Neural Network (DNN) based model is used in Lattice Free MMI (LF-MMI) approach. Its similar to lattice based MMI except that LF-MMI uses a numerator graph which makes use of alignment information and a common denominator graph instead of utterance based lattices. The LF-MMI numerator graph is a special acyclic graph that makes use of the GMM-HMM alignments as the time constraints on the phones. It is a finite state acceptor(FSA) where each phone can occur at some number of frames earlier or later than its actual occurrence in the corresponding alignment.
The two forward-backward passes are used to calculate the derivatives of the LF-MMI objective function (i.e) one on the denominator graph and the other on the numerator graph. To make the efficient forward backward pass of the denominator graph, all the utterances are split into a fixed 1.5 second chunks based on the alignment information and training is carried out on these mini batches. The pruned phone level language model trained on the previous GMM-HMM model alignments. In this work, we have used LF-MMI training in the DNN-HMM model with full denominator graph. The LF-MMI based discriminative training is expected to provide better performance than the DNN-HMM approach.

A. DATASETS
The Impaired speech corpus in tamil is formed from 18 impaired speakers of both male and female with various neurological disorders like cerebral palsy, multiple sclerosis, mental retardation, brain and spinal cord injury, muscular dystrophy and stroke. The speakers of all intelligibility levels "High", "Medium", "Low" and "Very Low" are involved in Impaired speech sample collection. The speech samples are recorded using lavalier collar microphone in a laboratory environment. Each Impaired speaker has uttered 50 unique isolated words and repeated those 50 words for 5 times in different sessions. Thus, each speaker has uttered 50*5=250 unique examples. The corpus also contains the speech data collected from 6 healthy speakers. We have used two dysarthric speech datasets namely 50 words impaired speech corpus and the other 20 words dataset formed from 50 words impaired speech corpus by picking utterances of word classes that are acoustically similar. First dataset contains 6000 utterances and second dataset contains 2400 utterances.The selected 20 words are listed in the Table 1. For

B. HMM FOR IMPAIRED SPEECH RECOGNITION
To model impaired speech utterances, due to co-articulation effects, context-dependent triphone units are used as the basic units. In conventional HMM training, the number of states and mixtures are fixed based on the lexicons, silence, phoneme related files and text. The text file contains the utterance-ids and the corresponding word. With the help of these files, the HMM topology is fixed and different alignments are performed by considering the phoneme as a basic unit. Triphones significantly increase the number of parameters to be estimated.The performance of the HMM evaluated using monophones and four different triphone models tri1a, tri2a, tri3a and tri4a for 20 acoustically similar words and 50 words impaired speech corpus in tamil datasets are shown in Table 2. HMM gives poor performance due to challenges in impaired speech such as missing vowels and consonants and overlaps in acoustically similar word classes.

C. DNN-HMM APPROACH
In DNN, various parameters like number of hidden layers, number of neurons in each hidden layer and batch size are fixed based on the HMM aligned data, Weighted Finite state Transducer (WFST) and lexicon file. The maximum number of states that can be active at one time is controlled by the max-active parameter is fixed during the decoding process. The different triphone alignments are studied to achieve better word recognition accuracy even with overlapped and missing phonemes. The DNN-HMM shows slight improvement in performance than that of HMM by 1.01% for 20 acoustically similar words dataset and 11.2% for 50 words impaired speech corpus in tamil dataset. Slight improvement for Impaired speech recognition is due to the limited amount of training data when compared to large datasets available for Automatic Speech Recognition (ASR) task. The performance of DNN-HMM of two datasets are shown in Table 3.

D. CONVOLUTIONAL NEURAL NETWORK APPROACH
Convolutional Neural Network (CNN) is used to learn high level features from Spectrograms. Spectrograms are generated by applying FFT over the preprocessed impaired speech signal with the help of hamming window. Then the Mel filter bank is applied for converting the spectrum to the Mel spectrum. The dimension of the generated spectrogram is 1368x864 pixels. The generated spectrograms are discriminative even for acoustically similar word classes. These spectrograms are fed as input to CNN to output the word label of impaired speech samples. The architecture of CNN is as follows: The network is composed of four sets of convolutional layers and max pooling layers with pool of size 2x2. Initially, the filter of size 16 is used for convolution operation and gradually increased to 128. Batch normalization and Dropout regularization are applied to avoid overfitting. The rate of dropout is set to 50% in all the layers. The categorical cross entropy and adam optimizer is used to optimize 42,879,892 trainable parameters. The Tensorflow and keras package are used to implement the CNN architecture. The performance of CNN with two datasets is shown in Table 6.

E. LATTICE FREE MMI APPROACH
The steps followed in LF-MMI experiment is explained as follows.  LF-MMI gives slightly better performance than that of DNN-HMM for impaired speech recognition as shown in table 6. LF-MMI approach shows improvement by 3.38%, 2.33% and 22.93% than that of HMM, DNN-HMM and CNN in 20 acoustically similar words impaired speech corpus in tamil respectively. In case of 50 words impaired speech corpus in tamil dataset, the LF-MMI approach shows better improvement by 25.5%, 11.18% and 35.15% than that of HMM, DNN-HMM and CNN respectively.

F. FIXED DIMENSIONAL REPRESENTATION USING CEPSTRAL FEATURES
The raw impaired speech signal is fed as input to extract Mel Frequency Cepstral Coefficients(MFCC). MFCC is a dominant feature extraction technique which extracts the speaker specific parameters from the impaired speech. The steps involved in MFCC feature extraction are as follows: preprocessing, framing and windowing, Fast Fourier Transform (FFT), processing using Mel Filter bank and Discrete Cosine Transform (DCT). The long windows are used to obtain better frequency resolution and short windows are used for better time resolution. Support Vector Machine is a discriminative classifier proved effective for complex recognition tasks even with small amount of training data.  Table 4.

G. VISUAL REPRESENTATION USING GAMMATONEGRAM
We explored another fixed dimensional representation using Gammatonegrams which perform better than the spectrograms [35]. Gammatonegram is a visual time vs frequency representation of energy of speech signal obtained using Short Time Fourier Transform (STFT) and Gammatone filterbank. Gammatonegram, the visual representation on gammatone filterbank. The gammatonegram generation is quite simple and requires only matrix multiplication and Discrete Fourier Transform (DFT). It is more robust than traditional spectrogram, since the gammatone bandpass filter's magnitude gain is proportional to the bins of the DFT.
The key difference between the spectrogram and the gammatonegram depends on the bandwidth. In spectrogram, the input speech signal is processed by bandpass filter with same bandwidth. But in case of gammatonegram representation, the bandwidth of the bandpass filter changes with the central frequency. It implies that the difference in frequency is not observed strongly in high frequency region than at low frequency. The input speech signal is divided into n number of frames and the gammatonegram representation y(t, f c ) is formed by concatenating the output response of the frame x(t) with the gammatone filterbank g(t, f c) [35].   where each column of gammatonegram is the filterbank response at time t, central frequency f c(inHz), a is the amplitude which is kept constant that controls the gain and n denotes the order of the filter. The bandwidth of the filter is determined by the impulse response duration and decay factor b. The gammatonegram generated for the word "Vendaam" and "Venum" uttered by the "Very Low" intelligibility speaker is depicted in Figure 3   But in case of spectrogram, the interest points are localized in low frequency. The generated gammatonegrams show better discrimination even for two acoustically similar words. The dimension of the generated gammatonegram is 875x656 and to reduce the computational complexity it is resized to 100x100 pixels. The robust features like Scale Invariant Fourier Transform (SIFT) and Binary Large Object (BLOB) are extracted from the gammatonegrams. The dimension of BLOB and SIFT feature vector is 30000. The performance of this representation using Auditory Image features for 20 acoustically similar words and 50 words impaired speech corpus in tamil dataset is shown in the Table 5.

H. MULTIVIEW REPRESENTATION USING CEPSTRAL FEATURES AND GAMMATONE IMAGE FEATURES
The cepstral features are combined with the auditory image features to form the multi view representation. The feature dimensions of the Multiview representation is 30000 (blob feature dimensions) + 3,900 (MFCC features with 100 windows). These combined features are fed as input to the discriminative classifier SVM and an improved performance is obtained when compared to fixed dimensional MFCC representation and Gammatonegram representation. The word recognition accuracy of the multi-view representation is shown in Table 6.

1) Overall Comparision
We have compared the LF-MMI approach with the conventional HMM, DNN-HMM, Fixed dimensional MFCC representation, gammatonegram representation, multiview representations and CNN respectively. The word recognition accuracy (WRA) is calculated for all the experiments to evaluate the performance.   Table 6. Though the performance of DNN-HMM is better than HMM, Fixed dimensional MFCC representation, Gammatonegram representation, Multiview representations and CNN, the LF-MMI approach attains a better recognition accuracy even in the presence of high overlapping word classes like "Kaayam" and "Kashtam", "Paal" and "Paapa", "Saapadu" and "Saapdu", "Venum" and "Vendaam".

2) Performance Analysis with varied Intelligibility levels
The performance of LF-MMI approach for different speakers belonging to different intelligibility levels of Impaired speech corpus in Tamil is shown in Table 7. The words uttered by speakers belonging to "Very Low" and "low" intelligibilty levels are correctly recognized by LF-MMI approach than other conventional approaches. Even with limited amount of training dataset, high overlapping word classes and missing phonemes, the LF-MMI approach provides better performance than HMM based approach and improved the performance by 2.56%, 3.68%, 20.54% and 7.35% for "High", "Medium", "Low" and "Very Low" intelligibility levels respectively. The Figures 5, 6, 7 and 8 show speaker wise word recognition accuracy of varied intelligibility levels "High", "Medium", "Low" and "Very Low" respectively. For the

V. CONCLUSION
We have investigated the performance of DNN-HMM approach and LF-MMI approach for Impaired Speech Recognition task. LF-MMI approach provides an improved discrimination among impaired speech utterances of acoustically similar word classes with missing vowels and consonants. The performance of the LF-MMI approach was evaluated using 20 acoustically similar words and 50 words dataset of Impaired speech corpus in Tamil. The LF-MMI approach shows better performance than the conventional HMM, DNN-HMM, CNN and MVR representation. Though DNN-HMM and LF MMI approaches are promising for healthy speech recognition systems, studies show that there is still a need for robust methodologies to improve the performance of impaired speech recognition task. .