Since human speech recognition capabilities significantly outperform state-of-the-art automatic speech recognition (ASR) systems, it is natural that many researchers have included biological inspiration in the design of ASR algorithms. In fact, the most popular ASR feature set, the Mel frequency cepstral coefficients (MFCC), makes use of the psychoacoustic properties of the auditory system, such as the non-uniform distribution of the auditory filters throughout the frequency range of hearing . However, these filters are merely a preprocessing step in the complex hierarchical structure of the auditory system where most of the computation is carried out via action potentials that originate from the cochlea as spike trains traveling through the auditory nerve (AN) fiber.
There has been some limited research in spike train representations for speech recognition , , however, these algorithms either use novel, yet biologically non-plausible ways to generate spike trains from speech or use the spike trains that are generated in the AN fiber connected to the cochlea without any spike encoding schemes. We believe that better understanding of the information encoding of AN fiber spike trains will lead to more noise-robust ASR systems. This research explores various spike encoding schemes as possible feature sets to come up with a noise-robust, spike-based feature domain for speech recognition. Furthermore, a fully spike-based classification algorithm is developed to employ the new feature set for a comparison with state-of-the-art ASR on a noisy vowel dataset. Our results show that phase locking cues inherent in the AN spike trains could be used as a competitive feature set for ASR and might provide a possible explanation for the superb robustness of the auditory system.
Pattern recognition problems usually consist of two fundamental steps: the extraction of a feature set which forms a compact and robust representation of the input signal; and classification in the generated feature space. Section II will describe the standard and spike-based methods for feature extraction for speech. Rank order coding and liquid state machine as spike-based classifiers will be discussed in Section III. Section IV includes the performance tests and results. Section V concludes the paper.
A. Mel Frequency Cepstral Coefficients
By far, the most popular feature set for commercial ASR systems is the mel frequency cepstral coefficients (MFCC) . The key reason behind the success of MFCC is the noise robustness of the features compared to alternatives. A typical MFCC feature extraction is performed by first dividing the speech utterance into fixed length frames, where quasi-stationary analysis is possible. Next, the FFT magnitude of the frame is used to obtain a spectral representation. Triangular overlapping filters are employed, evenly spaced on a mel frequency scale where they are distributed more densely around lower frequencies than higher frequencies, approximating the known spacing in the cochlea and that discerned through psychoacoustic experiments. The logarithm of the energy found at each mel frequency band is used to construct a vector of energies which is then mapped to the cepstral domain using a Discrete Cosine Transform (DCT) to obtain the feature space.
B. Auditory Nerve Spike Trains
Through a complex series of processing steps, the acoustic pressure wave is converted into a spatio-temporal array of action potentials at the AN fibers within the inner ear. The cochlea is a fluid filled chamber in which sound pressure waves flow and are filtered. These waves modulate the basilar membrane at various points along the cochlea corresponding to the frequencies existing in the sound signal. As the name implies, the basilar membrane acts as a base for the sensory cells and, in over simplified terms, acts as a frequency analyzer. The frequency content of the incoming signal is mapped to a place code along the length of the cochlea .
Special structures called the inner hair cells are attached to the basilar membrane throughout and are responsible for converting mechanical vibrations to electrical signals. Meddis has formulated their characteristics and operation in 1986 . The speech-to-spike front end used in this paper accounts for the latest revisions of the original model incorporating higher level variables such as the adaptation of the neural firings to constant stimuli , .
Fig. 1 shows a simplified block diagram of the speech-to-spike conversion used in this paper. For the sake of computational simplicity, the number of frequency bands is limited to 20, whereas the number of hair cells/nerve fibers located around each band is assumed to be 50. Each frequency band consists of a set of equivalent rectangular bandwidth (ERB) filters to simulate the behavior of cochlear banks filtering the incoming speech signal. The Meddis hair cell block is used to simulate the spike output of a nerve fiber connected to the hair cell with a characteristic frequency around the mean channel frequency. Finally, the distribution of channel frequencies mimics the logarithmic distribution of cochlear filter banks with higher frequency resolution at lower frequencies. The coding scheme block will be explained in further detail in the following section.
Fig. 1. Speech-to-spike conversion block which shows the transduction of sound waves into trains of action potentials at nerve fibers connected to inner hair cells.
View All | Next
C. Spike-Based Feature Extraction
Researchers have studied numerous methods to encode signals as spike trains including rate coding, direct temporal coding and synchrony coding . In , we investigated the most common coding schemes to extract information from spike trains and arrived at the conclusion that for conversational sound pressure levels (SPL), rate coding is not a viable way to carry information mostly due to the fact that during a regular conversation, the input SPLs fluctuate around 60dB and most nerve fibers are simply saturated, firing as fast as they can . Similarly, our results showed that direct temporal coding is susceptible to noise due to the degrading effect of spike jitter at high noise levels.
This paper introduces a modified version of feature extraction based on the phase locking behavior of AN fibers which will result in a noise-robust representation of the original speech spectrum. Synchrony and phase locking in the auditory neurons has been well documented and used for practical problems such as pitch detection . Synchrony coding can be viewed as a special type of temporal coding where groups of neurons fire as similar times. Such an effect has been used to explain the group communications of neurons especially on the sensory level . This type of coding is observed among AN fibers centered around the same characteristic frequencies when they phase lock to the input speech signal.
The inter-spike time intervals (ISI) between subsequent spike firings on each neuron are used to extract the phase locking feature. Assuming that the spike output of the 50 neurons (from Fig. 1) is present, the degree of phase locking can be obtained via a Fourier analysis of the ISI distribution among the neurons. Fig. 2 shows the log-magnitude spectral envelope of a common vowel “iy” as in “beet”. The peak observed in the bottom plot showing the Fourier analysis of the ISI distribution, indicates a specific degree of phase locking to 300Hz which corresponds to the first formant frequency of this vowel.
Fig. 2. Log-magnitude spectral envelope for /iy/ and the corresponding degree of phase locking for a set of hair cells centered at Fc = 300 Hz (computed for a noisy utterance with 5 dB SNR).
Previous | View All | Next
As shown in Fig. 3, the degree of phase locking for a different set of 50 neurons centered at a different characteristic frequency shows that the phase locking to the dominant frequency in the vowel signal still exists due to the overlapping bandwidth of channel filters. However the peak observed in the FFT magnitude plot is smaller indicating a weaker degree of phase locking. This observation forms the basis of our feature extraction using the phase locking cues existing within the spike trains. One possibility for a fully spike-based algorithm is to replace the above digital algorithm (FFT of ISI histogram) with a bank of leaky integrate and fire neurons with adaptive firing thresholds and decay constants.
Fig. 3. Degree of phase locking within 2 sets of hair cells centered at Fc = 300 Hz and Fc = 250 Hz in response to noisy vowel signal with F1 = 300 Hz.
Previous | View All
A. Rank Order Coding
Rank order coding (ROC) is a new temporal coding technique used more commonly for image processing applications , , , . The idea is to ignore the precise timing information and use a code that depends only on the order in which spikes arrive to the post-synaptic neuron. Recent experimental evidence suggests that this type of coding is used in the human somatosensory system and in the auditory systems of cats and rats to code sensory information . In  we have employed a spike-based architecture using the ROC decoding neurons. Even though ROC is an extremely fast and effective classifier, further analysis showed that with the increasing number of classes to be recognized, the performance of rank based classification has degraded and mandates the use of a more complex spike-based analyzer.
B. Liquid State Machine
The Liquid State Machine (LSM) consists of a reservoir of recurrent, sparsely connected spiking neurons with continuous-time dynamics followed by a simple feedforward, memoryless readout mechanism . Since the weights in the recurrent network are fixed to random values and not adapted, the reservoir acts like a set of fixed kernel or basis functions. The feedforward readout is trained to extract desired output signals from the reservoir. The LSM structure of this paper consists of a neural micro circuit (liquid) connected to the front-end system in Fig. 1 through a series of analog input neurons to map the degree of phase locking (DoPL) feature space to the state of the liquid that consists of all the firing activity within its neurons.
Concerning the use of DoPL features with LSM, since the DoPL feature set is an analog vector, it is supplied to the neural microcircuit using analog synapses modeling membrane potentials. The neural microcircuit is chosen to have 300 leaky integrate-and-fire neurons, 20% of which are chosen to be inhibitory. The local recurrent connections are modeled using a Gaussian distribution resulting in denser connections amongst neighboring neurons. All connections used in the microcircuit are dynamic spiking synapses with spike timing-dependent plasticity .
During the training phase, first the state vector is obtained via low-pass filtering the spike outputs of the 300 neurons in the microcircuit with a 300Hz cut-off frequency. The state vector is then sampled every 20ms and associated with a class label corresponding to the input signal. This generates the input and desired output pairs to be used to train a single hidden layer feed-forward neural network with the well known back-propagation algorithm. The neural network uses a tangential sigmoid output which is quantized by the number of available input classes and is used to approximate the memoryless readout function mapping the state of the microcircuit to the desired class label.
Two different experiments were performed to quantify the performance of the spike-based algorithm for a simple ASR task and then to compare the robustness of the proposed feature set to a baseline state-of-the-art ASR system. The baseline system employs a vector of 39 elements consisting of 13 MFCC coefficients along with their first and second derivatives. A single state Hidden Markov Model (HMM) of 64 Gaussians is used as its classifier. Even though a single state HMM is nothing more than a Gaussian mixture model (GMM), more HMM states are not necessary for this simple stationary phoneme recognition problem where left-to-right frame transition is non-existent. The dataset consists of the 10 most common English vowel classes. Training (200) and testing (200) vowels for each class are chosen from the popular TIMIT database to constitute a set of isolated vowels such that they are both multi-speaker and multi-gender. Finally, each utterance is corrupted by both street and car noise to measure the robustness of the proposed feature set/classifer against that of the baseline ASR system.
In the first experiment, the MFCC-HMM baseline algorithm is compared to the full spike-based structure employing the LSM and the degree of phase locking feature set (DoPLLSM). In the second experiment DoPL features are augmented with MFCC features and used with the HMM classifier. It is important to note that the time capturing properties of neither HMM nor LSM come into picture in any of the experiments as we are testing single vowels. In future experiments using words with multiple phonemes, HMM-LSM comparison will be further analyzed.
A. Experiment I
Table I shows the percentage of vowels correctly classified for the spike-based algorithm and the MFCC-HMM baseline. At high SNR values, the MFCC-HMM baseline outperforms DoPL-LSM. However, the drop in performance with increasing levels of noise, is close to 20% for MFCCHMM whereas it is only around 10% for DoPL-LSM which shows a significant degree of robustness for the spike-based method. This indicates that the generalization of the liquid state with the increasing number of vowel classes falls behind that of the hidden Markov model. We did not expect the fully spike-based algorithm to outperform the state-of-the-art methods but it is very promising to see how close the performance is, specifically in low SNR situations.
TABLE I Percentage of Vowels (10 Classes) Correctly Classified for LSM—DoPL and MFCC-HMM Engine
B. Experiment II
In our final test, we augmented the standard 39-dimension MFCC feature set with the degree of phase locking parameters before classification with an HMM. Table II shows the percentage of correctly classified vowels for the augmented feature set compared to MFCC alone. As expected the augmented feature set vastly outperforms MFCC at nearly every noise level and the performance difference is as much as 20% at the lowest SNR values. The conclusion is that it preserves the noise robustness of the phase-synchrony feature set and combines it with the better generalization of the HMM. This augmented system is no longer claimed to be biologically plausible but it is still strongly biologically inspired. Future work will include comparison as well as augmentation with more robust deviations of MFCC features.
TABLE II Percentage of Vowels (10 Classes) Correctly Classified for the Augmented Features and MFCC With the HMM Engine
This paper shows that the performance of our spike-based vowel recognition system is comparable to a state-of-the-art ASR system for low-SNR vowel recognition. The common inspiration from biology seen in conventional ASR systems is taken a step further with a novel feature extraction technique based upon the computational units of neural communication. Finally, the augmentation of the state-of-the-art MFCC features with novel spike-based features derived from the phase locking cues led to a much improved recognition system, at least for simple vowel recognition tasks.
The obvious next step is the application of these principles for more challenging speech frames such as unvoiced phonemes where extracting phase locking cues will be more difficult, and ultimately to a large vocabulary continuous speech recognition (LVCSR) problem.
The authors would like to thank Dr. Harsha Sathyendra for his past contributions to the research discussed in this paper. This work was partially funded by National Science Foundation grant 05412410.