By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 4 • Date May 2009

Filter Results

Displaying Results 1 - 25 of 36
  • Table of contents

    Publication Year: 2009 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (105 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2009 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (39 KB)  
    Freely Available from IEEE
  • Improved Noise Power Spectrum Density Estimation for Binaural Hearing Aids Operating in a Diffuse Noise Field Environment

    Publication Year: 2009 , Page(s): 521 - 533
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1232 KB) |  | HTML iconHTML  

    The current generation of digital hearing aids allows the implementation of advanced noise reduction schemes. However, most current noise reduction algorithms are monaural and are therefore intended for only bilateral hearing aids. Recently, binaural in contrast to monaural noise reduction schemes have been proposed, targeting future high-end binaural hearing aids. Those new types of hearing aids would allow the sharing of information/signals received from both left and right hearing aid microphones (via a wireless link) to generate an output for the left and right ear. This paper presents a novel noise power spectral density estimator for binaural hearing aids operating in a diffuse noise field environment, by taking advantage of the left and right reference signals that will be accessible, as opposed to the single reference signal currently available in bilateral hearing aids. In contrast with some previously published noise estimation methods for hearing aids or speech enhancement, the proposed noise estimator does not assume stationary noise, it can work for colored noise in a diffuse noise field, it does not require a voice activity detection, the noise power spectrum can be estimated during speech activity or not, it does not experience noise tracking latency and most importantly, it is not essential for the target speaker to be in front of the binaural hearing aid user to estimate the noise power spectrum, i.e., the direction of arrival of the source speech signal can be arbitrary. Finally, the proposed noise estimator can be combined with any hearing aid noise reduction technique, where the accuracy of the noise estimation can be critical to achieve a satisfactory de-noising performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Suppression of Late Reverberation Effect on Speech Signal Using Long-Term Multiple-step Linear Prediction

    Publication Year: 2009 , Page(s): 534 - 545
    Cited by:  Papers (34)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2953 KB) |  | HTML iconHTML  

    A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades automatic speech recognition (ASR) performance. One way to solve this problem is to dereverberate the observed signal prior to ASR. In this paper, a room impulse response is assumed to consist of three parts: a direct-path response, early reflections and late reverberations. Since late reverberations are known to be a major cause of ASR performance degradation, this paper focuses on dealing with the effect of late reverberations. The proposed method first estimates the late reverberations using long-term multi-step linear prediction, and then reduces the late reverberation effect by employing spectral subtraction. The algorithm provided good dereverberation with training data corresponding to the duration of one speech utterance, in our case, less than 6 s. This paper describes the proposed framework for both single-channel and multichannel scenarios. Experimental results showed substantial improvements in ASR performance with real recordings under severe reverberant conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Relative Transfer Function Identification Using Convolutive Transfer Function Approximation

    Publication Year: 2009 , Page(s): 546 - 555
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1615 KB) |  | HTML iconHTML  

    In this paper, we present a relative transfer function (RTF) identification method for speech sources in reverberant environments. The proposed method is based on the convolutive transfer function (CTF) approximation, which enables to represent a linear convolution in the time domain as a linear convolution in the short-time Fourier transform (STFT) domain. Unlike the restrictive and commonly used multiplicative transfer function (MTF) approximation, which becomes more accurate when the length of a time frame increases relative to the length of the impulse response, the CTF approximation enables representation of long impulse responses using short time frames. We develop an unbiased RTF estimator that exploits the nonstationarity and presence probability of the speech signal and derive an analytic expression for the estimator variance. Experimental results show that the proposed method is advantageous compared to common RTF identification methods in various acoustic environments, especially when identifying long RTFs typical to real rooms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies

    Publication Year: 2009 , Page(s): 556 - 565
    Cited by:  Papers (24)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (717 KB) |  | HTML iconHTML  

    Vowel onset point (VOP) is the instant at which the onset of vowel takes place during speech production. There are significant changes occurring in the energies of excitation source, spectral peaks, and modulation spectrum at the VOP. This paper demonstrates the independent use of each of these three energies in detecting the VOPs. Since each of these energies represents a different aspect of speech production, it may be possible that they contain complementary information about the VOP. The individual evidences are therefore combined for detecting the VOPs. The error rates measured as the ratio of missing and spurious to the total number of VOPs evaluated on the sentences taken from the TIMIT database are 6.92%, 8.8%, 6.13%, and 4.0% for source, spectral peaks, modulation spectrum, and combined information, respectively. The performance of the combined method for VOP detection is improved by 2.13% compared to the best performing individual VOP detection method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Convergence Analysis of Narrowband Active Noise Equalizer System Under Imperfect Secondary Path Estimation

    Publication Year: 2009 , Page(s): 566 - 571
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (506 KB) |  | HTML iconHTML  

    Active noise equalizer systems are used to adjust the noise level in an environment, based on the preference of retaining noise information. Several researches have been carried out to determine the maximum step size bond of narrowband active noise control system with perfect secondary path estimation without gain factor consideration. However, in practical environment, secondary path estimation error of the system exists. In this paper, a stochastic approach analysis is applied to determine the maximum step size of the system under imperfect secondary path estimation. Simulation results are conducted to verify the analysis. Results show that the gain factor, sampling frequency, and secondary path estimation errors are all major factors governing the maximum step size of the narrowband active noise equalizer system under imperfect secondary path estimation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Family of Robust Algorithms Exploiting Sparsity in Adaptive Filters

    Publication Year: 2009 , Page(s): 572 - 581
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1346 KB) |  | HTML iconHTML  

    We introduce a new family of algorithms to exploit sparsity in adaptive filters. It is based on a recently introduced new framework for designing robust adaptive filters. It results from minimizing a certain cost function subject to a time-dependent constraint on the norm of the filter update. Although in general this problem does not have a closed-form solution, we propose an approximate one which is very close to the optimal solution. We take a particular algorithm from this family and provide some theoretical results regarding the asymptotic behavior of the algorithm. Finally, we test it in different environments for system identification and acoustic echo cancellation applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection

    Publication Year: 2009 , Page(s): 582 - 596
    Cited by:  Papers (33)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1428 KB) |  | HTML iconHTML  

    During expressive speech, the voice is enriched to convey not only the intended semantic message but also the emotional state of the speaker. The pitch contour is one of the important properties of speech that is affected by this emotional modulation. Although pitch features have been commonly used to recognize emotions, it is not clear what aspects of the pitch contour are the most emotionally salient. This paper presents an analysis of the statistics derived from the pitch contour. First, pitch features derived from emotional speech samples are compared with the ones derived from neutral speech, by using symmetric Kullback-Leibler distance. Then, the emotionally discriminative power of the pitch features is quantified by comparing nested logistic regression models. The results indicate that gross pitch contour statistics such as mean, maximum, minimum, and range are more emotionally prominent than features describing the pitch shape. Also, analyzing the pitch statistics at the utterance level is found to be more accurate and robust than analyzing the pitch statistics for shorter speech regions (e.g., voiced segments). Finally, the best features are selected to build a binary emotion detection system for distinguishing between emotional versus neutral speech. A new two-step approach is proposed. In the first step, reference models for the pitch features are trained with neutral speech, and the input features are contrasted with the neutral model. In the second step, a fitness measure is used to assess whether the input speech is similar to, in the case of neutral speech, or different from, in the case of emotional speech, the reference models. The proposed approach is tested with four acted emotional databases spanning different emotional categories, recording settings, speakers and languages. The results show that the recognition accuracy of the system is over 77% just with the pitch features (baseline 50%). When compared to conventional classification schemes, th- - e proposed approach performs better in terms of both accuracy and robustness. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multichannel Room Impulse Response Generation With Coherence Control

    Publication Year: 2009 , Page(s): 597 - 606
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (410 KB) |  | HTML iconHTML  

    A method is proposed by which an arbitrary number of room impulse responses (RIRs) is generated from one input RIR. The method works by convolving the input RIR with a multitude of filters, one for each desired output RIR. The filters are designed in such a way that the coherence between all possible pairs of output RIRs feature assignable frequency-dependent values. An application example is given where the values for the coherence are given by the diffuse field coherence functions for the microphones in a five-element microphone array. The limitations in terms of achieving the desired coherence values exactly and avoiding spectral coloration are also discussed. The proposed method is most suitable for the part of the RIR that does not contain strong discrete reflections. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Analysis of 3-D Direction Estimation Based on Head-Related Transfer Function

    Publication Year: 2009 , Page(s): 607 - 613
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (761 KB) |  | HTML iconHTML  

    We analyze the performance of a sensor system based on the head-related transfer function (HRTF) modeling of human hearing, in finding the direction of a source in 3-D space. Thus, this system has only two sensors with a few nearby reflectors. First, we present a parametric measurement model that incorporates the diffractions associated with the head shape and the multipath reflections related to the external ear (pinna). Then, we compute the asymptotic Cramer-Rao bound on the 3-D direction estimate for a stochastic source signal of unknown parametrized spectrum. We provide a few numerical examples to illustrate our analytical results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Event-Based Instantaneous Fundamental Frequency Estimation From Speech Signals

    Publication Year: 2009 , Page(s): 614 - 624
    Cited by:  Papers (22)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB) |  | HTML iconHTML  

    Exploiting the impulse-like nature of excitation in the sequence of glottal cycles, a method is proposed to derive the instantaneous fundamental frequency from speech signals. The method involves passing the speech signal through two ideal resonators located at zero frequency. A filtered signal is derived from the output of the resonators by subtracting the local mean computed over an interval corresponding to the average pitch period. The positive zero crossings in the filtered signal correspond to the locations of the strong impulses in each glottal cycle. Then the instantaneous fundamental frequency is obtained by taking the reciprocal of the interval between successive positive zero crossings. Due to filtering by zero-frequency resonator, the effects of noise and vocal-tract variations are practically eliminated. For the same reason, the method is also robust to degradation in speech due to additive noise. The accuracy of the fundamental frequency estimation by the proposed method is comparable or even better than many existing methods. Moreover, the proposed method is also robust against rapid variation of the pitch period or vocal-tract changes. The method works well even when the glottal cycles are not periodic or when the speech signals are not correlated in successive glottal cycles. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Supervised Learning Approach to Monaural Segregation of Reverberant Speech

    Publication Year: 2009 , Page(s): 625 - 638
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (786 KB) |  | HTML iconHTML  

    A major source of signal degradation in real environments is room reverberation. Monaural speech segregation in reverberant environments is a particularly challenging problem. Although inverse filtering has been proposed to partially restore the harmonicity of reverberant speech before segregation, this approach is sensitive to specific source/receiver and room configurations. This paper proposes a supervised learning approach to monaural segregation of reverberant voiced speech, which learns to map from a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features. We devise a novel objective function for the learning process, which directly relates to the goal of maximizing signal-to-noise ratio. The models trained using this objective function yield significantly better T-F unit labeling. A segmentation and grouping framework is utilized to form reliable segments under reverberant conditions and organize them into streams. Systematic evaluations show that our approach produces very promising results under various reverberant conditions and generalizes well to new utterances and new speakers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Frequency-Domain Pearson Distribution Approach for Independent Component Analysis (FD-Pearson-ICA) in Blind Source Separation

    Publication Year: 2009 , Page(s): 639 - 649
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (885 KB) |  | HTML iconHTML  

    In frequency-domain blind source separation (BSS) for speech with independent component analysis (ICA), a practical parametric Pearson distribution system is used to model the distribution of frequency-domain source signals. ICA adaptation rules have a score function determined by an approximated signal distribution. Approximation based on the data may produce better separation performance than we can obtain with ICA. Previously, conventional hyperbolic tangent (tanh) or generalized Gaussian distribution (GGD) was uniformly applied to the score function for all frequency bins, even though a wideband speech signal has different distributions at different frequencies. To deal with this, we propose modeling the signal distribution at each frequency by adopting a parametric Pearson distribution and employing it to optimize the separation matrix in the ICA learning process. The score function is estimated by the appropriate Pearson distribution parameters for each frequency bin. We devised three methods for Pearson distribution parameter estimation and conducted separation experiments with real speech signals convolved with actual room impulse responses (T60=130 ms). Our experimental results show that the proposed frequency-domain Pearson-ICA (FD-Pearson-ICA) adapted well to the characteristics of frequency-domain source signals. By applying the FD-Pearson-ICA performance, the signal-to-interference ratio significantly improved by around 2-3 dB compared with conventional nonlinear functions. Even if the signal-to-interference ratio (SIR) values of FD-Pearson-ICA were poor, the performance based on a disparity measure between the true score function and estimated parametric score function clearly showed the advantage of FD-Pearson-ICA. Furthermore, we confirmed the optimum of the proposed approach for/optimized the proposed approach as regards separation performance. By combining individual distribution parameters directly estimated at low frequency with the ap- - propriate parameters optimized at high frequency, it was possible to both reasonably improve the FD-Pearson-ICA performance without any significant increase in the computational burden by comparison with conventional nonlinear functions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Blind Spatial Subtraction Array for Speech Enhancement in Noisy Environment

    Publication Year: 2009 , Page(s): 650 - 664
    Cited by:  Papers (27)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1090 KB) |  | HTML iconHTML  

    We propose a new blind spatial subtraction array (BSSA) consisting of a noise estimator based on independent component analysis (ICA) for efficient speech enhancement. In this paper, first, we theoretically and experimentally point out that ICA is proficient in noise estimation under a non-point-source noise condition rather than in speech estimation. Therefore, we propose BSSA that utilizes ICA as a noise estimator. In BSSA, speech extraction is achieved by subtracting the power spectrum of noise signals estimated using ICA from the power spectrum of the partly enhanced target speech signal with a delay-and-sum beamformer. This ldquopower-spectrum-domain subtractionrdquo procedure enables better noise reduction than the conventional ICA with estimation-error robustness. Another benefit of BSSA architecture is ldquopermutation robustness". Although the ICA part in BSSA suffers from a source permutation problem, the BSSA architecture can reduce the negative affection when permutation arises. The results of various speech enhancement test reveal that the noise reduction and speech recognition performance of the proposed BSSA are superior to those of conventional methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of Robust Broadband Beamformers With Passband Shaping Characteristics Using Tikhonov Regularization

    Publication Year: 2009 , Page(s): 665 - 681
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2670 KB) |  | HTML iconHTML  

    Broadband beamformers are known to be highly sensitive to the errors in microphone array characteristics, especially for small-sized microphone arrays. This paper proposes an algorithm for the design of robust broadband beamformers with passband shaping characteristics using Tikhonov regularization method by taking into account the statistics of the microphone characteristics. To facilitate the derivation, a weighted least squares based broadband beamforming algorithm with passband shaping characteristics is also proposed, while keeping a minimal stopband level. Unlike the existing criteria for broadband beamforming for microphone arrays, the proposed approaches fully exploit the degrees of freedom in weighting functions of the criteria used to design broadband beamformers. By doing so, efficient and flexible broadband beamforming can be achieved with the desired passband shaping characteristics. In addition, this paper also shows that the robust broadband beamforming using the statistics of the microphone characteristics recently proposed (Doclo and Moonen, 2003 and 2007) belongs to the class of white noise gain constraint-based techniques, which gives an insight into the nature of the state-of-the-art approach. The performance of the proposed beamforming algorithms is illustrated by the design examples and is compared with the existing approaches for microphone arrays. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Location Classification of Nonstationary Sound Sources Using Binaural Room Distribution Patterns

    Publication Year: 2009 , Page(s): 682 - 692
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (879 KB) |  | HTML iconHTML  

    This paper discusses the relationships between the nonstationarity of sound sources and the distribution patterns of interaural phase differences (IPDs) and interaural level differences (ILDs) based on short-term frequency analysis. The amplitude variation of nonstationary sound sources is modeled by the exponent of polynomials from the concept of moving pole model. According to the model, the sufficient condition for utilizing the distribution patterns of IPDs and ILDs to localize a nonstationary sound source is suggested and the phenomena of multiple peaks in the distribution pattern can be explained. Simulation is performed to interpret the relation between the distribution patterns of IPD and ILD and the nonstationary sound source. Furthermore, a Gaussian-mixture binaural room distribution model (GMBRDM) is proposed to model distribution patterns of IPDs and ILDs for nonstationary sound source location classification. The effectiveness and performance of the proposed GMBRDM are demonstrated by experimental results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quantitative Analysis of a Common Audio Similarity Measure

    Publication Year: 2009 , Page(s): 693 - 703
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (692 KB) |  | HTML iconHTML  

    For music information retrieval tasks, a nearest neighbor classifier using the Kullback-Leibler divergence between Gaussian mixture models of songs' melfrequency cepstral coefficients is commonly used to match songs by timbre. In this paper, we analyze this distance measure analytically and experimentally by the use of synthesized MIDI files, and we find that it is highly sensitive to different instrument realizations. Despite the lack of theoretical foundation, it handles the multipitch case quite well when all pitches originate from the same instrument, but it has some weaknesses when different instruments play simultaneously. As a proof of concept, we demonstrate that a source separation frontend can improve performance. Furthermore, we have evaluated the robustness to changes in key, sample rate, and bitrate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cepstrum-Domain Model Combination Based on Decomposition of Speech and Noise Using MMSE-LSA for ASR in Noisy Environments

    Publication Year: 2009 , Page(s): 704 - 713
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (438 KB) |  | HTML iconHTML  

    This paper presents an efficient method for combining models of speech and noise for robust speech recognition applications in noisy environments. This method decomposes the cepstrum domain representation of noise-corrupted speech into clean speech cepstrum and background noise cepstrum components using a minimum mean squared error-log spectral amplitude (MMSE-LSA) criterion. Speech recognition is then performed on noisy cepstrum domain observations using a model that is formed by parallel combination of cepstrum domain clean speech distributions and background noise distributions estimated using this MMSE-LSA based noise decomposition. This method is far more efficient than other parallel model combination (PMC) procedures because model combination is performed directly in the cepstrum domain rather than in the linear spectral domain. Whereas background noise model estimation is addressed as a separate issue in existing PMC procedures, this method explicitly incorporates a mechanism to continually update background noise models and signal-to-noise ratio (SNR) estimates over time. The performance of the proposed cepstrum-domain model combination method is compared with a well known implementation of PMC which uses a log-normal approximation when combining speech and background noise model means and variances on a connected digit string recognition task which is subjected to mismatched channel and environment conditions. As a result, it is shown that the proposed model combination technique gives a word error rate that is comparable to PMC when background noise information and SNR are known prior to estimation. The paper will also present the results of experiments where a combination of cepstrum-domain feature compensation and model combination are applied to this task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unsupervised Adaptation With Discriminative Mapping Transforms

    Publication Year: 2009 , Page(s): 714 - 723
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (327 KB) |  | HTML iconHTML  

    The most commonly used approaches to speaker adaptation are based on linear transforms, as these can be robustly estimated using limited adaptation data. Although significant gains can be obtained using discriminative criteria for training acoustic models, maximum-likelihood (ML) estimated transforms are still used for unsupervised adaptation. This is because discriminatively trained transforms are highly sensitive to errors in the adaptation supervision hypothesis. This paper describes a new framework for estimating transforms that are discriminative in nature, but are less sensitive to this hypothesis issue. A speaker-independent discriminative mapping transformation (DMT) is estimated during training. This transform is obtained after a speaker-specific ML-estimated transform of each training speaker has been applied. During recognition an ML speaker-specific transform is found for each test-set speaker and the speaker-independent DMT then applied. This allows a transform which is discriminative in nature to be indirectly estimated, while only requiring an ML speaker-specific transform to be found during recognition. The DMT technique is evaluated on an English conversational telephone speech task. Experiments showed that using DMT in unsupervised adaptation led to significant gains over both standard ML and discriminatively trained transforms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Importance of High-Order N-Gram Models in Morph-Based Speech Recognition

    Publication Year: 2009 , Page(s): 724 - 732
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (664 KB) |  | HTML iconHTML  

    Speech recognition systems trained for morphologically rich languages face the problem of vocabulary growth caused by prefixes, suffixes, inflections, and compound words. Solutions proposed in the literature include increasing the size of the vocabulary and segmenting words into morphs. However, in many cases, the methods have only been experimented with low-order n-gram models or compared to word-based models that do not have very large vocabularies. In this paper, we study the importance of using high-order variable-length n-gram models when the language models are trained over morphs instead of whole words. Language models trained on a very large vocabulary are compared with models based on different morph segmentations. Speech recognition experiments are carried out on two highly inflecting and agglutinative languages, Finnish and Estonian. The results suggest that high-order models can be essential in morph-based speech recognition, even when lattices are generated for two-pass recognition. The analysis of recognition errors reveal that the high-order morph language models improve especially the recognition of previously unseen words. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Hidden Agenda User Simulation Model

    Publication Year: 2009 , Page(s): 733 - 747
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (822 KB) |  | HTML iconHTML  

    A key advantage of taking a statistical approach to spoken dialogue systems is the ability to formalise dialogue policy design as a stochastic optimization problem. However, since dialogue policies are learnt by interactively exploring alternative dialogue paths, conventional static dialogue corpora cannot be used directly for training and instead, a user simulator is commonly used. This paper describes a novel statistical user model based on a compact stack-like state representation called a user agenda which allows state transitions to be modeled as sequences of push- and pop-operations and elegantly encodes the dialogue history from a user's point of view. An expectation-maximisation based algorithm is presented which models the observable user output in terms of a sequence of hidden states and thereby allows the model to be trained on a corpus of minimally annotated data. Experimental results with a real-world dialogue system demonstrate that the trained user model can be successfully used to optimise a dialogue policy which outperforms a hand-crafted baseline in terms of task completion rates and user satisfaction scores. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining Derivative and Parametric Kernels for Speaker Verification

    Publication Year: 2009 , Page(s): 748 - 757
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (455 KB) |  | HTML iconHTML  

    Support vector machine-based speaker verification (SV) has become a standard approach in recent years. These systems typically use dynamic kernels to handle the dynamic nature of the speech utterances. This paper shows that many of these kernels fall into one of two general classes, derivative and parametric kernels. The attributes of these classes are contrasted and the conditions under which the two forms of kernel are identical are described. By avoiding these conditions, gains may be obtained by combining derivative and parametric kernels. One combination strategy is to combine at the kernel level. This paper describes a maximum-margin-based scheme for learning kernel weights for the SV task. Various dynamic kernels and combinations were evaluated on the NIST 2002 SRE task, including derivative and parametric kernels based upon different model structures. The best overall performance was 7.78% EER achieved when combining five kernels. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Feature Compensation Techniques for ASR on Band-Limited Speech

    Publication Year: 2009 , Page(s): 758 - 774
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1456 KB) |  | HTML iconHTML  

    Band-limited speech (speech for which parts of the spectrum are completely lost) is a major cause for accuracy degradation of automatic speech recognition (ASR) systems particularly when acoustic models have been trained with data with a different spectral range. In this paper, we present an extensive study of the problem of ASR of band-limited speech with full-bandwidth acoustic models. Our focus is mainly on band-limited feature compensation, covering even the case of time-varying band-limiting distortions, but we also compare this approach to more common model-side techniques (adaptation and retraining) and explore the combination of feature-based and model-side approaches. The feature compensation algorithms proposed are organized in a unified framework supported by a novel mathematical model of the impact of such distortions on Mel-frequency cepstral coefficient (MFCC) features. A crucial and novel contribution is the analysis made of the relative correlation of different elements in the MFCC feature vector for the cases of full-bandwidth and limited-bandwidth speech, which justifies an important modification in the feature compensation scheme. Furthermore, an intensive experimental analysis is provided. Experiments are conducted on real telephone channels, as well as artificial low-pass and bandpass filters applied over TIMIT data, and results are given for different experimental constraints and variations of the feature compensation method. Results for other well-known robustness approaches, such as cepstral mean normalization (CMN), model retraining, and model adaptation are also given for comparison. ASR performance with our approach is similar or even better than model adaptation, and we argue that in particular cases such as rapidly varying distortions, or limited computational or memory resources, feature compensation is more convenient. Furthermore, we show that feature-side and model-side approaches may be combined, outperforming any of those approache- - s alone. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Wrapped Gaussian Mixture Models for Modeling and High-Rate Quantization of Phase Data of Speech

    Publication Year: 2009 , Page(s): 775 - 786
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (788 KB) |  | HTML iconHTML  

    The harmonic representation of speech signals has found many applications in speech processing. This paper presents a novel statistical approach to model the behavior of harmonic phases. Phase information is decomposed into three parts: a minimum phase part, a translation term, and a residual term referred to as dispersion phase. Dispersion phases are modeled by wrapped Gaussian mixture models (WGMMs) using an expectation-maximization algorithm suitable for circular vector data. A multivariate WGMM-based phase quantizer is then proposed and constructed using novel scalar quantizers for circular random variables. The proposed phase modeling and quantization scheme is evaluated in the context of a narrowband harmonic representation of speech. Results indicate that it is possible to construct a variable-rate harmonic codec that is equivalent to iLBC at approximately 13 kbps. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research