By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 3 • Date March 2011

Filter Results

Displaying Results 1 - 23 of 23
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (99 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • A New Framework for Underdetermined Speech Extraction Using Mixture of Beamformers

    Page(s): 445 - 457
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (825 KB) |  | HTML iconHTML  

    This paper describes frequency-domain nonlinear mixture of beamformers that can extract a speech source from a known direction when there are fewer microphones than sources (the underdetermined case). Our approach models the data in each frequency bin via Gaussian mixture distributions, which can be learned using the expectation maximization algorithm. The model learning is performed using the observed mixture signals only, and no prior training is required. Nonlinear beamformers are then developed based on this model. The proposed estimators are a nonlinear weighted sum of linear minimum mean square error or minimum variance distortionless response beamformers. The resulting nonlinear beamformers do not need to know or estimate the number of sources, and can be applied to microphone arrays with two or more microphones. We test and evaluate the described methods on underdetermined speech mixtures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On The Polynomial Approximation for Time-Variant Harmonic Signal Modeling

    Page(s): 458 - 467
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2998 KB) |  | HTML iconHTML  

    We present a novel approach to modeling time-variant harmonic content in monophonic audio signals. We show that both amplitude and fundamental frequency time variations can be compactly captured in a single time polynomial which modulates the fundamental harmonic component. A correct estimation of the fundamental frequency is assured through the fully automated spectral analysis method (ASA). The best-fit is easily obtained by linear least-squares, given the fact that the set of equations is linear-in-parameters. In contrast to the existing methods, the proposed approach is designed to properly describe harmonic structures in monophonic audio signals under conditions of both amplitude and frequency variations and low signal-to noise ratios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data Balancing for Efficient Training of Hybrid ANN/HMM Automatic Speech Recognition Systems

    Page(s): 468 - 481
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1297 KB) |  | HTML iconHTML  

    Hybrid speech recognizers, where the estimation of the emission pdf of the states of hidden Markov models (HMMs), usually carried out using Gaussian mixture models (GMMs), is substituted by artificial neural networks (ANNs) have several advantages over the classical systems. However, to obtain performance improvements, the computational requirements are heavily increased because of the need to train the ANN. Departing from the observation of the remarkable skewness of speech data, this paper proposes sifting out the training set and balancing the amount of samples per class. With this method, the training time has been reduced 18 times while obtaining performances similar to or even better than those with the whole database, especially in noisy environments. However, the application of these reduced sets is not straightforward. To avoid the mismatch between training and testing conditions created by the modification of the distribution of the training data, a proper scaling of the a posteriori probabilities obtained and a resizing of the context window need to be performed as demonstrated in this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dirichlet Class Language Models for Speech Recognition

    Page(s): 482 - 495
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1098 KB) |  | HTML iconHTML  

    Latent Dirichlet allocation (LDA) was successfully developed for document modeling due to its generalization to unseen documents through the latent topic modeling. LDA calculates the probability of a document based on the bag-of-words scheme without considering the order of words. Accordingly, LDA cannot be directly adopted to predict words in speech recognition systems. This work presents a new Dirichlet class language model (DCLM), which projects the sequence of history words onto a latent class space and calculates a marginal likelihood over the uncertainties of classes, which are expressed by Dirichlet priors. A Bayesian class-based language model is established and a variational Bayesian procedure is presented for estimating DCLM parameters. Furthermore, the long-distance class information is continuously updated using the large-span history words and is dynamically incorporated into class mixtures for a cache DCLM. Different language models are experimentally evaluated using the Wall Street Journal (WSJ) corpus. The amount of training data and the size of vocabulary are evaluated. We find that the cache DCLM effectively characterizes the unseen -gram events and stores the class information for long-distance language modeling. This approach outperforms the other class-based and topic-based language models in terms of perplexity and recognition accuracy. The DCLM and cache DCLM achieved relative gain of word error rate by 3% to 5% over the LDA topic-based language model with different sizes of training data . View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Manipulation of Consonants in Natural Speech

    Page(s): 496 - 504
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1098 KB) |  | HTML iconHTML  

    Natural speech often contains conflicting cues that are characteristic of confusable sounds. For example, the /k/, defined by a mid-frequency burst within 1-2 kHz, may also contain a high-frequency burst above 4 kHz indicative of /ta/, or vice versa. Conflicting cues can cause people to confuse the two sounds in a noisy environment. An efficient way of reducing confusion and improving speech intelligibility in noise is to modify these speech cues. This paper describes a method to manipulate consonant sounds in natural speech, based on our a priori knowledge of perceptual cues of consonants. We demonstrate that: 1) the percept of consonants in natural speech can be controlled through the manipulation of perceptual cues; 2) speech sounds can be made much more robust to noise by removing the conflicting cue and enhancing the target cue. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker Verification With Feature-Space MAPLR Parameters

    Page(s): 505 - 515
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (668 KB) |  | HTML iconHTML  

    This paper studies a new technique that characterizes a speaker by the difference between the speaker and a cohort of background speakers in the form of feature-space maximum a posteriori linear regression (fMAPLR). The fMAPLR is a linear regression function that projects speaker dependent features to speaker independent ones, also known as an affine transform. It consists of two sets of parameters, bias vectors and transform matrices. The former, representing the first order information, is more robust than the latter, the second-order information. We propose a flexible tying scheme that allows the bias vectors and the matrices to be associated with different regression classes, such that both parameters are given sufficient statistics in a speaker verification task. We formulate a maximum a posteriori (MAP) algorithm for the estimation of feature transform parameters, that further alleviates the possible numerical problem. The fMAPLR parameters are then vectorized and compared via a support vector machine (SVM). We conduct the experiments on National Institute of Standards and Technology (NIST) 2006 and 2008 Speaker Recognition Evaluation databases. The experiments show that the proposed technique consistently outperforms the baseline Gaussian mixture model (GMM)-SVM speaker verification system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment

    Page(s): 516 - 527
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1184 KB) |  | HTML iconHTML  

    This paper presents a blind source separation method for convolutive mixtures of speech/audio sources. The method can even be applied to an underdetermined case where there are fewer microphones than sources. The separation operation is performed in the frequency domain and consists of two stages. In the first stage, frequency-domain mixture samples are clustered into each source by an expectation-maximization (EM) algorithm. Since the clustering is performed in a frequency bin-wise manner, the permutation ambiguities of the bin-wise clustered samples should be aligned. This is solved in the second stage by using the probability on how likely each sample belongs to the assigned class. This two-stage structure makes it possible to attain a good separation even under reverberant conditions. Experimental results for separating four speech signals with three microphones under reverberant conditions show the superiority of the new method over existing methods. We also report separation results for a benchmark data set and live recordings of speech mixtures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Phase Grating Approach to Modeling Surface Diffusion in FDTD Room Acoustics Simulations

    Page(s): 528 - 537
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1050 KB) |  | HTML iconHTML  

    In this paper, a method for modeling diffusive boundaries in finite-difference time-domain (FDTD) room acoustics simulations with the use of impedance filters is presented. The proposed technique is based on the concept of phase grating diffusers, and realized by designing boundary impedance filters from normal-incidence reflection filters with added delay. These added delays, that correspond to the diffuser well depths, are varied across the boundary surface, and implemented using Thiran allpass filters. The proposed method for simulating sound scattering is suitable for modeling high frequency diffusion caused by small variations in surface roughness and, more generally, diffusers characterized by narrow wells with infinitely thin separators. This concept is also applicable to other wave-based modeling techniques. The approach is validated by comparing numerical results for Schroeder diffusers to measured data. In addition, it is proposed that irregular surfaces are modeled by shaping them with Brownian noise, giving good control over the sound scattering properties of the simulated boundary through two parameters, namely the spectral density exponent and the maximum well depth. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Supervised Framework for Keyword Extraction From Meeting Transcripts

    Page(s): 538 - 548
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (514 KB) |  | HTML iconHTML  

    This paper presents a supervised framework for extracting keywords from meeting transcripts, a genre that is significantly different from written text or other speech domains such as broadcast news. In addition to the traditional frequency- or position-based clues, we investigate a variety of novel features, including linguistically motivated term specificity features, decision-making sentence-related features, prosodic prominence scores, as well as a group of features derived from summary sentences. To generate better system summaries, we propose a feedback loop mechanism under a supervised framework to leverage the relationship between keywords and summary sentences. Experiments are performed on the ICSI meeting corpus using both human transcripts and automatic speech recognition (ASR) outputs. Results have shown that our proposed supervised framework is able to outperform both unsupervised term frequency inverse document frequency (TF-IDF) weighting and a supervised keyphrase extraction system which is known for its satisfying performance on written text. We conduct extensive analysis to demonstrate the effectiveness of the newly proposed features and the feedback mechanism used to generate summaries. Furthermore, we show promising results using n-best recognition output to address the problems of recognition errors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Region-Growing Permutation Alignment Approach in Frequency-Domain Blind Source Separation of Speech Mixtures

    Page(s): 549 - 557
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (524 KB) |  | HTML iconHTML  

    The convolutive blind source separation (BSS) problem can be solved efficiently in the frequency domain, where instantaneous BSS is performed separately in each frequency bin. However, the permutation ambiguity in each frequency bin should be resolved so that the separated frequency components from the same source are grouped together. To solve the permutation problem, this paper presents a new alignment method based on an inter-frequency dependence measure: the powers of separated signals. Bin-wise permutation alignment is applied first across all frequency bins, using the correlation of separated signal powers; then the full frequency band is partitioned into small regions based on the bin-wise permutation alignment result. Finally, region-wise permutation alignment is performed in a region-growing manner. The region-wise permutation correction scheme minimizes the spreading of the misalignment at isolated frequency bins to others, hence to improve permutation alignment. Experiment results in simulated and real environments verify the effectiveness of the proposed method. Analysis demonstrates that the proposed frequency-domain BSS method is computationally efficient. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Double-Ended Quality Assessment System for Super-Wideband Speech

    Page(s): 558 - 569
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (633 KB) |  | HTML iconHTML  

    This paper describes a double-ended quality assessment system for speech with a bandwidth of up to 14 kHz (so-called super-wideband speech). The quality assessment system is based on a combination of local and global features, where the local features are dependent on a time alignment procedure and the global features are not. The system is evaluated over a large set of subjectively scored narrowband, wideband and super-wideband speech databases. The system performs similarly to PESQ for narrowband speech and significantly better for wideband speech. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Emotional Audio-Visual Speech Synthesis Based on PAD

    Page(s): 570 - 582
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1460 KB) |  | HTML iconHTML  

    Audio-visual speech synthesis is the core function for realizing face-to-face human-computer communication. While considerable efforts have been made to enable talking with computer like people, how to integrate the emotional expressions into the audio-visual speech synthesis remains largely a problem. In this paper, we adopt the notion of Pleasure-Displeasure, Arousal-Nonarousal, and Dominance-Submissiveness (PAD) 3-D-emotional space, in which emotions can be described and quantified from three different dimensions. Based on this new definition, we propose a unified model for emotional speech conversion using Boosting-Gaussian mixture model (GMM), as well as a facial expression synthesis model. We further present an emotional audio-visual speech synthesis approach. Specifically, we take the text and the target PAD values as input, and employ the text-to-speech (TTS) engine to first generate the neutral speeches. Then the Boosting-GMM is used to convert the neutral speeches to emotional speeches, and the facial expression is synthesized simultaneously. Finally, the acoustic features of the emotional speech are used to modulate the facial expression in the audio-visual speech. We designed three objective and five subjective experiments to evaluate the performance of each model and the overall approach. Our experimental results on audio-visual emotional speech datasets show that the proposed approach can effectively and efficiently synthesize natural and expressive emotional audio-visual speeches. Analysis on the results also unveil that the mutually reinforcing relationship indeed exists between audio and video information. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Batch-Online Semi-Blind Source Separation Applied to Multi-Channel Acoustic Echo Cancellation

    Page(s): 583 - 599
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1827 KB) |  | HTML iconHTML  

    Semi-blind source separation (SBSS) is a special case of the well-known blind source separation (BSS) when some partial knowledge of the source signals is available to the system. In particular, a batch adaptation in the frequency domain based on independent component analysis (ICA) can be effectively used to jointly perform source separation and multichannel acoustic echo cancellation (MCAEC) through SBSS without double-talk detection. Many issues related to the implementation of an SBSS system are discussed in this paper. After a deep analysis of the structure of the SBSS adaptation, we propose a constrained batch-online implementation that stabilizes the convergence behavior even in the worst case scenario of a single far-end talker along with the non-uniqueness condition on the far-end mixing system. Specifically, a matrix constraint is proposed to reduce the effect of the non-uniqueness problem caused by highly correlated far-end reference signals during MCAEC. Experimental results show that high echo cancellation can be achieved just as the misalignment remains relatively low without any preprocessing procedure to decorrelate the far-end signals even for the single far-end talker case. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust Voice Activity Detection Using Long-Term Signal Variability

    Page(s): 600 - 613
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1870 KB) |  | HTML iconHTML  

    We propose a novel long-term signal variability (LTSV) measure, which describes the degree of nonstationarity of the signal. We analyze the LTSV measure both analytically and empirically for speech and various stationary and nonstationary noises. Based on the analysis, we find that the LTSV measure can be used to discriminate noise from noisy speech signal and, hence, can be used as a potential feature for voice activity detection (VAD). We describe an LTSV-based VAD scheme and evaluate its performance under eleven types of noises and five types of signal-to-noise ratio (SNR) conditions. Comparison with standard VAD schemes demonstrates that the accuracy of the LTSV-based VAD scheme averaged over all noises and all SNRs is ~6% (absolute) better than that obtained by the best among the considered VAD schemes, namely AMR-VAD2. We also find that, at -10 dB SNR, the accuracies of VAD obtained by the proposed LTSV-based scheme and the best considered VAD scheme are 88.49% and 79.30%, respectively. This improvement in the VAD accuracy indicates the robustness of the LTSV feature for VAD at low SNR condition for most of the noises considered. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Stereo Acoustic Echo Cancellation Employing Frequency-Domain Preprocessing and Adaptive Filter

    Page(s): 614 - 623
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1306 KB) |  | HTML iconHTML  

    This paper proposes a windowing frequency domain adaptive filter and an upsampling block transform preprocessing to solve the stereo acoustic echo cancellation problem. The proposed adaptive filter uses windowing functions with smooth cutoff property to reduce the spectral leakage during filter updating, so the utilization of the independent noise introduced by preprocessing in stereo acoustic echo cancellation can be increased. The proposed preprocessing is operated in short blocks with low processing delay, and it uses frequency-domain upsampling to meet the minimal block length requirement given by the band limit of simultaneous masking. Therefore, the simultaneous masking can be well utilized to improve the audio quality. The acoustic echo cancellation simulations and the audio quality evaluation show that, the proposed windowing frequency domain adaptive filter performs better than the conventional frequency domain adaptive filter in both mono and stereo cases, and the upsampling block transform preprocessing provides better audio quality and stereo acoustic echo cancellation performance than the half-wave preprocessing at the same noise level. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Convolutive BSS of Short Mixtures by ICA Recursively Regularized Across Frequencies

    Page(s): 624 - 639
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1643 KB) |  | HTML iconHTML  

    This paper proposes a new method of frequency-domain blind source separation (FD-BSS), able to separate acoustic sources in challenging conditions. In frequency-domain BSS, the time-domain signals are transformed into time-frequency series and the separation is generally performed by applying independent component analysis (ICA) at each frequency envelope. When short signals are observed and long demixing filters are required, the number of time observations for each frequency is limited and the variance of the ICA estimator increases due to the intrinsic statistical bias. Furthermore, common methods used to solve the permutation problem fail, especially with sources recorded under highly reverberant conditions. We propose a recursively regularized implementation of the ICA (RR-ICA) that overcomes the mentioned problem by exploiting two types of deterministic knowledge: 1) continuity of the demixing matrix across frequencies; 2) continuity of the time-activity of the sources. The recursive regularization propagates the statistics of the sources across frequencies reducing the effect of statistical bias and the occurrence of permutations. Experimental results on real-data show that the algorithm can successfully perform a fast separation of short signals (e.g., 0.5-1s), by estimating long demixing filters to deal with highly reverberant environments (e.g., ms). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Feature Extraction Based on Pitch-Synchronous Averaging for Robust Speech Recognition

    Page(s): 640 - 651
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1004 KB) |  | HTML iconHTML  

    In this paper, we propose two estimators for the autocorrelation sequence of a periodic signal in additive noise. Both estimators are formulated employing tables which contain all the possible products of sample pairs in a speech signal frame. The first estimator is based on a pitch-synchronous averaging. This estimator is statistically analyzed and we show that the signal-to-noise ratio (SNR) can be increased up to a factor equal to the number of available periods. The second estimator is similar to the former one but it avoids the use of those sample products more likely affected by noise. We prove that, under certain conditions, this estimator can remove the effect of an additive noise in a statistical sense. Both estimators are employed to extract mel frequency cepstral coefficients (MFCCs) as features for robust speech recognition. Although these estimators are initially conceived for voiced speech frames, we extend their application to unvoiced sounds in order to obtain a coherent feature extractor. The experimental results show the superiority of the proposed approach over other MFCC-based front-ends such as the higher-lag autocorrelation spectrum estimation (HASE), which also employs the idea of avoiding those autocorrelation coefficients more likely affected by noise. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transient Analysis of the Conventional Filtered-x Affine Projection Algorithm for Active Noise Control

    Page(s): 652 - 657
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (341 KB) |  | HTML iconHTML  

    Affine projection (AP) algorithms have been proposed in recent years for use in active noise control systems. This is due to their potential high convergence speed along with their robustness and moderate computational cost. However, these algorithms can exhibit an excessive computational cost for high projection orders (just when higher convergence speed is achieved). Thus, computationally efficient versions of these algorithms have been proposed. For the particular case of the AP algorithms applied to active noise control, the use of the conventional filtered-x structure instead of the commonly used modified filtered-x method can be understood as an efficient strategy, since it needs fewer operations to update the adaptive filter coefficients. However, the use of this structure implies different algorithm behavior for the following two reasons: the signals needed in the coefficient updates do not correspond exactly to the AP algorithm and this structure introduces a delay between the update of the adaptive filter coefficients and its effect on the noise signal. In practice, this dual effect mainly affects convergence of the algorithms in the transient regime. This correspondence presents a mathematical model so that the transient behavior of the conventional filtered-x AP algorithm can be predicted from the reference signal statistics and algorithm parameters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Audio, Speech, and Language Processing Edics

    Page(s): 657 - 658
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing Information for authors

    Page(s): 659 - 660
    Save to Project icon | Request Permissions | PDF file iconPDF (48 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research