By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 4 • Date May 2007

Filter Results

Displaying Results 1 - 25 of 43
  • Table of contents

    Publication Year: 2007 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (52 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2007 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (37 KB)  
    Freely Available from IEEE
  • A Soft Voice Activity Detection Using GARCH Filter and Variance Gamma Distribution

    Publication Year: 2007 , Page(s): 1129 - 1134
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (397 KB) |  | HTML iconHTML  

    This paper presents a robust algorithm for a voice activity detector (VAD) based on generalized autoregressive conditional heteroscedasticity (GARCH) filter, variance gamma distribution (VGD), and adaptive threshold function. GARCH models are new statistical methods that are used especially in economic time series. There is a consensus that speech signals exhibit variances that change through time. GARCH models are a popular choice to model these changing variances. A speech signal is assumed to have a VGD because the VGD has heavier tails than the Gaussian distribution (GD). The distribution of noise signal is assumed to be Gaussian. In proposed method, heteroscedasticity will be modeled by GARCH, and then the parameters of the distributions will be estimated recursively. Finally, hard detection is the result of comparing a multiple observation likelihood ratio test (MOLRT) with an adaptive threshold function. The simulation results show that the proposed VAD is able to operate down to -5 dB and in nonstationary environments View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Single and Multiple { F}_{0} Contour Estimation Through Parametric Spectrogram Modeling of Speech in Noisy Environments

    Publication Year: 2007 , Page(s): 1135 - 1145
    Cited by:  Papers (8)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1179 KB) |  | HTML iconHTML  

    This paper proposes a novel F0 contour estimation algorithm based on a precise parametric description of the voiced parts of speech derived from the power spectrum. The algorithm is able to perform in a wide variety of noisy environments as well as to estimate the F0s of cochannel concurrent speech. The speech spectrum is modeled as a sequence of spectral clusters governed by a common F0 contour expressed as a spline curve. These clusters are obtained by an unsupervised 2-D time-frequency clustering of the power density using a new formulation of the EM algorithm, and their common F 0 contour is estimated at the same time. A smooth F0 contour is extracted for the whole utterance, linking together its voiced parts. A noise model is used to cope with nonharmonic background noise, which would otherwise interfere with the clustering of the harmonic portions of speech. We evaluate our algorithm in comparison with existing methods on several tasks, and show 1) that it is competitive on clean single-speaker speech, 2) that it outperforms existing methods in the presence of noise, and 3) that it outperforms existing methods for the estimation of multiple F0 contours of cochannel concurrent speech View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Memory-Based Vector Quantization of LSF Parameters by a Power Series Approximation

    Publication Year: 2007 , Page(s): 1146 - 1155
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (437 KB) |  | HTML iconHTML  

    In this paper, memory-based quantization is studied in detail. We propose a new framework, power series quantization (PSQ), for memory-based quantization. With linear spectral frequency (LSF) quantization as the application, several common memory-based quantization methods (FSVQ, predictive VQ, VPQ, safety-net, etc.) are analyzed and compared with the proposed method, and it is shown that the proposed method performs better than all other tested methods. The proposed PSQ method is fully general, in that it can simulate all other memory-based quantizers if it is allowed unlimited complexity View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rate Allocation for Noncollaborative Multiuser Speech Communication Systems Based on Bargaining Theory

    Publication Year: 2007 , Page(s): 1156 - 1166
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (510 KB) |  | HTML iconHTML  

    We propose a novel rate allocation algorithm for multiuser speech communication systems based on bargaining theory. Specifically, we apply the generalized Kalai-Smorodinsky bargaining solution since it allows varying bargaining powers to match the dynamic nature of speech signals. We propose a novel method to derive bargaining powers based on the short-time energy of the input speech signals, and subsequently allocate rates accordingly to the users. An important merit of the proposed framework is that it is general and can be applicable for resource allocation across a variety of multirate speech coders, and it is robust to a variety of speech quality metrics. The proposed system is also shown to involve a quick and low-complexity training process. We generalize the algorithm to scenarios in which users have unequally weighted priorities. These scenarios might arise in emergency situations, in which certain users are more important than others. The proposed rate allocation system is shown to increase the utility measures for both the Itakura and segmental signal-to-noise ratio (SNR) functions relative to the baseline system that performs uniform rate allocation. Additionally, although the instantaneous bitrate resolution of the speech encoder is not changed, the proposed system is shown to increase the short-time average bitrate resolution, and therefore provides a greater number of operational rate modes for the network View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Wideband Speech Coding Advances in VMR-WB Standard

    Publication Year: 2007 , Page(s): 1167 - 1179
    Cited by:  Papers (8)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (935 KB) |  | HTML iconHTML  

    This paper presents novel techniques for source-controlled variable-rate wideband speech coding. These techniques have been used in the variable-rate multimode wideband (VMR-WB) speech codec recently selected by the Third-Generation Partnership Project 2 (3GPP2) for wideband (WB) speech telephony, streaming, and multimedia messaging services in the cdma2000 third-generation wireless system. The codec utilizes efficient coding modes optimized for different classes of speech signal including generic coding based on AMR-WB for transients and onsets, voiced coding optimized for stable voiced signals, unvoiced coding optimized for unvoiced segments, and comfort noise generation for inactive segments. Several innovations enable very good performance at average bit rates below 8 kb/s for active speech coding. The article presents an overview of the codec and describes in detail some of the codec novel features: Robust pitch tracking algorithm, coding-mode dependent prediction of linear prediction (LP) filter quantization, and novel frame erasure concealment techniques including supplementary information for reconstruction of lost onsets and improving decoder convergence. Selected results from the Selection and Characterization tests of the codec illustrate its performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Spectral Conversion Approach to Single-Channel Speech Enhancement

    Publication Year: 2007 , Page(s): 1180 - 1193
    Cited by:  Papers (4)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1947 KB) |  | HTML iconHTML  

    In this paper, a novel method for single-channel speech enhancement is proposed, which is based on a spectral conversion feature denoising approach. Spectral conversion has been applied previously in the context of voice conversion, and has been shown to successfully transform spectral features with particular statistical properties into spectral features that best fit (with the constraint of a piecewise linear transformation) different target statistics. This spectral transformation is applied as an initialization step to two well-known single channel enhancement methods, namely the iterative Wiener filter (IWF) and a particular iterative implementation of the Kalman filter. In both cases, spectral conversion is shown here to provide a significant improvement as opposed to initializations using the spectral features directly from the noisy speech. In essence, the proposed approach allows for applying these two algorithms in a user-centric manner, when "clean" speech training data are available from a particular speaker. The extra step of spectral conversion is shown to offer significant advantages regarding output signal-to-noise ratio (SNR) improvement over the conventional initializations, which can reach 2 dB for the IWF and 6 dB for the Kalman filtering algorithm, for low input SNRs and for white and colored noise, respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noisy Speech Enhancement Using Harmonic-Noise Model and Codebook-Based Post-Processing

    Publication Year: 2007 , Page(s): 1194 - 1203
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1070 KB) |  | HTML iconHTML  

    This paper presents a post-processing speech restoration module for enhancing the performance of conventional speech enhancement methods. The restoration module aims to retrieve parts of speech spectrum that may be lost to noise or suppressed when using conventional speech enhancement methods. The proposed restoration method utilizes a harmonic plus noise model (HNM) of speech to retrieve damaged speech structure. A modified HNM of speech is proposed where, instead of the conventional binary labeling of the signal in each subband as voiced or unvoiced, the concept of harmonicity is introduced which is more adaptable to the codebook mapping method used in the later stage of enhancement. To restore the lost or suppressed information, an HNM codebook mapping technique is proposed. The HNM codebook is trained on speaker-independent speech data. To reduce the sensitivity of the HNM codebook to speaker variability, a spectral energy normalization process is introduced. The proposed post-processing method is tested as an add-on module with several popular noise reduction methods. Evaluations of the performance gain obtained from the proposed post-processing are presented and compared to standard speech enhancement systems which show substantial improvement gains in perceptual quality View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Environmental Independent ASR Model Adaptation/Compensation by Bayesian Parametric Representation

    Publication Year: 2007 , Page(s): 1204 - 1217
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2629 KB) |  | HTML iconHTML  

    The mismatch between system training and operating conditions can seriously deteriorate the performance of automatic speech recognition (ASR) systems. Various techniques have been proposed to solve this problem in a specified speech environment. Employment of these techniques often involves modification on the ASR system structure. In this paper, we propose an environment-independent (EI) ASR model parameter adaptation approach based on Bayesian parametric representation (BPR), which is able to adapt ASR models to new environments without changing the structure of an ASR system. The parameter set of BPR is optimized by a maximum joint likelihood criterion which is consistent with that of the hidden Markov model (HMM)-based ASR model through an independent expectation-maximization (EM) procedure. Variations of the proposed approach are investigated in the experiments designed in two different speech environments: one is the noisy environment provided by the AURORA 2 database, and the other is the network environment provided by the NTIMIT database. Performances of the proposed EI ASR model compensation approach are compared to those of the cepstral mean normalization (CMN) approach, which is one of the standard techniques for additive noise compensation. The experimental results show that performances of ASR models in different speech environments are significantly improved after being adapted by the proposed BPR model compensation approach View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulation of Losses Due to Turbulence in the Time-Varying Vocal System

    Publication Year: 2007 , Page(s): 1218 - 1226
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (822 KB) |  | HTML iconHTML  

    Flow separation in the vocal system at the outlet of a constriction causes turbulence and a fluid dynamic pressure loss. In articulatory synthesizers, the pressure drop associated with such a loss is usually assumed to be concentrated at one specific position near the constriction and is represented by a lumped nonlinear resistance to the flow. This paper highlights discontinuity problems of this simplified loss treatment when the constriction location changes during dynamic articulation. The discontinuities can manifest as undesirable acoustic artifacts in the synthetic speech signal that need to be avoided for high-quality articulatory synthesis. We present a solution to this problem based on a more realistic distributed consideration of fluid dynamic pressure changes. The proposed method was implemented in an articulatory synthesizer where it proved to prevent any acoustic artifacts View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Variable-Length Unit Selection in TTS Using Structural Syntactic Cost

    Publication Year: 2007 , Page(s): 1227 - 1235
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1140 KB) |  | HTML iconHTML  

    This paper presents a variable-length unit selection scheme based on syntactic cost to select text-to-speech (TTS) synthesis units. The syntactic structure of a sentence is derived from a probabilistic context-free grammar (PCFG), and represented as a syntactic vector. The syntactic difference between target and candidate units (words or phrases) is estimated by the cosine measure with the inside probability of PCFG acting as a weight. Latent semantic analysis (LSA) is applied to reduce the dimensionality of the syntactic vectors. The dynamic programming algorithm is adopted to obtain a concatenated unit sequence with minimum cost. A syntactic property-rich speech database is designed and collected as the unit inventory. Several experiments with statistical testing are conducted to assess the quality of the synthetic speech as perceived by human subjects. The proposed method outperforms the synthesizer without considering syntactic property. The structural syntax estimates the substitution cost better than the acoustic features alone View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Audio Signal Feature Extraction and Classification Using Local Discriminant Bases

    Publication Year: 2007 , Page(s): 1236 - 1246
    Cited by:  Papers (20)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (692 KB) |  | HTML iconHTML  

    Audio feature extraction plays an important role in analyzing and characterizing audio content. Auditory scene analysis, content-based retrieval, indexing, and fingerprinting of audio are few of the applications that require efficient feature extraction. The key to extract strong features that characterize the complex nature of audio signals is to identify their discriminatory subspaces. In this paper, we propose an audio feature extraction and a multigroup classification scheme that focuses on identifying discriminatory time-frequency subspaces using the local discriminant bases (LDB) technique. Two dissimilarity measures were used in the process of selecting the LDB nodes and extracting features from them. The extracted features were then fed to a linear discriminant analysis-based classifier for a three-level hierarchical classification of audio signals into ten classes. In the first level, the audio signals were grouped into artificial and natural sounds. Each of the first level groups were subdivided to form the second level groups viz. instrumental, automobile, human, and nonhuman sounds. The third level was formed by subdividing the four groups of the second level into the final ten groups (drums, flute, piano, aircraft, helicopter, male, female, animals, birds and insects). A database of 213 audio signals were used in this study and an average classification accuracy of 83% for the first level (113 artificial and 100 natural sounds), 92% for the second level (73 instrumental and 40 automobile sounds; 40 human and 60 nonhuman sounds), and 89% for the third level (27 drums, 15 flute, and 31 piano sounds; 23 aircraft and 17 helicopter sounds; 20 male and 20 female speech; 20 animals, 20 birds and 20 insects sounds) were achieved. In addition to the above, a separate classification was also performed combining the LDB features with the mel-frequency cepstral coefficients. The average classification accuracies achieved using the combined features were 91% for the- - first level, 99% for the second level, and 95% for the third level View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Melody Transcription From Music Audio: Approaches and Evaluation

    Publication Year: 2007 , Page(s): 1247 - 1256
    Cited by:  Papers (30)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1450 KB) |  | HTML iconHTML  

    Although the process of analyzing an audio recording of a music performance is complex and difficult even for a human listener, there are limited forms of information that may be tractably extracted and yet still enable interesting applications. We discuss melody-roughly, the part a listener might whistle or hum-as one such reduced descriptor of music audio, and consider how to define it, and what use it might be. We go on to describe the results of full-scale evaluations of melody transcription systems conducted in 2004 and 2005, including an overview of the systems submitted, details of how the evaluations were conducted, and a discussion of the results. For our definition of melody, current systems can achieve around 70% correct transcription at the frame level, including distinguishing between the presence or absence of the melody. Melodies transcribed at this level are readily recognizable, and show promise for practical applications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Melody Extraction and Musical Onset Detection via Probabilistic Models of Framewise STFT Peak Data

    Publication Year: 2007 , Page(s): 1257 - 1272
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1008 KB) |  | HTML iconHTML  

    We propose a probabilistic method for the joint segmentation and melody extraction for musical audio signals which arise from a monophonic score. The method operates on framewise short-time Fourier transform (STFT) peaks, enabling a computationally efficient inference of note onset, duration, and pitch attributes while retaining sufficient information for pitch determination and spectral change detection. The system explicitly models note events in terms of transient and steady-state regions as well as possible gaps between note events. In this way, the system readily distinguishes abrupt spectral changes associated with musical onsets from other abrupt change events. Additionally, the method may incorporate melodic context by modeling note-to-note dependences. The method is successfully applied to a variety of piano and violin recordings containing reverberation, effective polyphony due to legato playing style, expressive pitch variations, and background voices. While the method does not provide a sample-accurate segmentation, it facilitates the latter in subsequent processing by isolating musical onsets to frame neighborhoods and identifying possible pitch content before and after the true onset sample location View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low Bit-Rate Object Coding of Musical Audio Using Bayesian Harmonic Models

    Publication Year: 2007 , Page(s): 1273 - 1282
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB) |  | HTML iconHTML  

    This paper deals with the decomposition of music signals into pitched sound objects made of harmonic sinusoidal partials for very low bit-rate coding purposes. After a brief review of existing methods, we recast this problem in the Bayesian framework. We propose a family of probabilistic signal models combining learned object priors and various perceptually motivated distortion measures. We design efficient algorithms to infer object parameters and build a coder based on the interpolation of frequency and amplitude parameters. Listening tests suggest that the loudness-based distortion measure outperforms other distortion measures and that our coder results in a better sound quality than baseline transform and parametric coders at 8 and 2 kbit/s. This work constitutes a new step towards a fully object-based coding system, which would represent audio signals as collections of meaningful note-like sound objects View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Joint Detection and Tracking of Time-Varying Harmonic Components: A Flexible Bayesian Approach

    Publication Year: 2007 , Page(s): 1283 - 1295
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (845 KB) |  | HTML iconHTML  

    This paper addresses the joint estimation and detection of time-varying harmonic components in audio signals. We follow a flexible viewpoint, where several frequency/amplitude trajectories are tracked in spectrogram using particle filtering. The core idea is that each harmonic component (composed of a fundamental partial together with several overtone partials) is considered a target. Tracking requires to define a state-space model with state transition and measurement equations. Particle filtering algorithms rely on a so-called sequential importance distribution, and we show that it can be built on previous multipitch estimation algorithms, so as to yield an even more efficient estimation procedure with established convergence properties. Moreover, as our model captures all the harmonic model information, it actually separates the harmonic sources. Simulations on synthetic and real music data show the interest of our approach View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust Data Hiding in Audio Using Allpass Filters

    Publication Year: 2007 , Page(s): 1296 - 1304
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (564 KB) |  | HTML iconHTML  

    A novel technique is proposed for data hiding in digital audio that exploits the low sensitivity of the human auditory system to phase distortion. Inaudible but controlled phase changes are introduced in the host audio using a set of allpass filters (APFs) with distinct parameters of allpass filters, i.e., pole-zero locations. The APF parameters are chosen to encode the embedding information. During the detection phase, the power spectrum of the audio data is estimated in the z-plane away from the unit circle. The power spectrum is used to estimate APF pole locations, for information decoding. Experimental results show that the proposed data hiding scheme can effectively withstand standard data manipulation attacks. Moreover, the proposed scheme is shown to embed 5-8 times more data than the existing audio data hiding schemes while providing comparable perceptual performance and robustness View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • System Identification in the Short-Time Fourier Transform Domain With Crossband Filtering

    Publication Year: 2007 , Page(s): 1305 - 1319
    Cited by:  Papers (30)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1933 KB) |  | HTML iconHTML  

    In this paper, we investigate the influence of crossband filters on a system identifier implemented in the short-time Fourier transform (STFT) domain. We derive analytical relations between the number of crossband filters, which are useful for system identification in the STFT domain, and the power and length of the input signal. We show that increasing the number of crossband filters not necessarily implies a lower steady-state mean-square error (mse) in subbands. The number of useful crossband filters depends on the power ratio between the input signal and the additive noise signal. Furthermore, it depends on the effective length of input signal employed for system identification, which is restricted to enable tracking capability of the algorithm during time variations in the system. As the power of input signal increases or as the time variations in the system become slower, a larger number of crossband filters may be utilized. The proposed subband approach is compared to the conventional fullband approach and to the commonly used subband approach that relies on multiplicative transfer function (MTF) approximation. The comparison is carried out in terms of mse performance and computational complexity. Experimental results verify the theoretical derivations and demonstrate the relations between the number of useful crossband filters and the power and length of the input signal View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Improvement of the Two-Path Algorithm Transfer Logic for Acoustic Echo Cancellation

    Publication Year: 2007 , Page(s): 1320 - 1326
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (488 KB) |  | HTML iconHTML  

    Adaptive filters for echo cancellation generally need update control schemes to avoid divergence in case of significant disturbances. The two-path algorithm avoids the problem of unnecessary halting of the adaptive filter when the control scheme gives an erroneous output. Versions of this algorithm have previously been presented for echo cancellation. This paper presents a transfer logic which improves the convergence speed of the two-path algorithm for acoustic echo cancellation, while retaining the robustness. Results from simulations show an improved performance, and a fixed-point DSP implementation verifies the performance in real-time View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Direction of Arrival Estimation Using the Parameterized Spatial Correlation Matrix

    Publication Year: 2007 , Page(s): 1327 - 1339
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (789 KB) |  | HTML iconHTML  

    The estimation of the direction-of-arrival (DOA) of one or more acoustic sources is an area that has generated much interest in recent years, with applications like automatic video camera steering and multiparty stereophonic teleconferencing entering the market. DOA estimation algorithms are hindered by the effects of background noise and reverberation. Methods based on the time-differences-of-arrival (TDOA) are commonly used to determine the azimuth angle of arrival of an acoustic source. TDOA-based methods compute each relative delay using only two microphones, even though additional microphones are usually available. This paper deals with DOA estimation based on spatial spectral estimation, and establishes the parameterized spatial correlation matrix as the framework for this class of DOA estimators. This matrix jointly takes into account all pairs of microphones, and is at the heart of several broadband spatial spectral estimators, including steered-response power (SRP) algorithms. This paper reviews and evaluates these broadband spatial spectral estimators, comparing their performance to TDOA-based locators. In addition, an eigenanalysis of the parameterized spatial correlation matrix is performed and reveals that such analysis allows one to estimate the channel attenuation from factors such as uncalibrated microphones. This estimate generalizes the broadband minimum variance spatial spectral estimator to more general signal models. A DOA estimator based on the multichannel cross correlation coefficient (MCCC) is also proposed. The performance of all proposed algorithms is included in the evaluation. It is shown that adding extra microphones helps combat the effects of background noise and reverberation. Furthermore, the link between accurate spatial spectral estimation and corresponding DOA estimation is investigated. The application of the minimum variance and MCCC methods to the spatial spectral estimation problem leads to better resolution than that of the - - commonly used fixed-weighted SRP spectrum. However, this increased spatial spectral resolution does not always translate to more accurate DOA estimation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multichannel Bin-Wise Robust Frequency-Domain Adaptive Filtering and Its Application to Adaptive Beamforming

    Publication Year: 2007 , Page(s): 1340 - 1351
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (822 KB) |  | HTML iconHTML  

    Least-squares error (LSE) or mean-squared error (MSE) optimization criteria lead to adaptive filters that are highly sensitive to impulsive noise. The sensitivity to noise bursts increases with the convergence speed of the adaptation algorithm and limits the performance of signal processing algorithms, especially when fast convergence is required, as for example, in adaptive beamforming for speech and audio signal acquisition or acoustic echo cancellation. In these applications, noise bursts are frequently due to undetected double-talk. In this paper, we present impulsive noise robust multichannel frequency-domain adaptive filters (MC-FDAFs) based on outlier-robust M-estimation using a Newton algorithm and a discrete Newton algorithm, which are especially designed for frequency bin-wise adaptation control. Bin-wise adaptation and control in the frequency-domain enables the application of the outlier-robust MC-FDAFs to a generalized sidelobe canceler (GSC) using an adaptive blocking matrix for speech and audio signal acquisition. It is shown that the improved robustness leads to faster convergence and to higher interference suppression relative to nonrobust adaptation algorithms, especially during periods of strong interference View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition

    Publication Year: 2007 , Page(s): 1352 - 1365
    Cited by:  Papers (29)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (759 KB) |  | HTML iconHTML  

    This paper proposes a novel one-pass search algorithm with on-the-fly composition of weighted finite-state transducers (WFSTs) for large-vocabulary continuous-speech recognition. In the standard search method with on-the-fly composition, two or more WFSTs are composed during decoding, and a Viterbi search is performed based on the composed search space. With this new method, a Viterbi search is performed based on the first of the two WFSTs. The second WFST is only used to rescore the hypotheses generated during the search. Since this rescoring is very efficient, the total amount of computation required by the new method is almost the same as when using only the first WFST. In a 65k-word vocabulary spontaneous lecture speech transcription task, our proposed method significantly outperformed the standard search method. Furthermore, our method was faster than decoding with a single fully composed and optimized WFST, where our method used only 38% of the memory required for decoding with the single WFST. Finally, we have achieved high-accuracy one-pass real-time speech recognition with an extremely large vocabulary of 1.8 million words View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Study of Variable-Parameter Gaussian Mixture Hidden Markov Modeling for Noisy Speech Recognition

    Publication Year: 2007 , Page(s): 1366 - 1376
    Cited by:  Papers (14)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1205 KB) |  | HTML iconHTML  

    To improve recognition performance in noisy environments, multicondition training is usually applied in which speech signals corrupted by a variety of noise are used in acoustic model training. Published hidden Markov modeling of speech uses multiple Gaussian distributions to cover the spread of the speech distribution caused by noise, which distracts the modeling of speech event itself and possibly sacrifices the performance on clean speech. In this paper, we propose a novel approach which extends the conventional Gaussian mixture hidden Markov model (GMHMM) by modeling state emission parameters (mean and variance) as a polynomial function of a continuous environment-dependent variable. At the recognition time, a set of HMMs specific to the given value of the environment variable is instantiated and used for recognition. The maximum-likelihood (ML) estimation of the polynomial functions of the proposed variable-parameter GMHMM is given within the expectation-maximization (EM) framework. Experiments on the Aurora 2 database show significant improvements of the variable-parameter Gaussian mixture HMMs compared to the conventional GMHMMs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Template-Based Continuous Speech Recognition

    Publication Year: 2007 , Page(s): 1377 - 1390
    Cited by:  Papers (27)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1052 KB) |  | HTML iconHTML  

    Despite their known weaknesses, hidden Markov models (HMMs) have been the dominant technique for acoustic modeling in speech recognition for over two decades. Still, the advances in the HMM framework have not solved its key problems: it discards information about time dependencies and is prone to overgeneralization. In this paper, we attempt to overcome these problems by relying on straightforward template matching. The basis for the recognizer is the well-known DTW algorithm. However, classical DTW continuous speech recognition results in an explosion of the search space. The traditional top-down search is therefore complemented with a data-driven selection of candidates for DTW alignment. We also extend the DTW framework with a flexible subword unit mechanism and a class sensitive distance measure-two components suggested by state-of-the-art HMM systems. The added flexibility of the unit selection in the template-based framework leads to new approaches to speaker and environment adaptation. The template matching system reaches a performance somewhat worse than the best published HMM results for the Resource Management benchmark, but thanks to complementarity of errors between the HMM and DTW systems, the combination of both leads to a decrease in word error rate with 17% compared to the HMM results View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research