By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 8 • Date Nov. 2011

Filter Results

Displaying Results 1 - 25 of 41
  • Table of contents

    Publication Year: 2011 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (102 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2011 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • Probabilistic Template-Based Chord Recognition

    Publication Year: 2011 , Page(s): 2249 - 2259
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (776 KB) |  | HTML iconHTML  

    This paper describes a probabilistic approach to template-based chord recognition in music signals. The algorithm only takes chromagram data and a user-defined dictionary of chord templates as input data. No training or musical information such as key, rhythm, or chord transition models is required. The chord occurrences are treated as probabilistic events, whose probabilities are learned from the song using an expectation-maximization (EM) algorithm. The adaptative estimation of these probabilities (together with an ad-hoc postprocessing filtering) has the desirable effect of smoothing out spurious chords that would occur in our previous baseline work. Our algorithm is compared to various methods that entered the Music Information Retrieval Evaluation eXchange (MIREX) in 2008 and 2009, using a diverse set of evaluation metrics, some of which are new. The systems are tested on two evaluation corpuses; the first one is composed of the Beatles catalog (180 pop-rock songs) and the other one is constituted of 20 songs from various artists and music genres. Results show that our method outperforms state-of-the-art chord recognition systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Binaural Noise Reduction in the Time Domain With a Stereo Setup

    Publication Year: 2011 , Page(s): 2260 - 2272
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1916 KB) |  | HTML iconHTML  

    Binaural noise reduction with a stereophonic (or simply stereo) setup has become a very important problem as stereo sound systems and devices are being more and more deployed in modern voice communications. This problem is very challenging since it requires not only the reduction of the noise at the stereo inputs, but also the preservation of the spatial information embodied in the two channels so that after noise reduction the listener can still localize the sound source from the binaural outputs. As a result, simply applying a traditional single-channel noise reduction technique to each channel individually may not work as the spatial effects may be destroyed. In this paper, we present a new formulation of the binaural noise reduction problem in stereo systems. We first form a complex signal from the stereo inputs with one channel being its real part and the other being its imaginary part. By doing so, the binaural noise reduction problem can be processed by a single-channel widely linear filter. The widely linear estimation theory is then used to derive optimal noise reduction filters that can fully take advantage of the noncircularity of the complex speech signal to achieve noise reduction while preserving the desired signal (speech) and spatial information. With this new formulation, the Wiener, minimum variance distortionless response (MVDR), maximum signal-to-noise ratio (SNR), and tradeoff filters are derived. Experiments are provided to justify the effectiveness of these filters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proportionate Affine Projection Sign Algorithms for Network Echo Cancellation

    Publication Year: 2011 , Page(s): 2273 - 2284
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2042 KB) |  | HTML iconHTML  

    Two proportionate affine projection sign algorithms (APSAs) are proposed for network echo cancellation (NEC) applications where the impulse response is often real-valued with sparse coefficients and long filter length. The proposed proportionate-type algorithms can achieve fast convergence and low steady-state misalignment by adopting a proportionate regularization matrix to the APSA. Benefiting from the characteristics of l1-norm optimization, affine projection, and proportionate matrix, the new algorithms are more robust to impulsive interferences and colored input than the proportionate least mean squares (PNLMS) algorithm and the robust proportionate affine projection algorithm (Robust PAPA). The new algorithms also achieve much faster convergence rate in sparse impulse responses than the original APSA and the normalized sign algorithm (NSA). The new algorithms are robust to all types of NEC impulse response with different sparseness without the need to change parameters or estimate the sparseness of the impulse response. The computational complexity of the new algorithms is lower than the affine projection algorithm (APA) family due to the elimination of the matrix inversion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Feature Compensation Approach Using High-Order Vector Taylor Series Approximation of an Explicit Distortion Model for Noisy Speech Recognition

    Publication Year: 2011 , Page(s): 2285 - 2293
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (450 KB) |  | HTML iconHTML  

    This paper presents a new feature compensation approach to noisy speech recognition by using high-order vector Taylor series (HOVTS) approximation of an explicit model of environmental distortions. Formulations for maximum-likelihood (ML) estimation of both additive noises and convolutional distortions, and minimum mean squared error (MMSE) estimation of clean speech are derived. Experimental results on Aurora2 and Aurora4 benchmark databases, where the modeling assumption of the distortion model is more accurate, demonstrate that the standard HOVTS-based feature compensation approaches achieve consistently significant improvement in recognition accuracy compared to traditional standard first-order VTS-based approach. For a real-world in-vehicle connected digits recognition task on Aurora3 benchmark database where the modeling assumption of the distortion model is less accurate, modifications are necessary to make VTS-based feature compensation approaches work. In this case, the second-order VTS-based approach performs only slightly better than the first-order VTS-based approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Preference Music Ratings Prediction Using Tokenization and Minimum Classification Error Training

    Publication Year: 2011 , Page(s): 2294 - 2303
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (493 KB) |  | HTML iconHTML  

    In order to address two main limitations of current content-based music recommendation approaches, an ordinal regression algorithm for music recommendation that incorporates dynamic information is presented. Instead of assuming that local spectral features within a song are identically and independently distributed examples of an underlying probability density, music is characterized by a vocabulary of acoustic segment models (ASMs), which are found with an unsupervised process. Further, instead of classifying music based on subjective classes, such as genre, or trying to find a universal notion of similarity, songs are classified based on personal preference ratings. The ordinal regression approach to perform the ratings prediction is based on the discriminative-training algorithm known as minimum classification error (MCE) training. Experimental results indicate that improved temporal modeling leads to superior performance over standard spectral-based music representations. Further, the MCE-based preference ratings algorithm is shown to be superior over two other systems. Analysis demonstrates that the superior performance is due to MCE being a non-conservative algorithm that demonstrates immunity to outliers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decimation-Whitening Filter in Spectral Band Replication

    Publication Year: 2011 , Page(s): 2304 - 2313
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1671 KB) |  | HTML iconHTML  

    MPEG-4 High-Efficiency Advanced Audio Coding (HE-AAC) has adopted spectral band replication (SBR) to efficiently compress the high-frequency part of the audio. In SBR, linear prediction is applied to low-frequency subbands to suppress tonal components and smooth the associated spectra for replicating to high-frequency bands. Such a tone-suppressing process is referred to as whitening filtering. In SBR, to avoid the alias artifact incurred by spectral adjustment, a complex filterbank instead of real filterbank is adopted. For QMF subbands, this paper demonstrates that the linear prediction defined in the SBR standard results in a predictive bias. A new whitening filter, called the decimation-whitening filter, is proposed to eliminate the predictive bias and provide advantages in terms of noise-to-signal ratio measure, frequency resolution, energy leakage, and computational complexity for SBR. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Convex Combination of Multiple Statistical Models With Application to VAD

    Publication Year: 2011 , Page(s): 2314 - 2327
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3594 KB) |  | HTML iconHTML  

    This paper proposes a robust voice activity detector (VAD) based on the observation that the distribution of speech captured with far-field microphones is highly varying, depending on the noise and reverberation conditions. The proposed VAD employs a convex combination scheme comprising three statistical distributions - a Gaussian, a Laplacian, and a two-sided Gamma - to effectively model captured speech. This scheme shows increased ability to adapt to dynamic acoustic environments. The contribution of each distribution to this convex combination is automatically adjusted based on the statistical characteristics of the instantaneous audio input. To further improve the performance of the system, an adaptive threshold is introduced, while a decision-smoothing scheme caters to the intra-frame correlation of speech signals. Extensive experiments under realistic scenarios support the proposed approach of combining several models for increased adaptation and performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reverberant Speech Segregation Based on Multipitch Tracking and Classification

    Publication Year: 2011 , Page(s): 2328 - 2337
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1079 KB) |  | HTML iconHTML  

    Room reverberation creates a major challenge to speech segregation. We propose a computational auditory scene analysis approach to monaural segregation of reverberant voiced speech, which performs multipitch tracking of reverberant mixtures and supervised classification. Speech and nonspeech models are separately trained, and each learns to map from a set of pitch-based features to a grouping cue which encodes the posterior probability of a time-frequency (T-F) unit being dominated by the source with the given pitch estimate. Because interference may be either speech or nonspeech, a likelihood ratio test selects the correct model for labeling corresponding T-F units. Experimental results show that the proposed system performs robustly in different types of interference and various reverberant conditions, and has a significant advantage over existing systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Lattice Indexing for Spoken Term Detection

    Publication Year: 2011 , Page(s): 2338 - 2347
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (438 KB) |  | HTML iconHTML  

    This paper considers the problem of constructing an efficient inverted index for the spoken term detection (STD) task. More specifically, we construct a deterministic weighted finite-state transducer storing soft-hits in the form of (utterance ID, start time, end time, posterior score) quadruplets. We propose a generalized factor transducer structure which retains the time information necessary for performing STD. The required information is embedded into the path weights of the factor transducer without disrupting the inherent optimality. We also describe how to index all substrings seen in a collection of raw automatic speech recognition lattices using the proposed structure. Our STD indexing/search implementation is built upon the OpenFst Library and is designed to scale well to large problems. Experiments on Turkish and English data sets corroborate our claims. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved Modeling of Cross-Decoder Phone Co-Occurrences in SVM-Based Phonotactic Language Recognition

    Publication Year: 2011 , Page(s): 2348 - 2363
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1448 KB) |  | HTML iconHTML  

    Most common approaches to phonotactic language recognition deal with several independent phone decodings. These decodings are processed and scored in a fully uncoupled way, their time alignment (and the information that may be extracted from it) being completely lost. Recently, we have presented two new approaches to phonotactic language recognition which take into account time alignment information, by considering time-synchronous cross-decoder phone co-occurrences. Experiments on the 2007 NIST LRE database demonstrated that using phone co-occurrence statistics could improve the performance of baseline phonotactic recognizers. In this paper, approaches based on time-synchronous cross-decoder phone co-occurrences are further developed and evaluated with regard to a baseline SVM-based phonotactic system, by using: 1) counts of n-grams (up to 4-grams) of phone co-occurrences; and 2) the degree of co-occurrence of phone n-grams (up to 4-grams). To evaluate these approaches, a choice of open software (Brno University of Technology phone decoders, LIBLINEAR and FoCal) was used, and experiments were carried out on the 2007 NIST LRE database. The two approaches presented in this paper outperformed the baseline phonotactic system, yielding around 7% relative improvement in terms of CLLR. The fusion of the baseline system with the two proposed approaches yielded 1.83% EER and CLLR=0.270 (meaning 18% relative improvement), the same performance (on the same task) than state-of-the-art phonotactic systems which apply more complex models and techniques, thus supporting the use of cross-decoder dependencies for language recognition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Generalized Proportionate Subband Adaptive Second-Order Volterra Filter for Acoustic Echo Cancellation in Changing Environments

    Publication Year: 2011 , Page(s): 2364 - 2373
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1238 KB) |  | HTML iconHTML  

    The performance of a typical linear echo canceller is limited by changes that occur in the acoustic echo path along with nonlinear echoes arising from electrodynamic loudspeaker distortion. This paper presents a subband structure based on proportionately adapted second-order Volterra filters for acoustic echo cancellation under these adverse conditions. The proposed structure allows the time and frequency domain nature of echo path changes to be exploited, resulting in faster convergence and tracking compared to an equivalent fullband structure while requiring significantly less computational complexity. Experimental results, based on measured data from a practical hands-free environment under changing conditions, demonstrate the improved echo cancellation performance of the proposed structure. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimating Direct-to-Reverberant Energy Ratio Using D/R Spatial Correlation Matrix Model

    Publication Year: 2011 , Page(s): 2374 - 2384
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1897 KB) |  | HTML iconHTML  

    We present a method for estimating the direct-to-reverberant energy ratio (DRR) that uses a direct and reverberant sound spatial correlation matrix model (Hereafter referred to as the spatial correlation model). This model expresses the spatial correlation matrix of an array input signal as two spatial correlation matrices, one for direct sound and one for reverberation. The direct sound propagates from the direction of the sound source but the reverberation arrives from every direction uniformly. The DRR is calculated from the power spectra of the direct sound and reverberation that are estimated from the spatial correlation matrix of the measured signal using the spatial correlation model. The results of experiment and simulation confirm that the proposed method gives mostly correct DRR estimates unless the sound source is far from the microphone array, in which circumstance the direct sound picked up by the microphone array is very small. The method was also evaluated using various scales in simulated and actual acoustical environments, and its limitations revealed. We estimated the sound source distance using a small microphone array, which is an example of application of the proposed DRR estimation method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Conditional Random Field Framework for Robust and Scalable Audio-to-Score Matching

    Publication Year: 2011 , Page(s): 2385 - 2397
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1060 KB) |  | HTML iconHTML  

    In this paper, we introduce the use of conditional random fields (CRFs) for the audio-to-score alignment task. This framework encompasses the statistical models which are used in the literature and allows for more flexible dependency structures. In particular, it allows observation functions to be computed from several analysis frames. Three different CRF models are proposed for our task, for different choices of tradeoff between accuracy and complexity. Three types of features are used, characterizing the local harmony, note attacks and tempo. We also propose a novel hierarchical approach, which takes advantage of the score structure for an approximate decoding of the statistical model. This strategy reduces the complexity, yielding a better overall efficiency than the classic beam search method used in HMM-based models. Experiments run on a large database of classical piano and popular music exhibit very accurate alignments. Indeed, with the best performing system, more than 95% of the note onsets are detected with a precision finer than 100 ms. We additionally show how the proposed framework can be modified in order to be robust to possible structural differences between the score and the musical performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Demodulation as Probabilistic Inference

    Publication Year: 2011 , Page(s): 2398 - 2411
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1722 KB) |  | HTML iconHTML  

    Demodulation is an ill-posed problem whenever both carrier and envelope signals are broadband and unknown. Here, we approach this problem using the methods of probabilistic inference. The new approach, called Probabilistic Amplitude Demodulation (PAD), is computationally challenging but improves on existing methods in a number of ways. By contrast to previous approaches to demodulation, it satisfies five key desiderata: PAD has soft constraints because it is probabilistic; PAD is able to automatically adjust to the signal because it learns parameters; PAD is user-steerable because the solution can be shaped by user-specific prior information; PAD is robust to broad-band noise because this is modeled explicitly; and PAD's solution is self-consistent, empirically satisfying a Carrier Identity property. Furthermore, the probabilistic view naturally encompasses noise and uncertainty, allowing PAD to cope with missing data and return error bars on carrier and envelope estimates. Finally, we show that when PAD is applied to a bandpass-filtered signal, the stop-band energy of the inferred carrier is minimal, making PAD well-suited to sub-band demodulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Generalized FLANN Filter for Nonlinear Active Noise Control

    Publication Year: 2011 , Page(s): 2412 - 2417
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (440 KB) |  | HTML iconHTML  

    In this paper, we propose an extension of the well-known FLANN filter using trigonometric expansions to include suitable cross-terms, i.e., products of input samples with different time shifts. It is shown that in some applications of nonlinear active noise control, the generalized FLANN filter can offer performance as good as that of high-order Volterra filters with a reduced complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhanced Sparse Imputation Techniques for a Robust Speech Recognition Front-End

    Publication Year: 2011 , Page(s): 2418 - 2429
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (878 KB) |  | HTML iconHTML  

    Missing data techniques (MDTs) have been widely employed and shown to improve speech recognition results under noisy conditions. This paper presents a new technique which improves upon previously proposed sparse imputation techniques relying on the least absolute shrinkage and selection operator (LASSO). LASSO is widely employed in compressive sensing problems. However, the problem with LASSO is that it does not satisfy oracle properties in the event of a highly collinear dictionary, which happens with features extracted from most speech corpora. When we say that a variable selection procedure satisfies the oracle properties, we mean that it enjoys the same performance as though the underlying true model is known. Through experiments on the Aurora 2.0 noisy spoken digits database, we demonstrate that the Least Angle Regression implementation of the Elastic Net (LARS-EN) algorithm is able to better exploit the properties of a collinear dictionary, and thus is significantly more robust in terms of basis selection when compared to LASSO on the continuous digit recognition task with estimated mask. In addition, we investigate the effects and benefits of a good measure of sparsity on speech recognition rates. In particular, we demonstrate that a good measure of sparsity greatly improves speech recognition rates, and that the LARS modification of LASSO and LARS-EN can be terminated early to achieve improved recognition results, even though the estimation error is increased. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bessel Nulls Recovery in Spherical Microphone Arrays for Time-Limited Signals

    Publication Year: 2011 , Page(s): 2430 - 2438
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1979 KB) |  | HTML iconHTML  

    Spherical microphone arrays have recently been developed for a wide range of applications. In particular, scanning arrays in open-sphere configuration have been employed for room acoustics analysis, based on room impulse response measurements. However, it has been shown that the simple single-sphere configuration suffers from ill-conditioning around the zeros of the spherical Bessel function, and so alternative configurations, typically more complex, such as dual-sphere and single-sphere with cardioid microphones, have been proposed to overcome the effect of the Bessel nulls. This paper shows that in the particular case of time-limited signals, for example in room impulse response measurements, the ill-conditioning due to the spherical Bessel functions can be reduced using nonuniform sampling in the frequency domain. Following a presentation of the theory and a simulation study, an experimental example is presented for use of the method in a single-sphere measurement in an auditorium. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transcribing Mandarin Broadcast Speech Using Multi-Layer Perceptron Acoustic Features

    Publication Year: 2011 , Page(s): 2439 - 2450
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (716 KB) |  | HTML iconHTML  

    Recently, several multi-layer perceptron (MLP)-based front-ends have been developed and used for Mandarin speech recognition, often showing significant complementary properties to conventional spectral features. Although widely used in multiple Mandarin systems, no systematic comparison of all the different approaches as well as their scalability has been proposed. The novelty of this correspondence is mainly experimental. In this work, all the MLP front-ends recently developed at multiple sites are described and compared in a systematic manner on a 100 hours setup. The study covers the two main directions along which the MLP features have evolved: the use of different input representations to the MLP and the use of more complex MLP architectures beyond the three-layer perceptron. The results are analyzed in terms of confusion matrices and the paper discusses a number of novel findings that the comparison reveals. Furthermore, the two best front-ends used in the GALE 2008 evaluation, referred as MLP1 and MLP2, are studied in a more complex LVCSR system in order to investigate their scalability in terms of the amount of training data (from 100 hours to 1600 hours) and the parametric system complexity (maximum likelihood versus discriminative training, speaker adaptative training, lattice level combination). Results on 5 hours of evaluation data from the GALE project reveal that the MLP features consistently produce improvements in the range of 15%-23% relative at the different steps of a multipass system when compared to mel-frequency cepstral coefficient (MFCC) and PLP features, suggesting that the improvements scale with the amount of data and with the complexity of the system. The integration of those features into the GALE 2008 evaluation system provide very competitive performances compared to other Mandarin systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MCE Training Techniques for Topic Identification of Spoken Audio Documents

    Publication Year: 2011 , Page(s): 2451 - 2460
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (604 KB) |  | HTML iconHTML  

    In this paper, we discuss the use of minimum classification error (MCE) training as a means for improving traditional approaches to topic identification such as naive Bayes classifiers and support vector machines. A key element of our new MCE training techniques is their ability to efficiently apply jackknifing or leave-one-out training to yield improved models which generalize better to unseen data. Experiments were conducted using recorded human-human telephone conversations from the Fisher Corpus using feature vector representations from word-based automatic speech recognition lattices. Sizeable improvements in topic identification accuracy using the new MCE training techniques were observed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Calibration of Confidence Measures in Speech Recognition

    Publication Year: 2011 , Page(s): 2461 - 2473
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1682 KB) |  | HTML iconHTML  

    Most speech recognition applications in use today rely heavily on confidence measure for making optimal decisions. In this paper, we aim to answer the question: what can be done to improve the quality of confidence measure if we cannot modify the speech recognition engine? The answer provided in this paper is a post-processing step called confidence calibration, which can be viewed as a special adaptation technique applied to confidence measure. Three confidence calibration methods have been developed in this work: the maximum entropy model with distribution constraints, the artificial neural network, and the deep belief network. We compare these approaches and demonstrate the importance of key features exploited: the generic confidence-score, the application-dependent word distribution, and the rule coverage ratio. We demonstrate the effectiveness of confidence calibration on a variety of tasks with significant normalized cross entropy increase and equal error rate reduction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Trust Region-Based Optimization for Maximum Mutual Information Estimation of HMMs in Speech Recognition

    Publication Year: 2011 , Page(s): 2474 - 2485
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1087 KB) |  | HTML iconHTML  

    In this paper, we have proposed two novel optimization methods for discriminative training (DT) of hidden Markov models (HMMs) in speech recognition based on an efficient global optimization algorithm used to solve the so-called trust region (TR) problem, where a quadratic function is minimized under a spherical constraint. In the first method, maximum mutual information estimation (MMIE) of Gaussian mixture HMMs is formulated as a standard TR problem so that the efficient global optimization method can be used in each iteration to maximize the auxiliary function of discriminative training for speech recognition. In the second method, we propose to construct a new auxiliary function for DT of HMMs by adding a quadratic penalty term. The new auxiliary function is constructed to serve as first-order approximation as well as lower bound of the original discriminative objective function within a locality constraint. Due to the lower-bound property, the found optimal point of the new auxiliary function is guaranteed to improve the original discriminative objective function until it converges to a local optimum or stationary point of the objective function. Both TR-based optimization methods have been investigated on two standard large-vocabulary continuous speech recognition tasks, using the WSJ0 and Switchboard databases. Experimental results have shown that the proposed TR methods outperform the conventional EBW method in terms of convergence behavior as well as recognition performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accurate Discretization of Analog Audio Filters With Application to Parametric Equalizer Design

    Publication Year: 2011 , Page(s): 2486 - 2493
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1006 KB) |  | HTML iconHTML  

    This paper is concerned with accurate discretization of linear analog filters such that the frequency response of the discrete time filter accurately matches that of the continuous time filter. The approach is based on formal reconstruction of the continuous time signal using Shannon's interpolation theorem and numerical solving of the differential equation corresponding to the analog filter. When the formal continuous time system is sampled, the resulting filter reduces to discrete linear filter, which can be realized either as a state space model or as an infinite impulse response (IIR) filter. The proposed methodology is applied to design of filters for parametric equalizers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Maximum-Entropy Segmentation Model for Statistical Machine Translation

    Publication Year: 2011 , Page(s): 2494 - 2505
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (562 KB) |  | HTML iconHTML  

    Segmentation is of great importance to statistical machine translation. It splits a source sentence into sequences of translatable segments. We propose a maximum-entropy segmentation model to capture desirable phrasal and hierarchical segmentations for statistical machine translation. We present an approach to automatically learning the beginning and ending boundaries of cohesive segments from word-aligned bilingual data without using any additional resources. The learned boundaries are then used to define cohesive segments in both phrasal and hierarchical segmentations. We integrate the segmentation model into phrasal statistical machine translation (SMT) and conduct experiments on the newswire and broadcast news domain to investigate the effectiveness of the proposed segmentation model on a large-scale training data. Our experimental results show that the maximum-entropy segmentation model significantly improves translation quality in terms of BLEU. We further validate that 1) the proposed segmentation model significantly outperforms syntactic constraints which are used in previous work to constrain segmentations; and 2) it is necessary to capture hierarchical segmentations besides phrasal segmentations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research