By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 6 • Date Nov 1998

Filter Results

Displaying Results 1 - 9 of 9
  • An improved (Auto:I, LSP:T) constrained iterative speech enhancement for colored noise environments

    Page(s): 573 - 579
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB)  

    We illustrate how the (Auto:I, LSP:T) constrained iterative speech enhancement algorithm can be extended to provide improved performance in colored noise environments. The modified algorithm, referred to as noise adaptive (Auto:I, LSP:T), operates on subband signal components in which the terminating iteration is adjusted based on the a posteriori estimate of the signal-to-noise ratio (SNR) in each signal subband. The enhanced speech is formulated as a combined estimate from individual signal subband estimators. The algorithm is shown to improve objective speech quality in additive noise environments over the traditional constrained iterative (Auto:I, LSP:T) enhancement formulation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech trajectory discrimination using the minimum classification error learning

    Page(s): 505 - 515
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (292 KB)  

    In this paper, we extend the maximum likelihood (ML) training algorithm to the minimum classification error (MCE) training algorithm for discriminatively estimating the state-dependent polynomial coefficients in the stochastic trajectory model or the trended hidden Markov model (HMM) originally proposed in Deng (1992). The main motivation of this extension is the new model space for smoothness-constrained, state-bound speech trajectories associated with the trended HMM, contrasting the conventional, stationary-state HMM, which describes only the piecewise-constant “degraded trajectories” in the observation data. The discriminative training implemented for the trended HMM has the potential to utilize this new, constrained model space, thereby providing stronger power to disambiguate the observational trajectories generated from nonstationary sources corresponding to different speech classes. Phonetic classification results are reported which demonstrate consistent performance improvements with use of the MCE-trained trended HMM both over the regular ML-trained trended HMM and over the MCE-trained stationary-state HMM View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • LPC interpolation by approximation of the sample autocorrelation function

    Page(s): 569 - 573
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (92 KB)  

    Conventionally, the energy of analysis frames is not taken into account for linear prediction (LPC) interpolation. Incorporating the frame energy improves the subjective quality of interpolation, but increases the spectral distortion (SP). The main reason for this discrepancy is that the outliers are increased in low energy parts of segments with rapid changes in energy. The energy is most naturally combined with a normalized autocorrelation representation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A general joint additive and convolutive bias compensation approach applied to noisy Lombard speech recognition

    Page(s): 524 - 538
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (388 KB)  

    A unified approach to the acoustic mismatch problem is proposed. A maximum likelihood state-based additive bias compensation algorithm is developed for the continuous density hidden Markov model (CDHMM). Based on this technique, specific bias models in the mel cepstral and the linear spectral domains are presented. Among these models, a new polynomial trend bias model in the mel cepstral domain is derived, which proved effective for Lombard speech compensation. In addition, a joint estimation algorithm for additive and convolutive bias compensation is proposed. This algorithm is based on applying the expectation maximization (EM) technique in both above-mentioned domains, in conjunction with a parallel model combination (PMC) based transformation. The compensation of the dynamic (difference) coefficients in the proposed framework is also studied. The evaluation data base consists of a 21 confusable word vocabulary uttered by 24 speakers. Three mismatched versions of the data base are considered, i.e., Lombard speech, 15 dB noisy Lombard speech, and 5 dB noisy Lombard speech. The proposed techniques result in 50.9%, 74.6%, and 67.3% reduction in the performance difference between matched and uncompensated word error rates for the three mismatch conditions, respectively. When dynamic coefficients are considered the corresponding reductions are 46.8%, 72.4%, and 70.9% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel feature transformation for vocal tract length normalization in automatic speech recognition

    Page(s): 549 - 557
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB)  

    This paper proposes a method to transform acoustic models that have been trained with a certain group of speakers for use on different speech in hidden Markov model based (HMM-based) automatic speech recognition. Features are transformed on the basis of assumptions regarding the difference in vocal tract length between the groups of speakers. First, the vocal tract length (VTL) of these groups has been estimated based on the average third formant F3. Second, the linear acoustic theory of speech production has been applied to warp the spectral characteristics of the existing models so as to match the incoming speech. The mapping is composed of subsequent nonlinear submappings. By locally linearizing it and comparing results in the output, a linear approximation for the exact mapping was obtained which is accurate as long as the warping is reasonably small. The feature vector, which is computed from a speech frame, consists of the mel scale cepstral coefficients (MFCC) along with delta and delta2-cepstra as well as delta and delta2 energy. The method has been tested for TI digits data base, containing adult and children speech, consisting of isolated digits and digit strings of different length. The word error rate when trained on adults and tested on children with transformed adult models is decreased by more than a factor of two compared to the nontransformed case View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flexible speech understanding based on combined key-phrase detection and verification

    Page(s): 558 - 568
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (212 KB)  

    We propose a novel speech understanding strategy based on combined detection and verification of semantically tagged key-phrases in spontaneous spoken utterances. Key-phrases are defined in a top-down manner so as to constitute semantic slots. Their detection directly leads to robust understanding. A phrase network realizes both a wide coverage and a reasonable constraint for detection. A subword-based verifier is then incorporated to reduce false alarms in detection and attach confidence measures of the detected phrases. This set of phrase confidence measures, when incorporated in a spoken dialogue system, forms a basis for designing intelligent speech interfaces that accept only verified key-phrases and reprompt users to clarify unspecified or unrecognized portions. Several forms of confidence measures based on subword-level tests are investigated. The proposed approach was tested on field data collected from real-world trial applications. The combined detection and verification strategy drastically improves the accuracy in handling out-of-grammar utterances over the conventional decoding approaches while maintaining the performance for in-grammar utterances View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving performance of spectral subtraction in speech recognition using a model for additive noise

    Page(s): 579 - 582
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (136 KB)  

    Addresses the problem of speech recognition with signals corrupted by additive noise at moderate signal-to-noise ratio (SNR). A model for additive noise is presented and used to compute the uncertainty about the hidden clean signal so as to weight the estimation provided by spectral subtraction. Weighted dynamic time warping (DTW) and Viterbi (HMM) algorithms are tested, and the results show that weighting the information along the signal can substantially increase the performance of spectral subtraction, an easily implemented technique, even with a poor estimation for noise and without using any information about the speaker. It is also shown that the weighting procedure can reduce the error rate when cepstral mean normalization is also used to cancel the convolutional noise View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient training algorithms for HMMs using incremental estimation

    Page(s): 539 - 548
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB)  

    Typically, parameter estimation for a hidden Markov model (HMM) is performed using an expectation-maximization (EM) algorithm with the maximum-likelihood (ML) criterion. The EM algorithm is an iterative scheme that is well-defined and numerically stable, but convergence may require a large number of iterations. For speech recognition systems utilizing large amounts of training material, this results in long training times. This paper presents an incremental estimation approach to speed-up the training of HMMs without any loss of recognition performance. The algorithm selects a subset of data from the training set, updates the model parameters based on the subset, and then iterates the process until convergence of the parameters. The advantage of this approach is a substantial increase in the number of iterations of the EM algorithm per training token, which leads to faster training. In order to achieve reliable estimation from a small fraction of the complete data set at each iteration, two training criteria are studied; ML and maximum a posteriori (MAP) estimation. Experimental results show that the training of the incremental algorithms is substantially faster than the conventional (batch) method and suffers no loss of recognition performance. Furthermore, the incremental MAP based training algorithm improves performance over the batch version View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unidirectional and parallel Baum-Welch algorithms

    Page(s): 516 - 523
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (412 KB)  

    Hidden Markov models (HMMs) are popular in many applications, such as automatic speech recognition, control theory, biology, communication theory over channels with bursts of errors, queueing theory, and many others. Therefore, it is important to have robust and fast methods for fitting HMMs to experimental data (training). Standard statistical methods of maximum likelihood parameter estimation (such as Newton-Raphson, conjugate gradients, etc.) are not robust and are difficult to use for fitting HMMs with many parameters. On the other hand, the Baum-Welch algorithm is robust, but slow. In this paper, we present a parallel version of the Baum-Welch algorithm. We consider also unidirectional procedures which, in contrast with the well-known forward-backward algorithm, use an amount of memory that is independent of the observation sequence length View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope