By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 6 • Date Nov. 1999

Filter Results

Displaying Results 1 - 13 of 13
  • Abstracts of manuscripts in review

    Publication Year: 1999 , Page(s): 725 - 726
    Save to Project icon | Request Permissions | PDF file iconPDF (21 KB)  
    Freely Available from IEEE
  • List of reviewers

    Publication Year: 1999 , Page(s): 727 - 728
    Save to Project icon | Request Permissions | PDF file iconPDF (11 KB)  
    Freely Available from IEEE
  • Comparing models for audiovisual fusion in a noisy-vowel recognition task

    Publication Year: 1999 , Page(s): 629 - 642
    Cited by:  Papers (18)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB)  

    Audiovisual speech recognition involves fusion of the audio and video sensors for phonetic identification. There are three basic ways to fuse data streams for taking a decision such as phoneme identification: data-to-decision, decision-to-decision, and data-to-data. This leads to four possible models for audiovisual speech recognition, that is direct identification in the first case, separate identification in the second one, and two variants of the third early integration case, namely dominant recoding or motor recoding. However, no systematic comparison of these models is available in the literature. We propose an implementation of these four models, and submit them to a benchmark test. For this aim, we use a noisy-vowel corpus tested on two recognition paradigms in which the systems are tested at noise levels higher than those used for learning. In one of these paradigms, the signal-to-noise ratio (SNR) value is provided to the recognition systems, in the other it is not. We also introduce a new criterion for evaluating performances, based on transmitted information on individual phonetic features. In light of the compared performances of the four models with the two recognition paradigms, we discuss the advantages and drawbacks of these models, leading to proposals for data representation, fusion architecture, and control of the fusion process through sensor reliability View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robustness of group-delay-based method for extraction of significant instants of excitation from speech signals

    Publication Year: 1999 , Page(s): 609 - 619
    Cited by:  Papers (23)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB)  

    We study the robustness of a group-delay-based method for determining the instants of significant excitation in speech signals. These instants correspond to the instants of glottal closure for voiced speech. The method uses the properties of the global phase characteristics of minimum phase signals. Robustness of the method against noise and distortion is due to the fact that the average phase characteristics of a signal is determined mainly by the strength of the excitation impulse. The strength of excitation is determined by the energy of the residual error signal around the instant of excitation. We propose a measure for the strength of the excitation based on Frobenius norm of the differenced signal. The robustness of the group-delay-based method is illustrated for speech under different types of degradations and for speech from different speakers View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Online hierarchical transformation of hidden Markov models for speech recognition

    Publication Year: 1999 , Page(s): 656 - 667
    Cited by:  Papers (23)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB)  

    This paper proposes a novel framework of online hierarchical transformation of hidden Markov model (HMM) parameters for adaptive speech recognition. Our goal is to incrementally transform (or adapt) all the HMM parameters to a new acoustical environment even though most of HMM units are unseen in observed adaptation data. We establish a hierarchical tree of HMM units and apply the tree to dynamically search the transformation parameters for individual HMM mixture components. In this paper, the transformation framework formulated according to the approximate Bayesian estimate, where the prior statistics and the transformation parameters can be jointly and incrementally refreshed after each consecutive adaptation data, is presented. Using this formulation, only the refreshed prior statistics and the current block of data are needed for online transformation. In a series of speaker adaptation experiments on the recognition of 408 Mandarin syllables, we examine the effects on constructing various types of hierarchical trees. The efficiency and effectiveness of proposed method on incremental adaptation of overall HMM units are also confirmed. Besides, we demonstrate the superiority of proposed online transformation to Huo's (see ibid., vol.5, p.161-72, 1997) on-line adaptation for a wide range of adaptation data View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An objective technique for evaluating doubletalk detectors in acoustic echo cancelers

    Publication Year: 1999 , Page(s): 718 - 724
    Cited by:  Papers (44)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    Echo cancelers commonly employ a doubletalk detector (DTD), which is essential to keep the adaptive filter from diverging in the presence of near-end speech and other disruptive noise. There have been numerous algorithms to detect doubletalk in an acoustic echo canceler (AEC). In those applications, typically, the threshold is chosen only by some heuristic method and the performance evaluation is very subjective. In this study, we develop a way to objectively evaluate DTD algorithms based on the standard statistical methods of detection theory. A receiver operating characteristic (ROC) is derived to characterize DTD performance. Several DTD algorithms are examined and simulated under typical real-world operating conditions using measured room responses and signals taken from a digital speech database. The DTD methods are then evaluated and compared using the ROC metric View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonlinear compensation for stochastic matching

    Publication Year: 1999 , Page(s): 643 - 655
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB)  

    The performance of an automatic speech recognizer degrades when there exists an acoustic mismatch between the training and the testing conditions in the data. Though it is certain that the mismatch is nonlinear, its exact form is unknown. Tackling the problem of nonlinear mismatches is a difficult task that has not been adequately addressed before. We develop an approach that uses nonlinear transformations in the stochastic matching framework to compensate for acoustic mismatches. The functional form of the nonlinear transformation is modeled by neural networks. We develop a new technique to train neural networks using the generalized EM algorithm. This technique eliminates the need for stereo databases, which are difficult to obtain in practical applications. The new technique is data-driven and hence can be used under a wide variety of conditions without a priori knowledge of the environment. Using this technique, we show that we can provide improvement under various types of acoustic mismatch; in some cases a 72% reduction in word error rate is achieved View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bark and ERB bilinear transforms

    Publication Year: 1999 , Page(s): 697 - 708
    Cited by:  Papers (88)  |  Patents (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB)  

    Use of a bilinear conformal map to achieve a frequency warping nearly identical to that of the Bark frequency scale is described. Because the map takes the unit circle to itself, its form is that of the transfer function of a first-order allpass filter. Since it is a first-order map, it preserves the model order of rational systems, making it a valuable frequency warping technique for use in audio filter design. A closed-form weighted-equation-error method is derived that computes the optimal mapping coefficient as a function of sampling rate, and the solution is shown to be generally indistinguishable from the optimal least-squares solution. The optimal Chebyshev mapping is also found to be essentially identical to the optimal least-squares solution. The expression 0.8517[arctan(0.06583fs)]1/2-0.916 is shown to accurately approximate the optimal allpass coefficient as a function of sampling rate fs in kHz for sampling rates greater than 1 kHz. A filter design example is included that illustrates improvements due to carrying out the design over a Bark scale. Corresponding results are also given and compared for approximating the related “equivalent rectangular bandwidth (ERB) scale” of Moore and Glasberg (ACTA Acustica, vo.82, p.335-45, 1996) using a first-order allpass transformation. Due to the higher frequency resolution called for by the ERB scale, particularly at low frequencies, the first-order conformal map is less able to follow the desired mapping, and the error is two to three times greater than the Bark-scale case, depending on the sampling rate View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel approach to isolated word recognition

    Publication Year: 1999 , Page(s): 620 - 628
    Cited by:  Papers (32)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB)  

    A voice signal contains the psychological and physiological properties of the speaker as well as dialect differences, acoustical environment effects, and phase differences. For these reasons, the same word uttered by different speakers can be very different. In this paper, two theories are developed by considering two optimization criteria applied to both the training set and the test set. The first theory is well known and uses what is called Criterion 1 here and ends up with the average of all vectors belonging to the words in the training set. The second theory is a novel approach and uses what is called Criterion 2 here, and it is used to extract the common properties of all vectors belonging to the words in the training set. It is shown that Criterion 2 is superior to Criterion 1 when the training set is of concern. In Criterion 2, the individual differences are obtained by subtracting a reference vector from other vectors, and individual difference vectors are used to obtain orthogonal vector basis by using the Gram-Schmidt orthogonalization method. The common vector is obtained by subtracting projections of any vector of the training set on the orthogonal vectors from this same vector. It is proved that this common vector is unique for any word class in the training set and independent of the chosen reference vector. This common vector is used in isolated word recognition, and it is also shown that Criterion 2 is superior to Criterion 1 for the test set. From the theoretical and experimental study, it is seen that the recognition rates increase as the number of speakers in the training set increases. This means that the common vector obtained from Criterion 2 represents the common properties of a spoken word better than the common or average vector obtained from Criterion 1 View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Common-acoustical-pole and residue model and its application to spatial interpolation and extrapolation of a room transfer function

    Publication Year: 1999 , Page(s): 709 - 717
    Cited by:  Papers (11)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (236 KB)  

    A method is proposed for modeling a room transfer function (RTF) by using common acoustical poles and their residues. The common acoustical poles correspond to the resonance frequencies (eigenfrequencies) of the room, so they are independent of the source and receiver positions. The residues correspond to the eigenfunctions of the room. Therefore, the residue, which is a function of the source and receiver positions, can be expressed using simple analytical functions for rooms with a simple geometry such as rectangular. That is, the proposed model can describe RTF variations using simple residue functions. Based on the proposed common-acoustical-pole and residue model, methods are also proposed for spatially interpolating and extrapolating RTFs. Because the common acoustical poles are invariant in a given room, the interpolation or extrapolation of RTFs is reformulated as a problem of interpolating or extrapolating residue values. The experimental results for a rectangular room, in which the residue values are interpolated or extrapolated by using a cosine function or a linear prediction method, demonstrate that unknown RTFs can be well estimated at low frequencies from known (measured) RTFs by using the proposed methods View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Kanji-to-Hiragana conversion based on a length-constrained n-gram analysis

    Publication Year: 1999 , Page(s): 685 - 696
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB)  

    A common problem in speech processing is the conversion of the written form of a language to a set of phonetic symbols representing the pronunciation. In this paper, we focus on an aspect of this problem specific to the Japanese language. Written Japanese consists of a mixture of three types of symbols: Kanji, Hiragana, and Katakana. We describe an algorithm for converting conventional Japanese orthography to a Hiragana-like symbol set that closely approximates the most common pronunciation of the text. The algorithm is based on two hypotheses: (1) the correct reading of a Kanji character can be determined by examining a small number of adjacent characters and (2) the number of such combinations required in a dictionary is manageable. The algorithm described here converts the input test by selecting the most probable sequence of orthographic units (n-grams) that can be concatenated to form the input text. In closed-set testing, the n-gram algorithm was shown to provide better performance than several public domain algorithms, achieving a sentence error rate of 3% on a wide range of text material. Though the focus of this paper is written Japanese, the pattern matching algorithm described here has applications to similar problems in other languages View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Bayesian approach for building triphone models for continuous speech recognition

    Publication Year: 1999 , Page(s): 678 - 684
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB)  

    This paper introduces a new statistical framework for constructing triphonic models from models of less context-dependency. This composition reduces the number of models to be estimated by higher than an order of magnitude and is therefore of great significance in relieving the data sparsity problem in triphone-based continuous speech recognition. The new framework is derived from Bayesian statistics, and represents an alternative to other triphone-by-composition techniques, particularly to the model-interpolation and quasitriphone approaches. The potential power of this new framework is explored by an implementation based on the hidden Markov modeling technique. It is shown that the new model structure includes the quasitriphone model as a special case, and leads to more efficient parameter estimation than the model-interpolation method. Phone recognition experiments show an increase in the accuracy over that obtained by comparable models View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • N-channel hidden Markov models for combined stressed speech classification and recognition

    Publication Year: 1999 , Page(s): 668 - 677
    Cited by:  Papers (12)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB)  

    Robust speech recognition systems must address variations due to perceptually induced stress in order to maintain acceptable levels of performance in adverse conditions. One approach for addressing these variations is to utilize front-end stress classification to direct a stress dependent recognition algorithm which separately models each speech production domain. This study proposes a new approach which combines stress classification and speech recognition functions into one algorithm. This is accomplished by generalizing the one-dimensional (1-D) hidden Markov model to an N-channel hidden Markov model (N-channel HMM). Here, each stressed speech production style under consideration is allocated a dimension in the N-channel HMM to model each perceptually induced stress condition. It is shown that this formulation better integrates perceptually induced stress effects for stress independent recognition. This is due to the sub-phoneme (state level) stress classification that is implicitly performed by the algorithm. The proposed N-channel stress independent HMM method is compared to a previously established one-channel stress dependent isolated word recognition system yielding a 73.8% reduction in error rate. In addition, an 82.7% reduction in error rate is observed compared to the common one-channel neutral trained recognition approach View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope