By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 4 • Date July 1999

Filter Results

Displaying Results 1 - 11 of 11
  • Abstracts of manuscripts in review

    Publication Year: 1999 , Page(s): 478
    Save to Project icon | Request Permissions | PDF file iconPDF (12 KB)  
    Freely Available from IEEE
  • Robust speech recognition based on a Bayesian prediction approach

    Publication Year: 1999 , Page(s): 426 - 440
    Cited by:  Papers (26)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (460 KB)  

    We study a category of robust speech recognition problem in which mismatches exist between training and testing conditions, and no accurate knowledge of the mismatch mechanism is available. The only available information is the test data along with a set of pretrained Gaussian mixture continuous density hidden Markov models (CDHMMs). We investigate the problem from the viewpoint of Bayesian prediction. A simple prior distribution, namely constrained uniform distribution, is adopted to characterize the uncertainty of the mean vectors of the CDHMMs. Two methods, namely a model compensation technique based on Bayesian predictive density and a robust decision strategy called Viterbi Bayesian predictive classification are studied. The proposed methods are compared with the conventional Viterbi decoding algorithm in speaker-independent recognition experiments on isolated digits and TI connected digit strings (TIDTGITS), where the mismatches between training and testing conditions are caused by: (1) additive Gaussian white noise, (2) each of 25 types of actual additive ambient noises, and (3) gender difference. The experimental results show that the adopted prior distribution and the proposed techniques help to improve the performance robustness under the examined mismatch conditions View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Objective estimation of perceived speech quality. I. Development of the measuring normalizing block technique

    Publication Year: 1999 , Page(s): 371 - 382
    Cited by:  Papers (41)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB)  

    Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts have found limited success, primarily in analog and higher-rate, error-free digital environments where speech waveforms are preserved or nearly preserved. The objective estimation of the perceived quality of highly compressed digital speech, possibly with bit errors or frame erasures has remained an open question. We report our findings regarding two essential components of objective estimators of perceived speech quality: perceptual transformations and distance measures. A perceptual transformation modifies a representation of an audio signal in a way that is approximately equivalent to the human hearing process. A distance measure reflects the magnitude of a perceived distance between two perceptually transformed signals. We then describe a new objective estimation approach that uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizing blocks. Each measuring normalizing block integrates two perceptually transformed signals over some time or frequency interval to determine the average difference across that interval. This difference is then normalized out of one signal, and is further processed to generate one or more measurements View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A dynamic system approach to speech enhancement using the H filtering algorithm

    Publication Year: 1999 , Page(s): 391 - 399
    Cited by:  Papers (35)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB)  

    This paper presents a new approach to speech enhancement based on the H filtering. This approach differs from the traditional modified Wiener/Kalman filtering approach in the following two aspects: (1) no a priori knowledge of the noise source statistics is required, the only assumption made is that noise signals have a finite energy; (2) the estimation criterion for the filter design is to minimize the worst possible amplification of the estimation error signals in terms of the modeling errors and additive noise. Since most additive noise in speech are nonGaussian, this estimation approach is highly robust and more appropriate in practical speech enhancement. The proposed approach is straightforward to implement, as detailed in this paper. Experimental results show consistently superior enhancement performance of the H filtering algorithm over the Kalman filtering counterpart, measured by the global signal-to-noise ratio (SNR). Examination of the spectrogram displays for the enhanced speech shows that the H filtering approach tends to be more effective where the assumptions on the noise statistics are less valid View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cantonese syllable recognition using neural networks

    Publication Year: 1999 , Page(s): 466 - 472
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (188 KB)  

    This work describes a novel neural network based speech recognition system for isolated Cantonese syllables. Since Cantonese is a monosyllabic and tonal language, the recognition system is composed of two major components, namely the tone recognizer and the base syllable recognizer. The tone recognizer adopts the architecture of multilayer perceptron in which each output neuron represents a particular tone. The base syllable recognizer consists of a large number of independently trained recurrent networks, each representing a designated Cantonese syllable. An integrated recognition algorithm is developed to give the ultimate recognition results based on N-best outputs of the two subrecognizers. To demonstrate the effectiveness of the proposed methods, a speaker-dependent recognition system has been built with the vocabulary expanding progressively from 10 syllables to 200 syllables. In the case of 200 syllables, a top-1 recognition accuracy of 81.8% has been attained whilst the top-3 accuracy is 95.28 View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An odd-DFT based approach to time-scale expansion of audio signals

    Publication Year: 1999 , Page(s): 441 - 453
    Cited by:  Papers (5)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (236 KB)  

    A new time-scale expansion algorithm based on a frequency-scale modification approach combined with time interpolation is presented. The algorithm is noniterative and is constrained to a blind modification of the magnitudes and phases of the relevant spectral components of the signal, on a frame-by-frame basis. The resulting advantages and limitations are discussed. A few simplified models for signal analysis/synthesis are developed, the most critical of which concern phase and frequency estimation beyond the frequency resolution of the filterbank. The structure of the algorithm is described and its performance is illustrated with both synthetic and natural audio signals View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluation of multiple reference active noise control algorithms on Dornier 328 aircraft data

    Publication Year: 1999 , Page(s): 473 - 477
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB)  

    This article presents an evaluation of multiple reference adaptive algorithms. Two least mean squares-types (LMS-types) and a Newton-type algorithm are considered. The special structure of the adaptive filtering problem implies that the Newton-type algorithm can be implemented with the same numerical complexity as LMS-type algorithms. The concept of a first filtered-x Newton algorithm is thus introduced View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A partitioned neural network approach for vowel classification using smoothed time/frequency features

    Publication Year: 1999 , Page(s): 414 - 425
    Cited by:  Papers (9)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (180 KB)  

    A novel pattern classification technique and a new feature extraction method are described and tested for vowel classification. The pattern classification technique partitions an N-way classification task into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a neural network classifier that is trained to discriminate the two members of one pair of categories. Multiple two way classification decisions are then combined to form an N-way decision. Some of the advantages of the new classification approach include the partitioning of the task allowing independent feature and classifier optimization for each pair of categories, lowered sensitivity of classification performance on network parameters, a reduction in the amount of training data required, and potential for superior performance relative to a single large network. The features described in this paper, closely related to the cepstral coefficients and delta cepstra commonly used in speech analysis, are developed using a unified mathematical framework which allows arbitrary nonlinear frequency, amplitude, and time scales to compactly represent the spectral/temporal characteristics of speech. This classification approach, combined with a feature ranking algorithm which selected the 35 most discriminative spectral/temporal features for each vowel pair, resulted in 71.5% accuracy for classification of 16 vowels extracted from the TIMIT database. These results, significantly higher than other published results for the same task, illustrate the potential for the methods presented in this paper View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An EM algorithm for linear distortion channel estimation based on observations from a mixture of Gaussian sources

    Publication Year: 1999 , Page(s): 400 - 413
    Cited by:  Papers (12)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (308 KB)  

    In this work, an expectation maximization (EM) algorithm is derived for maximum likelihood estimation of the autocorrelation function of a linear distortion channel as well as the level of additive noise, under the assumption that the source signal comes from a mixture of Gaussian sources. To facilitate parameter initialization in the EM algorithm, a correlation-matching based estimation algorithm is developed for the channel autocorrelation function. The proposed EM algorithm was evaluated on speech-derived simulated data of multiple autoregressive Gaussian sources and real speech of isolated digits under signal-to-noise ratios (SNRs) of 20 dB down to 0 dB. The algorithm is shown to produce convergent estimation results as well as estimates of signal statistics that lead to significantly improved classification accuracy under additive and convolutive noise conditions View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast implementations of the filtered-X LMS and LMS algorithms for multichannel active noise control

    Publication Year: 1999 , Page(s): 454 - 465
    Cited by:  Papers (19)  |  Patents (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB)  

    In some situations where active noise control could be used, the well-known multichannel version of the filtered-X least mean square (LMS) adaptive filter is too computationally complex to implement. We develop a fast, exact implementation of this adaptive filter for which the system's complexity scales according to the number of filter coefficients within the system. In addition, we extend computationally efficient methods for effectively removing the delays of the secondary paths within the coefficient updates to the multichannel case, thus yielding fast implementations of the LMS adaptive algorithm for multichannel active noise control. Examples illustrate both the equivalence of the algorithms to their original counterparts and the computational gains provided by the new algorithms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Objective estimation of perceived speech quality .II. Evaluation of the measuring normalizing block technique

    Publication Year: 1999 , Page(s): 383 - 390
    Cited by:  Papers (8)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB)  

    For pt.I see ibid., vol.7, no.4, p.371-82. Part I of this paper describes a new approach to the objective estimation of perceived speech quality. This new approach uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizing blocks. Each measuring normalizing block integrates two perceptually transformed signals over some time or frequency interval to determine the average difference across that interval. This difference is then normalized out of one signal, and is further processed to generate one or more measurements. In this part, the resulting estimates of the perceived speech quality are correlated with the results of nine subjective listening tests. Together, these tests include 219 4 kHz bandwidth speech codecs, transmission systems, and reference conditions, with bit rates ranging from 2.4 to 61 kb/s. When compared with six other estimators, significant improvements are seen in many cases, particularly at lower bit rates, and when bit errors or frame erasures are present. These hierarchical structures of measuring normalizing blocks, or other structures of measuring normalizing blocks may also address open issues in perceived audio quality estimation, layered speech or audio coding, automatic speech or speaker recognition, audio signal enhancement, and other areas View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope