Scheduled System Maintenance:
On May 6th, single article purchases and IEEE account management will be unavailable from 8:00 AM - 5:00 PM ET (12:00 - 21:00 UTC). We apologize for the inconvenience.
By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 5 • Date Sept. 2003

Filter Results

Displaying Results 1 - 11 of 11
  • A new approach to utterance verification based on neighborhood information in model space

    Publication Year: 2003 , Page(s): 425 - 434
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (502 KB) |  | HTML iconHTML  

    We propose to use neighborhood information in model space to perform utterance verification (UV). At first, we present a nested-neighborhood structure for each underlying model in model space and assume the underlying model's competing models sit in one of these neighborhoods, which is used to model alternative hypothesis in UV. Bayes factors (BF) is first introduced to UV and used as a major tool to calculate confidence measures based on the above idea. Experimental results in the Bell Labs communicator system show that the new method has dramatically improved verification performance when verifying correct words against mis-recognized words in the recognizer's output, relatively more than 20% reduction in equal error rate (EER) when comparing with the standard approach based on likelihood ratio testing and anti-models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments

    Publication Year: 2003 , Page(s): 435 - 446
    Cited by:  Papers (22)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (671 KB) |  | HTML iconHTML  

    This paper presents a set of acoustic feature pre-processing techniques that are applied to improving automatic speech recognition (ASR) performance on noisy speech recognition tasks. The principal contribution of this paper is an approach for cepstrum-domain feature compensation in ASR which is motivated by techniques for decomposing speech and noise that were originally developed for noisy speech enhancement. This approach is applied in combination with other feature compensation algorithms to compensating ASR features obtained from a mel-filterbank cepstrum coefficient front-end. Performance comparisons are made with respect to the application of the minimum mean squared error log spectral amplitude (MMSE-LSA) estimator based speech enhancement algorithm prior to feature analysis. An experimental study is presented where the feature compensation approaches described in the paper are found to greatly reduce ASR word error rate compared to uncompensated features under environmental and channel mismatched conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Matching pursuits sinusoidal speech coding

    Publication Year: 2003 , Page(s): 413 - 424
    Cited by:  Papers (6)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (599 KB) |  | HTML iconHTML  

    This paper introduces a sinusoidal modeling technique for low bit rate speech coding wherein the parameters for each sinusoidal component are sequentially extracted by a closed-loop analysis. The sinusoidal modeling of the speech linear prediction (LP) residual is performed within the general framework of matching pursuits with a dictionary of sinusoids. The frequency space of sinusoids is restricted to sets of frequency intervals or bins, which in conjunction with the closed-loop analysis allow us to map the frequencies of the sinusoids into a frequency vector that is efficiently quantized. In voiced frames, two sets of frequency vectors are generated: one of them represents harmonically related and the other one nonharmonically related components of the voiced segment. This approach eliminates the need for voicing dependent cutoff frequency that is difficult to estimate correctly and to quantize at low bit rates. In transition frames, to efficiently extract and quantize the set of frequencies needed for the sinusoidal representation of the LP residual, we introduce frequency bin vector quantization (FBVQ). FBVQ selects a vector of nonuniformly spaced frequencies from a frequency codebook in order to represent the frequency domain information in transition regions. Our use of FBVQ with closed-loop searching contribute to an improvement of speech quality in transition frames. The effectiveness of the coding scheme is enhanced by exploiting the critical band concept of auditory perception in defining the frequency bins. To demonstrate the viability and the advantages of the new models studied, we designed a 4 kbps matching pursuits sinusoidal speech coder. Subjective results indicate that the proposed coder at 4 kbps has quality exceeding the 6.3 kbps G.723.1 coder. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A soft voice activity detector based on a Laplacian-Gaussian model

    Publication Year: 2003 , Page(s): 498 - 505
    Cited by:  Papers (55)  |  Patents (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (485 KB) |  | HTML iconHTML  

    A new voice activity detector (VAD) is developed in this paper. The VAD is derived by applying a Bayesian hypothesis test on decorrelated speech samples. The signal is first decorrelated using an orthogonal transformation, e.g., discrete cosine transform (DCT) or the adaptive Karhunen-Loeve transform (KLT). The distributions of clean speech and noise signals are assumed to be Laplacian and Gaussian, respectively, as investigated recently. In addition, a hidden Markov model (HMM) is employed with two states representing silence and speech. The proposed soft VAD estimates the probability of voice being active (VBA), recursively. To this end, first the a priori probability of VBA is estimated/predicted based on feedback information from the previous time instance. Then the predicted probability is combined/updated with the new observed signal to calculate the probability of VBA at the current time instance. The required parameters of both speech and noise signals are estimated, adaptively, by the maximum likelihood (ML) approach. The simulation results show that the proposed soft VAD that uses a Laplacian distribution model for speech signals outperforms the previous VAD that uses a Gaussian model. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Audio source separation of convolutive mixtures

    Publication Year: 2003 , Page(s): 489 - 497
    Cited by:  Papers (35)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (598 KB) |  | HTML iconHTML  

    The problem of separation of audio sources recorded in a real world situation is well established in modern literature. A method to solve this problem is blind source separation (BSS) using independent component analysis (ICA). The recording environment is usually modeled as convolutive. Previous research on ICA of instantaneous mixtures provided solid background for the separation of convolved mixtures. The authors revise current approaches on the subject and propose a fast frequency domain ICA framework, providing a solution for the apparent permutation problem encountered in these methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quantization of LSF parameters using a trellis modeling

    Publication Year: 2003 , Page(s): 400 - 412
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (643 KB) |  | HTML iconHTML  

    An efficient block-based trellis quantization (BTQ) scheme is proposed for the quantization of the line spectral frequencies (LSF) in speech coding applications. The scheme is based on the modeling of the LSF intraframe dependencies with a trellis structure. The ordering property and the fact that LSF parameters are bounded within a range is explicitly incorporated in the trellis model. BTQ search and design algorithms are discussed and an efficient algorithm for the index generation (finding the index of a path in the trellis) is presented. Also the sequential vector decorrelation technique is presented to effectively exploit the intraframe correlation of LSF parameters within the trellis. Based on the proposed block-based trellis quantizer, two intraframe schemes and one interframe scheme are proposed. Comparisons to the split-VQ, the trellis coded quantization of LSF parameters, and the multi-stage VQ, as well as the interframe scheme used in IS-641 EFRC and the GSM AMR codec are provided. These results demonstrate that the proposed BTQ schemes outperform the above systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonuniform oversampled filter banks for audio signal processing

    Publication Year: 2003 , Page(s): 393 - 399
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (543 KB) |  | HTML iconHTML  

    In emerging audio technology applications, there is a need for decompositions of audio signals into oversampled subband components with time-frequency resolution which mimics that of the cochlear filter bank and with high aliasing attenuation in each of the subbands independently, rather than aliasing cancellation properties. We present a design of nearly perfect reconstruction nonuniform oversampled filter banks which implement signal decompositions of this kind. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Blind single channel deconvolution using nonstationary signal processing

    Publication Year: 2003 , Page(s): 476 - 488
    Cited by:  Papers (19)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1396 KB) |  | HTML iconHTML  

    Blind deconvolution is fundamental in signal processing applications and, in particular, the single channel case remains a challenging and formidable problem. This paper considers single channel blind deconvolution in the case where the degraded observed signal may be modeled as the convolution of a nonstationary source signal with a stationary distortion operator. The important feature that the source is nonstationary while the channel is stationary facilitates the unambiguous identification of either the source or channel, and deconvolution is possible, whereas if the source and channel are both stationary, identification is ambiguous. The parameters for the channel are estimated by modeling the source as a time-varyng AR process and the distortion by an all-pole filter, and using the Bayesian framework for parameter estimation. This estimate can then be used to deconvolve the observed signal. In contrast to the classical histogram approach for estimating the channel poles, where the technique merely relies on the fact that the channel is actually stationary rather than modeling it as so, the proposed Bayesian method does take account for the channel's stationarity in the model and, consequently, is more robust. The properties of this model are investigated, and the advantage of utilizing the nonstationarity of a system rather than considering it as a curse is discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging

    Publication Year: 2003 , Page(s): 466 - 475
    Cited by:  Papers (196)  |  Patents (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (852 KB) |  | HTML iconHTML  

    Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. We present an improved minima controlled recursive averaging (IMCRA) approach, for noise estimation in adverse environments involving nonstationary noise, weak speech components, and low input signal-to-noise ratio (SNR). The noise estimate is obtained by averaging past spectral power values, using a time-varying frequency-dependent smoothing parameter that is adjusted by the signal presence probability. The speech presence probability is controlled by the minima values of a smoothed periodogram. The proposed procedure comprises two iterations of smoothing and minimum tracking. The first iteration provides a rough voice activity detection in each frequency band. Then, smoothing in the second iteration excludes relatively strong speech components, which makes the minimum tracking during speech activity robust. We show that in nonstationary noise environments and under low SNR conditions, the IMCRA approach is very effective. In particular, compared to a competitive method, it obtains a lower estimation error, and when integrated into a speech enhancement system achieves improved speech quality and lower residual noise. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A perceptually motivated approach for speech enhancement

    Publication Year: 2003 , Page(s): 457 - 465
    Cited by:  Papers (29)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (548 KB) |  | HTML iconHTML  

    A new perceptually motivated approach is proposed for enhancement of speech corrupted by colored noise. The proposed approach takes into account the frequency masking properties of the human auditory system and reduces the perceptual effect of the residual noise. This new perceptual method is incorporated into a frequency-domain speech enhancement method and a subspace-based speech enhancement method. A better power spectrum/autocorrelation function estimator was also developed to improve the performance of the proposed algorithms. Objective measures and informal listening tests demonstrated significant improvements over other methods when tested with TIMIT sentences corrupted by various types of colored noise. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient text-independent speaker verification with structural Gaussian mixture models and neural network

    Publication Year: 2003 , Page(s): 447 - 456
    Cited by:  Papers (32)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB) |  | HTML iconHTML  

    We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope