By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 6 • Date Nov. 2005

Filter Results

Displaying Results 1 - 23 of 23
  • Table of contents

    Page(s): c1 - c4
    Save to Project icon | Request Permissions | PDF file iconPDF (40 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Speech and Audio Processing publication information

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • Comparative analysis of linear and nonlinear speech signals predictors

    Page(s): 1093 - 1097
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (264 KB) |  | HTML iconHTML  

    The paper presents a new approach to speech production modeling based on nonlinear predictors of signals. The coefficients of latter are found by solving the system of linear algebraic equations with use of least squares method. The comparative experiments were carried out to demonstrate the absolute superiority of nonlinear models over linear one in terms of normalized mean-square error. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonlinear speech analysis using models for chaotic systems

    Page(s): 1098 - 1109
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (576 KB) |  | HTML iconHTML  

    In this paper, we use concepts and methods from chaotic systems to model and analyze nonlinear dynamics in speech signals. The modeling is done not on the scalar speech signal, but on its reconstructed multidimensional attractor by embedding the scalar signal into a phase space. We have analyzed and compared a variety of nonlinear models for approximating the dynamics of complex systems using a small record of their observed output. These models include approximations based on global or local polynomials as well as approximations inspired from machine learning such as radial basis function networks, fuzzy-logic systems and support vector machines. Our focus has been on facilitating the application of the methods of chaotic signal analysis even when only a short time series is available, like phonemes in speech utterances. This introduced an increased degree of difficulty that was dealt with by resorting to sophisticated function approximation models that are appropriate for short data sets. Using these models enabled us to compute for short time series of speech sounds useful features like Lyapunov exponents that are used to assist in the characterization of chaotic systems. Several experimental insights are reported on the possible applications of such nonlinear models and features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Processing of reverberant speech for time-delay estimation

    Page(s): 1110 - 1118
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1288 KB)  

    In this paper, we present a method of extracting the time-delay between speech signals collected at two microphone locations. Time-delay estimation from microphone outputs is the first step for many sound localization algorithms, and also for enhancement of speech. For time-delay estimation, speech signals are normally processed using short-time spectral information (either magnitude or phase or both). The spectral features are affected by degradations in speech caused by noise and reverberation. Features corresponding to the excitation source of the speech production mechanism are robust to such degradations. We show that these source features can be extracted reliably from the speech signal. The time-delay estimate can be obtained using the features extracted even from short segments (50-100 ms) of speech from a pair of microphones. The proposed method for time-delay estimation is found to perform better than the generalized cross-correlation (GCC) approach. A method for enhancement of speech is also proposed using the knowledge of the time-delay and the information of the excitation source. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An effective subband OSF-based VAD with noise reduction for robust speech recognition

    Page(s): 1119 - 1129
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (784 KB) |  | HTML iconHTML  

    An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The approach is based on the determination of the speech/nonspeech divergence by means of specialized order statistics filters (OSFs) working on the subband log-energies. This algorithm differs from many others in the way the decision rule is formulated. Instead of making the decision based on the current frame, it uses OSFs on the subband log-energies which significantly reduces the error probability when discriminating speech from nonspeech in a noisy signal. Clear improvements in speech/nonspeech discrimination accuracy demonstrate the effectiveness of the proposed VAD. It is shown that an increase of the OSF order leads to a better separation of the speech and noise distributions, thus allowing a more effective discrimination and a tradeoff between complexity and performance. The algorithm also incorporates a noise reduction block working in tandem with the VAD and showed to further improve its accuracy. A previous noise reduction block also improves the accuracy in detecting speech and nonspeech. The experimental analysis carried out on the AURORA databases and tasks provides an extensive performance evaluation together with an exhaustive comparison to the standard VADs such as ITU G.729, GSM AMR, and ETSI AFE for distributed speech recognition (DSR), and other recently reported VADs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast QRD-lattice-based unconstrained optimal filtering for acoustic noise reduction

    Page(s): 1130 - 1143
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1112 KB) |  | HTML iconHTML  

    We derive a fast QRD-least-squares lattice (QRD-LSL) based unconstrained optimal filtering algorithm for multichannel acoustic noise reduction. As known from the literature, the unconstrained optimal filtering approach is an alternative to the popular GSC beamforming, which does not rely on a priori information and hence possesses improved robustness. The optimal filtering problem involved is special in that the desired response signal is not known explicitly. The derivation of the QRD-LSL algorithm is based on a significantly reorganized version of a QRD-RLS-based unconstrained optimal filtering scheme. Overall an order of magnitude complexity reduction is obtained without any performance penalty, which makes this new approach affordable for real time implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Subspace constrained Gaussian mixture models for speech recognition

    Page(s): 1144 - 1160
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB) |  | HTML iconHTML  

    A standard approach to automatic speech recognition uses hidden Markov models whose state dependent distributions are Gaussian mixture models. Each Gaussian can be viewed as an exponential model whose features are linear and quadratic monomials in the acoustic vector. We consider here models in which the weight vectors of these exponential models are constrained to lie in an affine subspace shared by all the Gaussians. This class of models includes Gaussian models with linear constraints placed on the precision (inverse covariance) matrices (such as diagonal covariance, maximum likelihood linear transformation, or extended maximum likelihood linear transformation), as well as the LDA/HLDA models used for feature selection which tie the part of the Gaussians in the directions not used for discrimination. In this paper, we present algorithms for training these models using a maximum likelihood criterion. We present experiments on both small vocabulary, resource constrained, grammar-based tasks, as well as large vocabulary, unconstrained resource tasks to explore the rather large parameter space of models that fit within our framework. In particular, we demonstrate significant improvements can be obtained in both word error rate and computational complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR

    Page(s): 1161 - 1172
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (672 KB) |  | HTML iconHTML  

    A feature compensation (FC) algorithm based on polynomial regression of utterance signal-to-noise ratio (SNR) for noise robust automatic speech recognition (ASR) is proposed. In this algorithm, the bias between clean and noisy speech features is approximated by a set of polynomials which are estimated from adaptation data from the new environment by the expectation-maximization (EM) algorithm under the maximum likelihood (ML) criterion. In ASR, the utterance SNR for the speech signal is first estimated and noisy speech features are then compensated for by regression polynomials. The compensated speech features are decoded via acoustic HMMs trained with clean data. Comparative experiments on the Aurora 2 (English) and the German part of the Aurora 3 databases are performed between FC and maximum likelihood linear regression (MLLR). With the Aurora2 experiments, there are two MLLR implementations: pooling adaptation data across all SNRs, and using three distinct SNR clusters. For each type of noise, FC achieves, on average, a word error rate reduction of 16.7% and 16.5% for Set A, and 20.5% and 14.6% for Set B compared to the first and second MLLR implementations, respectively. For each SNR condition, FC achieves, on average, a word error rate reduction of 33.1% and 34.5% for Set A, and 23.6% and 21.4% for Set B. Results using the Aurora3 database show that, the best FC performance outperforms MLLR by 15.9%, 3.0% and 14.6% for well-matched, medium-mismatched and high-mismatched conditions, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic transcription of conversational telephone speech

    Page(s): 1173 - 1185
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (432 KB) |  | HTML iconHTML  

    This paper discusses the Cambridge University HTK (CU-HTK) system for the automatic transcription of conversational telephone speech. A detailed discussion of the most important techniques in front-end processing, acoustic modeling and model training, language and pronunciation modeling are presented. These include the use of conversation side based cepstral normalization, vocal tract length normalization, heteroscedastic linear discriminant analysis for feature projection, minimum phone error training and speaker adaptive training, lattice-based model adaptation, confusion network based decoding and confidence score estimation, pronunciation selection, language model interpolation, and class based language models. The transcription system developed for participation in the 2002 NIST Rich Transcription evaluations of English conversational telephone speech data is presented in detail. In this evaluation the CU-HTK system gave an overall word error rate of 23.9%, which was the best performance by a statistically significant margin. Further details on the derivation of faster systems with moderate performance degradation are discussed in the context of the 2002 CU-HTK 10 × RT conversational speech transcription system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Recognizing GSM digital speech

    Page(s): 1186 - 1205
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1016 KB) |  | HTML iconHTML  

    The Global System for Mobile (GSM) environment encompasses three main problems for automatic speech recognition (ASR) systems: noisy scenarios, source coding distortion, and transmission errors. The first one has already received much attention; however, source coding distortion and transmission errors must be explicitly addressed. In this paper, we propose an alternative front-end for speech recognition over GSM networks. This front-end is specially conceived to be effective against source coding distortion and transmission errors. Specifically, we suggest extracting the recognition feature vectors directly from the encoded speech (i.e., the bitstream) instead of decoding it and subsequently extracting the feature vectors. This approach offers two significant advantages. First, the recognition system is only affected by the quantization distortion of the spectral envelope. Thus, we are avoiding the influence of other sources of distortion as a result of the encoding-decoding process. Second, when transmission errors occur, our front-end becomes more effective since it is not affected by errors in bits allocated to the excitation signal. We have considered the half and the full-rate standard codecs and compared the proposed front-end with the conventional approach in two ASR tasks, namely, speaker-independent isolated digit recognition and speaker-independent continuous speech recognition. In general, our approach outperforms the conventional procedure, for a variety of simulated channel conditions. Furthermore, the disparity increases as the network conditions worsen. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient method of Huffman decoding for MPEG-2 AAC and its performance analysis

    Page(s): 1206 - 1209
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    This paper presents a new method for Huffman decoding specially designed for the MPEG-2 AAC audio. The method significantly enhances the processing efficiency of the conventional Huffman decoding realized with the ordinary binary tree search method. A data structure of one-dimensional array is newly designed based on the numerical interpretation of the incoming bit stream and its utilization for the offset oriented nodes allocation. The Huffman tree implemented with the proposed data structure allows the direct computation of the branching location, eliminating the need for the pipeline-violating "compare and jump" instructions. The experimental results show the average performance enhancement of 67% and 285%, compared to those of the conventional binary tree search method and the sequential search method, respectively. The proposed method also shows slightly better processing efficiency, while requiring much less memory space, compared even with those up-to-date efficient search methods of Hashemian and its variants. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved noise reduction in audio signals using spectral resolution enhancement with time-domain signal extrapolation

    Page(s): 1210 - 1216
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (680 KB) |  | HTML iconHTML  

    In this paper, we present significant improvement to frame-by-frame noise reduction methods which are based on spectral domain processing. This work mainly focuses on the analysis stage of the noise reduction process. It is common knowledge that better performance is obtained by increasing the spectral resolution. However, effective spectral resolution depends directly on the number of samples in the processing frame. This leads to the well-known tradeoff between spectral and temporal resolution. Our approach is to use a novel discrete signal extrapolation method to prolong the given signal frame, resulting in increased spectral resolution without losing any temporal quality. The improvement in the noise reduction is tested with objective and subjective evaluations. The effect of the human auditory system is taken into account in the objective measures. Subjective tests are performed to determine the quality improvement in a perceptual sense. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Leaky-FXLMS algorithm: stochastic analysis for Gaussian data and secondary path modeling error

    Page(s): 1217 - 1230
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1104 KB) |  | HTML iconHTML  

    This paper presents a stochastic analysis of the leaky filtered-X least-mean-square (LFXLMS) algorithm. The version with leakage of the adaptive algorithm is used in practical implementations aiming to reduce undesirable effects due to numerical errors in finite-precision machines, overload of the secondary source, among others. Based on new analysis assumptions, instead of the ordinary independence theory frequently used in classical LMS analysis, an analytical model for the first and second moments of the adaptive filter weights has been derived. In addition, the proposed theoretical models consider the situation in which the secondary path is imperfectly modeled. Experimental results demonstrate the accuracy of the proposed model as compared with the classical analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Acoustic echo cancellation and doubletalk detection using estimated loudspeaker impulse responses

    Page(s): 1231 - 1237
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB) |  | HTML iconHTML  

    In this paper, we present a new approach to acoustic echo cancellation and doubletalk detection for a teleconferencing system including a loudspeaker for which an estimate of the loudspeaker impulse response is available. The approach is general in the sense that it may be applied to most existing acoustic echo cancellation and doubletalk detection algorithms. We show that the new approach reduces the computational complexity for both the echo cancellation and the doubletalk detection algorithms. Furthermore, the numerical examples show that the new approach also may increase the echo cancellation and doubletalk detection performances. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • List of reviewers

    Page(s): 1238 - 1240
    Save to Project icon | Request Permissions | PDF file iconPDF (29 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Speech and Audio Processing Edics

    Page(s): 1241
    Save to Project icon | Request Permissions | PDF file iconPDF (25 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Speech and Audio Processing Information for authors

    Page(s): 1242 - 1243
    Save to Project icon | Request Permissions | PDF file iconPDF (50 KB)  
    Freely Available from IEEE
  • Special issue on objective quality assessment of speech and audio

    Page(s): 1244
    Save to Project icon | Request Permissions | PDF file iconPDF (120 KB)  
    Freely Available from IEEE
  • Special issue on blind signal processing for speech and audio applications

    Page(s): 1245
    Save to Project icon | Request Permissions | PDF file iconPDF (105 KB)  
    Freely Available from IEEE
  • IEEE Odyssey 2006: The Speaker and Language Recognition Workshop

    Page(s): 1246
    Save to Project icon | Request Permissions | PDF file iconPDF (642 KB)  
    Freely Available from IEEE
  • 2005 Index

    Page(s): 1247 - 1260
    Save to Project icon | Request Permissions | PDF file iconPDF (201 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope