By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 5 • Date Jul 2001

Filter Results

Displaying Results 1 - 13 of 13
  • A real-time implementation of a stereophonic acoustic echo canceler

    Publication Year: 2001 , Page(s): 513 - 523
    Cited by:  Papers (17)  |  Patents (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (228 KB) |  | HTML iconHTML  

    Teleconferencing systems employ acoustic echo cancelers to reduce echoes that result from the coupling between loudspeaker and microphone. To enhance the sound realism, two-channel audio is necessary. However, stereophonic acoustic echo cancellation (SAEC) is more difficult to solve because of the necessity to uniquely identify two acoustic paths, which becomes problematic since the two excitation signals are highly correlated. In this paper, a wideband stereophonic acoustic echo canceler is presented. The fundamental difficulty of stereophonic acoustic echo cancellation is described and an echo canceler based on a fast recursive least squares (FRLS) algorithm in a subband structure, with equidistant frequency bands, is proposed. The structure has been used in a real-time implementation, with which experiments have been performed. In this paper, simulation results of this implementation on real life recordings, with 8 kHz bandwidth, are studied. The results clearly verify that the theoretic fundamental problem of SAEC also applies in real-life situations. They also show that more sophisticated adaptive algorithms are needed in the lower frequency regions than in the higher regions View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Delayless frequency domain acoustic echo cancellation

    Publication Year: 2001 , Page(s): 589 - 597
    Cited by:  Papers (10)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB)  

    The computational complexity of classical time domain gradient-based echo cancellation algorithms might be prohibitively high, due to the very long response of the acoustic transfer functions involved. A reduction in computational complexity can be achieved by using frequency domain or subband algorithms. However, these algorithms introduce an inherent delay in the signal path. The delayed echo has an annoying psychoacoustic effect. Additionally, the delay prevents natural, full-duplex conversation. Moreover, when operated in practical scenarios, using speech signals in actual room acoustic environments, the convergence and tracking properties of the frequency domain algorithms do not compare favorably with those of the NLMS algorithm. This is because the range of values of the convergence constant that support a stable filter is more restrictive for the frequency domain algorithms. In this study we introduce a new algorithm termed delayless frequency domain (DLFD). The DLFD exhibits performance comparable to that of the NLMS algorithm with a computational complexity comparable to that of standard frequency domain algorithms and without the processing delay View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach

    Publication Year: 2001 , Page(s): 469 - 481
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    Estimation of the spectral envelope (magnitude of the transfer function) of a filter driven by a periodic signal is a long-standing problem in speech and audio processing. Recently, there has been a renewed interest in this issue in connection with the rapid developments of processing techniques based on sinusoidal modeling. In this paper, we introduce a new performance criterion for spectral envelope fitting which is based on the statistical analysis of the behavior of the empirical sinusoidal magnitude estimates. We further show that penalization is an efficient approach to control the smoothness of the estimation envelope. In low-noise situations, the proposed method can be approximated by a two-step weighted least-squares procedure which also provides an interesting insight into the limitations of the previously proposed “discrete cepstrum” approach. A systematic simulation study confirms that the proposed methods perform significantly better than existing ones for high pitched and noisy signals View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Recursive coding of spectrum parameters

    Publication Year: 2001 , Page(s): 492 - 503
    Cited by:  Papers (28)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB) |  | HTML iconHTML  

    A theoretical analysis of recursive speech spectrum coding, where predictive and finite state schemes are special cases, is presented. We evaluate the spectral distortion (SD) theoretically and design coders that minimize the SD. The analysis rests on three cornerstones: high-rate theory, PDF modeling, and an approximation of SD. A derivation of the mean L2-norm distortion of a recursive quantizer operating at high rate is provided. Also, the distortion distribution is supplied. The evaluation of the distortion expressions requires a model of the joint PDF of two consecutive spectrum vectors. The LPC spectrum source considered here has outcomes in a bounded region, and this is taken into account in the choice of model and modeling algorithm. It is further shown how to approximate the SD with an L2-norm measure. Combining the results, we show theoretically that 16 bits are needed to achieve an average SD of 1 dB when quantizing ten-dimensional (10-D) spectrum vectors using a first-order recursive scheme. A gain of six bits per frame is noted compared to memoryless quantization. These results rely on high-rate assumptions which are validated in experiments. There, actual high-rate optimal coders are designed and evaluated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A bitstream-based front-end for wireless speech recognition on IS-136 communications system

    Publication Year: 2001 , Page(s): 558 - 568
    Cited by:  Papers (21)  |  Patents (31)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    We propose a feature extraction method for a speech recognizer that operates in digital communication networks. The feature parameters are basically extracted by converting the quantized spectral information of a speech coder into a cepstrum. We also include the voiced/unvoiced information obtained from the bitstream of the speech coder in the recognition feature set. We performed speaker-independent connected digit HMM recognition experiments under clean, background noise, and channel impairment conditions. From these results, we found that the speech recognition system employing the proposed bitstream-based front-end gives superior word and string accuracies over a recognizer constructed from decoded speech signals. Its performance is comparable to that of a wireline recognition system that uses the cepstrum as a feature set. Next, we extended the evaluation of the proposed bitstream-based front-end to large vocabulary speech recognition with a name database. The recognition results proved that the proposed bitstream-based front-end also gives a comparable performance to the conventional wireline front-end View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computer-aided analysis and design for spoken dialogue systems based on quantitative simulations

    Publication Year: 2001 , Page(s): 534 - 548
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (356 KB) |  | HTML iconHTML  

    In this paper, a complete development of computer-aided analysis and design approaches for spoken dialogue systems based on quantitative simulations is presented. With this approach the various performance metrics of a dialogue system can be flexibly defined and numerically evaluated, such that the behavior and performance of the dialogue system can be well predicted and efficiently analyzed before the implementation of the real spoken dialogue system is completed. How the different dialogue performance measures vary with respect to each of the many very complicated factors, regardless of whether it is caused by an individual component, by the overall system design, or by users' response patterns, can be separately identified, because all such factors can he precisely controlled in the simulation. Several analysis examples are presented to show how the approach can be used, including selection and tuning of the speech understanding front end, system strategy design considering query factors and confirmation factors, and objective estimates of user's degree of satisfaction. This approach is therefore very useful for the analysis and design of spoken dialogue systems, although the online test, corpus-based analysis and user survey can always follow after the system is online View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A very low bit rate speech coder based on a recognition/synthesis paradigm

    Publication Year: 2001 , Page(s): 482 - 491
    Cited by:  Papers (11)  |  Patents (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (196 KB) |  | HTML iconHTML  

    Previous studies have shown that a concatenative speech synthesis system with a large database produces more natural sounding speech. We apply this paradigm to the design of improved very low bit rate speech coders (sub 1000 b/s). The proposed speech coder consists of unit selection, prosody coding, prosody modification and waveform concatenation. The encoder selects the best unit sequence from a large database and compresses the prosody information. The transmitted parameters include unit indices and the prosody information. To increase naturalness as well as intelligibility, two costs are considered in the unit selection process: an acoustic target cost and a concatenation cost. A rate-distortion-based piecewise linear approximation is proposed to compress the pitch contour. The decoder concatenates the set of units, and then synthesizes the resultant sequence of speech frames using the harmonic+noise model (HNM) scheme. Before concatenating units, prosody modification which includes pitch shifting and gain modification is applied to match those of the input speech. With single speaker stimuli, a comparison category rating (CCR) test shows that the performance of the proposed coder is close to that of the 2400-b/s MELP coder at an average bit rate of about 800-b/s during talk spurts View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Blind signal separation using overcomplete subband representation

    Publication Year: 2001 , Page(s): 524 - 533
    Cited by:  Papers (19)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (284 KB) |  | HTML iconHTML  

    This paper discusses a multirate filterbank-based extended infomax algorithm for real-world signal separation, i.e., convolved mixtures separation. Since convolution in the time domain corresponds to instantaneous mixing in the frequency domain, polyphase subband projection naturally becomes an efficient alternative to the Fourier transform based frequency domain approach. The online implementation proposed is featured by a simultaneous inverse channel identification in the frequency domain and signal filtering in the time domain. It is shown that an over-representation structure reduces aliasing between different bands and results in more accurate inverse channel estimates. Therefore, it provides better performance than the Fourier transform based structure in the measures of both separation and distortion. The performance limitation of the method is also evaluated in terms of the Wiener solution View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise power spectral density estimation based on optimal smoothing and minimum statistics

    Publication Year: 2001 , Page(s): 504 - 512
    Cited by:  Papers (362)  |  Patents (32)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB)  

    We describe a method to estimate the power spectral density of nonstationary noise when a noisy speech signal is given. The method can be combined with any speech enhancement algorithm which requires a noise power spectral density estimate. In contrast to other methods, our approach does not use a voice activity detector. Instead it tracks spectral minima in each frequency band without any distinction between speech activity and speech pause. By minimizing a conditional mean square estimation error criterion in each time step we derive the optimal smoothing parameter for recursive smoothing of the power spectral density of the noisy speech signal. Based on the optimally smoothed power spectral density estimate and the analysis of the statistics of spectral minima an unbiased noise estimator is developed. The estimator is well suited for real time implementations. Furthermore, to improve the performance in nonstationary noise we introduce a method to speed up the tracking of the spectral minima. Finally, we evaluate the proposed method in the context of speech enhancement and low bit rate speech coding with various noise types View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cross-updated active noise control system with online secondary path modeling

    Publication Year: 2001 , Page(s): 598 - 602
    Cited by:  Papers (56)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (132 KB) |  | HTML iconHTML  

    A good active noise control (ANC) system with online secondary path modeling should have the property that the operation of the ANC controller and the modeling of the secondary path are mutually independent. A new ANC system with online secondary path modeling is presented. Three cross-updated least mean square (LMS) adaptive filters are used to reduce mutual disturbances between the operation of the ANC controller and the modeling of the secondary path. Computer simulations have been conducted and the results show that the proposed method is able to produce superior performance compared to existing methods View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A maximum a posteriori approach to speaker adaptation using the trended hidden Markov model

    Publication Year: 2001 , Page(s): 549 - 557
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (172 KB) |  | HTML iconHTML  

    A formulation of the maximum a posteriori (MAP) approach to speaker adaptation is presented with use of the trended or nonstationary-state hidden Markov model (HMM), where the Gaussian means in each HMM state are characterized by time-varying polynomial trend functions of the state sojourn time. Assuming uncorrelatedness among the polynomial coefficients in the trend functions, we have obtained analytical results for the MAP estimates of the parameters including time-varying means and time-invariant precisions. We have implemented a speech recognizer based on these results in speaker adaptation experiments using the TI46 corpora. The experimental evaluation demonstrates that the trended HMM, with use of either the linear or the quadratic polynomial trend function, consistently outperforms the conventional, stationary-state HMM. The evaluation also shows that the unadapted, speaker-independent models are outperformed by the models adapted by the MAP procedure under supervision with as few as a single adaptation token. Further, adaptation of polynomial coefficients alone is shown to be better than adapting both polynomial coefficients and precision matrices when fewer than four adaptation tokens are used, while the reverse is found with a greater number of adaptation tokens View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison of warped and conventional linear predictive coding

    Publication Year: 2001 , Page(s): 579 - 588
    Cited by:  Papers (21)  |  Patents (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (204 KB) |  | HTML iconHTML  

    Frequency-warped signal processing techniques are attractive to many wideband speech and audio applications since they have a clear connection to the frequency resolution of human hearing. A warped version of linear predictive coding (LPC) is studied. The performance of conventional and warped LPC algorithms are compared in a simulated coding system using listening tests and conventional technical measures. The results indicate that the use of warped techniques is beneficial especially in wideband coding and may result in savings of one bit per sample compared to the conventional algorithm while retaining the same subjective quality View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A detection approach to search-space reduction for HMM state alignment in speaker verification

    Publication Year: 2001 , Page(s): 569 - 578
    Cited by:  Papers (6)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    To support speaker verification (SV) in portable devices and in telephone servers with millions of users, a fast algorithm for hidden Markov model (HMM) alignment is necessary. Currently, the most popular algorithm is the Viterbi (1967) algorithm with beam search to reduce search-space; however, it is difficult to determine a suitable beam width beforehand. A small beam width may miss the optimal path while a large one may slow down the alignment. To address the problem, we propose a nonheuristic approach to reduce the search-space. Following the definition of the left-to-right HMM, we first detect the possible change-points between HMM states in a forward-and-backward scheme, then use the change-points to enclose a subspace for searching. The Viterbi algorithm or any other search algorithm can then be applied to the subspace to find the optimal state alignment. Compared to a full-search algorithm, the proposed algorithm is about four times faster, while the accuracy is still slightly better in an SV task; compared to the beam search algorithm, the proposed algorithm can provide better accuracy with even lower complexity. In short, for an HMM with S states, the computational complexity can be reduced up to a factor of S/3 with slightly better accuracy than in a full-search approach. This paper also discusses how to extend the change-point detection approach to large-vocabulary continuous speech recognition View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope