By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 5 • Date Sep 1995

Filter Results

Displaying Results 1 - 11 of 11
  • Robust speech recognition training via duration and spectral-based stress token generation

    Publication Year: 1995 , Page(s): 415 - 421
    Cited by:  Papers (11)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (612 KB)  

    It is known that speech recognition performance degrades if systems are not trained and tested under similar speaking conditions. This is particularly true if a speaker is exposed to demanding workload stress or noise. For recognition systems to be successful in applications susceptible to stress, speech recognizers should address the adverse conditions experienced by the user. The authors consider the problem of improved recognition training for speech recognition for various stressed speaking conditions (e.g., slow, loud, and Lombard effect speaking styles). The main objective is to devise a training procedure that produces a hidden Markov model recognizer that better characterizes a given stressed speaking style, without the need for directly collecting such stressed data. The novel approach is to construct a word production model using a previously suggested source generator framework [Hansen 1994], by employing knowledge of the statistical nature of duration and spectral variation of speech under stress. This model is used in turn to produce simulated stressed speech training tokens from neutral speech tokens. The token generation training method is shown to improve isolated word recognition by 24% for Lombard speech when compared to a neutral trained isolated word recognizer. Further results are reported for isolated and keyword recognition scenarios View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Theoretical analysis of the high-rate vector quantization of LPC parameters

    Publication Year: 1995 , Page(s): 367 - 381
    Cited by:  Papers (86)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1140 KB)  

    The paper presents a theoretical analysis of high-rate vector quantization (VQ) systems that use suboptimal, mismatched distortion measures, and describes the application of the analysis to the problem of quantizing the linear predictive coding (LPC) parameters in speech coding systems. First, it is shown that in many high-rate VQ systems the quantization distortion approaches a simple quadratically weighted error measure, where the weighting matrix is a “sensitivity matrix” that is an extension of the concept of the scalar sensitivity. The approximate performance of VQ systems that train and quantize using mismatched distortion measures is derived, and is used to construct better distortion measures. Second, these results are used to determine the performance of LPC vector quantizers, as measured by the log spectral distortion (LSD) measure, which have been trained using other error measures, such as mean-squared (MSE) or weighted mean-squared error (WMSE) measures of LEPC parameters, reflection coefficients and transforms thereof, and line spectral pair (LSP) frequencies. Computationally efficient algorithms for computing the sensitivity matrices of these parameters are described. In particular, it is shown that the sensitivity matrix for the LSP frequencies is diagonal, implying that a WMSE measured LSP frequencies converges to the LSD measure in high-rate VQ systems. Experimental results to support the theoretical performance estimates are provided View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Source generator equalization and enhancement of spectral properties for robust speech recognition in noise and stress

    Publication Year: 1995 , Page(s): 407 - 415
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (700 KB)  

    Studies have shown that depending on speaker task and environmental conditions, recognizers are sensitive to noisy stressful environments. The focus of the study is to achieve robust recognition in diverse environmental conditions through the formulation of feature enhancement and stress equalization algorithms under the framework of source generator theory. The generator framework is considered as a means of modeling production variation under stressful speaking conditions. A multi-dimensional stress equalization procedure is formulated that produces recognition features less sensitive to varying factors caused by stress. A feature enhancement algorithm is employed based on iterative techniques previously derived for enhancement of speech in varying background noise environments. Combined stress equalization and feature enhancement reduces average word error rates across 10 noisy stressful conditions by -38.7% (e.g., noisy loud, angry, and Lombard effect stress conditions, etc.). The results suggest that the combination of a flexible source generator framework to address stressed speaking conditions, and a feature enhancement algorithm that adapts based on speech-specific constraints, can be effective in reducing the consequences of stress and noise for robust automatic recognition View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Determination of instants of significant excitation in speech using group delay function

    Publication Year: 1995 , Page(s): 325 - 333
    Cited by:  Papers (42)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (728 KB)  

    A new method for determining the instants of significant excitation in speech signals is proposed. In the paper, significant excitation refers primarily to the instant of glottal closure within a pitch period in voiced speech. The method is based on the global phase characteristics of minimum phase signals. The average slope of the unwrapped phase of the short-time Fourier transform of linear prediction residual is calculated as a function of time. Instants where the phase slope function makes a positive zero-crossing are identified as significant excitations. The method is discussed in a source-filter context of speech production. The method is not sensitive to the characteristics of the filter. The influence of the type, length, and position of the analysis window is discussed. The method works well for all types of voiced speech in male as well as female speech but, in all cases, under noise-free conditions only View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spectral shape analysis in the central auditory system

    Publication Year: 1995 , Page(s): 382 - 395
    Cited by:  Papers (27)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1268 KB)  

    A model of spectral shape analysis in the central auditory system is developed based on neurophysiological mappings in the primary auditory cortex and on results from psychoacoustical experiments in human subjects. The model suggests that the auditory system analyzes an input spectral pattern along three independent dimensions: a logarithmic frequency axis, a local symmetry axis, and a local spectral bandwidth axis. It is shown that this representation is equivalent to performing an affine wavelet transform of the spectral pattern and preserving both the magnitude (a measure of the scale or local bandwidth of the spectrum) and phase (a measure of the local symmetry of the spectrum). Such an analysis is in the spirit of the cepstral analysis commonly used in speech recognition systems, the major difference being that the double Fourier-like transformation that the auditory system employs is carried out in a local fashion. Examples of such a representation for various speech and synthetic signals are discussed, together with its potential significance and applications for speech and audio processing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Two-tone suppression in a cochlear model

    Publication Year: 1995 , Page(s): 396 - 406
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (980 KB)  

    Two-tone rate suppression is a nonlinear property of the cochlea in which the total neural firing rate in the region most sensitive to a probe tone is reduced by the addition of a second (suppressor) tone at a different frequency. A cochlear model featuring a cascade of adaptive-Q filter sections was investigated to determine its ability to reproduce two-tone suppression. It was found that the adaptive-Q filter sections are adequate to reproduce two-tone suppression when the suppressor is located higher in frequency than the probe tone but are not adequate to reproduce suppression for a suppressor lower in frequency than the probe. A modification to the cochlear model, in which the output signal level is adjusted in response to the estimated peak signal levels in the regions of the cochlear filter tip and tail, was much more accurate in reproducing two-tone suppression behavior. A speech example is also presented, and the implications of two-tone suppression in applications of the cochlear model are discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dependence of opinion scores on listening sets used in degradation category rating assessments

    Publication Year: 1995 , Page(s): 421 - 424
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB)  

    The effect of using headphone versus telephone handsets is characterized for degradation mean opinion score (DMOS) assessments. From this, it is derived that differences in opinion scores are dependent on the type of listening instrument employed. Further, in the middle of the quality range, opinion scores derived from tests employing telephone handsets are higher than those derived from tests employing headphones View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Weighted RLS adaptive beamformer with initial directivity

    Publication Year: 1995 , Page(s): 424 - 428
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB)  

    For the effective reduction of noise that is suddenly switched on or transient, attenuation is required from its onset. A weighted recursive least square approach, where an extra term providing an initial directivity is combined with the power minimization term, is proposed. A noise reduction system is also proposed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition

    Publication Year: 1995 , Page(s): 334 - 345
    Cited by:  Papers (23)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1072 KB)  

    A theoretical framework for Bayesian adaptive training of the parameters of a discrete hidden Markov model (DHMM) and of a semi-continuous HMM (SCHMM) with Gaussian mixture state observation densities is presented. In addition to formulating the forward-backward MAP (maximum a posteriori) and the segmental MAP algorithms for estimating the above HMM parameters, a computationally efficient segmental quasi-Bayes algorithm for estimating the state-specific mixture coefficients in SCHMM is developed. For estimating the parameters of the prior densities, a new empirical Bayes method based on the moment estimates is also proposed. The MAP algorithms and the prior parameter specification are directly applicable to training speaker adaptive HMMs. Practical issues related to the use of the proposed techniques for HMM-based speaker adaptation are studied. The proposed MAP algorithms are shown to be effective especially in the cases in which the training or adaptation data are limited View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker adaptation using constrained estimation of Gaussian mixtures

    Publication Year: 1995 , Page(s): 357 - 366
    Cited by:  Papers (118)  |  Patents (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (868 KB)  

    A trend in automatic speech recognition systems is the use of continuous mixture-density hidden Markov models (HMMs). Despite the good recognition performance that these systems achieve on average in large vocabulary applications, there is a large variability in performance across speakers. Performance degrades dramatically when the user is radically different from the training population. A popular technique that can improve the performance and robustness of a speech recognition system is adapting speech models to the speaker, and more generally to the channel and the task. In continuous mixture-density HMMs the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, the authors propose a constrained estimation technique for Gaussian mixture densities. The algorithm is evaluated on the large-vocabulary Wall Street Journal corpus for both native and nonnative speakers of American English. For nonnative speakers, the recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers. For native speakers, the recognition performance after adaptation improves to the accuracy of speaker-dependent systems that use six times as much training data View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic word recognition in cars

    Publication Year: 1995 , Page(s): 346 - 356
    Cited by:  Papers (17)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1008 KB)  

    The paper compares, on a database recorded in a car, a number of signal analysis and speech enhancement techniques as well as some approaches to adapt speech recognition systems. It is shown that a new nonlinear spectral subtraction associated with Mel frequency cepstral coefficients (MFCC) is an adequate compromise for low-cost integration. The Lombard effect is analyzed and simulated. Such a simulation is used to derive realistic training utterances from noise-free utterances. Adapting a continuous-density hidden Markov model (CDHMM) to these artificially generated training samples yields a very high performance with respect to that achieved within the ESPRIT adverse environment recognition of speech (ARS) project, i.e., an average of 1% error for all driving conditions. Finally, the paper shows, both theoretically and experimentally, that whatever the noise estimation technique is, it is better to add this noise estimate to the reference clean models than to subtract it from the noisy data View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope