By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 2 • Date March 2005

Filter Results

Displaying Results 1 - 23 of 23
  • Table of contents

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (39 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Speech and Audio Processing publication information

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • Perceptual segmentation and component selection for sinusoidal representations of audio

    Page(s): 149 - 162
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (856 KB) |  | HTML iconHTML  

    This paper presents two fundamental enhancements in a hybrid audio signal model consisting of sinusoidal, transient, and noise (STN) components. The first enhancement involves a novel application of a perceptual metric for optimal time segmentation for the analysis of transients. In particular, Moore and Glasberg's model of partial loudness is modified for use with general signals and then integrated into a novel time segmentation scheme. The second, and perhaps more significant STN enhancement is concerned with a new methodology for ranking and selection of the most perceptually relevant sinusoids. A systematic procedure is developed for the selection of a compact set of sinusoids and comparative results are given to demonstrate the merit of this method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Companded quantization of speech MDCT coefficients

    Page(s): 163 - 173
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB) |  | HTML iconHTML  

    Here, we propose speech-coding procedures achieving high subjective quality, avoiding speech-specific processing and interframe exploitation. Thus, the scheme is tractable for packet-based voice communication, and has the capability of coding generic audio. The architecture is based on an modified discrete cosine transform (MDCT) representation of the signal, and combines efficient vector quantization (VQ) techniques with psychoacoustic principles. Weighted quantization of MDCT coefficients is performed, using a codebook based on a statistical model of the multidimensional MDCT pdf. The weighting and the codebook are adapted for each frame to account for masking thresholds given by a psychoacoustic analysis. Actual quantization is performed using lattices, thereby, achieving close to rate independent complexity. The result is a coding scheme operational at a range of rates. Here, a particular instance at 16 kbits/s, using a sampling frequency of 8 kHz, is shown to perform better than an LD-CELP operating at the same rate, even though no interframe memory is exploited. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Boosting with prior knowledge for call classification

    Page(s): 174 - 181
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB) |  | HTML iconHTML  

    The use of boosting for call classification in spoken language understanding is described in this paper. An extension to the AdaBoost algorithm is presented that permits the incorporation of prior knowledge of the application as a means of compensating for the large dependence on training data. We give a convergence result for the algorithm, and we describe experiments on four datasets showing that prior knowledge can substantially improve classification performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decision tree State tying using cluster validity criteria

    Page(s): 182 - 193
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB)  

    Decision tree state tying aims to perform divisive clustering, which can combine the phonetics and acoustics of speech signal for large vocabulary continuous speech recognition. A tree is built by successively splitting the observation frames of a phonetic unit according to the best phonetic questions. To prevent building over-large tree models, the stopping criterion is required to suppress tree growing. Accordingly, it is crucial to exploit the goodness-of-split criteria to choose the best questions for node splitting and test whether the splitting should be terminated or not. In this paper, we apply the Hubert's Γ statistic as the node splitting criterion and the T2-statistic as the stopping criterion. The Hubert's Γ statistic sufficiently characterizes the clustering structure in the given data. This cluster validity criterion is adopted to select the best questions to unravel tree nodes. Further, we examine the population closeness of two split nodes with a significance level. The T2-statistic expressed by an F distribution is determined to verify whether the mean vectors of two nodes are close together. The splitting is stopped when verified. In the experiments of Mandarin speech recognition, the proposed methods achieve better syllable recognition rates with smaller tree models compared to the conventional maximum likelihood and minimum description length criteria. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rapid online adaptation based on transformation space model evolution

    Page(s): 194 - 202
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (368 KB) |  | HTML iconHTML  

    This paper presents a new approach to online linear regression adaptation of continuous density hidden Markov models based on transformation space model (TSM) evolution. The TSM which characterizes the a priori knowledge of the training speakers associated with maximum likelihood linear regression matrix parameters is effectively described in terms of the latent variable models such as the factor analysis or probabilistic principal component analysis. The TSM provides various sources of information such as the correlation information, the prior distribution, and the prior knowledge of the regression parameters that are very useful for rapid adaptation. The quasi-Bayes estimation algorithm is formulated to incrementally update the hyperparameters of the TSM and regression matrices simultaneously. The proposed TSM evolution is a general framework with batch TSM adaptation as a special case. Experiments on supervised speaker adaptation demonstrate that the proposed approach is more effective compared with the conventional quasi-Bayes linear regression technique when a small amount of adaptation data is available. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker verification using sequence discriminant support vector machines

    Page(s): 203 - 210
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    This paper presents a text-independent speaker verification system using support vector machines (SVMs) with score-space kernels. Score-space kernels generalize Fisher kernels and are based on underlying generative models such as Gaussian mixture models (GMMs). This approach provides direct discrimination between whole sequences, in contrast with the frame-level approaches at the heart of most current systems. The resultant SVMs have a very high dimensionality since it is related to the number of parameters in the underlying generative model. To address problems that arise in the resultant optimization we introduce a technique called spherical normalization that preconditions the Hessian matrix. We have performed speaker verification experiments using the PolyVar database. The SVM system presented here reduces the relative error rates by 34% compared to a GMM likelihood ratio system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker classification using composite hypothesis testing and list decoding

    Page(s): 211 - 219
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (368 KB)  

    Speaker classification is seen as a hypothesis testing problem of J simple hypotheses and a composite hypothesis. The simple hypotheses represent target speakers while the composite hypothesis represents nontarget speakers. The simple hypotheses have well-defined distributions that are estimated from training signals. The distribution of the signal under the composite hypothesis is assumed to belong to a given family. The parameter of that distribution is assumed random with a prior distribution that is estimated from a large set of speakers. This formulation converts the problem to that of testing J+1 simple hypotheses. Signals corresponding to target and nontarget speakers are assumed Gaussian mixtures processes. Once the system has been trained, list decoding is applied in which a test signal is associated with a list of possible speakers. The probability that the correct speaker is on the list is maximized for a given average number of incorrect speakers on the list. Results from speaker identification and speaker verification experiments are reported. In speaker identification using a National Institute of Standards and Technology (NIST) database with 174 target speakers, over 77% correct identification was achieved for an average of less than two erroneous speakers on the list. Speaker verification experiments on a similar database yielded results, expressed in terms of the equal-error-rate, of 6.7% and 10.1% using two decision rules. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Entropy-constrained polar quantization and its application to audio coding

    Page(s): 220 - 232
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB)  

    In this work, we present a new method for quantization of sinusoidal amplitudes and phases, and apply the method to sinusoidal coding of speech and audio signals. The method is based on unrestricted polar quantization, where phase quantization accuracy depends on amplitude. Amplitude and phase quantizers are derived under an entropy (average rate) constraint using high-rate assumptions. First, we derive optimal quantizers for one sinusoid and a mean-squared error distortion measure. We provide a detailed analysis of entropy-constrained unrestricted polar quantization, showing its high performance and practicality even at low rates. Second, we find optimal quantizers for a set of sinusoids that model a short segment of an audio signal. The optimization is performed using a weighted error measure that can account for the masking effect in the human auditory system. We find the optimal rate distribution between sinusoids, as well as the corresponding optimal amplitude and phase quantizers, based on the perceptual importance of sinusoids defined by masking. The new method is used in an audio-coding application and is shown to significantly outperform a conventional sinusoidal quantization method where phase quantization accuracy is identical for all sinusoids. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Observer-based feedback linearization of dynamic loudspeakers with Ac amplifiers

    Page(s): 233 - 242
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB) |  | HTML iconHTML  

    For a reduction of nonlinear distortion produced by a dynamic loudspeaker, there exists a variety of approaches, one of them using the known principle of exact input-output linearization. In combination with a discrete state-space observer, this approach can successfully be applied as long as the amplifier driving the loudspeaker is dc-coupled. This paper discusses how ac amplifiers can be used in this context. It will be shown that an exact linearization is in general not possible with ac amplifiers due to insufficient stability of the feedback system. Two approaches will be presented that abandon the pursuit of an exact linearization in favor of stability and yield satisfactory results when implemented on a floating-point digital signal processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A bio-inspired companding strategy for spectral enhancement

    Page(s): 243 - 253
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (720 KB) |  | HTML iconHTML  

    This work presents a compressing-and-expanding, i.e., companding, strategy for spectral enhancement inspired by the operation of the auditory system. The companding strategy simulates the two-tone suppression phenomena of the auditory system and implements a soft local winner-take-all-like enhancement of the input spectrum. It performs multichannel syllabic compression without degrading spectral contrast. The companding strategy works in an analog fashion without explicit decision making, without the use of the fast Fourier transform, and without any cross-coupling between spectral channels. The strategy is useful in cochlear-implant processors, hearing aids, and speech recognition for enhancing spectral contrast. It is well suited to low-power analog circuit implementations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiresolution sinusoidal model with dynamic segmentation for timescale modification of polyphonic audio signals

    Page(s): 254 - 262
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB) |  | HTML iconHTML  

    In this paper, we propose an efficient sinusoidal model of polyphonic audio signals especially good for the application of timescale modification. One of the critical problem of sinusoidal modeling is that the signal is smeared during the synthesis frame, which is a very undesirable effect for transient parts. We solve this problem by introducing multiresolution analysis-synthesis and dynamic segmentation methods. A signal is modeled with a sinusoidal component and a noise component. A multiresolution filter bank is applied to an input signal which splits it into octave-spaced subbands without causing aliasing and then sinusoidal analysis is applied to each subband signal. To alleviate smearing of transients during synthesis, a dynamic segmentation method is applied to the subband signals that determines the optimal analysis-synthesis frame size adaptively to fit its time-frequency characteristics. To extract sinusoidal components and calculate respective parameters, a matching pursuit algorithm is applied to each analysis frame of the subband signal. A psychoacoustic model implementing frequency masking is incorporated with matching pursuit to provide a reasonable stop condition of iteration and reduce the number of sinusoids. The noise component obtained by subtracting the synthesized signal with sinusoidal components from the original signal is modeled by a line-segment model of short time spectrum envelope. For various polyphonic audio signals, the results of simulation shows the proposed sinusoidal modeling can synthesize original signals without loss of perceptual quality and do more robust and high-quality timescale modification for large scale factors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multichannel audio synthesis by subband-based spectral conversion and parameter adaptation

    Page(s): 263 - 274
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB) |  | HTML iconHTML  

    Multichannel audio can immerse a group of listeners in a seamless aural environment. Previously, we proposed a system capable of synthesizing the multiple channels of a virtual multichannel recording from a smaller set of reference recordings. This problem was termed multichannel audio resynthesis and the application was to reduce the excessive transmission requirements of multichannel audio. In this paper, we address the more general problem of multichannel audio synthesis, i.e., how to completely synthesize a multichannel audio recording from a specific stereophonic or monophonic recording, which would significantly enhance the recording's acoustic impression. We approach this problem by extending the model employed for the resynthesis problem. This is accomplished by adapting the resynthesis conversion parameters to the statistical properties of the recording that we wish to enhance. This parameter adaptation is similar to the task adaptation employed in speech recognition, when a specific model is applied to a different environment (speaker, language or channel). One particular approach to this problem is shown here to be quite advantageous toward solving the multichannel audio synthesis problem as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Beat tracking of musical performances using low-level audio features

    Page(s): 275 - 285
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB) |  | HTML iconHTML  

    This paper presents and compares two methods of tracking the beat in musical performances, one based on a Bayesian decision framework and the other a gradient strategy. The techniques can be applied directly to a digitized performance (i.e., a soundfile) and do not require a musical score or a MIDI transcription. In both cases, the raw audio is first processed into a collection of "rhythm tracks" which represent the time evolution of various low-level features. The Bayesian approach chooses a set of parameters that represent the beat by modeling the rhythm tracks as a concatenation of random variables with a patterned structure of variances. The output of the estimator is a trio of parameters that represent the interval between beats, its change (derivative), and the position of the starting beat. Recursive (and potentially real time) approximations to the method are derived using particle filters, and their behavior is investigated via simulation on a variety of musical sources. The simpler method, which performs a gradient descent over a window of beats, tends to converge more slowly and to undulate about the desired answer. Several examples are presented that highlight both the strengths and weaknesses of the approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Convergence analysis of a complex LMS algorithm with tonal reference signals

    Page(s): 286 - 292
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB) |  | HTML iconHTML  

    Often one encounters the presence of tonal noise in many active noise control applications. Such noise, usually generated by periodic noise sources like rotating machines, is cancelled by synthesizing the so-called antinoise by a set of adaptive filters which are trained to model the noise generation mechanism. Performance of such noise cancellation schemes depends on, among other things, the convergence characteristics of the adaptive algorithm deployed. In this paper, we consider a multireference complex least mean square (LMS) algorithm that can be used to train a set of adaptive filters to counter an arbitrary number of periodic noise sources. A deterministic convergence analysis of the multireference algorithm is carried out and necessary as well as sufficient conditions for convergence are derived by exploiting the properties of the input correlation matrix and a related product matrix. It is also shown that under convergence condition, the energy of each error sequence is independent of the tonal frequencies. An optimal step size for fastest convergence is then worked out by minimizing the error energy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward detecting emotions in spoken dialogs

    Page(s): 293 - 303
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB) |  | HTML iconHTML  

    The importance of automatically recognizing emotions from human speech has grown with the increasing role of spoken language interfaces in human-computer interaction applications. This paper explores the detection of domain-specific emotions using language and discourse information in conjunction with acoustic correlates of emotion in speech signals. The specific focus is on a case study of detecting negative and non-negative emotions using spoken language data obtained from a call center application. Most previous studies in emotion recognition have used only the acoustic information contained in speech. In this paper, a combination of three sources of information-acoustic, lexical, and discourse-is used for emotion recognition. To capture emotion information at the language level, an information-theoretic notion of emotional salience is introduced. Optimization of the acoustic correlates of emotion with respect to classification error was accomplished by investigating different feature sets obtained from feature selection, followed by principal component analysis. Experimental results on our call center data show that the best results are obtained when acoustic and language information are combined. Results show that combining all the information, rather than using only acoustic information, improves emotion classification by 40.7% for males and 36.4% for females (linear discriminant classifier used for acoustic information). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Speech and Audio Processing Edics

    Page(s): 304
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Speech and Audio Processing Information for authors

    Page(s): 305 - 306
    Save to Project icon | Request Permissions | PDF file iconPDF (42 KB)  
    Freely Available from IEEE
  • Have you visited lately? www.ieee.org

    Page(s): 307
    Save to Project icon | Request Permissions | PDF file iconPDF (220 KB)  
    Freely Available from IEEE
  • Quality without compromise [advertisement]

    Page(s): 308
    Save to Project icon | Request Permissions | PDF file iconPDF (318 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (30 KB)  
    Freely Available from IEEE
  • Blank page [back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (2 KB)  
    Freely Available from IEEE

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope