By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 1 • Date Jan. 2006

Filter Results

Displaying Results 1 - 25 of 42
  • Table of contents

    Page(s): c1 - c4
    Save to Project icon | Request Permissions | PDF file iconPDF (50 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • Editorial

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (27 KB)  
    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic mood detection and tracking of music audio signals

    Page(s): 5 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (664 KB)  

    Music mood describes the inherent emotional expression of a music clip. It is helpful in music understanding, music retrieval, and some other music-related applications. In this paper, a hierarchical framework is presented to automate the task of mood detection from acoustic music data, by following some music psychological theories in western cultures. The hierarchical framework has the advantage of emphasizing the most suitable features in different detection tasks. Three feature sets, including intensity, timbre, and rhythm are extracted to represent the characteristics of a music clip. The intensity feature set is represented by the energy in each subband, the timbre feature set is composed of the spectral shape features and spectral contrast features, and the rhythm feature set indicates three aspects that are closely related with an individual's mood response, including rhythm strength, rhythm regularity, and tempo. Furthermore, since mood is usually changeable in an entire piece of classical music, the approach to mood detection is extended to mood tracking for a music piece, by dividing the music into several independent segments, each of which contains a homogeneous emotional expression. Preliminary evaluations indicate that the proposed algorithms produce satisfactory results. On our testing database composed of 800 representative music clips, the average accuracy of mood detection achieves up to 86.3%. We can also on average recall 84.1% of the mood boundaries from nine testing music pieces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech enhancement using a masking threshold constrained Kalman filter and its heuristic implementations

    Page(s): 19 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB)  

    A masking threshold constrained Kalman filter for speech enhancement is derived in the paper. A key step in a traditional Kalman filter requires minimizing an estimation error variance between a clean signal and its estimation. Our new method is to minimize the estimation error variance under the constraint that the energy of the estimation error is smaller than a masking threshold, computed from both time-domain forward masking and frequency-domain simultaneous masking properties of human auditory systems. The new Kalman filter provides a theoretical base for the application of the masking properties in Kalman filtering for speech enhancement. Due to the high computation cost of the proposed perceptually constrained Kalman filter, a perceptual post-filter concatenated with a standard Kalman filter is also proposed as a heuristic alternative for real-time implementation. The post-filter is constructed to make the estimation error obtained from the Kalman filter lower than the masking threshold. A wavelet Kalman filter with post-filtering is introduced to further reduce the computational load. Experimental results with colored noise show that the new constrained Kalman filter method produces the best performance when compared with other recent methods, and that the proposed heuristics with post-filtering can also produce a significant performance gain over other recent methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the perceptual performance limitations of echo cancellers in wideband telephony

    Page(s): 33 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (696 KB)  

    In this paper, standard echo canceller performance measures are evaluated in terms of psychoacoustic aspects of human hearing. The focus is on wideband speech communications systems with long round-trip delays of 200 ms and up present in the transmission path. The results of a simple acoustic echo cancellation experiment are analyzed with a standard psychoacoustic model, revealing that steady-state echo return loss enhancement and mean square error cannot be used to determine whether residual echo is perceivable in the presence of background noise. In addition, a simple modification to the normalized least mean square (NLMS) algorithm is introduced by adding a perceptual preemphasis filter. Simulation results and listening tests show that it is possible to improve the perceived performance of an echo canceller during convergence by placing greater emphasis on frequencies at which the human auditory system is most sensitive. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic speech recognition with an adaptation model motivated by auditory processing

    Page(s): 43 - 49
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB)  

    The mel-frequency cepstral coefficient (MFCC) or perceptual linear prediction (PLP) feature extraction typically used for automatic speech recognition (ASR) employ several principles which have known counterparts in the cochlea and auditory nerve: frequency decomposition, mel- or bark-warping of the frequency axis, and compression of amplitudes. It seems natural to ask if one can profitably employ a counterpart of the next physiological processing step, synaptic adaptation. We, therefore, incorporated a simplified model of short-term adaptation into MFCC feature extraction. We evaluated the resulting ASR performance on the AURORA 2 and AURORA 3 tasks, in comparison to ordinary MFCCs, MFCCs processed by RASTA, and MFCCs processed by cepstral mean subtraction (CMS), and both in comparison to and in combination with Wiener filtering. The results suggest that our approach offers a simple, causal robustness strategy which is competitive with RASTA, CMS, and Wiener filtering and performs well in combination with Wiener filtering. Compared to the structurally related RASTA, our adaptation model provides superior performance on AURORA 2 and, if Wiener filtering is used prior to both approaches, on AURORA 3 as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse and shift-Invariant representations of music

    Page(s): 50 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (464 KB)  

    Redundancy reduction has been proposed as the main computational process in the primary sensory pathways in the mammalian brain. This idea has led to the development of sparse coding techniques, which are exploited in this article to extract salient structure from musical signals. In particular, we use a sparse coding formulation within a generative model that explicitly enforces shift-invariance. Previous work has applied these methods to relatively small problem sizes. In this paper, we present a subset selection step to reduce the computational complexity of these methods, which then enables us to use the sparse coding approach for many real world applications. We demonstrate the algorithm's potential on two tasks in music analysis: the extraction of individual notes from polyphonic piano music and single-channel blind source separation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mask estimation for missing data speech recognition based on statistics of binaural interaction

    Page(s): 58 - 67
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1056 KB)  

    This paper describes a perceptually motivated computational auditory scene analysis (CASA) system that combines sound separation according to spatial location with the "missing data" approach for robust speech recognition in noise. Missing data time-frequency masks are created using probability distributions based on estimates of interaural time and level differences (ITD and ILD) for mixed utterances in reverberated conditions; these masks indicate which regions of the spectrum constitute reliable evidence of the target speech signal. A number of experiments compare the relative efficacy of the binaural cues when used individually and in combination. We also investigate the ability of the system to generalize to acoustic conditions not encountered during training. Performance on a continuous digit recognition task using this method is found to be good, even in a particularly challenging environment with three concurrent male talkers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instrument recognition in polyphonic music based on automatic taxonomies

    Page(s): 68 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (760 KB)  

    We propose a new approach to instrument recognition in the context of real music orchestrations ranging from solos to quartets. The strength of our approach is that it does not require prior musical source separation. Thanks to a hierarchical clustering algorithm exploiting robust probabilistic distances, we obtain a taxonomy of musical ensembles which is used to efficiently classify possible combinations of instruments played simultaneously. Moreover, a wide set of acoustic features is studied including some new proposals. In particular, signal to mask ratios are found to be useful features for audio classification. This study focuses on a single music genre (i.e., jazz) but combines a variety of instruments among which are percussion and singing voice. Using a varied database of sound excerpts from commercial recordings, we show that the segmentation of music with respect to the instruments played can be achieved with an average accuracy of 53%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling timbre distance with temporal statistics from polyphonic music

    Page(s): 81 - 90
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB)  

    Timbre distance and similarity are expressions of the phenomenon that some music appears similar while other songs sound very different to us. The notion of genre is often used to categorize music, but songs from a single genre do not necessarily sound similar and vice versa. In this work, we analyze and compare a large amount of different audio features and psychoacoustic variants thereof for the purpose of modeling timbre distance. The sound of polyphonic music is commonly described by extracting audio features on short time windows during which the sound is assumed to be stationary. The resulting down sampled time series are aggregated to form a high-level feature vector describing the music. We generated high-level features by systematically applying static and temporal statistics for aggregation. The temporal structure of features in particular has previously been largely neglected. A novel supervised feature selection method is applied to the huge set of possible features. The distances of the selected feature correspond to timbre differences in music. The features show few redundancies and have high potential for explaining possible clusters. They outperform seven other previously proposed feature sets on several datasets with respect to the separation of the known groups of timbrally different music. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Musical source separation using time-frequency source priors

    Page(s): 91 - 98
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB)  

    This article deals with the source separation problem for stereo musical mixtures using prior information about the sources (instrument names and localization). After a brief review of existing methods, we design a family of probabilistic mixture generative models combining modified positive independent subspace analysis (ISA), localization models, and segmental models (SM). We express source separation as a Bayesian estimation problem and we propose efficient resolution algorithms. The resulting separation methods rely on a variable number of cues including harmonicity, spectral envelope, azimuth, note duration, and monophony. We compare these methods on two synthetic mixtures with long reverberation. We show that they outperform methods exploiting spatial diversity only and that they are robust against approximate localization of the sources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On perceptual distortion minimization and nonlinear least-squares frequency estimation

    Page(s): 99 - 109
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (384 KB)  

    In this paper, we present a framework for perceptual error minimization and sinusoidal frequency estimation based on a new perceptual distortion measure, and we state its optimal solution. Using this framework, we relate a number of well-known practical methods for perceptual sinusoidal parameter estimation such as the prefiltering method, the weighted matching pursuit, and the perceptual matching pursuit. In particular, we derive and compare the sinusoidal estimation criteria used in these methods. We show that for the sinusoidal estimation problem, the prefiltering method and the weighted matching pursuit are equivalent to the perceptual matching pursuit under certain conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multichannel active noise equalization of interior noise

    Page(s): 110 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1136 KB)  

    An algorithmic variant of the conventional active noise equalizer (ANE), which independently controls some given frequencies of the primary signal, has been developed and extended to the multichannel case. The modified version of the ANE is named common-error multiple-frequency ANE. A detailed analysis of both multichannel equalizers has been carried out. From a convergence analysis in the frequency-domain, the significance of transducer locations in the behavior of a practical system can be predicted through the matrix of secondary path responses at each frequency. The ANEs steady-state transfer functions from the primary input signal to the noise output have also been developed and compared for different parameter settings and for accurate and inaccurate secondary path estimation. Furthermore, the multichannel extension of both equalizers has been implemented in a real-time active system inside a listening room for multifrequency noise. Useful-size zones of equalization have been binaurally measured by using a head and torso simulator. It was found that the common-error multiple-frequency ANE performs better than the conventional equalizer because it achieves a saving in computational complexity and has smaller overshoot. It can also be implemented in a real controller more easily than the conventional ANE and without showing meaningful differences in the practical results provided. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of the filtered-X LMS algorithm and a related new algorithm for active control of multitonal noise

    Page(s): 123 - 130
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (392 KB) |  | HTML iconHTML  

    In the presence of tonal noise generated by periodic noise source like rotating machines, the filtered-X LMS (FXLMS) algorithm is used for active control of such noises. However, the algorithm is derived under the assumption of slow adaptation limit and the exact analysis of the algorithm is restricted to the case of one real sinusoid in the literature. In this paper, for the general case of arbitrary number of sources, the characteristic polynomial of the equivalent linear system describing the FXLMS algorithm is derived and a method for calculating the stability limit is presented. Also, a related new algorithm free from the above assumption, which is nonlinear with respect to the tap weights, is proposed. Simulation results show that in the early stage of adaptation the new algorithm gives faster decay of errors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Note segmentation and quantization for music information retrieval

    Page(s): 131 - 141
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB)  

    Much research in music information retrieval has focused on query-by-humming systems, which search melodic databases using sung queries. The database retrieval aspect of such systems has received considerable attention, but query processing and the melodic representation have not been examined as carefully. Common methods for query processing are based on musical intuition and historical momentum rather than specific performance criteria; existing systems often employ rudimentary note segmentation or coarse quantization of note estimates. In this work, we examine several alternative query processing methods as well as quantized melodic representations. One common difficulty with designing query-by-humming systems is the coupling between system components. We address this issue by measuring the performance of the query processing system both in isolation and coupled with a retrieval system. We first measure the segmentation performance of several note estimators. We then compute the retrieval accuracy of an experimental query-by-humming system that uses the various note estimators along with varying degrees of pitch and duration quantization. The results show that more advanced query processing can improve both segmentation performance and retrieval performance, although the best segmentation performance does not necessarily yield the best retrieval performance. Further, coarsely quantizing the melodic representation generally degrades retrieval accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluation of the affective valence of speech using pitch substructure

    Page(s): 142 - 151
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (424 KB)  

    In order to study the relationship between emotion and intonation, a new technique is introduced for the extraction of the dominant pitches within speech utterances and the quasi-musical analysis of the multipitch structure. After the distribution of fundamental frequencies over the entire utterance has been obtained, the underlying pitch structure is determined using an unsupervised "cluster" (Gaussian mixtures) algorithm. The technique normally results in 3-6 pitch clusters per utterance that can then be evaluated in terms of their inherent dissonance, harmonic "tension", and "major or minor modality". Stronger dissonance and tension were found in utterances with negative affect, relative to utterances with positive affect. Most importantly, utterances that were evaluated as having positive or negative affect had significantly different modality values. Factor analysis showed that the measures involving multiple pitches were distinct from other acoustical measures, indicating that the pitch substructure is an independent factor contributing to the affective valence of speech prosody. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Iterative joint source-channel decoding of speech spectrum parameters over an additive white Gaussian noise channel

    Page(s): 152 - 162
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (488 KB)  

    In this paper, we show how the Gaussian mixture modeling framework used to develop efficient source encoding schemes can be further exploited to model source statistics during channel decoding in an iterative framework to develop an effective joint source-channel decoding scheme. The joint probability density function (PDF) of successive source frames is modeled as a Gaussian mixture model (GMM). Based on previous work, the marginal source statistics provided by the GMM is used at the encoder to design a low-complexity memoryless source encoding scheme. The source encoding scheme has the specific advantage of providing good estimates to the probability of occurrence of a given source code-point based on the GMM. The proposed iterative decoding procedure works with any channel code whose decoder can implement the soft-output Viterbi algorithm that uses a priori information (APRI-SOVA) or the BCJR algorithm to provide extrinsic information on each source encoded bit. The source decoder uses the GMM model and the channel decoder output to provide a priori information back to the channel decoder. Decoding is done in an iterative manner by trading extrinsic information between the source and channel decoders. Experimental results showing improved decoding performance are provided in the application of speech spectrum parameter compression and communication. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Codebook driven short-term predictor parameter estimation for speech enhancement

    Page(s): 163 - 176
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB)  

    In this paper, we present a new technique for the estimation of short-term linear predictive parameters of speech and noise from noisy data and their subsequent use in waveform enhancement schemes. The method exploits a priori information about speech and noise spectral shapes stored in trained codebooks, parameterized as linear predictive coefficients. The method also uses information about noise statistics estimated from the noisy observation. Maximum-likelihood estimates of the speech and noise short-term predictor parameters are obtained by searching for the combination of codebook entries that optimizes the likelihood. The estimation involves the computation of the excitation variances of the speech and noise auto-regressive models on a frame-by-frame basis, using the a priori information and the noisy observation. The high computational complexity resulting from a full search of the joint speech and noise codebooks is avoided through an iterative optimization procedure. We introduce a classified noise codebook scheme that uses different noise codebooks for different noise types. Experimental results show that the use of a priori information and the calculation of the instantaneous speech and noise excitation variances on a frame-by-frame basis result in good performance in both stationary and nonstationary noise conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech enhancement based on auto gain control

    Page(s): 177 - 190
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2432 KB)  

    We propose a new method of speech enhancement based on auto gain control (AGC) using two channel inputs to deal with transient noises. Auto gain control is considered to be relatively ineffective for reducing noises that are superimposed on speech. Nevertheless, it offers advantages for addressing problems posed by musical noise and spectral distortion. This method combines two operations for obtaining accurate gain. One is spectral subtraction for two-channel input (2chSS); the other is self-offset of the noise with pre-whitening. This study also addresses a coherence based post-filter to reduce uncorrelated noise components among channels. The proposed method is evaluated in experiments across three noise conditions in which (i) impulsive noises, (ii) stationary car noise, and (iii) speech noise are present, respectively. Objective measures and spectrograms demonstrate marked improvements over other two-microphone based methods, but subjective preference tests reveal that the proposed method is less preferred than the equivalent of a nonprocessed signal in the case of stationary car noise (ii). The performance of the proposed method and the conventional 2chSS were even in the case of speech noise (iii). These results of subjective tests reflect some disadvantages of the AGC processing. Those drawbacks involve degradation of noise consistency in stationary noise conditions and residual noises in desired speech segments. Nevertheless, subjective tests in the case of noise (i) demonstrate that the proposed method is the most preferred among the methods compared here. The effectiveness of the proposed method is confirmed particularly for this noise condition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Audio source separation with a single sensor

    Page(s): 191 - 199
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB)  

    In this paper, we address the problem of audio source separation with one single sensor, using a statistical model of the sources. The approach is based on a learning step from samples of each source separately, during which we train Gaussian scaled mixture models (GSMM). During the separation step, we derive maximum a posteriori (MAP) and/or posterior mean (PM) estimates of the sources, given the observed audio mixture (Bayesian framework). From the experimental point of view, we test and evaluate the method on real audio examples. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multichannel blind deconvolution for source separation in convolutive mixtures of speech

    Page(s): 200 - 212
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1776 KB) |  | HTML iconHTML  

    This paper addresses the blind separation of convolutive and temporally correlated mixtures of speech, through the use of a multichannel blind deconvolution (MBD) method. In the proposed framework (LP-NGA), spatio-temporal separation is carried out by entropy maximization using the well-known natural gradient algorithm (NGA), while a temporal pre-whitening stage, based on linear prediction (LP), manages to fully preserve the original spectral characteristics of each source contribution. Confronted with synthetic convolutive mixtures, we show that the LP-NGA-an unconstrained natural extension to the multichannel BSS problem-benefits not only from fewer model constraints, but also from other factors, such as an overall increase in separation performance, spectral preservation efficiency and speed of convergence. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The AT&T spoken language understanding system

    Page(s): 213 - 222
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB)  

    Spoken language understanding (SLU) aims at extracting meaning from natural language speech. Over the past decade, a variety of practical goal-oriented spoken dialog systems have been built for limited domains. SLU in these systems ranges from understanding predetermined phrases through fixed grammars, extracting some predefined named entities, extracting users' intents for call classification, to combinations of users' intents and named entities. In this paper, we present the SLU system of VoiceTone® (a service provided by AT&T where AT&T develops, deploys and hosts spoken dialog applications for enterprise customers). The SLU system includes extracting both intents and the named entities from the users' utterances. For intent determination, we use statistical classifiers trained from labeled data, and for named entity extraction we use rule-based fixed grammars. The focus of our work is to exploit data and to use machine learning techniques to create scalable SLU systems which can be quickly deployed for new domains with minimal human intervention. These objectives are achieved by 1) using the predicate-argument representation of semantic content of an utterance; 2) extending statistical classifiers to seamlessly integrate hand crafted classification rules with the rules learned from data; and 3) developing an active learning framework to minimize the human labeling effort for quickly building the classifier models and adapting them to changes. We present an evaluation of this system using two deployed applications of VoiceTone®. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust speech recognition over mobile and IP networks in burst-like packet loss

    Page(s): 223 - 231
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (408 KB)  

    This paper addresses the problem of achieving robust distributed speech recognition in the presence of burst-like packet loss. To compensate for packet loss a number of techniques are investigated to provide estimates of lost vectors. Experimental results on both a connected digits task and a large vocabulary continuous speech recognition task show that simple methods, such as repetition, are not as effective as interpolation methods which are better able to preserve the dynamics of the feature vector stream. Best performance is given by maximum a-posteriori (MAP) estimation of lost vectors which utilizes statistics of the feature vector stream. At longer burst lengths the performance of these compensation techniques deteriorates as the temporal correlation in the received feature vector stream reduces. To compensate for this interleaving is proposed which aims to disperse bursts of loss into a series of unconnected smaller bursts. Results show substantial gains in accuracy, to almost that of the no loss condition, when interleaving is combined with estimation techniques, although this is at the expense of introducing delay. This leads to the proposal that, for a distributed speech recognition application, it is more beneficial to trade delay for accuracy rather than trading bit-rate for accuracy as in forward error correction schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research