System Maintenance:
There may be intermittent impact on performance while updates are in progress. We apologize for the inconvenience.
By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 1 • Date Jan. 2008

Filter Results

Displaying Results 1 - 25 of 28
  • Table of contents

    Publication Year: 2008 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (44 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2008 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • Cascade Prediction Filters With Adaptive Zeros to Track the Time-Varying Resonances of the Vocal Tract

    Publication Year: 2008 , Page(s): 1 - 7
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1932 KB) |  | HTML iconHTML  

    In this paper, a simple and reliable technique is proposed to track vocal tract resonances in continuous speech. The approach is based on the use of predictor filters with adaptive zeros whose constrained trajectories guarantee the successful tracking of the frequency and the damping of each resonance. The zeros are adapted using a gradient-based algorithm to minimize an instantaneous prediction residual according to the principle of minimal disturbance yielding an adaptive structure capable of tracking fast-changing resonance parameters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using Articulatory Representations to Detect Segmental Errors in Nonnative Pronunciation

    Publication Year: 2008 , Page(s): 8 - 22
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (719 KB) |  | HTML iconHTML  

    Motivated by potential applications in second-language pedagogy, we present a novel approach to using articulatory information to improve automatic detection of typical phone-level errors made by nonnative speakers of English-a difficult task that involves discrimination between close pronunciations. We describe a reformulation of the hidden-articulator Markov model (HAMM) framework that is appropriate for the pronunciation evaluation domain. Model training requires no direct articulatory measurement, but rather involves a constrained and interpolated mapping from phone-level transcriptions to a set of physically and numerically meaningful articulatory representations. Here, we define two new methods of deriving articulatory-based features for classification: one, by concatenating articulatory recognition results over eight streams representative of the vocal tract's constituents; the other, by calculating multidimensional articulatory confidence scores within these representations based on general linguistic knowledge of articulatory variants. After adding these articulatory features to traditional phone-level confidence scores, our results demonstrate absolute reductions in combined error rates for verification of segment-level pronunciations produced by nonnative speakers in the ISLE corpus by as much as 16%-17% for some target segments, and a 3%-4% absolute improvement overall. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Subjective Intelligibility Testing of Chinese Speech

    Publication Year: 2008 , Page(s): 23 - 33
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (525 KB) |  | HTML iconHTML  

    This paper presents a complete methodology and rationale for the subjective intelligibility testing of Chinese speech. It replaces the combination of several previously published Chinese intelligibility tests which have been in use for almost a decade, with a single composite test procedure constructed from a foundation of subjective trials and auditory evidence. Since publication of the first elements of Chinese intelligibility test, several factors have come to light which prompted this overhaul. First, international testing has highlighted words used in the original test that are unsuitable for speakers of particular regional dialects. Second, recent evidence indicates that the assumptions of tonal confusion made during the definition of the original tonal intelligibility tests are not borne out by subjective evidence. Finally, words published in the original test disadvantaged speakers from Mainland China due to the use of full-form Chinese characters rather than the more ubiquitous simplified form characters. This paper presents experimental evidence of tone confusion in Chinese speech, and uses this data to create a replacement tone test. Word choice has been adjusted to find more neutral alternatives for particular regional dialect speakers. The basic speech and tone extension tests are now presented with simplified form characters to ensure accessibility by the greatest number of test subjects. Finally, this paper includes a description of the full intelligibility test. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spectral Representations of Nonmodal Phonation

    Publication Year: 2008 , Page(s): 34 - 46
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1209 KB) |  | HTML iconHTML  

    Regions of nonmodal phonation, which exhibit deviations from uniform glottal-pulse periods and amplitudes, occur often in speech and convey information about linguistic content, speaker identity, and vocal health. Some aspects of these deviations are random, including small perturbations, known as jitter and shimmer, as well as more significant aperiodicities. Other aspects are deterministic, including repeating patterns of fluctuations such as diplophonia and triplophonia. These deviations are often the source of misinterpretation of the spectrum. In this paper, we introduce a general signal-processing framework for interpreting the effects of both stochastic and deterministic aspects of nonmodality on the short-time spectrum. As an example, we show that the spectrum is sensitive to even small perturbations in the timing and amplitudes of glottal pulses. In addition, we illustrate important characteristics that can arise in the spectrum, including apparent shifting of the harmonics and the appearance of multiple pitches. For stochastic perturbations, we arrive at a formulation of the power-spectral density as the sum of a low-pass line spectrum and a high-pass noise floor. Our findings are relevant to a number of speech-processing areas including linear-prediction analysis, sinusoidal analysis-synthesis, spectrally derived features, and the analysis of disordered voices. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Method for Automatic Detection of Vocal Fry

    Publication Year: 2008 , Page(s): 47 - 56
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (746 KB) |  | HTML iconHTML  

    Vocal fry (also called creak, creaky voice, and pulse register phonation) is a voice quality that carries important linguistic or paralinguistic information, depending on the language. We propose a set of acoustic measures and a method for automatically detecting vocal fry segments in speech utterances. A glottal pulse-synchronized method is proposed to deal with the very low fundamental frequency properties of vocal fry segments, which cause problems in the classic short-term analysis methods. The proposed acoustic measures characterize power, aperiodicity, and similarity properties of vocal fry signals. The basic idea of the proposed method is to scan for local power peaks in a ldquovery short-termrdquo power contour for obtaining glottal pulse candidates, check for periodicity properties, and evaluate a similarity measure between neighboring glottal pulse candidates for deciding the possibility of being vocal fry pulses. In the periodicity analysis, autocorrelation peak properties are taken into account for avoiding misdetection of periodicity in vocal fry segments. Evaluation of the proposed acoustic measures in the automatic detection resulted in 74% correct detection, with an insertion error rate of 13%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generalized Postfilter for Speech Quality Enhancement

    Publication Year: 2008 , Page(s): 57 - 64
    Cited by:  Papers (4)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (334 KB) |  | HTML iconHTML  

    Postfilters are commonly used in speech coding for the attenuation of quantization noise. In the presence of acoustic background noise or distortion due to tandeming operations, the postfilter parameters are not adjusted and the performance is, therefore, not optimal. We propose a modification that consists of replacing the nonadaptive postfilter parameters with parameters that adapt to variations in spectral flatness, obtained from the noisy speech. This generalization of the postfiltering concept can handle a larger range of noise conditions, but has the same computational complexity and memory requirements as the conventional postfilter. Test results indicate that the presented algorithm improves on the standard postfilter, as well as on the combination of a noise attenuation preprocessor and the conventional postfilter. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Regularized Linear Prediction of Speech

    Publication Year: 2008 , Page(s): 65 - 73
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (826 KB) |  | HTML iconHTML  

    All-pole spectral envelope estimates based on linear prediction (LP) for speech signals often exhibit unnaturally sharp peaks, especially for high-pitch speakers. In this paper, regularization is used to penalize rapid changes in the spectral envelope, which improves the spectral envelope estimate. Based on extensive experimental evidence, we conclude that regularized linear prediction outperforms bandwidth-expanded linear prediction. The regularization approach gives lower spectral distortion on average, and fewer outliers, while maintaining a very low computational complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unit-Centric Feature Mapping for Inventory Pruning in Unit Selection Text-to-Speech Synthesis

    Publication Year: 2008 , Page(s): 74 - 82
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (824 KB) |  | HTML iconHTML  

    The level of quality that can be attained in concatenative text-to-speech (TTS) synthesis is primarily governed by the inventory of units used in unit selection. This has led to the collection of ever larger corpora in the quest for ever more natural synthetic speech. As operational considerations limit the size of the unit inventory, however, pruning is critical to removing any instances that prove either spurious or superfluous. This paper proposes a novel pruning strategy based on a data-driven feature extraction framework separately optimized for each unit type in the inventory. A single distinctiveness/redundancy measure can then address, in a consistent manner, the two different problems of outliers and redundant units. Detailed analysis of an illustrative case study exemplifies the typical behavior of the resulting unit pruning procedure, and listening evidence suggests that both moderate and aggressive inventory pruning can be achieved with minimal degradation in perceived TTS quality. These experiments underscore the benefits of unit-centric feature mapping for database optimization in concatenative synthesis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Backward-Compatible Multichannel Audio Codec

    Publication Year: 2008 , Page(s): 83 - 93
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (497 KB) |  | HTML iconHTML  

    We propose in this paper a backward-compatible multichannel audio codec. This codec represents a multichannel audio input signal by a down mix and parametric data. In order to enable backward compatibility, it is necessary to have the possibility of exerting control over the down-mixing procedure. At the same time, in order to achieve a high coding efficiency, both signal and perceptual redundancies should be exploited. In this paper, we describe a codec that unifies the above-mentioned conditions: backward compatibility and exploitation of both signal and perceptual redundancies. The codec combines a high audio quality and a low parameter bit rate. Moreover, its design is flexible, examples of which are the scalability of the audio quality to (in principle) transparency and the possibility to preserve the correlation structure of the original input signals by using synthetic signals. A stereo backward compatible version of the proposed codec is used as a component of the recently standardized MPEG Surround multichannel audio codec. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Frequency Region-Based Prioritized Bit-Plane Coding for Scalable Audio

    Publication Year: 2008 , Page(s): 94 - 105
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1693 KB) |  | HTML iconHTML  

    A perceptually enhanced prioritized bit-plane audio coding algorithm is presented in this paper. According to the energy distribution in different frequency regions, the bit-planes are prioritized with optimized parameters. Based on the statistical modeling of the frequency spectrum, a much more simplified implementation of prioritized bit-plane coding is integrated with the recent release of MPEG-4 scalable lossless (SLS) audio coding structure by replacing the sequential bit-plane coding in the enhancement layer. With zero extra side information, trivial added complexity, and modification to the original SLS structure, extensive experimental results show that the perceptual quality of SLS with noncore and very low core bit-rate is improved significantly in a wide range of bit-rate combinations. Fully scalable audio coding up to lossless with much enhanced perceptual quality is thus achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-Scale Modification of Audio Signals Using Enhanced WSOLA With Management of Transients

    Publication Year: 2008 , Page(s): 106 - 115
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1061 KB) |  | HTML iconHTML  

    In this paper, we present an algorithm for time-scale modification of music signals, based on the waveform similarity overlap-and-add technique (WSOLA). A well-known disadvantage of the standard WSOLA is the uniform time-scaling of the entire signal, including the perceptually significant transient sections (PSTs), where temporal envelope changes as well as significant spectral transitions occur. Time-scaling of PSTs can severely degrade the music quality. We address this problem by detecting the PSTs and leaving them intact, while time-scaling the remainder of the signal, which is relatively steady-state. In the proposed algorithm, the PSTs are detected using a Mel frequency cepstrum nonstationarity measure and the normalized cross-correlation, with time-varying threshold functions. Our study shows that the accurate detection of PSTs within the WSOLA framework makes it possible to achieve a higher quality of time-scaled music, as confirmed by subjective listening tests. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instrument-Specific Harmonic Atoms for Mid-Level Music Representation

    Publication Year: 2008 , Page(s): 116 - 128
    Cited by:  Papers (27)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1017 KB) |  | HTML iconHTML  

    Several studies have pointed out the need for accurate mid-level representations of music signals for information retrieval and signal processing purposes. In this paper, we propose a new mid-level representation based on the decomposition of a signal into a small number of sound atoms or molecules bearing explicit musical instrument labels. Each atom is a sum of windowed harmonic sinusoidal partials whose relative amplitudes are specific to one instrument, and each molecule consists of several atoms from the same instrument spanning successive time windows. We design efficient algorithms to extract the most prominent atoms or molecules and investigate several applications of this representation, including polyphonic instrument recognition and music visualization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Objective Metric of Human Subjective Audio Quality Optimized for a Wide Range of Audio Fidelities

    Publication Year: 2008 , Page(s): 129 - 136
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (487 KB) |  | HTML iconHTML  

    The goal of this paper is to develop an audio quality metric that can accurately quantify subjective quality over audio fidelities ranging from highly impaired to perceptually lossless. As one example of its utility, such a metric would allow scalable audio coding algorithms to be easily optimized over their entire operating ranges. We have found that the ITU-recommended objective quality metric, ITU-R BS.1387, does not accurately predict subjective audio quality over the wide range of fidelity levels of interest to us. In developing the desired universal metric, we use as a starting point the model output variables (MOVs) that make up BS.1387 as well as the energy equalization truncation threshold which has been found to be particularly useful for highly impaired audio. To combine these MOVs into a single quality measure that is both accurate and robust, we have developed a hybrid least-squares/minimax optimization procedure. Our test results show that the minimax-optimized metric is up to 36% lower in maximum absolute error compared to a similar metric designed using the conventional least-squares procedure. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Noise-Robust FFT-Based Auditory Spectrum With Application in Audio Classification

    Publication Year: 2008 , Page(s): 137 - 150
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1388 KB) |  | HTML iconHTML  

    In this paper, we investigate the noise robustness of Wang and Shamma's early auditory (EA) model for the calculation of an auditory spectrum in audio classification applications. First, a stochastic analysis is conducted wherein an approximate expression of the auditory spectrum is derived to justify the noise-suppression property of the EA model. Second, we present an efficient fast Fourier transform (FFT)-based implementation for the calculation of a noise-robust auditory spectrum, which allows flexibility in the extraction of audio features. To evaluate the performance of the proposed FFT-based auditory spectrum, a set of speech/music/noise classification tasks is carried out wherein a support vector machine (SVM) algorithm and a decision tree learning algorithm (C4.5) are used as the classifiers. Features used for classification include conventional Mel-frequency cepstral coefficients (MFCCs), MFCC-like features obtained from the original auditory spectrum (i.e., based on the EA model) and the proposed FFT-based auditory spectrum, as well as spectral features (spectral centroid, bandwidth, etc.) computed from the latter. Compared to the conventional MFCC features, both the MFCC-like and spectral features derived from the proposed FFT-based auditory spectrum show more robust performance in noisy test cases. Test results also indicate that, using the new MFCC-like features, the performance of the proposed FFT-based auditory spectrum is slightly better than that of the original auditory spectrum, while its computational complexity is reduced by an order of magnitude. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Cascaded Broadcast News Highlighter

    Publication Year: 2008 , Page(s): 151 - 161
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (950 KB) |  | HTML iconHTML  

    This paper presents a fully automatic news skimming system which takes a broadcast news audio stream and provides the user with the segmented, structured, and highlighted transcript. This constitutes a system with three different, cascading stages: converting the audio stream to text using an automatic speech recognizer, segmenting into utterances and stories, and finally determining which utterance should be highlighted using a saliency score. Each stage must operate on the erroneous output from the previous stage in the system, an effect which is naturally amplified as the data progresses through the processing stages. We present a large corpus of transcribed broadcast news data enabling us to investigate to which degree information worth highlighting survives this cascading of processes. Both extrinsic and intrinsic experimental results indicate that mistakes in the story boundary detection has a strong impact on the quality of highlights, whereas erroneous utterance boundaries cause only minor problems. Further, the difference in transcription quality does not affect the overall performance greatly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive System Identification in the Short-Time Fourier Transform Domain Using Cross-Multiplicative Transfer Function Approximation

    Publication Year: 2008 , Page(s): 162 - 173
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (796 KB) |  | HTML iconHTML  

    In this paper, we introduce cross-multiplicative transfer function (CMTF) approximation for modeling linear systems in the short-time Fourier transform (STFT) domain. We assume that the transfer function can be represented by cross-multiplicative terms between distinct subbands. We investigate the influence of cross-terms on a system identifier implemented in the STFT domain and derive analytical relations between the noise level, data length, and number of cross-multiplicative terms, which are useful for system identification. As more data becomes available or as the noise level decreases, additional cross-terms should be considered and estimated to attain the minimal mean-square error (mse). A substantial improvement in performance is then achieved over the conventional multiplicative transfer function (MTF) approximation. Furthermore, we derive explicit expressions for the transient and steady-state mse performances obtained by adaptively estimating the cross-terms. As more cross-terms are estimated, a lower steady-state mse is achieved, but the algorithm then suffers from slower convergence. Experimental results validate the theoretical derivations and demonstrate the effectiveness of the proposed approach as applied to acoustic echo cancellation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse Linear Regression With Structured Priors and Application to Denoising of Musical Audio

    Publication Year: 2008 , Page(s): 174 - 185
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (613 KB) |  | HTML iconHTML  

    We describe in this paper an audio denoising technique based on sparse linear regression with structured priors. The noisy signal is decomposed as a linear combination of atoms belonging to two modified discrete cosine transform (MDCT) bases, plus a residual part containing the noise. One MDCT basis has a long time resolution, and thus high frequency resolution, and is aimed at modeling tonal parts of the signal, while the other MDCT basis has short time resolution and is aimed at modeling transient parts (such as attacks of notes). The problem is formulated within a Bayesian setting. Conditional upon an indicator variable which is either 0 or 1, one expansion coefficient is set to zero or given a hierarchical prior. Structured priors are employed for the indicator variables; using two types of Markov chains, persistency along the time axis is favored for expansion coefficients of the tonal layer, while persistency along the frequency axis is favored for the expansion coefficients of the transient layer. Inference about the denoised signal and model parameters is performed using a Gibbs sampler, a standard Markov chain Monte Carlo (MCMC) sampling technique. We present results for denoising of a short glockenspiel excerpt and a long polyphonic music excerpt. Our approach is compared with unstructured sparse regression and with structured sparse regression in a single resolution MDCT basis (no transient layer). The results show that better denoising is obtained, both from signal-to-noise ratio measurements and from subjective criteria, when both a transient and tonal layer are used, in conjunction with our proposed structured prior framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unsupervised Pattern Discovery in Speech

    Publication Year: 2008 , Page(s): 186 - 197
    Cited by:  Papers (33)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1534 KB) |  | HTML iconHTML  

    We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive Bayesian Latent Semantic Analysis

    Publication Year: 2008 , Page(s): 198 - 207
    Cited by:  Papers (16)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (821 KB) |  | HTML iconHTML  

    Due to the vast growth of data collections, the statistical document modeling has become increasingly important in language processing areas. Probabilistic latent semantic analysis (PLSA) is a popular approach whereby the semantics and statistics can be effectively captured for modeling. However, PLSA is highly sensitive to task domain, which is continuously changing in real-world documents. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve document modeling by incrementally extracting up-to-date latent semantic information to match the changing domains at run time. By adequately representing the priors of PLSA parameters using Dirichlet densities, the posterior densities belong to the same distribution so that a reproducible prior/posterior mechanism is activated for incremental learning from constantly accumulated documents. An incremental PLSA algorithm is constructed to accomplish the parameter estimation as well as the hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed approach is capable of performing dynamic document indexing and modeling. We also present the maximum a posteriori PLSA for corrective training. Experiments on information retrieval and document categorization demonstrate the superiority of using Bayesian PLSA methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Constrained Minimization and Discriminative Training for Natural Language Call Routing

    Publication Year: 2008 , Page(s): 208 - 215
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB) |  | HTML iconHTML  

    This paper presents a combination strategy of multiple individual routing classifiers to improve classification accuracy in natural language call routing applications. Since errors of individual classifiers in the ensemble should somehow be uncorrelated, we propose a combination strategy where the combined classifier accuracy is a function of the accuracy of individual classifiers and also the correlation between their classification errors. We show theoretically and empirically that our combination strategy, named the constrained minimization technique, has a good potential in improving the classification accuracy of single classifiers. We also show how discriminative training, more specifically the generalized probabilistic descent (GPD) algorithm, can be of benefit to further boost the performance of routing classifiers. The GPD algorithm has the potential to consider both positive and negative examples during training to minimize the classification error and increase the score separation of the correct from competing hypotheses. Some parameters become negative when using the GPD algorithm, resulting from suppressive learning not traditionally possible; important antifeatures are thus obtained. Experimental evaluation is carried on a banking call routing task and on switchboard databases with a set of 23 and 67 destinations, respectively. Results show either the GPD or constrained minimization technique outperform the accuracy of baseline classifiers by 44% when applied separately. When the constrained minimization technique is added on top of GPD, we show an additional 15% reduction in the classification error rate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence

    Publication Year: 2008 , Page(s): 216 - 228
    Cited by:  Papers (19)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (617 KB) |  | HTML iconHTML  

    With the advent of prosody annotation standards such as tones and break indices (ToBI), speech technologists and linguists alike have been interested in automatically detecting prosodic events in speech. This is because the prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. As the prosody of an utterance is closely tied to its syntactic and semantic content in addition to its lexical content, knowledge of the prosodic events within and across utterances can assist spoken language applications such as automatic speech recognition and translation. On the other hand, corpora annotated with prosodic events are useful for building natural-sounding speech synthesizers. In this paper, we build an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates. Following previous work in this area, we focus on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level. Our experiments achieved a performance rate of 86.75% agreement on the accent detection task, and 91.61% agreement on the phrase boundary detection task on the Boston University Radio News Corpus. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluation of Objective Quality Measures for Speech Enhancement

    Publication Year: 2008 , Page(s): 229 - 238
    Cited by:  Papers (194)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB) |  | HTML iconHTML  

    In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two signal-to-noise ratio levels by four classes of speech enhancement algorithms: spectral subtractive, subspace, statistical-model based, and Wiener algorithms. The subjective quality ratings were obtained using the ITU-T P.835 methodology designed to evaluate the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. This paper reports on the evaluation of correlations of several objective measures with these three subjective rating scales. Several new composite objective measures are also proposed by combining the individual objective measures using nonparametric and parametric regression analysis techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Factor Analyzed Subspace Modeling and Selection

    Publication Year: 2008 , Page(s): 239 - 248
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (864 KB) |  | HTML iconHTML  

    We present a novel subspace modeling and selection approach for noisy speech recognition. In subspace modeling, we develop a factor analysis (FA) representation of noisy speech, which is a generalization of a signal subspace (SS) representation. Using FA, noisy speech is represented by the extracted common factors, factor loading matrix, and specific factors. The observation space of noisy speech is accordingly partitioned into a principal subspace, containing speech and noise, and a minor subspace, containing residual speech and residual noise. We minimize the energies of speech distortion in the principal subspace as well as in the minor subspace so as to estimate clean speech with residual information. Importantly, we explore the optimal subspace selection via solving the hypothesis test problems. We test the equivalence of eigenvalues in the minor subspace to select the subspace dimension. To fulfill the FA spirit, we also examine the hypothesis of uncorrelated specific factors/residual speech. The subspace can be partitioned according to a consistent confidence towards rejecting the null hypothesis. Optimal solutions are realized through the likelihood ratio tests, which arrive at the approximated chi-square distributions as test statistics. In the experiments on the Aurora2 database, the FA model significantly outperforms the SS model for speech enhancement and recognition. Subspace selection via testing the correlation of residual speech achieves higher recognition accuracies than that of testing the equivalent eigenvalues in the minor subspace. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research