By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 1 • Date Jan. 2009

Filter Results

Displaying Results 1 - 25 of 25
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (102 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • Editorial

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (207 KB)  
    Freely Available from IEEE
  • Automatic Detection of Disfluency Boundaries in Spontaneous Speech of Children Using Audio–Visual Information

    Page(s): 2 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (599 KB) |  | HTML iconHTML  

    The presence of disfluencies in spontaneous speech, while poses a challenge for robust automatic recognition, also offers means for gaining additional insights into understanding a speaker's communicative and cognitive state. This paper analyzes disfluencies in children's spontaneous speech, in the context of spoken dialog based computer game play, and addresses the automatic detection of disfluency boundaries. Although several approaches have been proposed to detect disfluencies in speech, relatively little work has been done to utilize visual information to improve the performance and robustness of the disfluency detection system. This paper describes the use of visual information along with prosodic and language information to detect the presence of disfluencies in a child's computer-directed speech and shows how these information sources can be integrated to increase the overall information available for disfluency detection. The experimental results on our children's multimodal dialog corpus indicate that disfluency detection accuracy of over 80% can be obtained by utilizing audio-visual information. Specifically, results showed that the addition of visual information to prosody and language features yield relative improvements in disfluency detection error rates of 3.6% and 6.3%, respectively, for information fusion at the feature level and decision level. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation

    Page(s): 13 - 23
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (483 KB) |  | HTML iconHTML  

    Performance of statistical n-gram language models depends heavily on the amount of training text material and the degree to which the training text matches the domain of interest. The language modeling community is showing a growing interest in using large collections of text (obtainable, for example, from a diverse set of resources on the Internet) to supplement sparse in-domain resources. However, in most cases the style and content of the text harvested from the web differs significantly from the specific nature of these domains. In this paper, we present a relative entropy based method to select subsets of sentences whose n-gram distribution matches the domain of interest. We present results on language model adaptation using two speech recognition tasks: a medium vocabulary medical domain doctor-patient dialog system and a large vocabulary transcription system for European parliamentary plenary speeches (EPPS). We show that the proposed subset selection scheme leads to performance improvements over state of the art speech recognition systems in terms of both speech recognition word error rate (WER) and language model perplexity (PPL). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech Enhancement, Gain, and Noise Spectrum Adaptation Using Approximate Bayesian Estimation

    Page(s): 24 - 37
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1306 KB) |  | HTML iconHTML  

    This paper presents a new approximate Bayesian estimator for enhancing a noisy speech signal. The speech model is assumed to be a Gaussian mixture model (GMM) in the log-spectral domain. This is in contrast to most current models in frequency domain. Exact signal estimation is a computationally intractable problem. We derive three approximations to enhance the efficiency of signal estimation. The Gaussian approximation transforms the log-spectral domain GMM into the frequency domain using minimal Kullback-Leiber (KL)-divergency criterion. The frequency domain Laplace method computes the maximum a posteriori (MAP) estimator for the spectral amplitude. Correspondingly, the log-spectral domain Laplace method computes the MAP estimator for the log-spectral amplitude. Further, the gain and noise spectrum adaptation are implemented using the expectation-maximization (EM) algorithm within the GMM under Gaussian approximation. The proposed algorithms are evaluated by applying them to enhance the speeches corrupted by the speech-shaped noise (SSN). The experimental results demonstrate that the proposed algorithms offer improved signal-to-noise ratio, lower word recognition error rate, and less spectral distortion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reduced-Bandwidth and Distributed MWF-Based Noise Reduction Algorithms for Binaural Hearing Aids

    Page(s): 38 - 51
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1483 KB) |  | HTML iconHTML  

    In a binaural hearing aid system, output signals need to be generated for the left and the right ear. Using the binaural multichannel Wiener filter (MWF), which exploits all microphone signals from both hearing aids, a significant reduction of background noise can be achieved. However, due to power and bandwidth limitations of the binaural link, it is typically not possible to transmit all microphone signals between the hearing aids. To limit the amount of transmitted information, this paper presents reduced-bandwidth MWF-based noise reduction algorithms, where a filtered combination of the contralateral microphone signals is transmitted. A first scheme uses a signal-independent beamformer, whereas a second scheme uses the output of a monaural MWF on the contralateral microphone signals and a third scheme involves an iterative distributed MWF (DB-MWF) procedure. It is shown that in the case of a rank-1 speech correlation matrix, corresponding to a single speech source, the DB-MWF procedure converges to the binaural MWF solution. Experimental results compare the noise reduction performance of the reduced-bandwidth algorithms with respect to the benchmark binaural MWF. It is shown that the best performance of the reduced-bandwidth algorithms is obtained by the DB-MWF procedure and that the performance of the DB-MWF procedure approaches quite well the optimal performance of the binaural MWF. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Binaural Localization Based on Weighted Wiener Gain Improved by Incremental Source Attenuation

    Page(s): 52 - 65
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (558 KB) |  | HTML iconHTML  

    This paper addresses the problem of direction-of-arrival (DOA) estimation both in azimuthal and elevation angle from binaural sound that is processed with a head-related transfer function (HRTF). Previously, we proposed a weighted Wiener gain (WWG) method for two-dimensional DOA estimation with two-directional microphones. However, for signals processed with HRTFs, peaks in the spatial spectra of WWG indicating true sources can mingle with spurious peaks. To resolve this situation, we propose to apply incremental source attenuation (ISA) in combination with WWG. In fact, ISA reduces spectral components originating from specified sound sources and thereby improves the localization accuracy of the next targeted source in the proposed incremental estimation procedure. We conduct computer simulations using directional microphones and four HRTF sets corresponding to four individuals. The proposed method is compared to two DOA estimation methods that are equivalent to two generalized cross-correlation functions and two high-resolution methods of multiple signal classification (MUSIC) and minimum variance method. For comparison purposes, we introduce binary coherence detection (BCD) to high-resolution methods for emphasizing valid spectral components for localization in multiple source conditions. Evaluation results demonstrate that, although MUSIC with BCD yield comparable performance to that of WWG in conditions where single speech source exists, WWG with ISA surpasses the other methods in conditions including two or three speech sources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm

    Page(s): 66 - 83
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2411 KB) |  | HTML iconHTML  

    In this paper, we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here, we investigate six major aspects of the speaker adaptation: initial models; the amount of the training data for the initial models; the transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms; and combination algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods combining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring the Use of Speech Features and Their Corresponding Distribution Characteristics for Robust Speech Recognition

    Page(s): 84 - 94
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (732 KB) |  | HTML iconHTML  

    The performance of current automatic speech recognition (ASR) systems often deteriorates radically when the input speech is corrupted by various kinds of noise sources. Several methods have been proposed to improve ASR robustness over the last few decades. The related literature can be generally classified into two categories according to whether the methods are directly based on the feature domain or consider some specific statistical feature characteristics. In this paper, we present a polynomial regression approach that has the merit of directly characterizing the relationship between speech features and their corresponding distribution characteristics to compensate for noise interference. The proposed approach and a variant were thoroughly investigated and compared with a few existing noise robustness approaches. All experiments were conducted using the Aurora-2 database and task. The results show that our approaches achieve considerable word error rate reductions over the baseline system and are comparable to most of the conventional robustness approaches discussed in this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Probabilistic Generative Framework for Extractive Broadcast News Speech Summarization

    Page(s): 95 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (692 KB) |  | HTML iconHTML  

    In this paper, we consider extractive summarization of broadcast news speech and propose a unified probabilistic generative framework that combines the sentence generative probability and the sentence prior probability for sentence ranking. Each sentence of a spoken document to be summarized is treated as a probabilistic generative model for predicting the document. Two matching strategies, namely literal term matching and concept matching, are thoroughly investigated. We explore the use of the language model (LM) and the relevance model (RM) for literal term matching, while the sentence topical mixture model (STMM) and the word topical mixture model (WTMM) are used for concept matching. In addition, the lexical and prosodic features, as well as the relevance information of spoken sentences, are properly incorporated for the estimation of the sentence prior probability. An elegant feature of our proposed framework is that both the sentence generative probability and the sentence prior probability can be estimated in an unsupervised manner, without the need for handcrafted document-summary pairs. The experiments were performed on Chinese broadcast news collected in Taiwan, and very encouraging results were obtained. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Theory and Design of Soundfield Reproduction Using Continuous Loudspeaker Concept

    Page(s): 107 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (759 KB) |  | HTML iconHTML  

    Reproduction of a soundfield is a fundamental problem in acoustic signal processing. A common approach is to use an array of loudspeakers to reproduce the desired field where the least-squares method is used to calculate the loudspeaker weights. However, the least-squares method involves matrix inversion which may lead to errors if the matrix is poorly conditioned. In this paper, we use the concept of theoretical continuous loudspeaker on a circle to derive the discrete loudspeaker aperture functions by avoiding matrix inversion. In addition, the aperture function obtained through continuous loudspeaker method reveals the underlying structure of the solution as a function of the desired soundfield, the loudspeaker positions, and the frequency. This concept can also be applied for the 3-D soundfield reproduction using spherical harmonics analysis with a spherical array. Results are verified through computer simulations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Approach for Solving the Permutation Problem of Convolutive Blind Source Separation Based on Statistical Signal Models

    Page(s): 117 - 126
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (573 KB) |  | HTML iconHTML  

    In this paper, we present a new algorithm for solving the permutation ambiguity in convolutive blind source separation. Transformed to the frequency domain, existing algorithms can efficiently solve the reduction of the source separation problem into independent instantaneous separation in each frequency bin. However, this independency leads to the problem of correctly aligning these single bins. The new algorithm models the frequency-domain separated signals by means of the generalized Gaussian distribution and employs the small deviation of the parameters between neighboring bins for the detection of correct permutations. The performance of the algorithm will be demonstrated on synthetic and real-world data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Idiolect Extraction and Generation for Personalized Speaking Style Modeling

    Page(s): 127 - 137
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1756 KB) |  | HTML iconHTML  

    A person's speaking style, consisting of such attributes as voice, choice of vocabulary, and the physical motions employed, not only expresses the speaker's identity but also emphasizes the content of an utterance. Speech combining these aspects of speaking style becomes more vivid and expressive to listeners. Recent research on speaking style modeling has paid more attention to speech signal processing. This approach focuses on text processing for idiolect extraction and generation to model a specific person's speaking style for the application of text-to-speech (TTS) conversion. The first stage of this study adopts a statistical method to automatically detect the candidate idiolects from a personalized, transcribed speech corpus. Based on the categorization of the detected candidate idiolects, superfluous idiolects are extracted using the fluency measure while the remaining candidates are regarded as the nonsuperfluous idiolects. In idiolect generation, the input text is converted into a target text with a particular speaker's speaking style via the insertion of superfluous idiolect or synonym substitution of nonsuperfluous idiolect. To evaluate the performance of the proposed methods, experiments were conducted on a Chinese corpus collected and transcribed from the speech files of three Taiwanese politicians. The results show that the proposed method can effectively convert a source text into a target text with a personalized speaking style. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unsupervised Adaptation of Categorical Prosody Models for Prosody Labeling and Speech Recognition

    Page(s): 138 - 149
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (478 KB) |  | HTML iconHTML  

    Automatic speech recognition (ASR) systems rely almost exclusively on short-term segment-level features (MFCCs), while ignoring higher level suprasegmental cues that are characteristic of human speech. However, recent experiments have shown that categorical representations of prosody, such as those based on the Tones and Break Indices (ToBI) annotation standard, can be used to enhance speech recognizers. However, categorical prosody models are severely limited in scope and coverage due to the lack of large corpora annotated with the relevant prosodic symbols (such as pitch accent, word prominence, and boundary tone labels). In this paper, we first present an architecture for augmenting a standard ASR with symbolic prosody. We then discuss two novel, unsupervised adaptation techniques for improving, respectively, the quality of the linguistic and acoustic components of our categorical prosody models. Finally, we implement the augmented ASR by enriching ASR lattices with the adapted categorical prosody models. Our experiments show that the proposed unsupervised adaptation techniques significantly improve the quality of the prosody models; the adapted prosodic language and acoustic models reduce binary pitch accent (presence versus absence) classification error rate by 13.8% and 4.3%, respectively (relative to the seed models) on the Boston University Radio News Corpus, while the prosody-enriched ASR exhibits a 3.1% relative reduction in word error rate (WER) over the baseline system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Theoretical Analysis of a First-Order Azimuth-Steerable Superdirective Microphone Array

    Page(s): 150 - 162
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB) |  | HTML iconHTML  

    A first-order azimuth-steerable superdirectional microphone response can be constructed by means of a linear combination of three eigenbeams (monopole and two orthogonal dipoles). Via this method, we can construct any first-order directivity pattern (monopole, cardioid, hypercardioid, etc.) that can be electronically steered to a certain angle on the 2-D plane to capture the desired signal. In this paper, the superdirectional responses are generated via a planar microphone array with a square geometry. We analyze the influence of spatial aliasing on the captured desired signal and the directivity index. Furthermore, we investigate the sensitivity for uncorrelated sensor noise and the sensitivity for phase- and magnitude-errors on the individual sensors. Finally, two rules of thumb are derived to choose the size of the microphone array. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Lattice Extension and Vocabulary Adaptation for Turkish LVCSR

    Page(s): 163 - 173
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (252 KB) |  | HTML iconHTML  

    This paper presents two-pass speech recognition techniques to handle the out-of-vocabulary (OOV) problem in Turkish newspaper content transcription. OOV words are assumed to be replaced by acoustically ldquosimilarrdquo in-vocabulary (IV) words during decoding. Therefore, the first pass recognition lattice is used as the prior knowledge to adapt the vocabulary and the search space for the second pass. Vocabulary adaptation and lattice extension are performed with words similar to the hypothesis lattice words. These words are selected from a fallback vocabulary using distance functions that take the agglutinative language characteristics of Turkish into account. Morphology-based and phonetic-distance-based similarity functions respectively yield 1.9% and 4.6% absolute accuracy improvements. Statistical sub-word units are also utilized to handle the OOV problem encountered in the word-based system. Using sub-words alleviates the OOV problem and improves the recognition accuracy - OOV accuracy improved from 0% to 60.2%. However, this introduces ungrammatical items to the recognition output. Since automatically derived sub-word units do not provide explicit morphological features, the lattice extension strategy is modified to correct these ungrammatical items. Lattice extension for sub-words reduces the word error rate to 32.3% from 33.9%. This improvement is statistically significant at p=0.002 as measured by the NIST MAPSSWE significance test. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Temporal Integration for Audio Classification With Application to Musical Instrument Classification

    Page(s): 174 - 186
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (739 KB) |  | HTML iconHTML  

    Nowadays, it appears essential to design automatic indexing tools which provide meaningful and efficient means to describe the musical audio content. There is in fact a growing interest for music information retrieval (MIR) applications amongst which the most popular are related to music similarity retrieval, artist identification, musical genre or instrument recognition. Current MIR-related classification systems usually do not take into account the mid-term temporal properties of the signal (over several frames) and lie on the assumption that the observations of the features in different frames are statistically independent. The aim of this paper is to demonstrate the usefulness of the information carried by the evolution of these characteristics over time. To that purpose, we propose a number of methods for early and late temporal integration and provide an in-depth experimental study on their interest for the task of musical instrument recognition on solo musical phrases. In particular, the impact of the time horizon over which the temporal integration is performed will be assessed both for fixed and variable frame length analysis. Also, a number of proposed alignment kernels will be used for late temporal integration. For all experiments, the results are compared to a state of the art musical instrument recognition system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discriminatively Trained GMMs for Language Classification Using Boosting Methods

    Page(s): 187 - 197
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (591 KB) |  | HTML iconHTML  

    In language identification and other speech applications, discriminatively trained models often outperform nondiscriminative models trained with the maximum-likelihood criterion. For instance, discriminative Gaussian mixture models (GMMs) are typically trained by optimizing some discriminative criteria that can be computationally expensive and complex to implement. In this paper, we explore a novel approach to discriminative GMM training by using a variant the boosting framework (R. Schapire, ldquoThe boosting approach to machine learning, an overview,rdquo Proc. MSRI Workshop on Nonlinear Estimation and Classification, 2002) from machine learning, in which an ensemble of GMMs is trained sequentially. We have extended the purview of boosting to class conditional models (as opposed to discriminative models such as classification trees). The effectiveness of our boosting variation comes from the emphasis on working with the misclassified data to achieve discriminatively trained models. Our variant of boosting also includes utilizing low confidence data classifications as well as misclassified examples in classifier generation. We further apply our boosting approach to anti-models to achieve additional performance gains. We have applied our discriminative training approach to a variety of language identification experiments using the 12-language NIST 2003 language identification task. We show the significant performance improvements that can be obtained. The experiments include both acoustic as well as token-based speech models. Our best performing boosted GMM-based system on the 12-language verification task has a 2.3% EER. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Correction to `Word Boundary Detection With Mel-Scale Frequency Bank In Noisy Environment' [Sep 08 541-554]

    Page(s): 198
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (22 KB)  

    In the above entitled article (ibid., vol 8, no. 5, pp. 541-554, Sep 00), equation (10) had a typographical error. The correct equation is presented here. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Audio, Speech, and Language Processing Edics

    Page(s): 199 - 200
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing Information for authors

    Page(s): 201 - 202
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro

    Page(s): 203
    Save to Project icon | Request Permissions | PDF file iconPDF (591 KB)  
    Freely Available from IEEE
  • Call for papers on Applications of Signal Processing to Audio and Acoustic

    Page(s): 204
    Save to Project icon | Request Permissions | PDF file iconPDF (522 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (33 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research