By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 3 • Date March 2013

Filter Results

Displaying Results 1 - 25 of 31
  • Front Cover

    Publication Year: 2013 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (500 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2013 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (130 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2013 , Page(s): 461 - 462
    Save to Project icon | Request Permissions | PDF file iconPDF (206 KB)  
    Freely Available from IEEE
  • Time Difference of Arrival Estimation Exploiting Multichannel Spatio-Temporal Prediction

    Publication Year: 2013 , Page(s): 463 - 475
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (3396 KB) |  | HTML iconHTML  

    To localize sound sources in room acoustic environments, time differences of arrival (TDOA) between two or more microphone signals must be determined. This problem is often referred to as time delay estimation (TDE). The multichannel cross-correlation-coefficient (MCCC) algorithm, which is an extension of the traditional cross-correlation method from two- to multiple-channel cases, exploits spatia... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A New Bayesian Method Incorporating With Local Correlation for IBM Estimation

    Publication Year: 2013 , Page(s): 476 - 487
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (2441 KB) |  | HTML iconHTML  

    A lot of efforts have been made in the Ideal Binary Mask (IBM) estimation via statistical learning methods. The Bayesian method is a common one. However, one drawback is that the mask is estimated for each time-frequency (T-F) unit independently. The correlation between units has not been fully taken into account. In this paper, we attempt to consider the local correlation information from two asp... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Band-Limited Impulse Train Generation Using Sampled Infinite Impulse Responses of Analog Filters

    Publication Year: 2013 , Page(s): 488 - 497
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (2216 KB) |  | HTML iconHTML  

    The oscillator or waveform generator is at the heart of the musical sound synthesizers technology. A digital oscillator is the discrete time counterpart of the analog voltage control oscillator. A band-limited oscillator (BLO) is a digital oscillator that explicitly limits the power of the aliasing artifacts. It aims at reproducing on a Digital Signal Processor (DSP) the popular waveforms such as ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Building Acoustic Model Ensembles by Data Sampling With Enhanced Trainings and Features

    Publication Year: 2013 , Page(s): 498 - 507
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (1009 KB) |  | HTML iconHTML  

    We propose a novel approach of using Cross Validation (CV) and Speaker Clustering (SC) based data samplings to construct an ensemble of acoustic models for speech recognition. We also investigate the effects of the existing techniques of Cross Validation Expectation Maximization (CVEM), Discriminative Training (DT), and Multiple Layer Perceptron (MLP) features on the quality of the proposed ensemb... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Relation Between Pinna Reflection Patterns and Head-Related Transfer Function Features

    Publication Year: 2013 , Page(s): 508 - 519
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (3043 KB) |  | HTML iconHTML  

    This paper studies the relationship between head-related transfer functions (HRTFs) and pinna reflection patterns in the frontal hemispace. A pre-processed database of HRTFs allows extraction of up to three spectral notches from each response taken in the median sagittal plane. Ray-tracing analysis performed on the obtained notches' central frequencies is compared with a set of possible reflection... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On-Line Melody Extraction From Polyphonic Audio Using Harmonic Cluster Tracking

    Publication Year: 2013 , Page(s): 520 - 530
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (1444 KB) |  | HTML iconHTML  

    Extraction of predominant melody from the musical performances containing various instruments is one of the most challenging task in the field of music information retrieval and computational musicology. This paper presents a novel framework which estimates predominant vocal melody in real-time by tracking various sources with the help of harmonic clusters (combs) and then determining the predomin... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Robust Fitness Measure for Capturing Repetitions in Music Recordings With Applications to Audio Thumbnailing

    Publication Year: 2013 , Page(s): 531 - 543
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (2735 KB) |  | HTML iconHTML  

    The automatic extraction of structural information from music recordings constitutes a central research topic. In this paper, we deal with a subproblem of audio structure analysis called audio thumbnailing with the goal to determine the audio segment that best represents a given music recording. Typically, such a segment has many (approximate) repetitions covering large parts of the recording. As ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Structured SVMs for Automatic Speech Recognition

    Publication Year: 2013 , Page(s): 544 - 555
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (2695 KB) |  | HTML iconHTML  

    Structured discriminative models are a flexible sequence classification approach that enable a wide variety of features to be used. This paper describes a particular model in this framework, structured support vector machines (SSVM), and how it can be applied to medium to large vocabulary speech recognition tasks. An important aspect of SSVMs is the form of the joint feature spaces. Here, context-... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parametric Voice Conversion Based on Bilinear Frequency Warping Plus Amplitude Scaling

    Publication Year: 2013 , Page(s): 556 - 566
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (1994 KB) |  | HTML iconHTML  

    Voice conversion methods based on frequency warping followed by amplitude scaling have been recently proposed. These methods modify the frequency axis of the source spectrum in such manner that some significant parts of it, usually the formants, are moved towards their image in the target speaker's spectrum. Amplitude scaling is then applied to compensate for the differences between warped source ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast and Accurate Direct MDCT to DFT Conversion With Arbitrary Window Functions

    Publication Year: 2013 , Page(s): 567 - 578
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (3069 KB) |  | HTML iconHTML  

    In this paper, we propose a method for direct conversion of MDCT coefficients to DFT coefficients, without passing through time signal reconstruction. In contrast to previous works, this method is valid for any pair of MDCT and DFT window functions. It is based on the decomposition of the MDCT-to-DFT conversion matrices into a Toeplitz part plus a Hankel part. The latter is split, then mirrored an... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Inherent Temporal Precision of Phoneme Transitions

    Publication Year: 2013 , Page(s): 579 - 586
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (1899 KB) |  | HTML iconHTML  

    In natural speech, some phoneme transitions correspond to abrupt changes in the acoustic signal. Others are less clear-cut because the acoustic transition from one phoneme to the next is gradual. In this paper we determine the naturally occurring groups of phonemes (regardless of conventional phonetic categories) which show similar characteristics in such behavior. These data-driven groupings coul... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Autoregressive Models for Statistical Parametric Speech Synthesis

    Publication Year: 2013 , Page(s): 587 - 597
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (1366 KB) |  | HTML iconHTML  

    We propose using the autoregressive hidden Markov model (HMM) for speech synthesis. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard approach to statistical parametric speech synthesis. It supports easy and efficient parameter estimation using expectation maximization, in contrast to the trajectory HMM. At the same t... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Default Bayesian Estimation of the Fundamental Frequency

    Publication Year: 2013 , Page(s): 598 - 610
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (2928 KB) |  | HTML iconHTML  

    Joint fundamental frequency and model order estimation is an important problem in several applications. In this paper, a default estimation algorithm based on a minimum of prior information is presented. The algorithm is developed in a Bayesian framework, and it can be applied to both real- and complex-valued discrete-time signals which may have missing samples or may have been sampled at a non-un... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identification of Objectionable Audio Segments Based on Pseudo and Heterogeneous Mixture Models

    Publication Year: 2013 , Page(s): 611 - 623
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (2934 KB) |  | HTML iconHTML  

    In this paper, we generalize the Gaussian Mixture Model (GMM) in two ways: a) by introducing novel distance measures between two vectors based on nonlinear maps to give more general mixture models; b) by building mixture models based on multiple different kinds of distributions. These two generalizations cope with different problems arisen in feature modeling. Mixture model obtained by first metho... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MMSE-Based Missing-Feature Reconstruction With Temporal Modeling for Robust Speech Recognition

    Publication Year: 2013 , Page(s): 624 - 635
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (2939 KB) |  | HTML iconHTML  

    This paper addresses the problem of feature compensation in the log-spectral domain by using the missing-data (MD) approach to noise robust speech recognition, that is, the log-spectral features can be either almost unaffected by noise or completely masked by it. First, a general MD framework based on minimum mean square error (MMSE) estimation is introduced which exploits the correlation across f... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparison and Combination of Lightly Supervised Approaches for Language Portability of a Spoken Language Understanding System

    Publication Year: 2013 , Page(s): 636 - 648
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (981 KB) |  | HTML iconHTML  

    Portability of a spoken dialogue system (SDS) to a new domain or a new language is a hot topic as it may imply gains in time and cost for building new SDSs. In particular in this paper we investigate several fast and efficient approaches for language portability of the spoken language understanding (SLU) module of a dialogue system. We show that the use of statistical machine translation (SMT) can... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Twitter Topic Summarization With Speech Acts

    Publication Year: 2013 , Page(s): 649 - 658
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (1144 KB) |  | HTML iconHTML  

    With the growth of the social media service of Twitter, automatic summarization of Twitter messages (tweets) is in urgent need for efficient processing of the massive tweeted information. Unlike multi-document summarization in general, Twitter topic summarization must handle the numerous, short, dissimilar, and noisy nature of tweets. To address this challenge, we propose a novel speech act-guided... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse Inverse Covariance Matrices for Low Resource Speech Recognition

    Publication Year: 2013 , Page(s): 659 - 668
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (1298 KB) |  | HTML iconHTML  

    We propose to use sparse inverse covariance matrices for acoustic model training when there is insufficient training data. Acoustic models trained with inadequate training data tend to over fit, generalizing poorly to unseen test data, especially when full covariance matrices are used. We address this problem by adding an L1 regularization term to the traditional objective function for maximum lik... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of IIR Filters With Bayesian Model Selection and Parameter Estimation

    Publication Year: 2013 , Page(s): 669 - 674
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (672 KB) |  | HTML iconHTML  

    Bayesian model selection and parameter estimation are used to address the problem of choosing the most concise filter order for a given application while simultaneously determining the associated filter coefficients. This approach is validated against simulated data and used to generate pole-zero representations of head-related transfer functions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Audio Watermarking Via EMD

    Publication Year: 2013 , Page(s): 675 - 680
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandAbstract | PDF file iconPDF (844 KB) |  | HTML iconHTML  

    In this paper a new adaptive audio watermarking algorithm based on Empirical Mode Decomposition (EMD) is introduced. The audio signal is divided into frames and each one is decomposed adaptively, by EMD, into intrinsic oscillatory components called Intrinsic Mode Functions (IMFs). The watermark and the synchronization codes are embedded into the extrema of the last IMF, a low frequency mode stable... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • [Blank page]

    Publication Year: 2013 , Page(s): 681 - 682
    Save to Project icon | Request Permissions | PDF file iconPDF (6 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing EDICS

    Publication Year: 2013 , Page(s): 683 - 684
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research