Scheduled System Maintenance:
On Monday, April 27th, IEEE Xplore will undergo scheduled maintenance from 1:00 PM - 3:00 PM ET (17:00 - 19:00 UTC). No interruption in service is anticipated.
By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 3 • Date March 2008

Filter Results

Displaying Results 1 - 22 of 22
  • Table of contents

    Publication Year: 2008 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (139 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2008 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • A Minimum Distortion Noise Reduction Algorithm With Multiple Microphones

    Publication Year: 2008 , Page(s): 481 - 493
    Cited by:  Papers (19)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (867 KB) |  | HTML iconHTML  

    The problem of noise reduction using multiple microphones has long been an active area of research. Over the past few decades, most efforts have been devoted to beamforming techniques, which aim at recovering the desired source signal from the outputs of an array of microphones. In order to work reasonably well in reverberant environments, this approach often requires such knowledge as the direction of arrival (DOA) or even the room impulse responses, which are difficult to acquire reliably in practice. In addition, beamforming has to compromise its noise reduction performance in order to achieve speech dereverberation at the same time. This paper presents a new multichannel algorithm for noise reduction, which formulates the problem as one of estimating the speech component observed at one microphone using the observations from all the available microphones. This new approach explicitly uses the idea of spatial-temporal prediction and achieves noise reduction in two steps. The first step is to determine a set of inter-sensor optimal spatial-temporal prediction transformations. These transformations are then exploited in the second step to form an optimal noise-reduction filter. In comparison with traditional beamforming techniques, this new method has many appealing properties: it does not require DOA information or any knowledge of either the reverberation condition or the channel impulse responses; the multiple microphones do not have to be arranged into a specific array geometry; it works the same for both the far-field and near-field cases; and, most importantly, it can produce very good and robust noise reduction with minimum speech distortion in practical environments. Furthermore, with this new approach, it is possible to apply postprocessing filtering for additional noise reduction when a specified level of speech distortion is allowed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HMM Word and Phrase Alignment for Statistical Machine Translation

    Publication Year: 2008 , Page(s): 494 - 507
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (625 KB) |  | HTML iconHTML  

    Estimation and alignment procedures for word and phrase alignment hidden Markov models (HMMs) are developed for the alignment of parallel text. The development of these models is motivated by an analysis of the desirable features of IBM Model 4, one of the original and most effective models for word alignment. These models are formulated to capture the desirable aspects of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and compared to human-generated reference alignments, and the ability of these models to capture different types of alignment phenomena is evaluated. In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model 4 even when models are trained over large parallel texts. In translation performance, phrase-based statistical machine translation systems based on these HMM alignments can equal and exceed systems based on Model 4 alignments, and this is shown in Arabic-English and Chinese-English translation. These alignment models can also be used to generate posterior statistics over collections of parallel text, and this is used to refine and extend phrase translation tables with a resulting improvement in translation quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition

    Publication Year: 2008 , Page(s): 508 - 518
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (740 KB) |  | HTML iconHTML  

    In this paper, we investigate the combination of complementary acoustic feature streams in large-vocabulary continuous speech recognition (LVCSR). We have explored the use of acoustic features obtained using a pitch-synchronous analysis, Straight, in combination with conventional features such as Mel frequency cepstral coefficients. Pitch-synchronous acoustic features are of particular interest when used with vocal tract length normalization (VTLN) which is known to be affected by the fundamental frequency. We have combined these spectral representations directly at the acoustic feature level using heteroscedastic linear discriminant analysis (HLDA) and at the system level using ROVER. We evaluated this approach on three LVCSR tasks: dictated newspaper text (WSJCAM0), conversational telephone speech (CTS), and multiparty meeting transcription. The CTS and meeting transcription experiments were both evaluated using standard NIST test sets and evaluation protocols. Our results indicate that combining conventional and pitch-synchronous acoustic feature sets using HLDA results in a consistent, significant decrease in word error rate across all three tasks. Combining at the system level using ROVER resulted in a further significant decrease in word error rate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Random Forests of Phonetic Decision Trees for Acoustic Modeling in Conversational Speech Recognition

    Publication Year: 2008 , Page(s): 519 - 528
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (837 KB) |  | HTML iconHTML  

    In this paper, we present a novel technique of constructing phonetic decision trees (PDTs) for acoustic modeling in conversational speech recognition. We use random forests (RFs) to train a set of PDTs for each phone state unit and obtain multiple acoustic models accordingly. We investigate several methods of combining acoustic scores from the multiple models, including maximum-likelihood estimation of the weights of different acoustic models from training data, as well as using confidence score of -value or relative entropy to obtain the weights dynamically from online data. Since computing acoustic scores from the multiple models slows down decoding search, we propose clustering methods to compact the RF-generated acoustic models. The conventional concept of PDT-based state tying is extended to RF-based state tying. On each RF tied state, we cluster the Gaussian density functions (GDFs) from multiple acoustic models into classes and compute a prototype for each class to represent the original GDFs. In this way, the number of GDFs in each RF tied state is decreased greatly, which significantly reduces the time for computing acoustic scores. Experimental results on a telemedicine automatic captioning task demonstrate that the proposed RF-PDT technique leads to significant improvements in word recognition accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transcription and Separation of Drum Signals From Polyphonic Music

    Publication Year: 2008 , Page(s): 529 - 540
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (915 KB) |  | HTML iconHTML  

    The purpose of this article is to present new advances in music transcription and source separation with a focus on drum signals. A complete drum transcription system is described, which combines information from the original music signal and a drum track enhanced version obtained by source separation. In addition to efficient fusion strategies to take into account these two complementary sources of information, the transcription system integrates a large set of features, optimally selected by feature selection. Concurrently, the problem of drum track extraction from polyphonic music is tackled both by proposing a novel approach based on harmonic/noise decomposition and time/frequency masking and by improving an existing Wiener filtering-based separation method. The separation and transcription techniques presented are thoroughly evaluated on a large public database of music signals. A transcription accuracy between 64.5% and 80.3% is obtained, depending on the drum instrument, for well-balanced mixes, and the efficiency of our drum separation algorithms is illustrated in a comprehensive benchmark. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise Tracking Using DFT Domain Subspace Decompositions

    Publication Year: 2008 , Page(s): 541 - 553
    Cited by:  Papers (16)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (900 KB) |  | HTML iconHTML  

    All discrete Fourier transform (DFT) domain-based speech enhancement gain functions rely on knowledge of the noise power spectral density (PSD). Since the noise PSD is unknown in advance, estimation from the noisy speech signal is necessary. An overestimation of the noise PSD will lead to a loss in speech quality, while an underestimation will lead to an unnecessary high level of residual noise. We present a novel approach for noise tracking, which updates the noise PSD for each DFT coefficient in the presence of both speech and noise. This method is based on the eigenvalue decomposition of correlation matrices that are constructed from time series of noisy DFT coefficients. The presented method is very well capable of tracking gradually changing noise types. In comparison to state-of-the-art noise tracking algorithms the proposed method reduces the estimation error between the estimated and the true noise PSD. In combination with an enhancement system the proposed method improves the segmental SNR with several decibels for gradually changing noise types. Listening experiments show that the proposed system is preferred over the state-of-the-art noise tracking algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cascaded RLS–LMS Prediction in MPEG-4 Lossless Audio Coding

    Publication Year: 2008 , Page(s): 554 - 562
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (722 KB) |  | HTML iconHTML  

    This paper describes the cascaded recursive least square-least mean square (RLS-LMS) prediction, which is part of the recently published MPEG-4 Audio Lossless Coding international standard. The predictor consists of cascaded stages of simple linear predictors, with the prediction error at the output of one stage passed to the next stage as the input signal. A linear combiner adds up the intermediate estimates at the output of each prediction stage to give a final estimate of the RLS-LMS predictor. In the RLS-LMS predictor, the first prediction stage is a simple first-order predictor with a fixed coefficient value 1. The second prediction stage uses the recursive least square algorithm to adaptively update the predictor coefficients. The subsequent prediction stages use the normalized least mean square algorithm to update the predictor coefficients. The coefficients of the linear combiner are then updated using the sign-sign least mean square algorithm. For stereo audio signals, the RLS-LMS predictor uses both intrachannel prediction and interchannel prediction, which results in a 3% improvement in compression ratio over using only the intrachannel prediction. Through extensive tests, the MPEG-4 Audio Lossless coder using the RLS-LMS predictor has demonstrated a compression ratio that is on par with the best lossless audio coders in the field. In this paper, the structure of the RLS-LMS predictor is described in detail, and the optimal predictor configuration is studied through various experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Constructing Modulation Frequency Domain-Based Features for Robust Speech Recognition

    Publication Year: 2008 , Page(s): 563 - 577
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1297 KB) |  | HTML iconHTML  

    Data-driven temporal filtering approaches based on a specific optimization technique have been shown to be capable of enhancing the discrimination and robustness of speech features in speech recognition. The filters in these approaches are often obtained with the statistics of the features in the temporal domain. In this paper, we derive new data-driven temporal filters that employ the statistics of the modulation spectra of the speech features. Three new temporal filtering approaches are proposed and based on constrained versions of linear discriminant analysis (LDA), principal component analysis (PCA), and minimum class distance (MCD), respectively. It is shown that these proposed temporal filters can effectively improve the speech recognition accuracy in various noise-corrupted environments. In experiments conducted on Test Set A of the Aurora-2 noisy digits database, these new temporal filters, together with cepstral mean and variance normalization (CMVN), provide average relative error reduction rates of over 40% and 27% when compared with baseline Mel frequency cepstral coefficient (MFCC) processing and CMVN alone, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Capturing Local Variability for Speaker Normalization in Speech Recognition

    Publication Year: 2008 , Page(s): 578 - 593
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1973 KB) |  | HTML iconHTML  

    The new model reduces the impact of local spectral and temporal variability by estimating a finite set of spectral and temporal warping factors which are applied to speech at the frame level. Optimum warping factors are obtained while decoding in a locally constrained search. The model involves augmenting the states of a standard hidden Markov model (HMM), providing an additional degree of freedom. It is argued in this paper that this represents an efficient and effective method for compensating local variability in speech which may have potential application to a broader array of speech transformations. The technique is presented in the context of existing methods for frequency warping-based speaker normalization for ASR. The new model is evaluated in clean and noisy task domains using subsets of the Aurora 2, the Spanish Speech-Dat-Car, and the TIDIGITS corpora. In addition, some experiments are performed on a Spanish language corpus collected from a population of speakers with a range of speech disorders. It has been found that, under clean or not severely degraded conditions, the new model provides improvements over the standard HMM baseline. It is argued that the framework of local warping is an effective general approach to providing more flexible models of speaker variability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incorporating Model-Specific Score Distribution in Speaker Verification Systems

    Publication Year: 2008 , Page(s): 594 - 606
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (944 KB) |  | HTML iconHTML  

    It has been shown that the authentication performance of a biometric system is dependent on the models/templates specific to a user. As a result, some users may be more easily recognized or impersonated than others. The various categories of users have been characterized by Doddington et al. (1988). We refer to this unbalanced performance across users as the Doddington's zoo effect. In the context of fusion, we argue that this effect is system-dependent, i.e., a user model that is easily impersonated (a lamb) in one system may be easily recognized in another system (a sheep). While in principle, a fusion system could be trained to cope with the changing animal behavior of users from system to system, the lack of training data makes it impossible. We believe that one major cause of the Doddington's zoo effect is the variation of class conditional scores from one speaker model to another. We propose a two-level fusion framework that effectively realizes a fusion classifier adapted to each user. First, one applies a client-specific (or model-specific) score normalization procedure to each of the system outputs to be combined. Then, one feeds the resulting normalized outputs to a fusion classifier (common to all users) as input to obtain a final combined score. Two existing model-specific score normalization procedures are considered in this framework, i.e., F- and Z-norms. In addition to them, a novel score normalization method called model-specific log-likelihood ratio (MS-LLR) is also proposed. While Z-norm is impostor-centric, i.e., it makes use of only the impostor score statistics, F-norm and the proposed MS-LLR are client-impostor centric, i.e., they consider both the client and impostor score statistics simultaneously. Our findings based on the XM2VTS and the NIST2005 databases show that when client-impostor centric normalization procedures are used to implement the proposed two-level fusion framework, the resulting fusion classifier outpe- - rforms the conventional fusion classifier (without applying any user-specific score normalization) in the majority of experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rapid Speaker Adaptation Using Clustered Maximum-Likelihood Linear Basis With Sparse Training Data

    Publication Year: 2008 , Page(s): 607 - 616
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (443 KB) |  | HTML iconHTML  

    Speaker space-based adaptation methods for automatic speech recognition have been shown to provide significant performance improvements for tasks where only a few seconds of adaptation speech is available. However, these techniques are not widely used in practical applications because they require large amounts of speaker-dependent training data and large amounts of computer memory. The authors propose a robust, low-complexity technique within this general class that has been shown to reduce word error rate, reduce the large storage requirements associated with speaker space approaches, and eliminate the need for large numbers of utterances per speaker in training. The technique is based on representing speakers as a linear combination of clustered linear basis vectors and a procedure is presented for maximum-likelihood estimation of these vectors from training data. Significant word error rate reduction was obtained using these methods relative to speaker independent performance for the Resource Management and Wall Street Journal task domains. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Conditional Random Fields for Integrating Local Discriminative Classifiers

    Publication Year: 2008 , Page(s): 617 - 628
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (464 KB) |  | HTML iconHTML  

    Conditional random fields (CRFs) are a statistical framework that has recently gained in popularity in both the automatic speech recognition (ASR) and natural language processing communities because of the different nature of assumptions that are made in predicting sequences of labels compared to the more traditional hidden Markov model (HMM). In the ASR community, CRFs have been employed in a method similar to that of HMMs, using the sufficient statistics of input data to compute the probability of label sequences given acoustic input. In this paper, we explore the application of CRFs to combine local posterior estimates provided by multilayer perceptrons (MLPs) corresponding to the frame-level prediction of phone classes and phonological attribute classes. We compare phonetic recognition using CRFs to an HMM system trained on the same input features and show that the monophone label CRF is able to achieve superior performance to a monophone-based HMM and performance comparable to a 16 Gaussian mixture triphone-based HMM; in both of these cases, the CRF obtains these results with far fewer free parameters. The CRF is also able to better combine these posterior estimators, achieving a substantial increase in performance over an HMM-based triphone system by mixing the two highly correlated sets of phone class and phonetic attribute class posteriors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Highly Robust, Secure, and Perceptual-Quality Echo Hiding Scheme

    Publication Year: 2008 , Page(s): 629 - 638
    Cited by:  Papers (19)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (630 KB) |  | HTML iconHTML  

    Audio watermarking using echo hiding has fairly good perceptual quality. However, security and the tradeoff between robustness and imperceptibility are still relevant issues. This paper presents the echo hiding scheme in which the analysis-by-synthesis approach, interlaced kernels, and frequency hopping are adopted to achieve high robustness, security, and perceptual quality. The amplitudes of the embedded echoes are adequately adapted during the embedding process by considering not only the characteristics of the host signals, but also cases in which the watermarked audio signals have suffered various attacks. Additionally, the interlaced kernels are introduced such that the echo positions of the interlaced kernels for embedding "zero" and "one" are interchanged alternately to minimize the influence of host signals and various attacks on the watermarked data. Frequency hopping is employed to increase the robustness and security of the proposed echo hiding scheme in which each audio segment for watermarking is established by combining the fractions selected from all frequency bands based on a pseudonoise sequence as a secret key. Experimental results indicate that the proposed analysis-by-synthesis echo hiding scheme is superior to the conventional schemes in terms of robustness, security, and perceptual quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Specmurt Analysis of Polyphonic Music Signals

    Publication Year: 2008 , Page(s): 639 - 650
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1395 KB) |  | HTML iconHTML  

    This paper introduces a new music signal processing method to extract multiple fundamental frequencies, which we call specmurt analysis. In contrast with cepstrum which is the inverse Fourier transform of log-scaled power spectrum with linear frequency, specmurt is defined as the inverse Fourier transform of linear power spectrum with log-scaled frequency. Assuming that all tones in a polyphonic sound have a common harmonic pattern, the sound spectrum can be regarded as a sum of linearly stretched common harmonic structures along frequency. In the log-frequency domain, it is formulated as the convolution of a common harmonic structure and the distribution density of the fundamental frequencies of multiple tones. The fundamental frequency distribution can be found by deconvolving the observed spectrum with the assumed common harmonic structure, where the common harmonic structure is given heuristically or quasi-optimized with an iterative algorithm. The efficiency of specmurt analysis is experimentally demonstrated through generation of a piano-roll-like display from a polyphonic music signal and automatic sound-to-MIDI conversion. Multipitch estimation accuracy is evaluated over several polyphonic music signals and compared with manually annotated MIDI data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Modeling of Diffuse Boundaries in the 2-D Digital Waveguide Mesh

    Publication Year: 2008 , Page(s): 651 - 665
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2457 KB) |  | HTML iconHTML  

    The digital waveguide mesh can be used to simulate the propagation of sound waves in an acoustic system. The accurate simulation of the acoustic characteristics of boundaries within such a system is an important part of the model. One significant property of an acoustic boundary is its diffusivity. Previous approaches to simulating diffuse boundaries in a digital waveguide mesh are effective but exhibit limitations and have not been analyzed in detail. An improved technique is presented here that simulates diffusion at boundaries and offers a high degree of control and consistency. This technique works by rotating wavefronts as they pass through a special diffusing layer adjacent to the boundary. The waves are rotated randomly according to a chosen probability function and the model is lossless. This diffusion model is analyzed in detail, and its diffusivity is quantified in the form of frequency dependent diffusion coefficients. The approach used to measuring boundary diffusion is described here in detail for the 2-D digital waveguide mesh and can readily be extended for the 3-D case. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Microphone Array Shape Calibration in Diffuse Noise Fields

    Publication Year: 2008 , Page(s): 666 - 670
    Cited by:  Papers (24)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (358 KB) |  | HTML iconHTML  

    This correspondence presents a microphone array shape calibration procedure for diffuse noise environments. The procedure estimates intermicrophone distances by fitting the measured noise coherence with its theoretical model and then estimates the array geometry using classical multidimensional scaling. The technique is validated on noise recordings from two office environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dark Energy in Sparse Atomic Estimations

    Publication Year: 2008 , Page(s): 671 - 676
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (864 KB) |  | HTML iconHTML  

    Sparse overcomplete methods, such as matching pursuit, attempt to find an efficient estimation of a signal using terms (atoms) selected from an overcomplete dictionary. In some cases, atoms can be selected that have energy in regions of the signal that have no energy. Other atoms are then used to destructively interfere with these terms in order to preserve the original waveform. Because some terms may even ldquodisappearrdquo in the reconstruction, we refer to the destructive and constructive interference between the atoms of a sparse atomic estimation as ldquodark energy.rdquo In this paper, we formally define dark energy for matching pursuit, explore its properties, and present empirical results for decompositions of audio signals. This paper demonstrates that dark energy is a useful measure of the interference between the terms of a sparse atomic estimation and might provide information for the decomposition process. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Audio, Speech, and Language Processing Edics

    Publication Year: 2008 , Page(s): 677 - 678
    Save to Project icon | Request Permissions | PDF file iconPDF (30 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing Information for authors

    Publication Year: 2008 , Page(s): 679 - 680
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Publication Year: 2008 , Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research