Scheduled System Maintenance:
On May 6th, system maintenance will take place from 8:00 AM - 12:00 PM ET (12:00 - 16:00 UTC). During this time, there may be intermittent impact on performance. We apologize for the inconvenience.
By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 1 • Date Jan. 2012

Filter Results

Displaying Results 1 - 25 of 38
  • Table of contents

    Publication Year: 2012 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (184 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2012 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • Farewell Editorial

    Publication Year: 2012 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (24 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Inaugural Editorial: Riding the Tidal Wave of Human-Centric Information Processing — Innovate, Outreach, Collaborate, Connect, Expand, and Win

    Publication Year: 2012 , Page(s): 2 - 3
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | PDF file iconPDF (279 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Introduction to the Special Section on Deep Learning for Speech and Language Processing

    Publication Year: 2012 , Page(s): 4 - 6
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (752 KB) |  | HTML iconHTML  

    Current speech recognition systems, for example, typically use Gaussian mixture models (GMMs), to estimate the observation (or emission) probabilities of hidden Markov models (HMMs), and GMMs are generative models that have only one layer of latent variables. Instead of developing more powerful models, most of the research effort has gone into finding better ways of estimating the GMM parameters so that error rates are decreased or the margin between different classes is increased. The same observation holds for natural language processing (NLP) in which maximum entropy (MaxEnt) models and conditional random fields (CRFs) have been popular for the last decade. Both of these approaches use shallow models whose success largely depends on the use of carefully handcrafted features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deep and Wide: Multiple Layers in Automatic Speech Recognition

    Publication Year: 2012 , Page(s): 7 - 13
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (541 KB) |  | HTML iconHTML  

    This paper reviews a line of research carried out over the last decade in speech recognition assisted by discriminatively trained, feedforward networks. The particular focus is on the use of multiple layers of processing preceding the hidden Markov model based decoding of word sequences. Emphasis is placed on the use of multiple streams of highly dimensioned layers, which have proven useful for this purpose. This paper ultimately concludes that while the deep processing structures can provide improvements for this genre, choice of features and the structure with which they are incorporated, including layer width, can also be significant factors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Acoustic Modeling Using Deep Belief Networks

    Publication Year: 2012 , Page(s): 14 - 22
    Cited by:  Papers (138)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (467 KB) |  | HTML iconHTML  

    Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse Multilayer Perceptron for Phoneme Recognition

    Publication Year: 2012 , Page(s): 23 - 29
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (379 KB) |  | HTML iconHTML  

    This paper introduces the sparse multilayer perceptron (SMLP) which jointly learns a sparse feature representation and nonlinear classifier boundaries to optimally discriminate multiple output classes. SMLP learns the transformation from the inputs to the targets as in multilayer perceptron (MLP) while the outputs of one of the internal hidden layers is forced to be sparse. This is achieved by adding a sparse regularization term to the cross-entropy cost and updating the parameters of the network to minimize the joint cost. On the TIMIT phoneme recognition task, SMLP-based systems trained on individual speech recognition feature streams perform significantly better than the corresponding MLP-based systems. Phoneme error rate of 19.6% is achieved using the combination of SMLP-based systems, a relative improvement of 3.0% over the combination of MLP-based systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

    Publication Year: 2012 , Page(s): 30 - 42
    Cited by:  Papers (172)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (677 KB) |  | HTML iconHTML  

    We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bayesian Sensing Hidden Markov Models

    Publication Year: 2012 , Page(s): 43 - 54
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (435 KB) |  | HTML iconHTML  

    In this paper, we introduce Bayesian sensing hidden Markov models (BS-HMMs) to represent sequential data based on a set of state-dependent basis vectors. The goal of this work is to perform Bayesian sensing and model regularization for heterogeneous training data. By incorporating a prior density on sensing weights, the relevance of different bases to a feature vector is determined by the corresponding precision parameters. The BS-HMM parameters, consisting of the basis vectors, the precision matrices of sensing weights and the precision matrices of reconstruction errors, are jointly estimated by maximizing the likelihood function, which is marginalized over the weight priors. We derive recursive solutions for the three parameters, which are expressed via maximum a posteriori estimates of the sensing weights. We specifically optimize BS-HMMs for large-vocabulary continuous speech recognition (LVCSR) by introducing a mixture model of BS-HMMs and by adapting the basis vectors to different speakers. Discriminative training of BS-HMMs in the model domain and the feature domain is also proposed. Experimental results on an LVCSR task show consistent improvements due to the three sets of BS-HMM parameters and demonstrate how the extensions of mixture models, speaker adaptation, and discriminative training achieve better recognition results compared to those of conventional HMMs based on Gaussian mixture models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Topic-Based Hierarchical Segmentation

    Publication Year: 2012 , Page(s): 55 - 66
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1104 KB) |  | HTML iconHTML  

    Latent Dirichlet allocation (LDA) is a new paradigm of topic model which is powerful to capture the latent topic information from natural language. However, the topic information in text streams, e.g. meeting recording, lecture transcription and conversational dialogue, are inherently heterogeneous and nonstationary without explicit boundaries. It is difficult to train a precise topic model from the observed text streams. Furthermore, the usage of words in different paragraphs within a document is varied with different composition styles. In this paper, we present a new hierarchical segmentation model (HSM) where the heterogeneous topic information in stream level and the word variations in document level are characterized. We incorporate the contextual topic information in stream-level segmentation. The topic similarity between sentences is used to form a beta distribution reflecting the prior knowledge of document boundaries in a text stream. The distribution of segmentation variable is adaptively updated to achieve flexible segmentation and is used to group coherent sentences into a topic-specific document. For each pseudo-document, we further use a Markov chain to detect the stylistic segments within a document. The words in a segment are accordingly generated by the same composition style, which differs from the style of the next segment. Each segment is represented by a Markov state, and so the word variations within a document are compensated. The whole model is trained by a variational Bayesian EM procedure and is evaluated on using TDT2 corpus. Experimental results show benefits by using the proposed HSM in terms of perplexity, segmentation error, detection accuracy and F measure. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Improving Dynamic State Space Approaches to Articulatory Inversion With MAP-Based Parameter Estimation

    Publication Year: 2012 , Page(s): 67 - 81
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1095 KB) |  | HTML iconHTML  

    This paper presents a complete framework for articulatory inversion based on jump Markov linear systems (JMLS). In the model, the acoustic measurements and the position of each articulator are considered as observable measurement and continuous-valued hidden state of the system, respectively, and discrete regimes of the system are represented by the use of a discrete-valued hidden modal state. Articulatory inversion based on JMLS involves learning the model parameter set of the system and making inference about the state (position of each articulator) of the system using acoustic measurements. Iterative learning algorithms based on maximum-likelihood (ML) and maximum a posteriori (MAP) criteria are proposed to learn the model parameter set of the JMLS. It is shown that the learning procedure of the JMLS is a generalized version of hidden Markov model (HMM) training when both acoustic and articulatory data are given. In this paper, it is shown that the MAP-based learning algorithm improves modeling performance of the system and gives significantly better results compared to ML. The inference stage of the proposed algorithm is based on an interacting multiple models (IMM) approach, and done online (filtering), and/or offline (smoothing). Formulas are provided for IMM-based JMLS smoothing. It is shown that smoothing significantly improves the performance of articulatory inversion compared to filtering. Several experiments are conducted with the MOCHA database to show the performance of the proposed method. Comparison of the performance of the proposed method with the ones given in the literature shows that the proposed method improves the performance of state space approaches, making state space approaches comparable to the best published results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimation of Glottal Closing and Opening Instants in Voiced Speech Using the YAGA Algorithm

    Publication Year: 2012 , Page(s): 82 - 91
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1039 KB) |  | HTML iconHTML  

    Accurate estimation of glottal closing instants (GCIs) and opening instants (GOIs) is important for speech processing applications that benefit from glottal-synchronous processing including pitch tracking, prosodic speech modification, speech dereverberation, synthesis and study of pathological voice. We propose the Yet Another GCI/GOI Algorithm (YAGA) to detect GCIs from speech signals by employing multiscale analysis, the group delay function, and N-best dynamic programming. A novel GOI detector based upon the consistency of the candidates' closed quotients relative to the estimated GCIs is also presented. Particular attention is paid to the precise definition of the glottal closed phase, which we define as the analysis interval that produces minimum deviation from an all-pole model of the speech signal with closed-phase linear prediction (LP). A reference algorithm analyzing both electroglottograph (EGG) and speech signals is described for evaluation of the proposed speech-based algorithm. In addition to the development of a GCI/GOI detector, an important outcome of this work is in demonstrating that GOIs derived from the EGG signal are not necessarily well-suited to closed-phase LP analysis. Evaluation of YAGA against the APLAWD and SAM databases show that GCI identification rates of up to 99.3% can be achieved with an accuracy of 0.3 ms and GOI detection can be achieved equally reliably with an accuracy of 0.5 ms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spectral Magnitude Minimum Mean-Square Error Estimation Using Binary and Continuous Gain Functions

    Publication Year: 2012 , Page(s): 92 - 102
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1020 KB) |  | HTML iconHTML  

    Recently, binary mask techniques have been proposed as a tool for retrieving a target speech signal from a noisy observation. A binary gain function is applied to time-frequency tiles of the noisy observation in order to suppress noise dominated and retain target dominated time-frequency regions. When implemented using discrete Fourier transform (DFT) techniques, the binary mask techniques can be seen as a special case of the broader class of DFT-based speech enhancement algorithms, for which the applied gain function is not constrained to be binary. In this context, we develop and compare binary mask techniques to state-of-the-art continuous gain techniques. We derive spectral magnitude minimum mean-square error binary gain estimators; the binary gain estimators turn out to be simple functions of the continuous gain estimators. We show that the optimal binary estimators are closely related to a range of existing, heuristically developed, binary gain estimators. The derived binary gain estimators perform better than existing binary gain estimators in simulation experiments with speech signals contaminated by several different noise sources as measured by speech quality and intelligibility measures. However, even the best binary mask method is significantly outperformed by state-of-the-art continuous gain estimators. The instrumental intelligibility results are confirmed in an intelligibility listening test. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fixed-Point Implementation of Cascaded Forward–Backward Adaptive Predictors

    Publication Year: 2012 , Page(s): 103 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (649 KB) |  | HTML iconHTML  

    Adaptive least mean square (LMS) predictors with independently low-order cascaded structures, such as the cascaded forward LMS (CFLMS) and cascaded forward-backward LMS (CFBLMS), have proven effective in combating the misadjustment and eigenvalue spread effects of linear predictors. Further developing this cascade structure, we study the fixed-point implementation of CFBLMS with applications to speech signals. Moreover, two groups of predictors with a total of six cases are compared. Group 1 employs the transversal structure for LMS, CFLMS, and CFBLMS algorithms. Group 2 employs the lattice structure for LMS, CFLMS, and CFBLMS algorithms. Experimental results show that, in group 1, the performance degradation of CFBLMS and CFLMS predictors becomes significant when the number of bits is reduced to 8, while that of the LMS predictor becomes significant when the number of bits is reduced to 9. On the other hand, in group 2, the performance degradation of CFBLMS and CFLMS predictors becomes significant when the number of bits is reduced to 5, while that of the LMS predictor becomes significant when the number of bits is reduced to 6. In both groups, the performances of CFBLMS and CFLMS are significantly superior to that of LMS, and CFBLMS is superior to CFLMS, in terms of the rate of convergence, misadjustment, and mean-square error (MSE). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise-Robust Speaker Recognition Combining Missing Data Techniques and Universal Background Modeling

    Publication Year: 2012 , Page(s): 108 - 121
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1189 KB) |  | HTML iconHTML  

    Although the field of automatic speaker recognition (ASR) has been the subject of extensive research over the past decades, the lack of robustness against background noise has remained a major challenge. This paper describes a noise-robust speaker recognition system that combines missing data (MD) recognition with the adaptation of speaker models using a universal background model (UBM). For MD recognition, the identification of reliable and unreliable feature components is required. For this purpose, the signal-to-noise ratio (SNR) based mask estimation performance of various state-of-the art noise estimation techniques and noise reduction schemes is compared. Speaker recognition experiments show that the usage of a UBM in combination with missing data recognition yields substantial improvements in recognition performance, especially in the presence of highly non-stationary background noise at low SNRs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiple Position Room Response Equalization in Frequency Domain

    Publication Year: 2012 , Page(s): 122 - 135
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2302 KB) |  | HTML iconHTML  

    This paper deals with methods for multiple position room response equalization. Differently from a well-known technique working in the time domain and based on fuzzy c-means clustering, in the proposed approach most of the operations are performed in the frequency domain and, in particular, the fuzzy c-means clustering is applied to the room magnitude responses at different positions. It is shown that working in the frequency domain allows us to obtain equalization performances at least similar to those of the time domain approach with a strongly reduced computational complexity. In addition, different techniques that can replace the fuzzy c-means clustering algorithm in the derivation of the prototype room response equalizer, with additional reduction of the number of operations, are discussed. Finally, the results of three sets of experiments are used to illustrate the performance, the robustness and the quality of the proposed room response equalization method using alternative prototype design strategies applied to different environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Hierarchical Bayesian Approach to Modeling Heterogeneity in Speech Quality Assessment

    Publication Year: 2012 , Page(s): 136 - 146
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (577 KB) |  | HTML iconHTML  

    The development of objective speech quality measures generally involves fitting a model to subjective rating data. A typical data set comprises ratings generated by listening tests performed in different languages and across different laboratories. These factors as well as others, such as the sex and age of the talker, influence the subjective ratings and result in data heterogeneity. We use a linear hierarchical Bayes (HB) structure to account for heterogeneity. To make the structure effective, we develop a variational Bayesian inference for the linear HB structure that approximates not only the posterior over the model parameters, but also the model evidence. Using the approximate model evidence we are able to study and exploit the heterogeneity inducing factors in the Bayesian framework. The new approach yields a simple linear predictor with state-of-the-art predictive performance. Our experiments show that the new method compares favorably with systems based on more complex predictor structures such as ITU-T recommendation P.563, Bayesian MARS, and Gaussian processes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Perceptual Confusions Among Consonants, Revisited—Cross-Spectral Integration of Phonetic-Feature Information and Consonant Recognition

    Publication Year: 2012 , Page(s): 147 - 161
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (911 KB) |  | HTML iconHTML  

    The perceptual basis of consonant recognition was experimentally investigated through a study of how information associated with phonetic features (Voicing, Manner, and Place of Articulation) combines across the acoustic-frequency spectrum. The speech signals, 11 Danish consonants embedded in Consonant + Vowel + Liquid syllables, were partitioned into 3/4-octave bands (“slits”) centered at 750 Hz, 1500 Hz, and 3000 Hz, and presented individually and in two- or three-slit combinations. The amount of information transmitted (IT) was calculated from consonant-confusion matrices for each feature and slit combination. The growth of IT was measured as a function of the number of slits presented and their center frequency for the phonetic features and consonants. The IT associated with Voicing, Manner, and Consonants sums nearly linearly for two-band stimuli irrespective of their center frequency. Adding a third band increases the IT by an amount somewhat less than predicted by linear cross-spectral integration (i.e., a compressive function). In contrast, for Place of Articulation, the IT gained through addition of a second or third slit is far more than predicted by linear, cross-spectral summation. This difference is mirrored in a measure of error-pattern similarity across bands-Symmetric Redundancy. Consonants, as well as Voicing and Manner, share a moderate degree of redundancy between bands. In contrast, the cross-spectral redundancy associated with Place is close to zero, which means the bands are essentially independent in terms of decoding this feature. Because consonant recognition and Place decoding are highly correlated (correlation coefficient r2 = 0.99), these results imply that the auditory processes underlying consonant recognition are not strictly linear. This may account for why conventional cross-spectral integration speech models, such as the Articulation Index, Speech Intelligibility Index, and the Speech Transmission Index do- - not predict intelligibility and segment recognition well under certain conditions (e.g., discontiguous frequency bands and audio-visual speech). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Design and Implementation of Higher Order Differential Microphones

    Publication Year: 2012 , Page(s): 162 - 174
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1205 KB) |  | HTML iconHTML  

    A novel systematic approach to the design of directivity patterns of higher order differential microphones is proposed. The directivity patterns are obtained by optimizing a cost function which is a convex combination of a front-back energy ratio and uniformity within a frontal sector of interest. Most of the standard directivity patterns - omnidirectional, cardioid, subcardioid, hypercardioid, supercardioid - are particular solutions of this optimization problem with specific values of two free parameters: the angular width of the frontal sector and the convex combination factor. More general solutions of practical use are obtained by varying these two parameters. Many of these optimal directivity patterns are trigonometric polynomials with complex roots. A new differential array structure that enables the implementation of general higher order directivity patterns, with complex or real roots, is then proposed. The effectiveness of the proposed design framework and the implementation structure are illustrated by design examples, simulations, and measurements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhancement of Residual Echo for Robust Acoustic Echo Cancellation

    Publication Year: 2012 , Page(s): 175 - 189
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1948 KB) |  | HTML iconHTML  

    This paper examines the technique of using a noise-suppressing nonlinearity in the adaptive filter error feedback-loop of an acoustic echo canceler (AEC) based on the least mean square (LMS) algorithm when there is an interference at the near end. The source of distortion may be linear, such as local speech or background noise, or nonlinear due to speech coding used in the telecommunication networks. Detailed derivation of the error recovery nonlinearity (ERN), which “enhances” the filter estimation error prior to the adaptation in order to assist the linear adaptation process, will be provided. Connections to other existing AEC and signal enhancement techniques will be revealed. In particular, the error enhancement technique is well-founded in the information-theoretic sense and has strong ties to independent component analysis (ICA), which is the basis for blind source separation (BSS) that permits unsupervised adaptation in the presence of multiple interfering signals. The single-channel AEC problem can be viewed as a special case of semi-blind source separation (SBSS) where one of the source signals is partially known, i.e., the far-end microphone signal that generates the near-end acoustic echo. The system approach to robust AEC will be motivated, where a proper integration of the LMS algorithm with the ERN into the AEC “system” allows for continuous and stable adaptation even during double talk without precise estimation of the signal statistics. The error enhancement paradigm encompasses many traditional signal enhancement techniques and opens up an entirely new avenue for solving the AEC problem in a real-world setting. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Following: Real-Time Prediction of Musical Sequences Without a Score

    Publication Year: 2012 , Page(s): 190 - 199
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (539 KB) |  | HTML iconHTML  

    This paper introduces a technique for predicting harmonic sequences in a musical performance for which no score is available, using real-time audio signals. Recent short-term information is aligned with longer term information, contextualizing the present within the past, allowing predictions about the future of the performance to be made. Using a mid-level representation in the form of beat-synchronous harmonic sequences, we reduce the size of the information needed to represent the performance. This allows the implementation of real-time performance following in live performance situations. We conduct an objective evaluation on a database of rock, pop, and folk music. Our results show that we are able to predict a large majority of repeated harmonic content with no prior knowledge in the form of a score. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment

    Publication Year: 2012 , Page(s): 200 - 210
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1192 KB) |  | HTML iconHTML  

    Aligning lyrics to audio has a wide range of applications such as the automatic generation of karaoke scores, song-browsing by lyrics, and the generation of audio thumbnails. Existing methods are restricted to using only lyrics and match them to phoneme features extracted from the audio (usually mel-frequency cepstral coefficients). Our novel idea is to integrate the textual chord information provided in the paired chords-lyrics format known from song books and Internet sites into the inference procedure. We propose two novel methods that implement this idea: First, assuming that all chords of a song are known, we extend a hidden Markov model (HMM) framework by including chord changes in the Markov chain and an additional audio feature (chroma) in the emission vector; second, for the more realistic case in which some chord information is missing, we present a method that recovers the missing chord information by exploiting repetition in the song. We conducted experiments with five changing parameters and show that with accuracies of 87.5% and 76.7%, respectively, both methods perform better than the baseline with statistical significance. We introduce the new accompaniment interface Song Prompter, which uses the automatically aligned lyrics to guide musicians through a song. It demonstrates that the automatic alignment is accurate enough to be used in a musical performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Risk-Aware Modeling Framework for Speech Summarization

    Publication Year: 2012 , Page(s): 211 - 222
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (417 KB) |  | HTML iconHTML  

    Extractive speech summarization attempts to select a representative set of sentences from a spoken document so as to succinctly describe the main theme of the original document. In this paper, we adapt the notion of risk minimization for extractive speech summarization by formulating the selection of summary sentences as a decision-making problem. To this end, we develop several selection strategies and modeling paradigms that can leverage supervised and unsupervised summarization models to inherit their individual merits as well as to overcome their inherent limitations. On top of that, various component models are introduced, providing a principled way to render the redundancy and coherence relationships among sentences and between sentences and the whole document, respectively. A series of experiments on speech summarization seem to demonstrate that the methods deduced from our summarization framework are very competitive with existing summarization methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise Correlation Matrix Estimation for Multi-Microphone Speech Enhancement

    Publication Year: 2012 , Page(s): 223 - 233
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1065 KB) |  | HTML iconHTML  

    For multi-channel noise reduction algorithms like the minimum variance distortionless response (MVDR) beamformer, or the multi-channel Wiener filter, an estimate of the noise correlation matrix is needed. For its estimation, it is often proposed in the literature to use a voice activity detector (VAD). However, using a VAD the estimated matrix can only be updated in speech absence. As a result, during speech presence the noise correlation matrix estimate does not follow changing noise fields with an appropriate accuracy. This effect is further increased, as in nonstationary noise voice activity detection is a rather difficult task, and false-alarms are likely to occur. In this paper, we present and analyze an algorithm that estimates the noise correlation matrix without using a VAD. This algorithm is based on measuring the correlation of the noisy input and a noise reference which can be obtained, e.g., by steering a null towards the target source. When applied in combination with an MVDR beamformer, it is shown that the proposed noise correlation matrix estimate results in a more accurate beamformer response, a larger signal-to-noise ratio improvement and a larger instrumentally predicted speech intelligibility when compared to competing algorithms such as the generalized sidelobe canceler, a VAD-based MVDR beamformer, and an MVDR based on the noisy correlation matrix. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research