By Topic

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Issue 4 • Date April 2014

Filter Results

Displaying Results 1 - 19 of 19
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (321 KB)  
    Freely Available from IEEE
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): 741 - 742
    Save to Project icon | Request Permissions | PDF file iconPDF (242 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): 743 - 744
    Save to Project icon | Request Permissions | PDF file iconPDF (250 KB)  
    Freely Available from IEEE
  • An Overview of Noise-Robust Automatic Speech Recognition

    Page(s): 745 - 777
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3915 KB) |  | HTML iconHTML  

    New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Application of Deep Belief Networks for Natural Language Understanding

    Page(s): 778 - 784
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (900 KB) |  | HTML iconHTML  

    Applications of Deep Belief Nets (DBN) to various problems have been the subject of a number of recent studies ranging from image classification and speech recognition to audio classification. In this study we apply DBNs to a natural language understanding problem. The recent surge of activity in this area was largely spurred by the development of a greedy layer-wise pretraining method that uses an efficient learning algorithm called Contrastive Divergence (CD). CD allows DBNs to learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms: Support Vector Machines (SVM), boosting and Maximum Entropy (MaxEnt). The plain DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models. However, using additional unlabeled data for DBN pre-training and combining DBN-based learned features with the original features provides significant gains over SVMs, which, in turn, performed better than both MaxEnt and Boosting. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-rank Approximation Based Multichannel Wiener Filter Algorithms for Noise Reduction with Application in Cochlear Implants

    Page(s): 785 - 799
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2959 KB) |  | HTML iconHTML  

    This paper presents low-rank approximation based multichannel Wiener filter algorithms for noise reduction in speech plus noise scenarios, with application in cochlear implants. In a single speech source scenario, the frequency-domain autocorrelation matrix of the speech signal is often assumed to be a rank-1 matrix, which then allows to derive different rank-1 approximation based noise reduction filters. In practice, however, the rank of the autocorrelation matrix of the speech signal is usually greater than one. Firstly, the link between the different rank-1 approximation based noise reduction filters and the original speech distortion weighted multichannel Wiener filter is investigated when the rank of the autocorrelation matrix of the speech signal is indeed greater than one. Secondly, in low input signal-to-noise-ratio scenarios, due to noise non-stationarity, the estimation of the autocorrelation matrix of the speech signal can be problematic and the noise reduction filters can deliver unpredictable noise reduction performance. An eigenvalue decomposition based filter and a generalized eigenvalue decomposition based filter are introduced that include a more robust rank-1, or more generally rank-R, approximation of the autocorrelation matrix of the speech signal. These noise reduction filters are demonstrated to deliver a better noise reduction performance especially in low input signal-to-noise-ratio scenarios. The filters are especially useful in cochlear implants, where more speech distortion and hence a more aggressive noise reduction can be tolerated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of Superdirective Planar Arrays With Sparse Aperiodic Layouts for Processing Broadband Signals via 3-D Beamforming

    Page(s): 800 - 815
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3658 KB) |  | HTML iconHTML  

    Planar arrays are used jointly with filter-and-sum beamforming to achieve 3-D spatial discrimination in processing broadband signals. In these systems, the beams are steered in various directions to investigate a given portion of space. The band can be so wide as to require both superdirective performance (to increase directivity at low frequencies) and sparse aperiodic layouts (to avoid grating lobes at high frequencies). We propose an original method to simultaneously optimize the transducer positions and the coefficients of the Finite Impulse Response FIR filters, providing a solution that maintains its validity for whatever steering direction inside a predefined region of interest. A hybrid strategy, analytical for the coefficients and stochastic for the positions, is devised to minimize the beam pattern (BP) energy while maintaining an unaltered signal from the steering direction and controlling the side lobes. The robustness of the superdirectivity is achieved by taking into account the probability density functions for the characteristics of realistic transducers. A distinctive feature of our method is its ability to maintain the computational tractability of the addressed optimization problem by drastically reducing the burden of evaluating the cost function. The obtained results, addressing tens of transducers and several octaves of band, demonstrate the effectiveness of the proposed method in terms of directivity, contrast, and robustness. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-Feature Beat Tracking

    Page(s): 816 - 825
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1085 KB) |  | HTML iconHTML  

    A recent trend in the field of beat tracking for musical audio signals has been to explore techniques for measuring the level of agreement and disagreement between a committee of beat tracking algorithms. By using beat tracking evaluation methods to compare all pairwise combinations of beat tracker outputs, it has been shown that selecting the beat tracker which most agrees with the remainder of the committee, on a song-by-song basis, leads to improved performance which surpasses the accuracy of any individual beat tracker used on its own. In this paper we extend this idea towards presenting a single, standalone beat tracking solution which can exploit the benefit of mutual agreement without the need to run multiple separate beat tracking algorithms. In contrast to existing work, we re-cast the problem as one of selecting between the beat outputs resulting from a single beat tracking model with multiple, diverse input features. Through extended evaluation on a large annotated database, we show that our multi-feature beat tracker can outperform the state of the art, and thereby demonstrate that there is sufficient diversity in input features for beat tracking, without the need for multiple tracking models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition

    Page(s): 826 - 835
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1050 KB) |  | HTML iconHTML  

    Recently, supervised classification has been shown to work well for the task of speech separation. We perform an in-depth evaluation of such techniques as a front-end for noise-robust automatic speech recognition (ASR). The proposed separation front-end consists of two stages. The first stage removes additive noise via time-frequency masking. The second stage addresses channel mismatch and the distortions introduced by the first stage; a non-linear function is learned that maps the masked spectral features to their clean counterpart. Results show that the proposed front-end substantially improves ASR performance when the acoustic models are trained in clean conditions. We also propose a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks and HMM. Results show that dFDLR consistently improves performance in all test conditions. Surprisingly, the best average results are obtained when dFDLR is applied to models trained using noisy log-Mel spectral features from the multi-condition training set. With no channel mismatch, the best results are obtained when the proposed speech separation front-end is used along with multi-condition training using log-Mel features followed by dFDLR adaptation. Both these results are among the best on the Aurora-4 dataset. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust Speaker Identification in Noisy and Reverberant Conditions

    Page(s): 836 - 845
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1255 KB) |  | HTML iconHTML  

    Robustness of speaker recognition systems is crucial for real-world applications, which typically contain both additive noise and room reverberation. However, the combined effects of additive noise and convolutive reverberation have been rarely studied in speaker identification (SID). This paper addresses this issue in two phases. We first remove background noise through binary masking using a deep neural network classifier. Then we perform robust SID with speaker models trained in selected reverberant conditions, on the basis of bounded marginalization and direct masking. Evaluation results show that the proposed system substantially improves SID performance over related systems in a wide range of reverberation time and signal-to-noise ratios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the use of i–vector posterior distributions in Probabilistic Linear Discriminant Analysis

    Page(s): 846 - 857
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2594 KB) |  | HTML iconHTML  

    The i-vector extraction process is affected by several factors such as the noise level, the acoustic content of the observed features, the channel mismatch between the training conditions and the test data, and the duration of the analyzed speech segment. These factors influence both the i-vector estimate and its uncertainty, represented by the i-vector posterior covariance. This paper presents a new PLDA model that, unlike the standard one, exploits the intrinsic i-vector uncertainty. Since the recognition accuracy is known to decrease for short speech segments, and their length is one of the main factors affecting the i-vector covariance, we designed a set of experiments aiming at comparing the standard and the new PLDA models on short speech cuts of variable duration, randomly extracted from the conversations included in the NIST SRE 2010 extended dataset, both from interviews and telephone conversations. Our results on NIST SRE 2010 evaluation data show that in different conditions the new model outperforms the standard PLDA by more than 10% relative when tested on short segments with duration mismatches, and is able to keep the accuracy of the standard model for long enough speaker segments. This technique has also been successfully tested in the NIST SRE 2012 evaluation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Chinese-English Phone Set Construction for Code-Switching ASR Using Acoustic and DNN-Extracted Articulatory Features

    Page(s): 858 - 862
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (936 KB)  

    This study proposes a data-driven approach to phone set construction for code-switching automatic speech recognition (ASR). Acoustic and context-dependent cross-lingual articulatory features (AFs) are incorporated into the estimation of the distance between triphone units for constructing a Chinese-English phone set. The acoustic features of each triphone in the training corpus are extracted for constructing an acoustic triphone HMM. Furthermore, the articulatory features of the “last/first” state of the corresponding preceding/succeeding triphone in the training corpus are used to construct an AF-based GMM. The AFs, extracted using a deep neural network (DNN), are used for code-switching articulation modeling to alleviate the data sparseness problem due to the diverse context-dependent phone combinations in intra-sentential code-switching. The triphones are then clustered to obtain a Chinese-English phone set based on the acoustic HMMs and the AF-based GMMs using a hierarchical triphone clustering algorithm. Experimental results on code-switching ASR show that the proposed method for phone set construction outperformed other traditional methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing Edics

    Page(s): 863 - 864
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing Information for Authors

    Page(s): 865 - 866
    Save to Project icon | Request Permissions | PDF file iconPDF (147 KB)  
    Freely Available from IEEE
  • Open Access

    Page(s): 867
    Save to Project icon | Request Permissions | PDF file iconPDF (1157 KB)  
    Freely Available from IEEE
  • Publish your article in IEEE Access

    Page(s): 868
    Save to Project icon | Request Permissions | PDF file iconPDF (1156 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (121 KB)  
    Freely Available from IEEE
  • [Blank page - back cover]

    Page(s): C4
    Save to Project icon | Request Permissions | PDF file iconPDF (5 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE/ACM Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief

Li Deng
Microsoft Research