By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 7 • Date Sept. 2011

Filter Results

Displaying Results 1 - 25 of 40
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (102 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • Performance of an Event-Based Instantaneous Fundamental Frequency Estimator for Distant Speech Signals

    Page(s): 1853 - 1864
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (726 KB) |  | HTML iconHTML  

    This paper proposes a method for extracting the fundamental frequency of voiced speech from distant speech signals. The method is based on the impulse-like nature of excitation in voiced speech. The characteristics of impulse-like excitation are extracted by filtering the speech signal through a cascade of resonators located at zero frequency. The resulting filtered signal preserves information specific to the fundamental frequency, in the sequence of positive-to-negative zero crossings. Also, the filtered signal is free from the effects of resonances of the vocal tract. An estimate of the fundamental frequency is derived from the short-time spectrum of the filtered signal. This estimate is used to remove spurious zero crossings in the filtered signal. The proposed method depends only on the strengths of impulse-like excitations in the direct component of distant speech signals, and not on the similarity of speech signal in successive glottal cycles. Hence, the method is robust to the effects of reverberation and noise. Performance of the method is evaluated using a database of close-speaking and distant speech signals. Experiments show that the accuracy of the proposed method is significantly higher than that of existing methods based on time-domain and frequency-domain processing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A New Efficient Narrowband Active Noise Control System and its Performance Analysis

    Page(s): 1865 - 1874
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (421 KB) |  | HTML iconHTML  

    A new narrowband ANC system structure is proposed which requires only two reference signal filtering (x-filtering) blocks regardless of the number of targeted frequencies. The reference cosine or sine waves are combined, respectively, to form an input to an x-filtering block. The output of each x-filtering block is decomposed into filtered-x cosine or sine waves by a special bandpass filter bank. In this way, the computational cost of the system may be significantly reduced. Analysis of the new system is then provided and discussed in some detail. Analytical results reveal that the proposed system performs quite the same as its counterpart does while requiring considerably fewer multiplications. Modification to the proposed structure is also made to cope with the frequency mismatch (FM) in real-life applications. Extensive simulations are conducted to demonstrate the effectiveness of the proposed system and its modified version, and as well as to confirm the validity of analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic Analysis and Organization of Spoken Documents Based on Parameters Derived From Latent Topics

    Page(s): 1875 - 1889
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (963 KB) |  | HTML iconHTML  

    Spoken documents are audio signals and are thus not easily displayed on-screen and not easily scanned and browsed by the user. It is therefore highly desirable to automatically construct summaries, titles, latent topic trees and key term-based topic labels for these spoken documents to aid the user in browsing. We refer to this as semantic analysis and organization. Also, as network content is both copious and dynamic, with topics and domains changing everyday, the approaches here must be primarily unsupervised. We propose a framework for unsupervised semantic analysis and organization of spoken documents and for this purpose propose two measures derived from latent topic analysis: latent topic significance and latent topic entropy. We show that these can be integrated into an application system, with which the user can more easily navigate archives of spoken documents. Probabilistic latent semantic analysis is used as a typical example approach for unsupervised topic analysis in most experiments, although latent Dirichlet allocation is also used in some experiments to show that the proposed measures are equally applicable for different analysis approaches. All of the experiments were performed on Mandarin Chinese broadcast news. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Study on Universal Background Model Training in Speaker Verification

    Page(s): 1890 - 1899
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1021 KB) |  | HTML iconHTML  

    State-of-the-art Gaussian mixture model (GMM)-based speaker recognition/verification systems utilize a universal background model (UBM), which typically requires extensive resources, especially if multiple channel and microphone categories are considered. In this study, a systematic analysis of speaker verification system performance is considered for which the UBM data is selected and purposefully altered in different ways, including variation in the amount of data, sub-sampling structure of the feature frames, and variation in the number of speakers. An objective measure is formulated from the UBM covariance matrix which is found to be highly correlated with system performance when the data amount was varied while keeping the UBM data set constant, and increasing the number of UBM speakers while keeping the data amount constant. The advantages of feature sub-sampling for improving UBM training speed is also discussed, and a novel and effective phonetic distance-based frame selection method is developed. The sub-sampling methods presented are shown to retain baseline equal error rate (EER) system performance using only 1% of the original UBM data, resulting in a drastic reduction in UBM training computation time. This, in theory, dispels the myth of “There's no data like more data” for the purpose of UBM construction. With respect to the UBM speakers, the effect of systematically controlling the number of training (UBM) speakers versus overall system performance is analyzed. It is shown experimentally that increasing the inter-speaker variability in the UBM data while maintaining the overall total data size constant gradually improves system performance. Finally, two alternative speaker selection methods based on different speaker diversity measures are presented. Using the proposed schemes, it is shown that by selecting a diverse set of UBM speakers, the baseline system performance can be retained using less than 30% of the original UBM speakers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Versatile Framework for Speaker Separation Using a Model-Based Speaker Localization Approach

    Page(s): 1900 - 1912
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4999 KB) |  | HTML iconHTML  

    We build upon our speaker localization framework developed in a previous work (N. Madhu and R. Martin, A scalable framework for multiple speaker localization and tracking,” in Proc. Int. Workshop Acoustic Echo Noise Control (IWAENC), Sep. 2008) to perform source separation. The proposed approach, exploiting the supplementary information from the mixture of Gaussians-based localization model, allows for the incorporation of a wide class of separation algorithms, from the nonlinear time-frequency mask-based approaches to a fully adaptive beamformer in the generalized sidelobe canceller (GSC) structure. We propose, in addition, a generalized estimation of the blocking matrix based on subspace projectors. The adaptive beamformer realized as proposed is insensitive to gain mismatches among the sensors, obviating the need for magnitude calibration of the microphones. It is also demonstrated that the proposed linear approach has a performance comparable to that of an optimal (oracle) GSC implementation. In comparison to ICA-based approaches, another advantage of the separation framework described herein is its robustness to ambient noise and scenarios with an unknown number of sources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Articulatory Information for Noise Robust Speech Recognition

    Page(s): 1913 - 1924
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1723 KB) |  | HTML iconHTML  

    Prior research has shown that articulatory information, if extracted properly from the speech signal, can improve the performance of automatic speech recognition systems. However, such information is not readily available in the signal. The challenge posed by the estimation of articulatory information from speech acoustics has led to a new line of research known as “acoustic-to-articulatory inversion” or “speech-inversion.” While most of the research in this area has focused on estimating articulatory information more accurately, few have explored ways to apply this information in speech recognition tasks. In this paper, we first estimated articulatory information in the form of vocal tract constriction variables (abbreviated as TVs) from the Aurora-2 speech corpus using a neural network based speech-inversion model. Word recognition tasks were then performed for both noisy and clean speech using articulatory information in conjunction with traditional acoustic features. Our results indicate that incorporating TVs can significantly improve word recognition rates when used in conjunction with traditional acoustic features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Inferring the Structure of a Tennis Game Using Audio Information

    Page(s): 1925 - 1937
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1103 KB) |  | HTML iconHTML  

    We describe a novel framework for inferring the low-level structure of a sports game (tennis) using only the information available on the audio track of a video recording of the game. Our goal is to segment the games into a sequence of points, the natural unit for describing a tennis match. The framework is hierarchical, consisting of, at the lowest level, identification of audio events, followed by “match” (i.e., semantic) events and at the highest level, game points. Different techniques that are appropriate to the characteristics of each of these events are used to detect them and these techniques are coupled in a probabilistic framework. The techniques consist of Gaussian mixture models and a hierarchical language model to detect sequences of audio events, a maximum entropy Markov model to infer “match” events from these audio events and multigrams to infer the segmentation of a sequence of match events into sequences of points in a a tennis game. Our results are promising, giving an F-score for the final detection of points of >; 0.7. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Voice Pathology Detection and Discrimination Based on Modulation Spectral Features

    Page(s): 1938 - 1948
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (917 KB) |  | HTML iconHTML  

    In this paper, we explore the information provided by a joint acoustic and modulation frequency representation, referred to as modulation spectrum, for detection and discrimination of voice disorders. The initial representation is first transformed to a lower dimensional domain using higher order singular value decomposition (HOSVD). From this dimension-reduced representation a feature selection process is suggested using an information-theoretic criterion based on the mutual information between voice classes (i.e., normophonic/dysphonic) and features. To evaluate the suggested approach and representation, we conducted cross-validation experiments on a database of sustained vowel recordings from healthy and pathological voices, using support vector machines (SVMs) for classification. For voice pathology detection, the suggested approach achieved a classification accuracy of 94.1±0.28% (95% confidence interval), which is comparable to the accuracy achieved using cepstral-based features. However, for voice pathology classification the suggested approach significantly outperformed the performance of cepstral-based features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker Distance Detection Using a Single Microphone

    Page(s): 1949 - 1961
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1288 KB) |  | HTML iconHTML  

    A method to detect the distance of a speaker from a single microphone in a room environment is proposed. Several features, related to statistical parameters of speech source excitation signals, are introduced and are shown to depend on the distance between source and receiver. Those features are used to train a pattern recognizer for distance detection. The method is tested using a database of speech recordings in four rooms with different acoustical properties. Performance is shown to be independent of the signal gain and level, but depends on the reverberation time and the characteristics of the room. Overall, the system performs well especially for close distances and for rooms with low reverberation time and it appears to be robust to small distance mismatches. Finally, a listening test is conducted in order to compare the results of the proposed method to the performance of human listeners. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Virtual Sound Rendering in a Stereophonic Loudspeaker Setup

    Page(s): 1962 - 1974
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1122 KB) |  | HTML iconHTML  

    This paper presents a mathematical analysis of the effects of interchannel amplitude and time differences in two channel (stereophonic) sound systems. The analysis is developed by computing the acoustic conditions at the listener's ears as a function of the stereophonic signal feature. We also present separate approximations of head-related transfer function (HRTF)-based panning according to predefined frequency bands. We attempt to create non-sophisticated models of the stereophonic listening mechanism with approximations in both the time and frequency domains. The models are based upon psychoacoustical theories that present the frequency-dependent relative importance of acoustical cues. Based on the model, we propose new panning methods that can enhance the localization accuracy of conventional panning methods, such as amplitude panning and HRTF-based panning. The localization performances of the new panning techniques are evaluated and compared by means of auditory model simulations and listening tests. Through simulations and listening test results, it is shown that the proposed panning method makes substantial improvements in the localization of virtual sources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Supervector Dimension Reduction for Efficient Speaker Age Estimation Based on the Acoustic Speech Signal

    Page(s): 1975 - 1985
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2531 KB) |  | HTML iconHTML  

    This paper presents a novel dimension reduction method which aims to improve the accuracy and the efficiency of speaker's age estimation systems based on speech signal. Two different age estimation approaches were studied and implemented; the first, age-group classification, and the second, precise age estimation using regression. These two approaches use the Gaussian mixture model (GMM) supervectors as features for a support vector machine (SVM) model. When a radial basis function (RBF) kernel is used, the accuracy is improved compared to using a linear kernel; however, the computation complexity is more sensitive to the feature dimension. Classic dimension reduction methods like principal component analysis (PCA) and linear discriminant analysis (LDA) tend to eliminate the relevant feature information and cannot always be applied without damaging the model's accuracy. In our study, a novel dimension reduction method was developed, the weighted-pairwise principal components analysis (WPPCA) based on the nuisance attribute projection (NAP) technique. This method projects the supervectors to a reduced space where the redundant within-class pairwise variability is eliminated. This method was applied and compared to the baseline system where no dimensionality reduction is done on the supervectors. The conducted experiments showed a dramatic speed-up in the SVM training testing time using reduced feature vectors. The system accuracy was improved by 5% for the classification system and by 10% for the regression system using the proposed dimension reduction method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bayesian Interpolation and Parameter Estimation in a Dynamic Sinusoidal Model

    Page(s): 1986 - 1998
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1447 KB) |  | HTML iconHTML  

    In this paper, we propose a method for restoring the missing or corrupted observations of nonstationary sinusoidal signals which are often encountered in music and speech applications. To model nonstationary signals, we use a time-varying sinusoidal model which is obtained by extending the static sinusoidal model into a dynamic sinusoidal model. In this model, the in-phase and quadrature components of the sinusoids are modeled as first-order Gauss-Markov processes. The inference scheme for the model parameters and missing observations is formulated in a Bayesian framework and is based on a Markov chain Monte Carlo method known as Gibbs sampler. We focus on the parameter estimation in the dynamic sinusoidal model since this constitutes the core of model-based interpolation. In the simulations, we first investigate the applicability of the model and then demonstrate the inference scheme by applying it to the restoration of lost audio packets on a packet-based network. The results show that the proposed method is a reasonable inference scheme for estimating unknown signal parameters and interpolating gaps consisting of missing/corrupted signal segments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Large Margin Discriminative Semi-Markov Model for Phonetic Recognition

    Page(s): 1999 - 2012
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (675 KB) |  | HTML iconHTML  

    This paper considers a large margin discriminative semi-Markov model (LMSMM) for phonetic recognition. The hidden Markov model (HMM) framework that is often used for phonetic recognition assumes only local statistical dependencies between adjacent observations, and it is used to predict a label for each observation without explicit phone segmentation. On the other hand, the semi-Markov model (SMM) framework allows simultaneous segmentation and labeling of sequential data based on a segment-based Markovian structure that assumes statistical dependencies among all the observations within a phone segment. For phonetic recognition which is inherently a joint segmentation and labeling problem, the SMM framework has the potential to perform better than the HMM framework at the expense of slight increase in computational complexity. The SMM framework considered in this paper is based on a non-probabilistic discriminant function that is linear in the joint feature map which attempts to capture long-range statistical dependencies among observations. The parameters of the discriminant function are estimated by a large margin learning framework for structured prediction. The parameter estimation problem in hand leads to an optimization problem with many margin constraints, and this constrained optimization problem is solved using a stochastic gradient descent algorithm. The proposed LMSMM outperformed the large margin discriminative HMM in the TIMIT phonetic recognition task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Measuring Structural Similarity in Music

    Page(s): 2013 - 2025
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1203 KB) |  | HTML iconHTML  

    This paper presents a novel method for measuring the structural similarity between music recordings. It uses recurrence plot analysis to characterize patterns of repetition in the feature sequence, and the normalized compression distance, a practical approximation of the joint Kolmogorov complexity, to measure the pairwise similarity between the plots. By measuring the distance between intermediate representations of signal structure, the proposed method departs from common approaches to music structure analysis which assume a block-based model of music, and thus concentrate on segmenting and clustering sections. The approach ensures that global structure is consistently and robustly characterized in the presence of tempo, instrumentation, and key changes, while the used metric provides a simple to compute, versatile and robust alternative to common approaches in music similarity research. Finally, experimental results demonstrate success at characterizing similarity, while contributing an optimal parameterization of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition

    Page(s): 2026 - 2038
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1412 KB) |  | HTML iconHTML  

    For several reasons, the Fourier phase domain is less favored than the magnitude domain in signal processing and modeling of speech. To correctly analyze the phase, several factors must be considered and compensated, including the effect of the step size, windowing function and other processing parameters. Building on a review of these factors, this paper investigates a spectral representation based on the Instantaneous Frequency Deviation, but in which the step size between processing frames is used in calculating phase changes, rather than the traditional single sample interval. Reflecting these longer intervals, the term delta-phase spectrum is used to distinguish this from instantaneous derivatives. Experiments show that mel-frequency cepstral coefficients features derived from the delta-phase spectrum (termed Mel-Frequency delta-phase features) can produce broadly similar performance to equivalent magnitude domain features for both voice activity detection and speaker recognition tasks. Further, it is shown that the fusion of the magnitude and phase representations yields performance benefits over either in isolation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Ideal Binary Mask Ratio: A Novel Metric for Assessing Binary-Mask-Based Sound Source Separation Algorithms

    Page(s): 2039 - 2045
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (475 KB) |  | HTML iconHTML  

    A number of metrics has been proposed in the literature to assess sound source separation algorithms. The addition of convolutional distortion raises further questions about the assessment of source separation algorithms in reverberant conditions as reverberation is shown to undermine the optimality of the ideal binary mask (IBM) in terms of signal-to-noise ratio (SNR). Furthermore, with a range of mixture parameters common across numerous acoustic conditions, SNR-based metrics demonstrate an inconsistency that can only be attributed to the convolutional distortion. This suggests the necessity for an alternate metric in the presence of convolutional distortion, such as reverberation. Consequently, a novel metric-dubbed the IBM ratio (IBMR)-is proposed for assessing source separation algorithms that aim to calculate the IBM. The metric is robust to many of the effects of convolutional distortion on the output of the system and may provide a more representative insight into the performance of a given algorithm . View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Subjective and Objective Quality Assessment of Audio Source Separation

    Page(s): 2046 - 2057
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1110 KB) |  | HTML iconHTML  

    We aim to assess the perceived quality of estimated source signals in the context of audio source separation. These signals may involve one or more kinds of distortions, including distortion of the target source, interference from the other sources or musical noise artifacts. We propose a subjective test protocol to assess the perceived quality with respect to each kind of distortion and collect the scores of 20 subjects over 80 sounds. We then propose a family of objective measures aiming to predict these subjective scores based on the decomposition of the estimation error into several distortion components and on the use of the PEMO-Q perceptual salience measure to provide multiple features that are then combined. These measures increase correlation with subjective scores up to 0.5 compared to nonlinear mapping of individual state-of-the-art source separation measures. Finally, we released the data and code presented in this paper in a freely available toolkit called PEASS. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Performance of Hybrid Active Noise Control Systems for Uncorrelated Narrowband Disturbances

    Page(s): 2058 - 2066
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1173 KB) |  | HTML iconHTML  

    In filtered-x LMS (FxLMS) single-channel feedforward active noise control (ANC) systems, a reference signal is available that is correlated with the primary disturbance at the error microphone. In some practical situations, there may also be a disturbance uncorrelated with the primary disturbance at the error microphone, for which a correlated reference signal is not available. This disturbance, being uncorrelated with the primary noise, cannot be controlled by the standard FxLMS algorithm, and increases the residual noise. In this paper we propose an improved hybrid ANC system that can simultaneously control both the correlated and uncorrelated noise signals. The proposed method comprises three adaptive filters: 1) the FxLMS-based ANC filter to cancel the primary noise; 2) a separate FxLMS-based ANC filter to cancel the uncorrelated disturbance; and 3) an LMS-based supporting adaptive filter to generate appropriate signals for the two ANC filters. Computer simulations demonstrate that the proposed method can effectively mitigate the correlated and uncorrelated primary disturbances. This improved performance is achieved at only a small increase in computational complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition

    Page(s): 2067 - 2080
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (862 KB) |  | HTML iconHTML  

    This paper proposes to use exemplar-based sparse representations for noise robust automatic speech recognition. First, we describe how speech can be modeled as a linear combination of a small number of exemplars from a large speech exemplar dictionary. The exemplars are time-frequency patches of real speech, each spanning multiple time frames. We then propose to model speech corrupted by additive noise as a linear combination of noise and speech exemplars, and we derive an algorithm for recovering this sparse linear combination of exemplars from the observed noisy speech. We describe how the framework can be used for doing hybrid exemplar-based/HMM recognition by using the exemplar-activations together with the phonetic information associated with the exemplars. As an alternative to hybrid recognition, the framework also allows us to take a source separation approach which enables exemplar-based feature enhancement as well as missing data mask estimation. We evaluate the performance of these exemplar-based methods in connected digit recognition on the AURORA-2 database. Our results show that the hybrid system performed substantially better than source separation or missing data mask estimation at lower signal-to-noise ratios (SNRs), achieving up to 57.1% accuracy at SNR = -5 dB. Although not as effective as two baseline recognizers at higher SNRs, the novel approach offers a promising direction of future research on exemplar-based ASR. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spoken Language Derived Measures for Detecting Mild Cognitive Impairment

    Page(s): 2081 - 2090
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB) |  | HTML iconHTML  

    Spoken responses produced by subjects during neuropsychological exams can provide diagnostic markers beyond exam performance. In particular, characteristics of the spoken language itself can discriminate between subject groups. We present results on the utility of such markers in discriminating between healthy elderly subjects and subjects with mild cognitive impairment (MCI). Given the audio and transcript of a spoken narrative recall task, a range of markers are automatically derived. These markers include speech features such as pause frequency and duration, and many linguistic complexity measures. We examine measures calculated from manually annotated time alignments (of the transcript with the audio) and syntactic parse trees, as well as the same measures calculated from automatic (forced) time alignments and automatic parses. We show statistically significant differences between clinical subject groups for a number of measures. These differences are largely preserved with automation. We then present classification results, and demonstrate a statistically significant improvement in the area under the ROC curve (AUC) when using automatic spoken language derived features in addition to the neuropsychological test scores. Our results indicate that using multiple, complementary measures can aid in automatic detection of MCI. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Boosted Mixture Learning of Gaussian Mixture Hidden Markov Models Based on Maximum Likelihood for Speech Recognition

    Page(s): 2091 - 2100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (437 KB) |  | HTML iconHTML  

    In this paper, we apply the well-known boosted mixture learning (BML) method to learn Gaussian mixture HMMs in speech recognition. BML is an incremental method to learn mixture models for classification problems. In each step of BML, one new mixture component is estimated according to the functional gradient of an objective function to ensure that it is added along the direction that maximizes the objective function. Several techniques have been proposed to extend BML from simple mixture models like the Gaussian mixture model (GMM) to the Gaussian mixture hidden Markov model (HMM), including Viterbi approximation for state segmentation, weight decay and sampling boosting to initialize sample weights to avoid overfitting, combination between partial updating and global updating to refine model parameters in each BML iteration, and use of the Bayesian Information Criterion (BIC) for parsimonious modeling. Experimental results on two large-vocabulary continuous speech recognition tasks, namely the WSJ-5k and Switchboard tasks, have shown that the proposed BML yields significant performance gain over the conventional training procedure, especially for small model sizes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Diffuse Noise Suppression Using Crystal-Shaped Microphone Arrays

    Page(s): 2101 - 2110
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (627 KB) |  | HTML iconHTML  

    This paper describes novel methods for diffuse noise suppression using crystal-shaped microphone arrays. The two-stage processing of the observed signals by the Minimum Variance Distortionless Response (MVDR) beamformer and the subsequent Wiener post-filter is effective for diffuse noise suppression and gives the linear minimum mean square error (LMMSE) estimator of the target signal. It is essential in this framework to accurately estimate the short-time power spectrum and the steering vectors of the target signal from the noisy observations. Our methods diagonalize the spatial noise covariance matrix and utilizes the denoised off-diagonal entries of the spatial covariance matrix to accurately estimate the short-time power spectrum and the steering vectors of the target signal. We employ crystal arrays, certain classes of crystal-shaped array geometries, which make it possible to diagonalize the unknown noise covariance matrix by a constant unitary matrix regardless of its value as long as noise meets an isotropy condition. It is shown through experiments with simulated and real environmental noise that the proposed methods outperform previous methods substantially for real world noise and in the presence of reverberation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Auditory Motivated Asymmetric Compression Technique for Speech Recognition

    Page(s): 2111 - 2124
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1675 KB) |  | HTML iconHTML  

    The Mel-frequency cepstral coefficient (MFCC) parameterization for automatic speech recognition (ASR) utilizes several perceptual features of the human auditory system, one of which is the static compression. Motivated by the human auditory system, the conventional static logarithmic compression applied in the MFCC is analyzed using psychophysical loudness perception curves. Following the property of the auditory system that the dynamic range compression is higher in the basal regions than the apical regions of the basilar membrane, we propose a method of unequal (asymmetric) compression, i.e., higher compression applied in the higher frequency regions than the lower frequency regions. The methods is applied and tested in the MFCC and the PLP parameterizations in the spectral domain, and the ZCPA auditory model used as an ASR front-end in the temporal domain. The extent of the asymmetric compression is applied as a multiplicative gain to the existing static compression, and is determined from the gradient of the piece-wise linear segment of the perceptual compression curve. The proposed method has the advantage of adjusting compression parametrically for improved ASR performance and audibility in noise conditions by low-frequency spectral enhancement, particularly of vowels with lower F1 and F2 formants. Continuous-density HMM recognition using the Aurora 2 corpus and the TIdigits show performance improvements in additive noise conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research