By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 3 • Date March 2012

Filter Results

Displaying Results 1 - 25 of 36
  • Table of contents

    Publication Year: 2012 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (102 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2012 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • A Nonparametric Bayesian Multipitch Analyzer Based on Infinite Latent Harmonic Allocation

    Publication Year: 2012 , Page(s): 717 - 730
    Cited by:  Papers (8)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (899 KB) |  | HTML iconHTML  

    The statistical multipitch analyzer described in this paper estimates multiple fundamental frequencies (F0s) in polyphonic music audio signals produced by pitched instruments. It is based on hierarchic4al nonparametric Bayesian models that can deal with uncertainty of unknown random variables such as model complexities (e.g., the number of F0s and the number of harmonic partials), model parameters (e.g., the values of F0s and the relative weights of harmonic partials), and hyperparameters (i.e., prior knowledge on complexities and parameters). Using these models, we propose a statistical method called infinite latent harmonic allocation (iLHA). To avoid model-complexity control, we allow the observed spectra to contain an unbounded number of sound sources (F0s), each of which is allowed to contain an unbounded number of harmonic partials. More specifically, to model a set of time-sliced spectra, we formulated nested infinite Gaussian mixture models based on hierarchical and generalized Dirichlet processes. To avoid manual tuning of influential hyperparameters, we put noninformative hyperprior distributions on them in a hierarchical manner. For efficient Bayesian inference, we used a modern technique called collapsed variational Bayes. In comparative experiments using audio recordings of piano and guitar solo performances, iLHA yielded promising results and we found that there would be room for improvement based on modeling of temporal continuity and spectral smoothness. View full abstract»

    Open Access
  • Performance Analysis and Improvement of Turkish Broadcast News Retrieval

    Publication Year: 2012 , Page(s): 731 - 741
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (991 KB) |  | HTML iconHTML  

    This paper presents our work on the retrieval of spoken information in Turkish. Traditional speech retrieval systems perform indexing and retrieval over automatic speech recognition (ASR) transcripts, which include errors either because of out-of-vocabulary (OOV) words or ASR inaccuracy. We use subword units as recognition and indexing units to reduce the OOV rate and index alternative recognition hypotheses to handle ASR errors. Performance of such methods is evaluated on our Turkish Broadcast News Corpus with two types of speech retrieval systems: a spoken term detection (STD) and a spoken document retrieval (SDR) system. To evaluate the SDR system, we also build a spoken information retrieval (IR) collection, which is the first for Turkish. Experiments showed that word segmentation algorithms are quite useful for both tasks. SDR performance is observed to be less dependent on the ASR component, whereas any performance change in ASR directly affects STD. We also present extensive analysis of retrieval performance depending on query length, and propose length-based index combination and thresholding strategies for the STD task. Finally, a new approach, which depends on the detection of stems instead of complete terms, is tried for STD and observed to give promising results. Although evaluations were performed in Turkish, we expect the proposed methods to be effective for similar languages as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal Higher Order Ambisonics Encoding With Predefined Constraints

    Publication Year: 2012 , Page(s): 742 - 754
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2782 KB) |  | HTML iconHTML  

    In this paper, we propose a design method for 3-D higher order ambisonics (3-D HOA) encoding matrices which offers the possibility to impose spatial stop-bands in the directivity patterns of all the spherical-harmonic audio channels while keeping the transformed audio channels still compatible with the 3-D HOA reproduction sound format. This might be useful as an encoding technique which suppresses interfering signals from specific directions in a 3-D HOA recording, or in other situations where certain spatial areas should be suppressed. The design method is adapted from recent work on the optimization of spherical microphone array beamforming. Using the proposed optimization method and the spherical harmonics mathematics framework, the relationship between several design factors, e.g., distortions in the desired response, the dynamic range of matrix coefficients, can be analyzed and illustrated as function of frequency. Based on the proposed optimization formulation, additional constraints can also be easily included and solved. In some of the formulations, the processing can be applied as a matrix multiplication to recorded spherical harmonics coefficients, that is, already encoded 3-D HOA format signals. The modified signals can be of the same or a lower spherical harmonics order. For a full optimization that gives a globally optimal solution, on the other hand, the processing must be applied to the microphone signals themselves. Numerical and experimental results validate the proposed method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Source-Normalized LDA for Robust Speaker Recognition Using i-Vectors From Multiple Speech Sources

    Publication Year: 2012 , Page(s): 755 - 766
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (888 KB) |  | HTML iconHTML  

    The recent development of the i-vector framework for speaker recognition has set a new performance standard in the research field. An i-vector is a compact representation of a speakers utterance extracted from a total variability subspace. Prior to classification using a cosine kernel, i-vectors are projected into an linear discriminant analysis (LDA) space in order to reduce inter-session variability and enhance speaker discrimination. The accurate estimation of this LDA space from a training dataset is crucial to detection performance. A typical training dataset, however, does not consist of utterances acquired through all sources of interest for each speaker. This has the effect of introducing systematic variation related to the speech source in the between-speaker covariance matrix and results in an incomplete representation of the within-speaker scatter matrix used for LDA. The recently proposed source-normalized (SN) LDA algorithm improves the robustness of i-vector-based speaker recognition under both mis-matched evaluation conditions and conditions for which inadequate speech resources are available for suitable system development. When evaluated on the recent NIST 2008 and 2010 Speaker Recognition Evaluations (SRE), SN-LDA demonstrated relative improvements of up to 38% in equal error rate (EER) and 44% in minimum DCF over LDA under mis-matched and sparsely resourced evaluation conditions while also providing improvements in the common telephone-only conditions. Extending on these initial developments, this study provides a thorough analysis of how SN-LDA transforms the i-vector space to reduce source variation and its robustness to varying evaluation and LDA training conditions. The concept of source-normalization is further extended to within-class covariance normalization (WCCN) and data-driven source detection. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Wiener Filter Approach to Microphone Leakage Reduction in Close-Microphone Applications

    Publication Year: 2012 , Page(s): 767 - 779
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2482 KB) |  | HTML iconHTML  

    Microphone leakage is one of the most prevalent problems in audio applications involving multiple instruments and multiple microphones. Currently, sound engineers have limited solutions available to them. In this paper, the applicability of two widely used signal enhancement methods to this problem is discussed, namely blind source separation and noise suppression. By extending previous work, it is shown that the noise suppression framework is a valid choice and can effectively address the problem of microphone leakage. Here, an extended form of the single channel Wiener filter is used which takes into account the individual audio sources to derive a multichannel noise term. A novel power spectral density (PSD) estimation method is also proposed based on the identification of dominant frequency bins by examining the microphone and output signal PSDs. The performance of the method is examined for simulated environments with various source-microphone setups and it is shown that the proposed approach efficiently suppresses leakage. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Speech Recognition Based on Non-Uniform Error Criteria

    Publication Year: 2012 , Page(s): 780 - 793
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (628 KB) |  | HTML iconHTML  

    The Bayes decision theory is the foundation of the classical statistical pattern recognition approach, with the expected error as the performance objective. For most pattern recognition problems, the “error” is conventionally assumed to be binary, i.e., 0 or 1, equivalent to error counting, independent of the specifics of the error made by the system. The term “error rate” is thus long considered the prevalent system performance measure. This performance measure, nonetheless, may not be satisfactory in many practical applications. In automatic speech recognition, for example, it is well known that some errors are more detrimental (e.g., more likely to lead to misunderstanding of the spoken sentence) than others. In this paper, we propose an extended framework for the speech recognition problem with non-uniform classification/recognition error cost which can be controlled by the system designer. In particular, we address the issue of system model optimization when the cost of a recognition error is class dependent. We formulate the problem in the framework of the minimum classification error (MCE) method, after appropriate generalization to integrate the class-dependent error cost into one consistent objective function for optimization. We present a variety of training scenarios for automatic speech recognition under this extended framework. Experimental results for continuous speech recognition are provided to demonstrate the effectiveness of the new approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Product of Experts for Statistical Parametric Speech Synthesis

    Publication Year: 2012 , Page(s): 794 - 805
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2918 KB) |  | HTML iconHTML  

    Multiple acoustic models are often combined in statistical parametric speech synthesis. Both linear and non-linear functions of an observation sequence are used as features to be modeled. This paper shows that this combination of multiple acoustic models can be expressed as a product of experts (PoE); the likelihoods from the models are scaled, multiplied together, and then normalized. Normally these models are individually trained and only combined at the synthesis stage. This paper discusses a more consistent PoE framework where the models are jointly trained. A training algorithm for PoEs based on linear feature functions and Gaussian experts is derived by generalizing the training algorithm for trajectory HMMs. However for non-linear feature functions or non-Gaussian experts this is not possible, so a scheme based on contrastive divergence learning is described. Experimental results show that the PoE framework provides both a mathematically elegant way to train multiple acoustic models jointly and significant improvements in the quality of the synthesized speech. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Voice Conversion Using Dynamic Kernel Partial Least Squares Regression

    Publication Year: 2012 , Page(s): 806 - 817
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1501 KB) |  | HTML iconHTML  

    A drawback of many voice conversion algorithms is that they rely on linear models and/or require a lot of tuning. In addition, many of them ignore the inherent time-dependency between speech features. To address these issues, we propose to use dynamic kernel partial least squares (DKPLS) technique to model nonlinearities as well as to capture the dynamics in the data. The method is based on a kernel transformation of the source features to allow non-linear modeling and concatenation of previous and next frames to model the dynamics. Partial least squares regression is used to find a conversion function that does not overfit to the data. The resulting DKPLS algorithm is a simple and efficient algorithm and does not require massive tuning. Existing statistical methods proposed for voice conversion are able to produce good similarity between the original and the converted target voices but the quality is usually degraded. The experiments conducted on a variety of conversion pairs show that DKPLS, being a statistical method, enables successful identity conversion while achieving a major improvement in the quality scores compared to the state-of-the-art Gaussian mixture-based model. In addition to enabling better spectral feature transformation, quality is further improved when aperiodicity and binary voicing values are converted using DKPLS with auxiliary information from spectral features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling

    Publication Year: 2012 , Page(s): 818 - 827
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1298 KB) |  | HTML iconHTML  

    This paper presents a novel noise-robust automatic speech recognition (ASR) system that combines aspects of the noise modeling and source separation approaches to the problem. The combined approach has been motivated by the observation that the noise backgrounds encountered in everyday listening situations can be roughly characterized as a slowly varying noise floor in which there are embedded a mixture of energetic but unpredictable acoustic events. Our solution combines two complementary techniques. First, an adaptive noise floor model estimates the degree to which high-energy acoustic events are masked by the noise floor (represented by a soft missing data mask). Second, a fragment decoding system attempts to interpret the high-energy regions that are not accounted for by the noise floor model. This component uses models of the target speech to decide whether fragments should be included in the target speech stream or not. Our experiments on the CHiME corpus task show that the combined approach performs significantly better than systems using either the noise model or fragment decoding approach alone, and substantially outperforms multicondition training. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modulation Spectrum Equalization for Improved Robust Speech Recognition

    Publication Year: 2012 , Page(s): 828 - 843
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2419 KB) |  | HTML iconHTML  

    We propose novel approaches for equalizing the modulation spectrum for robust feature extraction in speech recognition. Common to all approaches in that the temporal trajectories of the feature parameters are first transformed into the magnitude modulation spectrum. In spectral histogram equalization (SHE) and two-band spectral histogram equalization (2B-SHE), we equalize the histogram of the modulation spectrum for each utterance to a reference histogram obtained from clean training data, or perform the equalization with two sub-bands on the modulation spectrum. In magnitude ratio equalization (MRE), we define the magnitude ratio of lower to higher modulation frequency components for each utterance, and equalize this to a reference value obtained from clean training data. These approaches can be viewed as temporal filters that are adapted to each testing utterance. Experiments performed on the Aurora 2 and 4 corpora for small and large vocabulary tasks indicate that significant performance improvements are achievable for all noise conditions. We also show that additional improvements can be obtained when these approaches are integrated with cepstral mean and variance normalization (CMVN), histogram equalization (HEQ), higher order cepstral moment normalization (HOCMN), or the advanced front-end (AFE). We analyze and discuss the reasons for these improvements from different viewpoints with different sets of data, including adaptive temporal filtering, noise behavior on the modulation spectrum, phoneme types, and modulation spectrum distance measures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Transcription of Bell Chiming Recordings

    Publication Year: 2012 , Page(s): 844 - 853
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (403 KB) |  | HTML iconHTML  

    Bell chiming is a folk music tradition that involves performers playing rhythmic patterns on church bells. The paper presents a method for automatic transcription of bell chiming recordings, where the goal is to detect the bells that were played and their onset times. We first present an algorithm that estimates the number of bells in a recording and their approximate spectra. The algorithm uses a modified version of the intelligent k-means algorithm, as well as some prior knowledge of church bell acoustics to find clusters of partials with synchronous onsets in the time-frequency representation of a recording. Cluster centers are used to initialize non-negative matrix factorization that factorizes the time-frequency representation into a set of basis vectors (bell spectra) and their activations. To transcribe a recording, we propose a probabilistic framework that integrates factorization and onset detection data with prior knowledge of bell chiming performance rules. Both parts of the algorithm are evaluated on a set of bell chiming field recordings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Speech Distortion and Interference Rejection Constraint Beamformer

    Publication Year: 2012 , Page(s): 854 - 867
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2076 KB) |  | HTML iconHTML  

    Signals captured by a set of microphones in a speech communication system are mixtures of desired and undesired signals and ambient noise. Existing beamformers can be divided into those that preserve or distort the desired signal. Beamformers that preserve the desired signal are, for example, the linearly constrained minimum variance (LCMV) beamformer that is supposed, ideally, to reject the undesired signal and reduce the ambient noise power, and the minimum variance distortionless response (MVDR) beamformer that reduces the interference-plus-noise power. The multichannel Wiener filter, on the other hand, reduces the interference-plus-noise power without preserving the desired signal. In this paper, a speech distortion and interference rejection constraint (SDIRC) beamformer is derived that minimizes the ambient noise power subject to specific constraints that allow a tradeoff between speech distortion and interference-plus-noise reduction on the one hand, and undesired signal and ambient noise reductions on the other hand. Closed-form expressions for the performance measures of the SDIRC beamformer are derived and the relations to the aforementioned beamformers are derived. The performance evaluation demonstrates the tradeoffs that can be made using the SDIRC beamformer. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Normalized Beamforming Algorithm for Broadband Speech Using a Continuous Interleaved Sampling Strategy

    Publication Year: 2012 , Page(s): 868 - 874
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (887 KB) |  | HTML iconHTML  

    The delay beamforming method, one of the most effective microphone array techniques, requires minimal computational complexity and enjoys wide usage in narrowband signal applications. However, the method suffers from high losses of low-frequency sound and is generally not suitable for broadband signals. In this paper, a novel normalized beamforming algorithm is introduced for broadband speech applications in multichannel cochlear implant (CI), which can be adjusted based on application conditions. The proposed algorithm can, with minimal computational complexity, accurately and effectively compensate for changes in signal beam pattern, in accordance with a preselected benchmark. In addition, this method can be combined with the widely used continuous interleaved sampling (CIS) strategy, adding to its practical value. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data

    Publication Year: 2012 , Page(s): 875 - 887
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (932 KB) |  | HTML iconHTML  

    A state-of-the-art automatic speech recognition (ASR) system can often achieve high accuracy for most spoken languages of interest if a large amount of speech material can be collected and used to train a set of language-specific acoustic phone models. However, designing good ASR systems with little or no language-specific speech data for resource-limited languages is still a challenging research topic. As a consequence, there has been an increasing interest in exploring knowledge sharing among a large number of languages so that a universal set of acoustic phone units can be defined to work for multiple or even for all languages. This work aims at demonstrating that a recently proposed automatic speech attribute transcription framework can play a key role in designing language-universal acoustic models by sharing speech units among all target languages at the acoustic phonetic attribute level. The language-universal acoustic models are evaluated through phone recognition. It will be shown that good cross-language attribute detection and continuous phone recognition performance can be accomplished for “unseen” languages using minimal training data from the target languages to be recognized. Furthermore, a phone-based background model (PBM) approach will be presented to improve attribute detection accuracies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Forced Spectral Diversity Algorithm for Speech Dereverberation in the Presence of Near-Common Zeros

    Publication Year: 2012 , Page(s): 888 - 899
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (830 KB) |  | HTML iconHTML  

    Blind identification of single-input multiple-output (SIMO) systems is not normally possible if common zeros exist in the channels. Studies of measured acoustic SIMO systems show that near-common zeros occur in such systems as encountered in the speech dereverberation task. We therefore introduce a method to add additional diversity to the SIMO system to be identified which we term forced spectral diversity (FSD) and we show that its use leads to an identification-equalization approach that gives improved dereverberation. As part of this work, we show the link between channel diversity and the effect of common zeros. We also define and discuss in more detail the concept and impact of near-common zeros. The proposed algorithm is presented specifically for a two-channel system where such near-common zeros exist. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning-Based Auditory Encoding for Robust Speech Recognition

    Publication Year: 2012 , Page(s): 900 - 914
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1451 KB) |  | HTML iconHTML  

    This paper describes an approach to the optimization of the nonlinear component of a physiologically motivated feature extraction system for automatic speech recognition. Most computational models of the peripheral auditory system include a sigmoidal nonlinear function that relates the log of signal intensity to output level, which we represent by a set of frequency dependent logistic functions. The parameters of these rate-level functions are estimated to maximize the a posteriori probability of the correct class in training data. The performance of this approach was verified by the results of a series of experiments conducted with the CMU S phinx-III speech recognition system on the DARPA Resource Management, Wall Street Journal databases, and on the AURORA 2 database. In general, it was shown that feature extraction that incorporates the learned rate-nonlinearity, combined with a complementary loudness compensation function, results in better recognition accuracy in the presence of background noise than traditional MFCC feature extraction without the optimized nonlinearity when the system is trained on clean speech and tested in noise. We also describe the use of lattice structure that constraints the training process, enabling training with much more complicated acoustic models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Transcription of Guitar Chords and Fingering From Audio

    Publication Year: 2012 , Page(s): 915 - 921
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (766 KB) |  | HTML iconHTML  

    This paper proposes a method for extracting the fingering configurations automatically from a recorded guitar performance. 330 different fingering configurations are considered, corresponding to different versions of the major, minor, major 7th, and minor 7th chords played on the guitar fretboard. The method is formulated as a hidden Markov model, where the hidden states correspond to the different fingering configurations and the observed acoustic features are obtained from a multiple fundamental frequency estimator that measures the salience of a range of candidate note pitches within individual time frames. Transitions between consecutive fingerings are constrained by a musical model trained on a database of chord sequences, and a heuristic cost function that measures the physical difficulty of moving from one configuration of finger positions to another. The method was evaluated on recordings from the acoustic, electric, and the Spanish guitar and clearly outperformed a non-guitar-specific reference chord transcription method despite the fact that the number of chords considered here is significantly larger. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Audio Inpainting

    Publication Year: 2012 , Page(s): 922 - 932
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1248 KB) |  | HTML iconHTML  

    We propose the audio inpainting framework that recovers portions of audio data distorted due to impairments such as impulsive noise, clipping, and packet loss. In this framework, the distorted data are treated as missing and their location is assumed to be known. The signal is decomposed into overlapping time-domain frames and the restoration problem is then formulated as an inverse problem per audio frame. Sparse representation modeling is employed per frame, and each inverse problem is solved using the Orthogonal Matching Pursuit algorithm together with a discrete cosine or a Gabor dictionary. The Signal-to-Noise Ratio performance of this algorithm is shown to be comparable or better than state-of-the-art methods when blocks of samples of variable durations are missing. We also demonstrate that the size of the block of missing samples, rather than the overall number of missing samples, is a crucial parameter for high quality signal restoration. We further introduce a constrained Matching Pursuit approach for the special case of audio declipping that exploits the sign pattern of clipped audio samples and their maximal absolute value, as well as allowing the user to specify the maximum amplitude of the signal. This approach is shown to outperform state-of-the-art and commercially available methods for audio declipping in terms of Signal-to-Noise Ratio. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SAFE: A Statistical Approach to F0 Estimation Under Clean and Noisy Conditions

    Publication Year: 2012 , Page(s): 933 - 944
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1403 KB) |  | HTML iconHTML  

    A novel Statistical Algorithm for F0 Estimation (SAFE) is proposed to improve the accuracy of F0 estimation under both clean and noisy conditions. Prominent signal-to-noise ratio (SNR) peaks in speech spectra constitute a robust information source from which F0 can be inferred. A probabilistic framework is proposed to model the effect of noise on voiced speech spectra. Prominent SNR peaks in the low-frequency band (0 - 1000 Hz) are important to F0 estimation, and prominent SNR peaks in the middle and high-frequency bands (1000-3000 Hz) are also useful supplemental information to F0 estimation under noisy conditions, especially the babble noise condition. Experiments show that the SAFE algorithm has the lowest gross pitch errors (GPEs) compared to prevailing F0 trackers in white and babble noise conditions at low SNRs. Experimental results also show that SAFE is robust in maintaining a low mean and standard deviation of the fine pitch errors (MFPE and SDFPE) in noise. The code of SAFE is available at http://www.ee.ucla.edu/~weichu/safe. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Psychoacoustic Model Compensation for Robust Speaker Verification in Environmental Noise

    Publication Year: 2012 , Page(s): 945 - 953
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (829 KB) |  | HTML iconHTML  

    We investigate the problem of speaker verification in noisy conditions in this paper. Our work is motivated by the fact that environmental noise severely degrades the performance of speaker verification systems. We present a model compensation scheme based on the psychoacoustic principles that adapts the model parameters in order to reduce the training and verification mismatch. To deal with scenarios where accurate noise estimation is difficult, a modified multiconditioning scheme is proposed. The new algorithm was tested on two speech databases. The first database is the TIMIT database corrupted with white and pink noise and the noise estimation is fairly easy in this case. The second database is the MIT Mobile Device Speaker Verification Corpus (MITMDSVC) containing realistic noisy speech data which makes the noise estimation difficult. The proposed scheme achieves significant performance gain over the baseline system in both cases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Perspective on Frequency-Domain Beamformers in Room Acoustics

    Publication Year: 2012 , Page(s): 947 - 960
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1490 KB) |  | HTML iconHTML  

    Signals captured by a set of microphones in a speech communication system are mixtures of desired signals and noise. In this paper, a different perspective on frequency-domain beamformers in room acoustics is provided. Specifically, the observed noise signals are divided into coherent and incoherent signal components while no assumptions are being made regarding the number of coherent noise sources and the noise sound field. From this perspective, performance measures are defined and existing beamformers are deduced. In addition, a new and general tradeoff beamformer is proposed that enables a compromise between noise reduction and speech distortion on the one hand, and coherent noise versus incoherent noise reductions on the other hand. The presented performance evaluation shows how existing beamformers and the tradeoff beamformer perform in different scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications

    Publication Year: 2012 , Page(s): 968 - 981
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (811 KB) |  | HTML iconHTML  

    The modeling of speech production often relies on a source-filter approach. Although methods parameterizing the filter have nowadays reached a certain maturity, there is still a lot to be gained for several speech processing applications in finding an appropriate excitation model. This manuscript presents a Deterministic plus Stochastic Model (DSM) of the residual signal. The DSM consists of two contributions acting in two distinct spectral bands delimited by a maximum voiced frequency. Both components are extracted from an analysis performed on a speaker-dependent dataset of pitch-synchronous residual frames. The deterministic part models the low-frequency contents and arises from an orthonormal decomposition of these frames. As for the stochastic component, it is a high-frequency noise modulated both in time and frequency. Some interesting phonetic and computational properties of the DSM are also highlighted. The applicability of the DSM in two fields of speech processing is then studied. First, it is shown that incorporating the DSM vocoder in HMM-based speech synthesis enhances the delivered quality. The proposed approach turns out to significantly outperform the traditional pulse excitation and provides a quality equivalent to STRAIGHT. In a second application, the potential of glottal signatures derived from the proposed DSM is investigated for speaker identification purpose. Interestingly, these signatures are shown to lead to better recognition rates than other glottal-based methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Analysis and Design of FxLMS Algorithm in Broadband ANC System With Online Secondary-Path Modeling

    Publication Year: 2012 , Page(s): 982 - 993
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (624 KB) |  | HTML iconHTML  

    The filtered-x LMS (FxLMS) algorithm has been widely used in active noise control (ANC) systems, where the secondary path is usually estimated online by injecting auxiliary noises. In such an ANC system, the ANC controller and the secondary-path estimator are coupled with each other, which make it difficult to analyze the performance of the entire system. Therefore, a comprehensive performance analysis of broadband ANC systems is not available currently to our best knowledge. In this paper, the convergence behavior of the FxLMS algorithm in broadband ANC systems with online secondary-path modeling is studied. Difference equations which describe the mean and mean square convergence behaviors of the adaptive algorithms are derived. Using these difference equations, the stability of the system is analyzed. Finally, the coupled equations at the steady state are solved to obtain the steady-state excess mean square errors (EMSEs) for the ANC controller and the secondary-path estimator. Computer simulations are conducted to verify the agreement between the simulated and theoretically predicted results. Moreover, using the proposed theoretical analysis, a systematic and simple design procedure for ANC systems is proposed. The usefulness of the theoretical results and design procedure is demonstrated by means of a design example. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research