System Maintenance:
There may be intermittent impact on performance while updates are in progress. We apologize for the inconvenience.
By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 8 • Date Aug. 2013

Filter Results

Displaying Results 1 - 25 of 31
  • [Front cover]

    Publication Year: 2013 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (296 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2013 , Page(s): 1535 - 1536
    Save to Project icon | Request Permissions | PDF file iconPDF (228 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2013 , Page(s): 1537 - 1538
    Save to Project icon | Request Permissions | PDF file iconPDF (229 KB)  
    Freely Available from IEEE
  • Study of the General Kalman Filter for Echo Cancellation

    Publication Year: 2013 , Page(s): 1539 - 1549
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3063 KB) |  | HTML iconHTML  

    The Kalman filter is a very interesting signal processing tool, which is widely used in many practical applications. In this paper, we study the Kalman filter in the context of echo cancellation. The contribution of this work is threefold. First, we derive a different form of the Kalman filter by considering, at each iteration, a block of time samples instead of one time sample as it is the case in the conventional approach. Second, we show how this general Kalman filter (GKF) is connected with some of the most popular adaptive filters for echo cancellation, i.e., the normalized least-mean-square (NLMS) algorithm, the affine projection algorithm (APA) and its proportionate version (PAPA). Third, a simplified Kalman filter is developed in order to reduce the computational load of the GKF; this algorithm behaves like a variable step-size adaptive filter. Simulation results indicate the good performance of the proposed algorithms, which can be attractive choices for echo cancellation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Class of Algorithms for Time-Frequency Multiplier Estimation

    Publication Year: 2013 , Page(s): 1550 - 1559
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2055 KB) |  | HTML iconHTML  

    We propose here a new approach together with a corresponding class of algorithms for offline estimation of linear operators mapping input to output signals. The operators are modeled as multipliers, i.e., linear and diagonal operator in a frame or Bessel representation of signals (like Gabor, wavelets ...) and characterized by a transfer function. The estimation problem is formulated as a regularized inverse problem, and solved using iterative algorithms, based on gradient descent schemes. Various estimation problems, which differ by a choice for the regularization function, are studied in the case of Gabor multipliers. The transfer function actually provides a meaningful interpretation of the differences between the two signals or signal classes under consideration, and examples are discussed. Furthermore, examples of signal transformations with such Gabor transfer functions are also given. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Joint Source-Filter Optimization for Accurate Vocal Tract Estimation Using Differential Evolution

    Publication Year: 2013 , Page(s): 1560 - 1572
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2258 KB) |  | HTML iconHTML  

    In this work, we present a joint source-filter optimization approach for separating voiced speech into vocal tract (VT) and voice source components. The presented method is pitch-synchronous and thereby exhibits a high robustness against vocal jitter, shimmer and other glottal variations while covering various voice qualities. The voice source is modeled using the Liljencrants-Fant (LF) model, which is integrated into a time-varying auto-regressive speech production model with exogenous input (ARX). The non-convex optimization problem of finding the optimal model parameters is addressed by a heuristic, evolutionary optimization method called differential evolution. The optimization method is first validated in a series of experiments with synthetic speech. Estimated glottal source and VT parameters are the criteria used for comparison with the iterative adaptive inverse filter (IAIF) method and the linear prediction (LP) method under varying conditions such as jitter, fundamental frequency (f0) as well as environmental and glottal noise. The results show that the proposed method largely reduces the bias and standard deviation of estimated VT coefficients and glottal source parameters. Furthermore, the performance of the source-filter separation is evaluated in experiments using speech generated with a physical model of speech production. The proposed method reliably estimates glottal flow waveforms and lower formant frequencies. Results obtained for higher formant frequencies indicate that research on more accurate voice source models and their interaction with the VT is necessary to improve the source-filter separation. The proposed optimization approach promises to be a useful tool for future research addressing this topic. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Broadband DOA Estimation Using Sensor Arrays on Complex-Shaped Rigid Bodies

    Publication Year: 2013 , Page(s): 1573 - 1585
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2450 KB) |  | HTML iconHTML  

    Sensor arrays mounted on complex-shaped rigid bodies are a common feature in many practical broadband direction of arrival (DOA) estimation applications. The scattering and reflections caused by these rigid bodies introduce complexity and diversity in the frequency domain of the channel transfer function, which presents several challenges to existing broadband DOA estimators. This paper presents a novel high resolution broadband DOA estimation technique based on signal subspace decomposition. We describe how broadband signals can be decomposed into narrow subband components, and combined such that the frequency domain diversity is retained. The DOA estimation performance is compared with existing techniques using a uniform circular array and a sensor array on a hypothetical rigid body. An improvement in closely spaced source resolution of up to 6 dB is observed for the sensor array on the hypothetical rigid body, in comparison to the uniform circular array. The results suggest that frequency domain diversity, introduced by complex-shaped rigid bodies, can provide higher resolution and clearer separation of closely spaced broadband sound sources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Syntax-Based Translation With Bilingually Lexicalized Synchronous Tree Substitution Grammars

    Publication Year: 2013 , Page(s): 1586 - 1597
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1605 KB) |  | HTML iconHTML  

    Syntax-based models can significantly improve the translation performance due to their grammatical modeling on one or both language side(s). However, the translation rules such as the non-lexical rule “ VP→(x0x1,VP:x1PP:x0)” in string-to-tree models do not consider any lexicalized information on the source or target side. The rule is so generalized that any subtree rooted at VP can substitute for the nonterminal VP:x1. Because rules containing nonterminals are frequently used when generating the target-side tree structures, there is a risk that rules of this type will potentially be severely misused in decoding due to a lack of lexicalization guidance. In this article, inspired by lexicalized PCFG, which is widely used in monolingual parsing, we propose to upgrade the STSG (synchronous tree substitution grammars)-based syntax translation model with bilingually lexicalized STSG. Using the string-to-tree translation model as a case study, we present generative and discriminative models to integrate lexicalized STSG into the translation model. Both small- and large-scale experiments on Chinese-to-English translation demonstrate that the proposed lexicalized STSG can provide superior rule selection in decoding and substantially improve the translation quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reverberation and Noise Robust Feature Compensation Based on IMM

    Publication Year: 2013 , Page(s): 1598 - 1611
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2603 KB) |  | HTML iconHTML  

    In this paper, we propose a novel feature compensation approach based on the interacting multiple model (IMM) algorithm specially designed for joint processing of background noise and acoustic reverberation. Our approach to cope with the time-varying environmental parameters is to establish a switching linear dynamic model for the additive and convolutive distortions, such as the background noise and acoustic reverberation, in the log-spectral domain. We construct multiple state space models with the speech corruption process in which the log spectra of clean speech and log frequency response of acoustic reverberation are jointly handled as the state of our interest. The proposed approach shows significant improvements in the Aurora-5 automatic speech recognition (ASR) task which was developed to investigate the influence on the performance of ASR for a hands-free speech input in noisy room environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Joint Discriminative Decoding of Words and Semantic Tags for Spoken Language Understanding

    Publication Year: 2013 , Page(s): 1612 - 1621
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1791 KB) |  | HTML iconHTML  

    Most Spoken Language Understanding (SLU) systems today employ a cascade approach, where the best hypothesis from Automatic Speech Recognizer (ASR) is fed into understanding modules such as slot sequence classifiers and intent detectors. The output of these modules is then further fed into downstream components such as interpreter and/or knowledge broker. These statistical models are usually trained individually to optimize the error rate of their respective output. In such approaches, errors from one module irreversibly propagates into other modules causing a serious degradation in the overall performance of the SLU system. Thus it is desirable to jointly optimize all the statistical models together. As a first step towards this, in this paper, we propose a joint decoding framework in which we predict the optimal word as well as slot sequence (semantic tag sequence) jointly given the input acoustic stream. Furthermore, the improved recognition output is then used for an utterance classification task, specifically, we focus on intent detection task. On a SLU task, we show 1.5% absolute reduction (7.6% relative reduction) in word error rate (WER) and 1.2% absolute improvement in F measure for slot prediction when compared to a very strong cascade baseline comprising of state-of-the-art large vocabulary ASR followed by conditional random field (CRF) based slot sequence tagger. Similarly, for intent detection, we show 1.2% absolute reduction (12% relative reduction) in classification error rate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse Classifier Fusion for Speaker Verification

    Publication Year: 2013 , Page(s): 1622 - 1631
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1879 KB) |  | HTML iconHTML  

    State-of-the-art speaker verification systems take advantage of a number of complementary base classifiers by fusing them to arrive at reliable verification decisions. In speaker verification, fusion is typically implemented as a weighted linear combination of the base classifier scores, where the combination weights are estimated using a logistic regression model. An alternative way for fusion is to use classifier ensemble selection, which can be seen as sparse regularization applied to logistic regression. Even though score fusion has been extensively studied in speaker verification, classifier ensemble selection is much less studied. In this study, we extensively study a sparse classifier fusion on a collection of twelve I4U spectral subsystems on the NIST 2008 and 2010 speaker recognition evaluation (SRE) corpora. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Free-Source Method (FrSM) for Calibrating a Large-Aperture Microphone Array

    Publication Year: 2013 , Page(s): 1632 - 1639
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2377 KB) |  | HTML iconHTML  

    Large-aperture microphone arrays can be used to capture and enhance speech from individual talkers in noisy, multi-talker, and reverberant environments. However, they must be calibrated, often more than once, to obtain accurate 3-dimensional coordinates for all microphones. Direct-measurement techniques, such as using a measuring tape or a laser-based tool are cumbersome and time-consuming. Some previous methods that used acoustic signals for array calibration required bulky hardware and/or fixed, known source locations. Others, which allowed more flexible source placement, often have issues with real data, have reported results for 2D only, or work only for small arrays. This paper describes a complete and robust method for automatic calibration using acoustic signals which is simple, repeatable, accurate, and has been shown to work for a real system. The method requires only a single transducer (speaker) with a microphone attached above its center. The unit is freely moved around the focal volume of the microphone array generating a single long recording from all the microphones. After that, the system is completely automatic. We describe the free source method (FrSM), validate its effectiveness and present accuracy results against measured ground truth. The performance of FrSM is compared to that from several other methods for a real 128-microphone array. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bayesian Feature Enhancement for Reverberation and Noise Robust Speech Recognition

    Publication Year: 2013 , Page(s): 1640 - 1652
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2207 KB) |  | HTML iconHTML  

    In this contribution we extend a previously proposed Bayesian approach for the enhancement of reverberant logarithmic mel power spectral coefficients for robust automatic speech recognition to the additional compensation of background noise. A recently proposed observation model is employed whose time-variant observation error statistics are obtained as a side product of the inference of the a posteriori probability density function of the clean speech feature vectors. Further a reduction of the computational effort and the memory requirements are achieved by using a recursive formulation of the observation model. The performance of the proposed algorithms is first experimentally studied on a connected digits recognition task with artificially created noisy reverberant data. It is shown that the use of the time-variant observation error model leads to a significant error rate reduction at low signal-to-noise ratios compared to a time-invariant model. Further experiments were conducted on a 5000 word task recorded in a reverberant and noisy environment. A significant word error rate reduction was obtained demonstrating the effectiveness of the approach on real-world data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis and Design of Multichannel Systems for Perceptual Sound Field Reconstruction

    Publication Year: 2013 , Page(s): 1653 - 1665
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2268 KB) |  | HTML iconHTML  

    This paper presents a systematic framework for the analysis and design of circular multichannel surround sound systems. Objective analysis based on the concept of active intensity fields shows that for stable rendition of monochromatic plane waves it is beneficial to render each such wave by no more than two channels. Based on that finding, we propose a methodology for the design of circular microphone arrays, in the same configuration as the corresponding loudspeaker system, which aims to capture inter-channel time and intensity differences that ensure accurate rendition of the auditory perspective. The methodology is applicable to regular and irregular microphone/speaker layouts, and a wide range of microphone array radii, including the special case of coincident arrays which corresponds to intensity-based systems. Several design examples, involving first and higher-order microphones are presented. Results of formal listening tests suggest that the proposed design methodology achieves a performance comparable to prior art in the center of the loudspeaker array and a more graceful degradation away from the center. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Musical Instrument Sound Morphing Guided by Perceptually Motivated Features

    Publication Year: 2013 , Page(s): 1666 - 1675
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1190 KB) |  | HTML iconHTML  

    Sound morphing is a transformation that gradually blurs the distinction between the source and target sounds. For musical instrument sounds, the morph must operate across timbre dimensions to create the auditory illusion of hybrid musical instruments. The ultimate goal of sound morphing is to perform perceptually linear transitions, which requires an appropriate model to represent the sounds being morphed and an interpolation function to obtain intermediate sounds. Typically, morphing techniques directly interpolate the parameters of the sound model without considering the perceptual impact or evaluating the results. Perceptual evaluations are cumbersome and not always conclusive. In this work, we seek parameters of a sound model that favor linear variation of perceptually motivated temporal and spectral features used to guide the morph towards more perceptually linear results. The requirement of linear variation of feature values gives rise to objective evaluation criteria for sound morphing. We investigate several spectral envelope morphing techniques to determine which spectral representation renders the most linear transformation in the spectral shape feature domain. We found that interpolation of line spectral frequencies gives the most linear spectral envelope morphs. Analogously, we study temporal envelope morphing techniques and we concluded that interpolation of cepstral coefficients results in the most linear temporal envelope morph. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A General Compression Approach to Multi-Channel Three-Dimensional Audio

    Publication Year: 2013 , Page(s): 1676 - 1688
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2127 KB) |  | HTML iconHTML  

    This paper presents a technique for low bit rate compression of three-dimensional (3D) audio produced by multiple loudspeaker channels. The approach is based on the time-frequency analysis of the localization of spatial sound sources within the 3D space as rendered by a multi-channel audio signal (in this case 16 channels). This analysis results in the derivation of a stereo downmix signal representing the original 16 channels. Alternatively, a mono-downmix signal with side information representing the location of sound sources within the 3D spatial scene can also be derived. The resulting downmix signals are then compressed with a traditional audio coder, resulting in a representation of the 3D soundfield at bit rates comparable with existing stereo audio coders while maintaining the perceptual quality produced from separate encoding of each channel. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust Log-Energy Estimation and its Dynamic Change Enhancement for In-car Speech Recognition

    Publication Year: 2013 , Page(s): 1689 - 1698
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1594 KB) |  | HTML iconHTML  

    The log-energy parameter, typically derived from a full-band spectrum, is a critical feature commonly used in automatic speech recognition (ASR) systems. However, log-energy is difficult to estimate reliably in the presence of background noise. In this paper, we theoretically show that background noise affects the trajectories of not only the “conventional” log-energy, but also its delta parameters. This results in a poor estimation of the actual log-energy and its delta parameters, which no longer describe the speech signal. We thus propose a new method to estimate log-energy from a sub-band spectrum, followed by dynamic change enhancement and mean smoothing. We demonstrate the effectiveness of the proposed log-energy estimation and its post-processing steps through speech recognition experiments conducted on the in-car CENSREC-2 database. The proposed log-energy (together with its corresponding delta parameters) yields an average improvement of 32.8% compared with the baseline front-ends. Moreover, it is also shown that further improvement can be achieved by incorporating the new Mel-Frequency Cepstral Coefficients (MFCCs) obtained by non-linear spectral contrast stretching. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Coding-Based Informed Source Separation: Nonnegative Tensor Factorization Approach

    Publication Year: 2013 , Page(s): 1699 - 1712
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2862 KB) |  | HTML iconHTML  

    Informed source separation (ISS) aims at reliably recovering sources from a mixture. To this purpose, it relies on the assumption that the original sources are available during an encoding stage. Given both sources and mixture, a side-information may be computed and transmitted along with the mixture, whereas the original sources are not available any longer. During a decoding stage, both mixture and side-information are processed to recover the sources. ISS is motivated by a number of specific applications including active listening and remixing of music, karaoke, audio gaming, etc. Most ISS techniques proposed so far rely on a source separation strategy and cannot achieve better results than oracle estimators. In this study, we introduce Coding-based ISS (CISS) and draw the connection between ISS and source coding. CISS amounts to encode the sources using not only a model as in source coding but also the observation of the mixture. This strategy has several advantages over conventional ISS methods. First, it can reach any quality, provided sufficient bandwidth is available as in source coding. Second, it makes use of the mixture in order to reduce the bitrate required to transmit the sources, as in classical ISS. Furthermore, we introduce Nonnegative Tensor Factorization as a very efficient model for CISS and report rate-distortion results that strongly outperform the state of the art. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Applying Multi- and Cross-Lingual Stochastic Phone Space Transformations to Non-Native Speech Recognition

    Publication Year: 2013 , Page(s): 1713 - 1726
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2032 KB) |  | HTML iconHTML  

    In the context of hybrid HMM/MLP Automatic Speech Recognition (ASR), this paper describes an investigation into a new type of stochastic phone space transformation, which maps “source” phone (or phone HMM state) posterior probabilities (as obtained at the output of a Multilayer Perceptron/MLP) into “destination” phone (HMM phone state) posterior probabilities. The resulting stochastic matrix transformation can be used within the same language to automatically adapt to different phone formats (e.g., IPA) or across languages. Additionally, as shown here, it can also be applied successfully to non-native speech recognition. In the same spirit as MLLR adaptation, or MLP adaptation, the approach proposed here is directly mapping posterior distributions, and is trained by optimizing on a small amount of adaptation data a Kullback-Leibler based cost function, along a modified version of an iterative EM algorithm. On a non-native English database (HIWIRE), and comparing with multiple setups (monophone and triphone mapping, MLLR adaptation) we show that the resulting posterior mapping yields state-of-the-art results using very limited amounts of adaptation data in mono-, cross- and multi-lingual setups. We also show that “universal” phone posteriors, trained on a large amount of multilingual data, can be transformed to English phone posteriors, resulting in an ASR system that significantly outperforms a system trained on English data only. Finally, we demonstrate that the proposed approach outperforms alternative data-driven, as well as a knowledge-based, mapping techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sound Source Distance Estimation in Rooms based on Statistical Properties of Binaural Signals

    Publication Year: 2013 , Page(s): 1727 - 1741
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2188 KB) |  | HTML iconHTML  

    A novel method for the estimation of the distance of a sound source from binaural speech signals is proposed. The method relies on several statistical features extracted from such signals and their binaural cues. Firstly, the standard deviation of the difference of the magnitude spectra of the left and right binaural signals is used as a feature for this method. In addition, an extended set of additional statistical features that can improve distance detection is extracted from an auditory front-end which models the peripheral processing of the human auditory system. The method incorporates the above features into two classification frameworks based on Gaussian mixture models and Support Vector Machines and the relative merits of those frameworks are evaluated. The proposed method achieves distance detection when tested in various acoustical environments and performs well in unknown environments. Its performance is also compared to an existing binaural distance detection method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Model-Based Multiple Pitch Tracking Using Factorial HMMs: Model Adaptation and Inference

    Publication Year: 2013 , Page(s): 1742 - 1754
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3528 KB) |  | HTML iconHTML  

    Robustness against noise and interfering audio signals is one of the challenges in speech recognition and audio analysis technology. One avenue to approach this challenge is single-channel multiple-source modeling. Factorial hidden Markov models (FHMMs) are capable of modeling acoustic scenes with multiple sources interacting over time. While these models reach good performance on specific tasks, there are still serious limitations restricting the applicability in many domains. In this paper, we generalize these models and enhance their applicability. In particular, we develop an EM-like iterative adaptation framework which is capable to adapt the model parameters to the specific situation (e.g. actual speakers, gain, acoustic channel, etc.) using only speech mixture data. Currently, source-specific data is required to learn the model. Inference in FHMMs is an essential ingredient for adaptation. We develop efficient approaches based on observation likelihood pruning. Both adaptation and efficient inference are empirically evaluated for the task of multipitch tracking using the GRID corpus. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective Model Representation by Information Bottleneck Principle

    Publication Year: 2013 , Page(s): 1755 - 1759
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (895 KB) |  | HTML iconHTML  

    The common approaches to feature extraction in speech processing are generative and parametric although they are highly sensitive to violations of their model assumptions. Here, we advocate the non-parametric Information Bottleneck (IB). IB is an information theoretic approach that extends minimal sufficient statistics. However, unlike minimal sufficient statistics which does not allow any relevant data loss, IB method enables a principled tradeoff between compactness and the amount of target-related information. IB's ability to improve a broad range of recognition tasks is illustrated for model dimension reduction tasks for speaker recognition and model clustering for age-group verification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Spectral Nature of Maximum Likelihood Noise Compensated Linear Prediction

    Publication Year: 2013 , Page(s): 1760 - 1765
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1603 KB) |  | HTML iconHTML  

    The effects of noise in autoregressive (AR) analysis (or linear prediction) and its compensation (NCAR) has been commonly carried out in the time domain under the least square (LS) criterion. This paper studies the adequacy of such an approach by means of a comparative analysis with selected frequency-based NCAR methods. In particular, the maximization of the spectral likelihood (ML) results in a proper optimization problem that is easy to solve and brings useful insights into the rationale of the NCAR problem. On the contrary, popular time-based NCAR methods are shown in the paper to be designed, in the ML context, around ill-conditioned criteria, requiring constraints to guarantee stable solutions. The statistical analysis on a realistic scenario as well as an experiment on speech enhancement complement this analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Correction to ‘The ICSI RT-09 Speaker Diarization System’ [Feb 12 371-381]

    Publication Year: 2013 , Page(s): 1766
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (67 KB)  

    In the above-named article appearing in the February 2012 issue [ibid., vol. 20, no. 2, pp. 371-381, Feb. 2012], the author name "Xavier Anguera Miro" was listed incorrectly. It should have been "Xavier Anguera." The author prefers that researchers use this name when referencing this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Audio, Speech, and Language Processing Edics

    Publication Year: 2013 , Page(s): 1767 - 1768
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research