By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 3 • Date March 2007

Filter Results

Displaying Results 1 - 25 of 41
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (50 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • Multiple-Description Predictive-Vector Quantization With Applications to Low Bit-Rate Speech Coding Over Networks

    Page(s): 749 - 755
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (404 KB) |  | HTML iconHTML  

    An algorithm for designing linear prediction-based two-channel multiple-description predictive-vector quantizers (MD-PVQs) for packet-loss channels is presented. This algorithm iteratively improves the encoder partition, the set of multiple description codebooks, and the linear predictor for a given channel loss probability, based on a training set of source data. The effectiveness of the designs obtained with the given algorithm is demonstrated using a waveform coding example involving a Markov source as well as vector quantization of speech line spectral pairs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-Rate Optimized Recursive Vector Quantization Structures Using Hidden Markov Models

    Page(s): 756 - 769
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (818 KB) |  | HTML iconHTML  

    This paper examines the design of recursive vector quantization systems built around Gaussian mixture vector quantizers. The problem of designing such systems for minimum high-rate distortion, under input-weighted squared error, is discussed. It is shown that, in high dimensions, the design problem becomes equivalent to a weighted maximum likelihood problem. A variety of recursive coding schemes, based on hidden Markov models are presented. The proposed systems are applied to the problem of wideband speech line spectral frequency (LSF) quantization under the log spectral distortion (LSD) measure. By combining recursive quantization and random coding techniques, the systems are able to attain transparent quality at rates as low as 36 bits per frame View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A High-Rate Optimal Transform Coder With Gaussian Mixture Companders

    Page(s): 770 - 783
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (605 KB) |  | HTML iconHTML  

    This paper examines the problem of designing fixed-rate transform coders for sources whose distributions are unknown and presumably non-Gaussian, under input-weighted squared error distortion measures. As a component of this system, a flexible scalar compander based on Gaussian mixtures is proposed. The high-rate analysis of transform coders is reviewed, and extended to the case of input-weighted squared error. An algorithm is developed to set the parameters of the system using a data-driven technique that automatically balances the source statistics, distortion measure, and structure of the transform coder to minimize the high-rate distortion. The implementation of Gaussian mixture companders is explored, resulting in a flexible, low-complexity scalar quantizer. Additionally, modifications to the system for operation at moderate rates, using unstructured scalar quantizers, are presented. The operation of the system for the problem of wideband speech line spectral frequencies (LSF) quantization with log spectral distortion is illustrated, and shown to provide good performance with very low complexity View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Kernel Eigenspace-Based MLLR Adaptation

    Page(s): 784 - 795
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (793 KB) |  | HTML iconHTML  

    In this paper, we propose an application of kernel methods for fast speaker adaptation based on kernelizing the eigenspace-based maximum-likelihood linear regression adaptation method. We call our new method "kernel eigenspace-based maximum-likelihood linear regression adaptation" (KEMLLR). In KEMLLR, speaker-dependent (SD) models are estimated from a common speaker-independent (SI) model using MLLR adaptation, and the MLLR transformation matrices are mapped to a kernel-induced high-dimensional feature space, wherein kernel principal component analysis is used to derive a set of eigenmatrices. In addition, a composite kernel is used to preserve row information in the transformation matrices. A new speaker's MLLR transformation matrix is then represented as a linear combination of the leading kernel eigenmatrices, which, though exists only in the feature space, still allows the speaker's mean vectors to be found explicitly. As a result, at the end of KEMLLR adaptation, a regular hidden Markov model (HMM) is obtained for the new speaker and subsequent speech recognition is as fast as normal HMM decoding. KEMLLR adaptation was tested and compared with other adaptation methods on the Resource Management and Wall Street Journal tasks using 5 or 10 s of adaptation speech. In both cases, KEMLLR adaptation gives the greatest improvement over the SI model with 11%-20% word error rate reduction View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Log-Rayleigh Distribution: A Simple and Efficient Statistical Representation of Log-Spectral Coefficients

    Page(s): 796 - 802
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (477 KB) |  | HTML iconHTML  

    In this paper, we study the distribution of the log-modulus of a Gaussian complex random variable. In the circular case, it is a Log-Rayleigh (LR) variable, whose probability distribution function (pdf) depends on only one parameter. In the noncircular case, the pdf is more complicated, although we show that it can be adequately modeled by an LR pdf, for which the optimal fitting parameter is derived. These results can be used in any application using the log-modulus of discrete Fourier transform coefficients, e.g., for speech/audio signals, and suggest that a mixture of LR pdf kernels is preferable to more classical models such as mixtures of Gaussian kernels, which are more costly and less efficient View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using Broad Phonetic Group Experts for Improved Speech Recognition

    Page(s): 803 - 812
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1120 KB) |  | HTML iconHTML  

    In phoneme recognition experiments, it was found that approximately 75% of misclassified frames were assigned labels within the same broad phonetic group (BPG). While the phoneme can be described as the smallest distinguishable unit of speech, phonemes within BPGs contain very similar characteristics and can be easily confused. However, different BPGs, such as vowels and stops, possess very different spectral and temporal characteristics. In order to accommodate the full range of phonemes, acoustic models of speech recognition systems calculate input features from all frequencies over a large temporal context window. A new phoneme classifier is proposed consisting of a modular arrangement of experts, with one expert assigned to each BPG and focused on discriminating between phonemes within that BPG. Due to the different temporal and spectral structure of each BPG, novel feature sets are extracted using mutual information, to select a relevant time-frequency (TF) feature set for each expert. To construct a phone recognition system, the output of each expert is combined with a baseline classifier under the guidance of a separate BPG detector. Considering phoneme recognition experiments using the TIMIT continuous speech corpus, the proposed architecture afforded significant error rate reductions up to 5% relative View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimation of the Instantaneous Pitch of Speech

    Page(s): 813 - 822
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1030 KB) |  | HTML iconHTML  

    An accurate estimation of the pitch is essential for many speech processing applications, such as speech synthesis, speech coding, and speech enhancement. A widely used assumption in most common pitch estimation methods is that pitch is constant over a segment of short duration. This assumption does not apply in reality and leads to inaccurate pitch estimates. In this paper, we present a method for continuous pitch estimation that is able to track fast changes. In the presented framework, the pitch is modeled by a B-spline expansion and optimized in a multistage procedure for increased robustness. The performance of the continuous optimization procedure is compared to state-of-the-art pitch estimation methods and is evaluated both for artificial speech-like signals with known pitch, and for real speech signals. The results of the experiments show that our method leads to a higher accuracy of the estimate of the pitch than state-of-the-art methods View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multicomponent AM–FM Representations: An Asymptotically Exact Approach

    Page(s): 823 - 837
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3506 KB) |  | HTML iconHTML  

    This paper presents, on the basis of a rigorous mathematical formulation, a multicomponent sinusoidal model that allows an asymptotically exact reconstruction of nonstationary speech signals, regardless of their duration and without any limitation in the modeling of voiced, unvoiced, and transitional segments. The proposed approach is based on the application of the Hilbert transform to obtain an amplitude signal from which an AM component is extracted by filtering, so that the residue can then be iteratively processed in the same way. This technique permits a multicomponent AM-FM model to be derived in which the number of components (iterations) may be arbitrarily chosen. Additionally, the instantaneous frequencies of these components can be calculated with a given accuracy by segmentation of the phase signals. The validity of the proposed approach has been proven by some applications to both synthetic signals and natural speech. Several comparisons show how this approach almost always has a higher performance than that obtained by current best practices, and does not need the complex filter optimizations required by other techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Effective Algorithm for Automatic Detection and Exact Demarcation of Breath Sounds in Speech and Song Signals

    Page(s): 838 - 850
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1800 KB) |  | HTML iconHTML  

    Automatic detection of predefined events in speech and audio signals is a challenging and promising subject in signal processing. One important application of such detection is removal or suppression of unwanted sounds in audio recordings, for instance in the professional music industry, where the demand for quality is very high. Breath sounds, which are present in most song recordings and often degrade the aesthetic quality of the voice, are an example of such unwanted sounds. Another example is bad pronunciation of certain phonemes. In this paper, we present an automatic algorithm for accurate detection of breaths in speech or song signals. The algorithm is based on a template matching approach, and consists of three phases. In the first phase, a template is constructed from mel frequency cepstral coefficients (MFCCs) matrices of several breath examples and their singular value decompositions, to capture the characteristics of a typical breath event. Next, in the initial processing phase, each short-time frame is compared to the breath template, and marked as breathy or nonbreathy according to predefined thresholds. Finally, an edge detection algorithm, based on various time-domain and frequency-domain parameters, is applied to demarcate the exact boundaries of each breath event and to eliminate possible false detections. Evaluation of the algorithm on a database of speech and songs containing several hundred breath sounds yielded a correct identification rate of 98% with a specificity of 96% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Perceptual Long-Term Variable-Rate Sinusoidal Modeling of Speech

    Page(s): 851 - 861
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1239 KB) |  | HTML iconHTML  

    In this paper, the problem of modeling the time-trajectory of the sinusoidal components of voiced speech signals is addressed. A new global approach is presented: a single so-called long-term (LT) model, based on discrete cosine functions, is used to model the overall trajectories of amplitude and phase parameters, for each entire voiced section of speech, differing from usual (short-term) models defined on a frame-by-frame basis. The complete analysis-modeling-synthesis process is presented, including an iterative algorithm for optimal fitting between LT model and measures. A major issue of this paper concerns the use of perceptual criteria in the LT model fitting process (both for amplitude and phase modeling). The adaptation of perceptual criteria usually defined in the short-term and/or stationary cases to the long-term processing is proposed. Experiments dealing with the ten first harmonics of voiced signals show that the proposed approach provides an efficient variable-rate representation of voiced speech signals. Promising results are given in terms of modeling accuracy, synthesis quality, and data compression. The interest of the presented approach for speech coding and speech watermarking is discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved Subspace-Based Single-Channel Speech Enhancement Using Generalized Super-Gaussian Priors

    Page(s): 862 - 872
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (958 KB) |  | HTML iconHTML  

    Traditional single-channel subspace-based schemes for speech enhancement rely mostly on linear minimum mean-square error estimators, which are globally optimal only if the Karhunen-Loeacuteve transform (KLT) coefficients of the noise and speech processes are Gaussian distributed. We derive in this paper subspace-based nonlinear estimators assuming that the speech KLT coefficients are distributed according to a generalized super-Gaussian distribution which has as special cases the Laplacian and the two-sided Gamma distribution. As with the traditional linear estimators, the derived estimators are functions of the a priori signal-to-noise ratio (SNR) in the subspaces spanned by the KLT transform vectors. We propose a scheme for estimating these a priori SNRs, which is in fact a generalization of the "decision-directed" approach which is well-known from short-time Fourier transform (STFT)-based enhancement schemes. We show that the proposed a priori SNR estimation scheme leads to a significant reduction of the residual noise level, a conclusion which is confirmed in extensive objective speech quality evaluations as well as subjective tests. We also show that the derived estimators based on the super-Gaussian KLT coefficient distribution lead to improvements for different noise sources and levels as compared to when a Gaussian assumption is imposed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Neural Network-Based Artificial Bandwidth Expansion of Speech

    Page(s): 873 - 881
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (806 KB) |  | HTML iconHTML  

    The limited bandwidth of 0.3-3.4 kHz in current telephone systems reduces both the quality and the intelligibility of speech. Artificial bandwidth expansion is a method that expands the bandwidth of the narrowband speech signal in the receiving end of the transmission link by adding new frequency components to the higher frequencies, i.e., up to 8 kHz. In this paper, a new method for artificial bandwidth expansion, termed Neuroevolution Artificial Bandwidth Expansion (NEABE) is proposed. The method uses spectral folding to create the initial spectral components above the telephone band. The spectral envelope is then shaped in the frequency domain, based on a set of parameters given by a neural network. Subjective listening tests were used to evaluate the performance of the proposed algorithm, and the results showed that NEABE speech was preferred over narrowband speech in about 80% of the test cases View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HMM-Based Gain Modeling for Enhancement of Speech in Noise

    Page(s): 882 - 892
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (475 KB) |  | HTML iconHTML  

    Accurate modeling and estimation of speech and noise gains facilitate good performance of speech enhancement methods using data-driven prior models. In this paper, we propose a hidden Markov model (HMM)-based speech enhancement method using explicit gain modeling. Through the introduction of stochastic gain variables, energy variation in both speech and noise is explicitly modeled in a unified framework. The speech gain models the energy variations of the speech phones, typically due to differences in pronunciation and/or different vocalizations of individual speakers. The noise gain helps to improve the tracking of the time-varying energy of nonstationary noise. The expectation-maximization (EM) algorithm is used to perform offline estimation of the time-invariant model parameters. The time-varying model parameters are estimated online using the recursive EM algorithm. The proposed gain modeling techniques are applied to a novel Bayesian speech estimator, and the performance of the proposed enhancement method is evaluated through objective and subjective tests. The experimental results confirm the advantage of explicit gain modeling, particularly for nonstationary noise sources View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Single-Mixture Audio Source Separation by Subspace Decomposition of Hilbert Spectrum

    Page(s): 893 - 900
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1276 KB) |  | HTML iconHTML  

    A novel technique is developed to separate the audio sources from a single mixture. The method is based on decomposing the Hilbert spectrum (HS) of the mixed signal into independent source subspaces. Hilbert transform combined with empirical mode decomposition (EMD) constitutes HS, which is a fine-resolution time-frequency representation of a nonstationary signal. The EMD represents any time-domain signal as the sum of a finite set of oscillatory components called intrinsic mode functions (IMFs). After computing the spectral projections between the mixed signal and the individual IMF components, the projection vectors are used to derive a set of spectral independent bases by applying principal component analysis (PCA) and independent component analysis (ICA). A k-means clustering algorithm based on Kulback-Leibler divergence (KLd) is introduced to group the independent basis vectors into the number of component sources inside the mixture. The HS of the mixed signal is projected onto the space spanned by each group of basis vectors yielding the independent source subspaces. The time-domain source signals are reconstructed by applying the inverse transformation. Experimental results show that the proposed algorithm performs separation of speech and interfering sound from a single mixture View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rayleigh Mixture Model-Based Hidden Markov Modeling and Estimation of Noise in Noisy Speech Signals

    Page(s): 901 - 917
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4234 KB) |  | HTML iconHTML  

    In this paper, we propose a new statistical model for noise periodogram modeling and estimation. The proposed model is a hidden Markov model (HMM) with a Rayleigh mixture model (RMM) in each state. For this new model, we derive an expectation-maximization (EM) training algorithm and a minimum mean-square error (MMSE) noise periodogram estimator. It is shown that when compared to the Gaussian mixture model (GMM)-based HMM, the RMM-based HMM has less computationally complex EM iterations and gives a better fit of the noise periodograms when the mixture models has a low number of components. Furthermore, we propose a specialization of the proposed model, which is shown to provide better MMSE noise periodogram estimates than any other of the tested HMM initializations for cyclo-stationary noise types View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MAP Estimators for Speech Enhancement Under Normal and Rayleigh Inverse Gaussian Distributions

    Page(s): 918 - 927
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (872 KB) |  | HTML iconHTML  

    This paper presents a new class of estimators for speech enhancement in the discrete Fourier transform (DFT) domain, where we consider a multidimensional normal inverse Gaussian (MNIG) distribution for the speech DFT coefficients. The MNIG distribution can model a wide range of processes, from heavy-tailed to less heavy-tailed processes. Under the MNIG distribution complex DFT and amplitude estimators are derived. In contrast to other estimators, the suppression characteristics of the MNIG-based estimators can be adapted online to the underlying distribution of the speech DFT coefficients. Compared to noise suppression algorithms based on preselected super-Gaussian distributions, the MNIG-based complex DFT and amplitude estimators lead to a performance improvement in terms of segmental signal-to-noise ratio (SNR) in the order of 0.3 to 0.6 dB and 0.2 to 0.6 dB, respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Gaussian Mixture Clustering and Language Adaptation for the Development of a New Language Speech Recognition System

    Page(s): 928 - 938
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (718 KB) |  | HTML iconHTML  

    The porting of a speech recognition system to a new language is usually a time-consuming and expensive process since it requires collecting, transcribing, and processing a large amount of language-specific training sentences. This work presents techniques for improved cross-language transfer of speech recognition systems to new target languages. Such techniques are particularly useful for target languages where minimal amounts of training data are available. We describe a novel method to produce a language-independent system by combining acoustic models from a number of source languages. This intermediate language-independent acoustic model is used to bootstrap a target-language system by applying language adaptation. For our experiments, we use acoustic models of seven source languages to develop a target Greek acoustic model. We show that our technique significantly outperforms a system trained from scratch when less than 8 h of read speech is available View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Implementation of Rational Wavelets and Filter Design for Phonetic Classification

    Page(s): 939 - 948
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (564 KB) |  | HTML iconHTML  

    Although wavelet analysis has been proposed for speech processing as an alternative to Fourier analysis, most approaches make use of off-the-shelf wavelets and dyadic tree-structured filter banks. In this paper, we extend previous wavelet-based frameworks in two ways. First, we increase the flexibility in wavelet selection by taking advantage of the relationship between wavelets and filter banks and by designing new wavelets using filter design methods. We adopt two filter design techniques that we refer to as filter matching and attenuation minimization. Second, we improve the flexibility in frequency partitioning by implementing rational as well as dyadic filter banks. Rational filter banks naturally incorporate the critical-band effect in the human auditory system. To test our extensions, we implement an energy-based measurement which we also compare in performance to the mel-frequency cepstral coefficients (MFCCs) in a phonetic classification task. We show that the designed wavelets outperform off-the-shelf wavelets as well as an MFCC baseline View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Contribution of Various Sources of Spectral Mismatch to Audible Discontinuities in a Diphone Database

    Page(s): 949 - 956
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (669 KB) |  | HTML iconHTML  

    One of the major problems in concatenative synthesis is the occurrence of audible discontinuities between two successive concatenative units. Several studies have attempted to discover objective distance measures that predict the audibility of these discontinuities. In this paper, we investigate mid-vowel joins for three vowels with a range of post-vocalic consonant contexts typical for diphone databases. A first perceptual experiment uses a pairwise comparison procedure to find two subsets of unit combinations: Those with versus without audible discontinuities. A second perceptual experiment uses these two subsets in a procedure where formant resynthesis is used to manipulate three sources of discontinuity separately: formant frequencies, formant bandwidths, and overall energy. Results show mismatch in formant frequencies provides the largest contribution to audible discontinuity, followed by mismatch in overall energy View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Globally Optimal Training of Unit Boundaries in Unit Selection Text-to-Speech Synthesis

    Page(s): 957 - 965
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (550 KB) |  | HTML iconHTML  

    The level of quality that can be achieved by modern concatenative text-to-speech synthesis heavily depends on a judicious composition of the unit inventory used in the unit selection process. Unit boundary optimization, in particular, can make a huge difference in the users' perception of the concatenated acoustic waveform. This paper considers the iterative refinement of unit boundaries based on a data-driven feature extraction framework separately optimized for each boundary region. This guarantees a globally optimal cut point between any two matching units in the underlying inventory. The associated boundary training procedure is objectively characterized, first in terms of convergence behavior, and then by comparing the distributions in inter-unit discontinuity obtained before and after training. Experimental results underscore the viability of this approach for unit boundary optimization. Listening evidence also qualitatively exemplifies a noticeable reduction in the perception of discontinuity between concatenated acoustic units View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-Resolution Spherical Quantization of Sinusoidal Parameters

    Page(s): 966 - 981
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (725 KB) |  | HTML iconHTML  

    Sinusoidal coding is an often employed technique in low bit-rate audio coding. Therefore, methods for efficient quantization of sinusoidal parameters are of great importance. In this paper, we use high-resolution assumptions to derive analytical expressions for the optimal entropy-constrained unrestricted spherical quantizers for the amplitude, phase, and frequency parameters of the sinusoidal model. This is done both for the case of a single sinusoid, and for the more practically relevant case of multiple sinusoids distributed across multiple segments. To account for psychoacoustical effects of the auditory system, a perceptual distortion measure is used. The optimal quantizers minimize a high-resolution approximation of the expected perceptual distortion, while the corresponding quantization indices satisfy an entropy constraint. The quantizers turn out to be flexible and of low complexity, in the sense that they can be determined easily for varying bit rate requirements, without any sort of retraining or iterative procedures. In an objective comparison it is shown that for the squared error distortion measure, the rate-distortion performance of the proposed method is very close to that of the theoretically optimal entropy-constrained vector quantization. Furthermore, for the perceptual distortion measure, the proposed scheme is shown to objectively outperform an existing sinusoidal quantization scheme, where frequency quantization is done independently. Finally, a subjective listening test, in which the proposed scheme is compared to an existing state-of-the-art sinusoidal quantization scheme with fixed quantizers for all input signals, indicates that the proposed scheme leads to an average bit rate reduction of 20%, at the same subjective quality level as the existing scheme View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering

    Page(s): 982 - 994
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1564 KB) |  | HTML iconHTML  

    This paper proposes a multipitch analyzer called the harmonic temporal structured clustering (HTC) method, that jointly estimates pitch, intensity, onset, duration, etc., of each underlying source in a multipitch audio signal. HTC decomposes the energy patterns diffused in time-frequency space, i.e., the power spectrum time series, into distinct clusters such that each has originated from a single source. The problem is equivalent to approximating the observed power spectrum time series by superimposed HTC source models, whose parameters are associated with the acoustic features that we wish to extract. The update equations of the HTC are explicitly derived by formulating the HTC source model with a Gaussian kernel representation. We verified through experiments the potential of the HTC method View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combined Estimation of Spectral Envelopes and Sound Source Direction of Concurrent Voices by Multidimensional Statistical Filtering

    Page(s): 995 - 1008
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1411 KB) |  | HTML iconHTML  

    A key question for speech enhancement and simulations of auditory scene analysis in high levels of nonstationary noise is how to combine principles of auditory grouping and to integrate several noise-perturbed acoustical cues in a robust way. We present an application of recent online, nonlinear, non-Gaussian multidimensional statistical filtering methods which integrates tracking of sound-source direction and spectro-temporal dynamics of two mixed voices. The framework used is in agreement with the notion of evaluating competing hypotheses. To limit the number of hypotheses which need to be evaluated, the approach developed here uses a detailed statistical description of the high-dimensional spectro-temporal dynamics of speech, which is measured from a large speech database. The results show that the algorithm tracks sound source directions very precisely, separates the voice envelopes with algorithmic convergence times down to 50 ms, and enhances the signal-to-noise ratio in adverse conditions, requiring high computational effort. The approach has a high potential for improvements of efficiency and could be applied for voice separation and reduction of nonstationary noises View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research