By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 3 • Date May 2006

Filter Results

Displaying Results 1 - 25 of 39
  • Table of contents

    Page(s): c1 - c4
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • Sinusoidal model-based analysis and classification of stressed speech

    Page(s): 737 - 746
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (688 KB) |  | HTML iconHTML  

    In this paper, a sinusoidal model has been proposed for characterization and classification of different stress classes (emotions) in a speech signal. Frequency, amplitude and phase features of the sinusoidal model are analyzed and used as input features to a stressed speech recognition system. The performances of sinusoidal model features are evaluated for recognition of different stress classes with a vector-quantization classifier and a hidden Markov model classifier. To find the effectiveness of these features for recognition of different emotions in different languages, speech signals are recorded and tested in two languages, Telugu (an Indian language) and English. Average stressed speech index values are proposed for comparing differences between stress classes in a speech signal. Results show that sinusoidal model features are successful in characterizing different stress classes in a speech signal. Sinusoidal features perform better compared to the linear prediction and cepstral features in recognizing the emotions in a speech signal. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new structural approach in system identification with generalized analysis-by-synthesis for robust speech coding

    Page(s): 747 - 751
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB)  

    In this paper, we apply a new structural approach to generalized analysis-by-synthesis (GAbS) for system identification as a preprocessor of a low-bit-rate speech coder. In our approach, the coder-decoder (CODEC) system is separately estimated and then applied to modify the current input signal. This is different from that originally proposed where the CODEC system is sequentially estimated and then applied to the next input signal. The proposed estimation scheme is compared to the conventional method in terms of the signal modification approach under the various noise data and in several SNR conditions, and shows better performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rate-distortion optimal time-segmentation and redundancy selection for VoIP

    Page(s): 752 - 763
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB) |  | HTML iconHTML  

    In this paper, novel techniques for packet loss robust speech coding are proposed. By exploiting knowledge of the receiving end packet loss concealment algorithm, an existing rate-distortion optimal time-segmentation algorithm is extended to taking packet losses into account. To increase robustness in highly nonstationary signals, the technique is complemented by a redundancy selection scheme. A jointly optimal approach ensures that the complementarity between time-segmentation and redundancies is fully exploited. The performance of the methods is investigated through Monte Carlo simulations under various conditions, such as rate, packet loss probability, and algorithmic delay. Finally, subjective listening tests demonstrate perceptual improvements as compared to conventional adaptive time-segmentation not taking packet losses into account. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On causal algorithms for speech enhancement

    Page(s): 764 - 773
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (360 KB)  

    Kalman filtering is a powerful technique for the estimation of a signal observed in noise that can be used to enhance speech observed in the presence of acoustic background noise. In a speech communication system, the speech signal is typically buffered for a period of 10-40 ms and, therefore, the use of either a causal or a noncausal filter is possible. We show that the causal Kalman algorithm is in conflict with the basic properties of human perception and address the problem of improving its perceptual quality. We discuss two approaches to improve perceptual performance. The first is based on a new method that combines the causal Kalman algorithm with pre- and postfiltering to introduce perceptual shaping of the residual noise. The second is based on the conventional Kalman smoother. We show that a short lag removes the conflict resulting from the causality constraint and we quantify the minimum lag required for this purpose. The results of our objective and subjective evaluations confirm that both approaches significantly outperform the conventional causal implementation. Of the two approaches, the Kalman smoother performs better if the signal statistics are precisely known, if this is not the case the perceptually weighted Kalman filter performs better. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A two-stage algorithm for one-microphone reverberant speech enhancement

    Page(s): 774 - 784
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1304 KB) |  | HTML iconHTML  

    Under noise-free conditions, the quality of reverberant speech is dependent on two distinct perceptual components: coloration and long-term reverberation. They correspond to two physical variables: signal-to-reverberant energy ratio (SRR) and reverberation time, respectively. Inspired by this observation, we propose a two-stage reverberant speech enhancement algorithm using one microphone. In the first stage, an inverse filter is estimated to reduce coloration effects or increase SRR. The second stage employs spectral subtraction to minimize the influence of long-term reverberation. The proposed algorithm significantly improves the quality of reverberant speech. A comparison with a recent enhancement algorithm is made on a corpus of speech utterances in a number of reverberant conditions, and the results show that our algorithm performs substantially better. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Stereophonic acoustic echo cancellation employing selective-tap adaptive algorithms

    Page(s): 785 - 796
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB) |  | HTML iconHTML  

    Stereophonic acoustic echo cancellation has generated much interest in recent years due to the nonuniqueness and misalignment problems that are caused by the strong interchannel signal coherence. In this paper, we introduce a novel adaptive filtering approach to reduce interchannel coherence which is based on a selective-tap updating procedure. This tap-selection technique is then applied to the normalized least-mean-square, affine projection and recursive least squares algorithms for stereophonic acoustic echo cancellation. Simulation results for the proposed algorithms have shown a significant improvement in convergence rate compared with existing techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Aggregate a posteriori linear regression adaptation

    Page(s): 797 - 807
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB) |  | HTML iconHTML  

    We present a new discriminative linear regression adaptation algorithm for hidden Markov model (HMM) based speech recognition. The cluster-dependent regression matrices are estimated from speaker-specific adaptation data through maximizing the aggregate a posteriori probability, which can be expressed in a form of classification error function adopting the logarithm of posterior distribution as the discriminant function. Accordingly, the aggregate a posteriori linear regression (AAPLR) is developed for discriminative adaptation where the classification errors of adaptation data are minimized. Because the prior distribution of regression matrix is involved, AAPLR is geared with the Bayesian learning capability. We demonstrate that the difference between AAPLR discriminative adaptation and maximum a posteriori linear regression (MAPLR) adaptation is due to the treatment of the evidence. Different from minimum classification error linear regression (MCELR), AAPLR has closed-form solution to fulfil rapid adaptation. Experimental results reveal that AAPLR speaker adaptation does improve speech recognition performance with moderate computational cost compared to maximum likelihood linear regression (MLLR), MAPLR, MCELR and conditional maximum likelihood linear regression (CMLLR). These results are verified for supervised adaptation as well as unsupervised adaptation for different numbers of adaptation data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimization of temporal filters for constructing robust features in speech recognition

    Page(s): 808 - 832
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1944 KB)  

    Linear discriminant analysis (LDA) has long been used to derive data-driven temporal filters in order to improve the robustness of speech features used in speech recognition. In this paper, we proposed the use of new optimization criteria of principal component analysis (PCA) and the minimum classification error (MCE) for constructing the temporal filters. Detailed comparative performance analysis for the features obtained using the three optimization criteria, LDA, PCA, and MCE, with various types of noise and a wide range of SNR values is presented. It was found that the new criteria lead to superior performance over the original MFCC features, just as LDA-derived filters can. In addition, the newly proposed MCE-derived filters can often do better than the LDA-derived filters. Also, it is shown that further performance improvements are achievable if any of these LDA/PCA/MCE-derived filters are integrated with the conventional approach of cepstral mean and variance normalization (CMVN). The performance improvements obtained in recognition experiments are further supported by analyses conducted using two different distance measures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise compensation for speech recognition with arbitrary additive noise

    Page(s): 833 - 844
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1552 KB) |  | HTML iconHTML  

    This paper investigates speech recognition involving additive background noise, assuming no knowledge about the noise characteristics. A new method, namely universal compensation (UC), is proposed as a solution to the problem. The UC method is an extension of the missing-feature method, i.e., recognition based only on reliable data but robust to any corruption type, including full corruption in which the noise affects all time-frequency components of the speech representation. The UC technique achieves robustness to unknown, full noise corruption through a novel combination of the multicondition training method and the missing-feature method. Multicondition training is employed to convert fullband spectral corruption into partial-band spectral corruption, which is achieved by training the model using data involving simulated wide-band noise at different signal-to-noise ratios. The missing-feature principle is employed to reduce the effect of the remaining partial-band corruption on recognition by basing the recognition only on the matched or compensated spectral components from the multicondition training. The combination of these two strategies makes the new method potentially capable of dealing with arbitrary additive noise-with arbitrary temporal-spectral characteristics-based only on clean speech training data and simulated noise data, without requiring knowledge of the actual noise. Two databases, Aurora 2 and an E-set word database, have been used to evaluate the UC method. Experiments on Aurora 2 indicate that the new model has the potential to achieve a recognition performance close to the performance obtained by a multicondition baseline model trained using data involving the test environments. Further experiments for noise conditions unseen in Aurora 2 show significant performance improvement for the new model over the multicondition model. The experimental results on the E-set database demonstrate the ability of the UC model to deal with acoustically confusing recognition tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quantile based histogram equalization for noise robust large vocabulary speech recognition

    Page(s): 845 - 854
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (472 KB)  

    The noise robustness of automatic speech recognition systems can be improved by reducing an eventual mismatch between the training and test data distributions during feature extraction. Based on the quantiles of these distributions the parameters of transformation functions can be reliably estimated with small amounts of data. This paper will give a detailed review of quantile equalization applied to the Mel scaled filter bank, including considerations about the application in online systems and improvements through a second transformation step that combines neighboring filter channels. The recognition tests have shown that previous experimental observations on small vocabulary recognition tasks can be confirmed on the larger vocabulary Aurora 4 noisy Wall Street Journal database. The word error rate could be reduced from 45.7% to 25.5% (clean training) and from 19.5% to 17.0% (multicondition training). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic determination of acoustic model topology using variational Bayesian estimation and clustering for large vocabulary continuous speech recognition

    Page(s): 855 - 872
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1352 KB)  

    We describe the automatic determination of a large and complicated acoustic model for speech recognition by using variational Bayesian estimation and clustering (VBEC) for speech recognition. We propose an efficient method for decision tree clustering based on a Gaussian mixture model (GMM) and an efficient model search algorithm for finding an appropriate acoustic model topology within the VBEC framework. GMM-based decision tree clustering for triphone HMM states features a novel approach designed to reduce the overly large number of computations to a practical level by utilizing the statistics of monophone hidden Markov model states. The model search algorithm also reduces the search space by utilizing the characteristics of the acoustic model. The experimental results confirmed that VBEC automatically and rapidly yielded an optimum model topology with the highest performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximum entropy direct models for speech recognition

    Page(s): 873 - 881
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (360 KB) |  | HTML iconHTML  

    Traditional statistical models for speech recognition have mostly been based on a Bayesian framework using generative models such as hidden Markov models (HMMs). This paper focuses on a new framework for speech recognition using maximum entropy direct modeling, where the probability of a state or word sequence given an observation sequence is computed directly from the model. In contrast to HMMs, features can be asynchronous and overlapping. This model therefore allows for the potential combination of many different types of features, which need not be statistically independent of each other. In this paper, a specific kind of direct model, the maximum entropy Markov model (MEMM), is studied. Even with conventional acoustic features, the approach already shows promising results for phone level decoding. The MEMM significantly outperforms traditional HMMs in word error rate when used as stand-alone acoustic models. Preliminary results combining the MEMM scores with HMM and language model scores show modest improvements over the best HMM speech recognizer. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimum phone error training of precision matrix models

    Page(s): 882 - 889
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (344 KB) |  | HTML iconHTML  

    Gaussian mixture models (GMMs) are commonly used as the output density function for large-vocabulary continuous speech recognition (LVCSR) systems. A standard problem when using multivariate GMMs to classify data is how to accurately represent the correlations in the feature vector. Full covariance matrices yield a good model, but dramatically increase the number of model parameters. Hence, diagonal covariance matrices are commonly used. Structured precision matrix approximations provide an alternative, flexible, and compact representation. Schemes in this category include the extended maximum likelihood linear transform and subspace for precision and mean models. This paper examines how these precision matrix models can be discriminatively trained and used on state-of-the-art speech recognition tasks. In particular, the use of the minimum phone error criterion is investigated. Implementation issues associated with building LVCSR systems are also addressed. These models are evaluated and compared using large vocabulary continuous telephone speech and broadcast news English tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Average divergence distance as a statistical discrimination measure for hidden Markov models

    Page(s): 890 - 906
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1816 KB) |  | HTML iconHTML  

    This paper proposes and evaluates a new statistical discrimination measure for hidden Markov models (HMMs) extending the notion of divergence, a measure of average discrimination information originally defined for two probability density functions. Similar distance measures have been proposed for the case of HMMs, but those have focused primarily on the stationary behavior of the models. However, in speech recognition applications, the transient aspects of the models have a principal role in the discrimination process and, consequently, capturing this information is crucial in the formulation of any discrimination indicator. This paper proposes the notion of average divergence distance (ADD) as a statistical discrimination measure between two HMMs, considering the transient behavior of these models. This paper provides an analytical formulation of the proposed discrimination measure, a justification of its definition based on the Viterbi decoding approach, and a formal proof that this quantity is well defined for a left-to-right HMM topology with a final nonemitting state, a standard model for basic acoustic units in automatic speech recognition (ASR) systems. Using experiments based on this discrimination measure, it is shown that ADD provides a coherent way to evaluate the discrimination dissimilarity between acoustic models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora

    Page(s): 907 - 919
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB)  

    The problem of unsupervised audio classification and segmentation continues to be a challenging research problem which significantly impacts automatic speech recognition (ASR) and spoken document retrieval (SDR) performance. This paper addresses novel advances in 1) audio classification for speech recognition and 2) audio segmentation for unsupervised multispeaker change detection. A new algorithm is proposed for audio classification, which is based on weighted GMM Networks (WGN). Two new extended-time features: variance of the spectrum flux (VSF) and variance of the zero-crossing rate (VZCR) are used to preclassify the audio and supply weights to the output probabilities of the GMM networks. The classification is then implemented using weighted GMM networks. Since historically there have been no features specifically designed for audio segmentation, we evaluate 16 potential features including three new proposed features: perceptual minimum variance distortionless response (PMVDR), smoothed zero-crossing rate (SZCR), and filterbank log energy coefficients (FBLC) in 14 noisy environments to determine the best robust features on the average across these conditions. Next, a new distance metric, T2-mean, is proposed which is intended to improve segmentation for short segment turns (i.e., 1-5 s). A new false alarm compensation procedure is implemented, which can compensate the false alarm rate significantly with little cost to the miss rate. Evaluations on a standard data set-Defense Advanced Research Projects Agency (DARPA) Hub4 Broadcast News 1997 evaluation data-show that the WGN classification algorithm achieves over a 50% improvement versus the GMM network baseline algorithm, and the proposed compound segmentation algorithm achieves 23%-10% improvement in all metrics versus the baseline Mel-frequency cepstral coefficients (MFCC) and traditional Bayesian information criterion (BIC) algorithm. The new classification and segmentation algorithms also obtain very satisfactory results on the more diverse and challenging National Gallery of the Spoken Word (NGSW) corpus. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations

    Page(s): 920 - 930
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1336 KB)  

    We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multilinear dimensionality reduction technique and classified by a support vector machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches (Scheirer and Slaney, 2002 and Kingsbury et al., 2002). The results demonstrate the advantages of the auditory model over the other two systems, especially at low signal-to-noise ratios (SNRs) and high reverberation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Text-independent speaker recognition based on the Hurst parameter and the multidimensional fractional Brownian motion model

    Page(s): 931 - 940
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB) |  | HTML iconHTML  

    In this paper, a text-independent automatic speaker recognition (ASkR) system is proposed-the SRHurst-which employs a new speech feature and a new classifier. The statistical feature pH is a vector of Hurst (H) parameters obtained by applying a wavelet-based multidimensional estimator (M_dim_wavelets ) to the windowed short-time segments of speech. The proposed classifier for the speaker identification and verification tasks is based on the multidimensional fBm (fractional Brownian motion) model, denoted by M_dim_fBm. For a given sequence of input speech features, the speaker model is obtained from the sequence of vectors of H parameters, means, and variances of these features. The performance of the SRHurst was compared to those achieved with the Gaussian mixture models (GMMs), autoregressive vector (AR), and Bhattacharyya distance (dB) classifiers. The speech database-recorded from fixed and cellular phone channels-was uttered by 75 different speakers. The results have shown the superior performance of the M_dim_fBm classifier and that the pH feature aggregates new information on the speaker identity. In addition, the proposed classifier employs a much simpler modeling structure as compared to the GMM. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computer-assisted translation using speech recognition

    Page(s): 941 - 951
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    Current machine translation systems are far from being perfect. However, such systems can be used in computer-assisted translation to increase the productivity of the (human) translation process. The idea is to use a text-to-text translation system to produce portions of target language text that can be accepted or amended by a human translator using text or speech. These user-validated portions are then used by the text-to-text translation system to produce further, hopefully improved suggestions. There are different alternatives of using speech in a computer-assisted translation system: From pure dictated translation to simple determination of acceptable partial translations by reading parts of the suggestions made by the system. In all the cases, information from the text to be translated can be used to constrain the speech decoding search space. While pure dictation seems to be among the most attractive settings, unfortunately perfect speech decoding does not seem possible with the current speech processing technology and human error-correcting would still be required. Therefore, approaches that allow for higher speech recognition accuracy by using increasingly constrained models in the speech recognition process are explored here. All these approaches are presented under the statistical framework. Empirical results support the potential usefulness of using speech within the computer-assisted translation paradigm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonparallel training for voice conversion based on a parameter adaptation approach

    Page(s): 952 - 963
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB) |  | HTML iconHTML  

    The objective of voice conversion algorithms is to modify the speech by a particular source speaker so that it sounds as if spoken by a different target speaker. Current conversion algorithms employ a training procedure, during which the same utterances spoken by both the source and target speakers are needed for deriving the desired conversion parameters. Such a (parallel) corpus, is often difficult or impossible to collect. Here, we propose an algorithm that relaxes this constraint, i.e., the training corpus does not necessarily contain the same utterances from both speakers. The proposed algorithm is based on speaker adaptation techniques, adapting the conversion parameters derived for a particular pair of speakers to a different pair, for which only a nonparallel corpus is available. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30%. A speaker identification measure is also employed that more insightfully portrays the importance of adaptation, while listening tests confirm the success of our method. Both the objective and subjective tests employed, demonstrate that the proposed algorithm achieves comparable results with the ideal case when a parallel corpus is available. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Waveguide physical modeling of vocal tract acoustics: flexible formant bandwidth control from increased model dimensionality

    Page(s): 964 - 971
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB) |  | HTML iconHTML  

    Digital waveguide physical modeling is often used as an efficient representation of acoustical resonators such as the human vocal tract. Building on the basic one-dimensional (1-D) Kelly-Lochbaum tract model, various speech synthesis techniques demonstrate improvements to the wave scattering mechanisms in order to better approximate wave propagation in the complex vocal system. Some of these techniques are discussed in this paper, with particular reference to an alternative approach in the form of a two-dimensional waveguide mesh model. Emphasis is placed on its ability to produce vowel spectra similar to that which would be present in natural speech, and how it improves upon the 1-D model. Tract area function is accommodated as model width, rather than translated into acoustic impedance, and as such offers extra control as an additional bounding limit to the model. Results show that the two-dimensional (2-D) model introduces approximately linear control over formant bandwidths leading to attainable realistic values across a range of vowels. Similarly, the 2-D model allows for application of theoretical reflection values within the tract, which when applied to the 1-D model result in small formant bandwidths, and, hence, unnatural sounding synthesized vowels. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prosody modification using instants of significant excitation

    Page(s): 972 - 980
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (752 KB) |  | HTML iconHTML  

    Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness. This paper proposes a method for prosody (pitch and duration) modification using the instants of significant excitation of the vocal tract system during the production of speech. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations like onset of burst in the case of nonvoiced speech. Instants of significant excitation are computed from the linear prediction (LP) residual of speech signals by using the property of average group-delay of minimum phase signals. The modification of pitch and duration is achieved by manipulating the LP residual with the help of the knowledge of the instants of significant excitation. The modified residual is used to excite the time-varying filter, whose parameters are derived from the original speech signal. Perceptual quality of the synthesized speech is good and is without any significant distortion. The proposed method is evaluated using waveforms, spectrograms, and listening tests. The performance of the method is compared with linear prediction pitch synchronous overlap and add (LP-PSOLA) method, which is another method for prosody manipulation based on the modification of the LP residual. The original and the synthesized speech signals obtained by the proposed method and by the LP-PSOLA method are available for listening at http://speech.cs.iitm.ernet.in/Main/result/prosody.html. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MLP-based phone boundary refining for a TTS database

    Page(s): 981 - 989
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB) |  | HTML iconHTML  

    The automatic labeling of a large speech corpus plays an important role in the development of a high-quality Text-To-Speech (TTS) synthesis system. This paper describes a method for the automatic labeling of speech signals, which mainly involves the construction of a large database for a TTS synthesis system. The main objective of the work involves the refinement of an initial estimation of phone boundaries which are provided by an alignment, based on a Hidden Markov Model. A multilayer perceptron (MLP) was employed to refine the phone boundaries. To increase the accuracy of phoneme segmentation, several specialized MLPs were individually trained based on phonetic transition. The optimum partitioning of the entire phonetic transition space and the corresponding MLPs were constructed from the standpoint of minimizing the overall deviation from the hand-labeling position. The experimental results showed that more than 93% of all phone boundaries have a boundary deviation from a reference position smaller than 20 ms. We also confirmed that the database constructed using the proposed method produced results that were perceptually comparable to a hand-labeled database, based on subjective listening tests. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A global, boundary-centric framework for unit selection text-to-speech synthesis

    Page(s): 990 - 997
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB) |  | HTML iconHTML  

    The level of quality that can be achieved by modern concatenative text-to-speech synthesis heavily depends on the optimization criteria used in the unit selection process. While effective cost functions arise naturally for prosody assessment, the criteria typically selected to quantify discontinuities in the speech signal do not closely reflect users' perception of the resulting acoustic waveform. This paper introduces an alternative feature extraction paradigm, which eschews general purpose Fourier analysis in favor of a modal decomposition separately optimized for each boundary region. The ensuing transform framework preserves, by construction, those properties of the waveform which are globally relevant to each concatenation considered. In addition, it leads to a novel discontinuity measure which jointly, albeit implicitly, accounts for both interframe incoherence and discrepancies in formant frequencies/bandwidths. Experimental evaluations are conducted to characterize the behavior of this new metric, first on a contiguity prediction task, and then via a systematic listening comparison using a conventional metric as baseline. The results underscore the viability of the proposed framework in quantifying the perception of discontinuity between acoustic units. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research