Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 5 • Date July 2011

Filter Results

Displaying Results 1 - 25 of 40
  • Table of contents

    Publication Year: 2011 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (103 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2011 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • A Framework for Automatic Human Emotion Classification Using Emotion Profiles

    Publication Year: 2011 , Page(s): 1057 - 1070
    Cited by:  Papers (18)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1080 KB) |  | HTML iconHTML  

    Automatic recognition of emotion is becoming an increasingly important component in the design process for affect-sensitive human-machine interaction (HMI) systems. Well-designed emotion recognition systems have the potential to augment HMI systems by providing additional user state details and by informing the design of emotionally relevant and emotionally targeted synthetic behavior. This paper describes an emotion classification paradigm, based on emotion profiles (EPs). This paradigm is an approach to interpret the emotional content of naturalistic human expression by providing multiple probabilistic class labels, rather than a single hard label. EPs provide an assessment of the emotion content of an utterance in terms of a set of simple categorical emotions: anger; happiness; neutrality; and sadness. This method can accurately capture the general emotional label (attaining an accuracy of 68.2% in our experiment on the IEMOCAP data) in addition to identifying underlying emotional properties of highly emotionally ambiguous utterances. This capability is beneficial when dealing with naturalistic human emotional expressions, which are often not well described by a single semantic label. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis

    Publication Year: 2011 , Page(s): 1071 - 1079
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (619 KB) |  | HTML iconHTML  

    The modeling of fundamental frequency, or F0, in HMM-based speech synthesis is a critical factor in delivering speech which is both natural and accurately conveys all of the many nuances of the message. However, F0 modeling is difficult because F0 values are normally considered to depend on a binary voicing decision such that they are continuous in voiced regions and undefined in unvoiced regions. F0 is therefore a discontinuous function of time. Multi-space probability distribution HMM (MSDHMM) is a widely used solution to this problem. The MSDHMM essentially uses a joint distribution of discrete voicing labels and the discontinuous F0 observations. However, due to the discontinuity assumption, the MSDHMM provides a rather weak F0 trajectory model. In this paper, F0 is viewed as being a continuous function of time and this is achieved by assuming that F0 can be observed within unvoiced regions as well as voiced regions. This provides a continuous F0 data stream which can be modeled by standard HMMs. Voicing labels are modeled either implicitly or explicitly in order to perform voicing classification and a globally tied distribution (GTD) technique is used to achieve robust F0 estimation. Both objective measures and subjective listening tests demonstrate that continuous F0 modeling yields better synthesized F0 trajectories and significant improvements to the naturalness of synthesized speech compared to using the MSDHMM model. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Phase Minimization for Glottal Model Estimation

    Publication Year: 2011 , Page(s): 1080 - 1090
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1354 KB) |  | HTML iconHTML  

    In glottal source analysis, the phase minimization criterion has already been proposed to detect excitation instants. As shown in this paper, this criterion can also be used to estimate the shape parameter of a glottal model (ex. Liljencrants-Fant model) and not only its time position. Additionally, we show that the shape parameter can be estimated independently of the glottal model position. The reliability of the proposed methods is evaluated with synthetic signals and compared to that of the IAIF and minimum/maximum-phase decomposition methods. The results of the methods are evaluated according to the influence of the fundamental frequency and noise. The estimation of a glottal model is useful for the separation of the glottal source and the vocal-tract filter and therefore can be applied in voice transformation, synthesis, and also in clinical context or for the study of the voice production. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HMM-Based Multipitch Tracking for Noisy and Reverberant Speech

    Publication Year: 2011 , Page(s): 1091 - 1102
    Cited by:  Papers (18)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1392 KB) |  | HTML iconHTML  

    Multipitch tracking in real environments is critical for speech signal processing. Determining pitch in reverberant and noisy speech is a particularly challenging task. In this paper, we propose a robust algorithm for multipitch tracking in the presence of both background noise and room reverberation. An auditory front-end and a new channel selection method are utilized to extract periodicity features. We derive pitch scores for each pitch state, which estimate the likelihoods of the observed periodicity features given pitch candidates. A hidden Markov model integrates these pitch scores and searches for the best pitch state sequence. Our algorithm can reliably detect single and double pitch contours in noisy and reverberant conditions. Quantitative evaluations show that our approach outperforms existing ones, particularly in reverberant conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Data-Driven Affective Analysis Framework Toward Naturally Expressive Speech Synthesis

    Publication Year: 2011 , Page(s): 1113 - 1122
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (586 KB) |  | HTML iconHTML  

    An essential step in the generation of expressive speech synthesis is the automatic detection and classification of emotions most likely to be present in textual input. Though increasingly data-driven, emotion analysis still relies on critical expert knowledge in order to isolate the emotional keywords or keysets necessary to the construction of affective categories. This makes it vulnerable to any discrepancy between the ensuing taxonomy of affective states and the underlying domain of discourse. This paper proposes a more general framework, latent affective mapping, which exploits two separate levels of semantic information: the first one encapsulates the foundations of the domain considered, while the second one specifically accounts for the overall affective fabric of the language. Exposing the emergent relationship between these two levels advantageously steers the emotion classification process. Empirical evidence suggests that this approach is effective for automatic emotion analysis in text. This bodes well for its deployability toward naturally expressive speech synthesis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Relationship Between Bayes Risk and Word Error Rate in ASR

    Publication Year: 2011 , Page(s): 1103 - 1112
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (444 KB) |  | HTML iconHTML  

    Recently, a number of (approximate) approaches emerged in speech processing, which try to overcome the known lack of match between symbol level evaluation measures (e.g., word error rate) and the standard string (symbol sequence) cost (e.g., sentence error)-based Bayes decision rule, by using symbol level cost functions for Bayes decision rule. Nevertheless, experiments show that for a majority of test samples both decision rules still give equal decisions, especially at lower error rates. In this paper, analytic evidence for these observations is provided. A set of conditions is presented, for which Bayes decision rule with symbol level and string level cost function leads to the same decisions. Furthermore, the case of word error cost represented by the Levenshtein (edit) distance is investigated, which upon others covers the important case of speech recognition. A Hamming distance-based upper bound to the Levenshtein cost function is discussed. This cost function relates to former, word-posterior based decision rules, and the corresponding efficient decision rule is shown to be strongly related to Bayes decision rule with the Levenshtein cost. The analytic results are verified experimentally, and their quantitative effect is studied by experiments on four different well-known large vocabulary automatic speech recognition tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimators of the Magnitude-Squared Spectrum and Methods for Incorporating SNR Uncertainty

    Publication Year: 2011 , Page(s): 1123 - 1137
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1898 KB) |  | HTML iconHTML  

    Statistical estimators of the magnitude-squared spectrum are derived based on the assumption that the magnitude-squared spectrum of the noisy speech signal can be computed as the sum of the (clean) signal and noise magnitude-squared spectra. Maximum a posterior (MAP) and minimum mean square error (MMSE) estimators are derived based on a Gaussian statistical model. The gain function of the MAP estimator was found to be identical to the gain function used in the ideal binary mask (IdBM) that is widely used in computational auditory scene analysis (CASA). As such, it was binary and assumed the value of 1 if the local signal-to-noise ratio (SNR) exceeded 0 dB, and assumed the value of 0 otherwise. By modeling the local instantaneous SNR as an F-distributed random variable, soft masking methods were derived incorporating SNR uncertainty. The soft masking method, in particular, which weighted the noisy magnitude-squared spectrum by the a priori probability that the local SNR exceeds 0 dB was shown to be identical to the Wiener gain function. Results indicated that the proposed estimators yielded significantly better speech quality than the conventional minimum mean square error spectral power estimators, in terms of yielding lower residual noise and lower speech distortion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Equivalence of Generative and Log-Linear Models

    Publication Year: 2011 , Page(s): 1138 - 1148
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (611 KB) |  | HTML iconHTML  

    Conventional speech recognition systems are based on hidden Markov models (HMMs) with Gaussian mixture models (GHMMs). Discriminative log-linear models are an alternative modeling approach and have been investigated recently in speech recognition. GHMMs are directed models with constraints, e.g., positivity of variances and normalization of conditional probabilities, while log-linear models do not use such constraints. This paper compares the posterior form of typical generative models related to speech recognition with their log-linear model counterparts. The key result will be the derivation of the equivalence of these two different approaches under weak assumptions. In particular, we study Gaussian mixture models, part-of-speech bigram tagging models, and eventually, the GHMMs. This result unifies two important but fundamentally different modeling paradigms in speech recognition on the functional level. Furthermore, this paper will present comparative experimental results for various speech tasks of different complexity, including a digit string and large-vocabulary continuous speech recognition tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Three-Dimensional Sound Field Reproduction Using Multiple Circular Loudspeaker Arrays

    Publication Year: 2011 , Page(s): 1149 - 1159
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1051 KB) |  | HTML iconHTML  

    Three-dimensional spatial sound field reproduction enables enhanced immersive acoustic experience for a listener. Recreating an arbitrary 3-D spatial sound field using a practically realizable array of loudspeakers is a challenging problem in acoustic signal processing. This paper exploits the underlying characteristics of wavefield propagation to devise a strategy for accurate 3-D sound field reproduction inside a 3-D region of interest with practical array geometries. Specifically, we use the properties of the associated Legendre functions and the spherical Hankel functions, which are part of the solution to the wave equation in spherical coordinates, for loudspeaker placement on a set of multiple circular arrays and provide a technique for spherical harmonic mode-selection to control the reproduced sound field. We also analyze the artifacts of spatial aliasing due to the use of discrete loudspeaker arrays in the region of interest. As an illustration, we design a third-order reproduction system to operate at a frequency of 500 Hz with 18 loudspeakers arranged in a practically realizable configuration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using N-Best Lists and Confusion Networks for Meeting Summarization

    Publication Year: 2011 , Page(s): 1160 - 1169
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (718 KB) |  | HTML iconHTML  

    The incorrect speech recognition results usually have a negative impact on the speech summarization task, especially on the meeting domain where the word error rate is often higher than other speech genres. In this paper we investigate using rich speech recognition results to improve meeting summarization performance. Two kinds of structures are considered, n-best hypotheses and confusion networks. We develop methods to utilize multiple word and sentence candidates and their recognition confidence for summarization under an unsupervised framework. Our experimental results on the ICSI meeting corpus show that our proposed method can significantly improve summarization performance over using 1-best recognition output, evaluated by both ROUGE-1 and ROUGE-2 scores. We also find that if the task is to generate speech summaries or identify salient segments, using rich speech recognition output is just as effective as using human transcripts. In addition, we discuss the difference between n-best lists and confusion networks, and analyze the word error rate in the exacted summary sentences. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Effect of Spectral Estimation on Speech Enhancement Performance

    Publication Year: 2011 , Page(s): 1170 - 1179
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1307 KB) |  | HTML iconHTML  

    It has long been observed that accuracy in spectral estimation greatly affects the quality of enhanced speech. A small decrease in the bias and variance of the estimator can greatly reduce the amount of residual noise and distortion in the recovered speech. To date, however, there has been little interest in a rigorous analysis quantifying such observations. In this paper, we analyze the effect of spectral estimate variance on enhanced speech as measured by quantitative and qualitative means. The performance analysis is derived for the signal subspace and the minimum mean square error short-time spectral amplitude estimators. Error is defined as the random function of frequency given by the difference between the estimated and the true power spectral density (PSD) functions. It is measured by its variance as a fraction of the clean speech PSD squared: a norm called the variance quality factor (VQF). The error VQF is derived in terms of the VQF of measurable quantities such as noisy speech and noise alone. It is shown that reducing the PSD estimate variance reduces significantly the VQF of the enhancement error. We provide analytical derivations to establish the results and accompanying simulations to confirm the theoretical analysis. Simulations test the periodogram, Blackman-Tukey, Bartlett-Welch, and Multitaper spectral estimation methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimation of Articulatory Trajectories Based on Gaussian Mixture Model (GMM) With Audio-Visual Information Fusion and Dynamic Kalman Smoothing

    Publication Year: 2011 , Page(s): 1180 - 1195
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1662 KB) |  | HTML iconHTML  

    This paper presents a detailed framework for Gaussian mixture model (GMM)-based articulatory inversion equipped with special postprocessing smoothers, and with the capability to perform audio-visual information fusion. The effects of different acoustic features on the GMM inversion performance are investigated and it is shown that the integration of various types of acoustic (and visual) features improves the performance of the articulatory inversion process. Dynamic Kalman smoothers are proposed to adapt the cutoff frequency of the smoother to data and noise characteristics; Kalman smoothers also enable the incorporation of auxiliary information such as phonetic transcriptions to improve articulatory estimation. Two types of dynamic Kalman smoothers are introduced: global Kalman (GK) and phoneme-based Kalman (PBK). The same dynamic model is used for all phonemes in the GK smoother; it is shown that GK improves the performance of articulatory inversion better than the conventional low-pass (LP) smoother. However, the PBK smoother, which uses one dynamic model for each phoneme, gives significantly better results than the GK smoother. Different methodologies to fuse the audio and visual information are examined. A novel modified late fusion algorithm, designed to consider the observability degree of the articulators, is shown to give better results than either the early or the late fusion methods. Extensive experimental studies are conducted with the MOCHA database to illustrate the performance gains obtained by the proposed algorithms. The average RMS error and correlation coefficient between the true (measured) and the estimated articulatory trajectories are 1.227 mm and 0.868 using audiovisual information fusion and GK smoothing, and 1.199 mm and 0.876 using audiovisual information fusion together with PBK smoothing based on a phonetic transcription of the utterance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Background Music Removal Based on Cepstrum Transformation for Popular Singer Identification

    Publication Year: 2011 , Page(s): 1196 - 1205
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (764 KB) |  | HTML iconHTML  

    One major challenge of identifying singers in popular music recordings lies in how to reduce the interference of background accompaniment in trying to characterize the singer voice. Although a number of studies on automatic Singer IDentification (SID) from acoustic features have been reported, most systems to date, however, do not explicitly deal with the background accompaniment. This study proposes a background accompaniment removal approach for SID by exploiting the underlying relationships between solo singing voices and their accompanied versions in cepstrum. The relationships are characterized by a transformation estimated using a large set of accompanied singing generated by manually mixing solo singing with the accompaniments extracted from Karaoke VCDs. Such a transformation reflects the cepstrum variations of a singing voice before and after it is added with accompaniments. When an unknown accompanied voice is presented to our system, the transformation is performed to convert the cepstrum of the accompanied voice into a solo-voice-like one. Our experiments show that such a background removal approach improves the SID accuracy significantly; even when a test music recording involves sung language not covered in the data for estimating the transformation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient MMSE Estimation and Uncertainty Processing for Multienvironment Robust Speech Recognition

    Publication Year: 2011 , Page(s): 1206 - 1220
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (604 KB) |  | HTML iconHTML  

    This paper presents a feature compensation framework based on minimum mean square error (MMSE) estimation and stereo training data for robust speech recognition. In our proposal, we model the clean and noisy feature spaces in order to obtain clean feature estimates. However, unlike other well-known MMSE compensation methods such as SPLICE or MEMLIN, which model those spaces with Gaussian mixture models (GMMs), in our case every feature space is characterized by a set of prototype vectors which can be alternatively considered as a vector quantization (VQ) codebook. The discrete nature of this feature space characterization introduces two significative advantages. First, it allows the implementation of a very efficient MMSE estimator in terms of accuracy and computational cost. On the other hand, time correlations can be exploited by means of hidden Markov modeling (HMM). In addition, a novel subregion-based modeling is applied in order to accurately represent the transformation between the clean and noisy domains. In order to deal with unknown environments, a multiple-model approach is also explored. Since this approach has been shown quite sensitive to incorrect environment classification, we adapt two uncertainty processing techniques, soft-data decoding and exponential weighting, to our estimation framework. As a result, environment miss-classifications are concealed, allowing a better performance under unknown environments. The experimental results on noisy digit recognition show a relative improvement of 87.93% in word accuracy regarding the baseline when clean acoustic models are used, while a 4.54% is achieved with multi-style trained models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Properties, Relations, and Simplified Implementation of Filter Banks in the Dolby Digital (Plus) AC-3 Audio Coding Standards

    Publication Year: 2011 , Page(s): 1231 - 1241
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (254 KB) |  | HTML iconHTML  

    The Dolby Digital (Plus) AC-3 audio coding standards are currently the key enabling technologies for high-quality compression and decompression of digital audio signals. The Dolby Digital (Plus) AC-3 audio coding standards have adopted the modified discrete cosine transform (MDCT) for the time/frequency transformation of an audio data block. Besides a long transform being the MDCT, the AC-3 defines additional two variants of cosine-modulated filter banks called the first and second short transforms. Based on the matrix representation of AC-3 filter banks, by a systematic investigation of their properties and relations among transform matrices, a relation between the frequency coefficients of the long (MDCT) and those of two short transforms is derived. Frequency coefficients of two short transforms in AC-3 can be simply obtained from the given frequency coefficients of the long (MDCT) transform via a conversion matrix. Since the conversion matrix after proper scaling is an orthonormal matrix with very regular general structure, the frequency coefficients of the short transforms can be converted to the frequency coefficients of the long (MDCT) transform via the transposed conversion matrix. Consequently, the current implementation of AC-3 filter banks for the time/frequency transformation of an audio data block can be simplified in the encoder and the forward long (MDCT) transform computation is required only. Further, it is shown that there exists a simple relation between the time domain aliased data sequence recovered by the backward long (MDCT) and those of two short transforms. Consequently, the current implementation of AC-3 filter banks can be also simplified in the decoder and the backward long (MDCT) transform computation is required only. Thus, the computation of two short transforms in both the AC-3 encoder and decoder can be completely eliminated. Moreover, due to the existence of both relations between transform coefficients and time domain aliased data sequences, the conversion matrix in the AC-3 encoder and decoder may not be used at all. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-Domain Implementation of Broadband Beamformer in Spherical Harmonics Domain

    Publication Year: 2011 , Page(s): 1221 - 1230
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1492 KB) |  | HTML iconHTML  

    Most of the existing spherical array modal beamformers are implemented in the frequency domain, where a block of snapshots is required to perform the discrete Fourier transform. In this paper, an approach to real-valued time-domain implementation of the modal beamformer for broadband spherical microphone arrays is presented. The microphone array data are converted to the spherical harmonics domain by using the discrete spherical Fourier transform, and then steered to the look direction followed by the pattern generation unit implemented using the filter-and-sum structure. We derive the expression for the array response, the beamformer output power against both isotropic noise and spatially white noise, and the mainlobe spatial response variation in terms of the finite impulse response (FIR) filters' tap weights. A multiple-constraint problem is then formulated to find the filters' tap weights with the aim of providing a suitable tradeoff among multiple conflicting performance measures such as directivity index, robustness, sidelobe level, and mainlobe response variation. Simulation and experimental results show good performance of the proposed time-domain broadband modal beamforming approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spectral and Temporal Periodicity Representations of Rhythm for the Automatic Classification of Music Audio Signal

    Publication Year: 2011 , Page(s): 1242 - 1252
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (861 KB) |  | HTML iconHTML  

    In this paper, we study the spectral and temporal periodicity representations that can be used to describe the characteristics of the rhythm of a music audio signal. A continuous-valued energy-function representing the onset positions over time is first extracted from the audio signal. From this function we compute at each time a vector which represents the characteristics of the local rhythm. Four feature sets are studied for this vector. They are derived from the amplitude of the discrete Fourier transform (DFT), the auto-correlation function (ACF), the product of the DFT and the ACF interpolated on a hybrid lag/frequency axis and the concatenated DFT and ACF coefficients. Then the vectors are sampled at some specific frequencies, which represent various ratios of the local tempo. The ability of these periodicity representations to describe the rhythm characteristics of an audio item is evaluated through a classification task. In this, we test the use of the periodicity representations alone, combined with tempo information and combined with a proposed set of rhythm features. The evaluation is performed using annotated and estimated tempo. We show that using such simple periodicity representations allows achieving high recognition rates at least comparable to previously published results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Burst Onset Landmark Detection and Its Application to Speech Recognition

    Publication Year: 2011 , Page(s): 1253 - 1264
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (639 KB) |  | HTML iconHTML  

    The reliable detection of salient acoustic-phonetic cues in speech signal plays an important role in speech recognition based on speech landmarks. Once speech landmarks are located, not only can phone recognition be performed, but other useful information can also be derived. This paper focuses on the detection of burst onset landmarks, which are crucial to the recognition of stop and affricate consonants. The proposed detector is purely based on a random forest technique, which belongs to an ensemble of tree-structured classifiers. By adopting a special asymmetric bootstrapping method, a series of experiments conducted on the TIMIT database demonstrate that the proposed detector is an efficient and accurate method for detecting burst onsets. When the detection results are appended to mel frequency cepstral coefficient vectors, the augmented feature vectors enhance the recognition correctness of hidden Markov models in recognizing stop and affricate consonants in continuous speech. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • New Results on Single-Channel Speech Separation Using Sinusoidal Modeling

    Publication Year: 2011 , Page(s): 1265 - 1277
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (848 KB) |  | HTML iconHTML  

    We present new results on single-channel speech separation and suggest a new separation approach to improve the speech quality of separated signals from an observed mixture. The key idea is to derive a mixture estimator based on sinusoidal parameters. The proposed estimator is aimed at finding sinusoidal parameters in the form of codevectors from vector quantization (VQ) codebooks pre-trained for speakers that, when combined, best fit the observed mixed signal. The selected codevectors are then used to reconstruct the recovered signals for the speakers in the mixture. Compared to the log-max mixture estimator used in binary masks and the Wiener filtering approach, it is observed that the proposed method achieves an acceptable perceptual speech quality with less cross-talk at different signal-to-signal ratios. Moreover, the method is independent of pitch estimates and reduces the computational complexity of the separation by replacing the short-time Fourier transform (STFT) feature vectors of high dimensionality with sinusoidal feature vectors. We report separation results for the proposed method and compare them with respect to other benchmark methods. The improvements made by applying the proposed method over other methods are confirmed by employing perceptual evaluation of speech quality (PESQ) as an objective measure and a MUSHRA listening test as a subjective evaluation for both speaker-dependent and gender-dependent scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units

    Publication Year: 2011 , Page(s): 1278 - 1288
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (541 KB) |  | HTML iconHTML  

    Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker Clustering Using Decision Tree-Based Phone Cluster Models With Multi-Space Probability Distributions

    Publication Year: 2011 , Page(s): 1289 - 1300
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1337 KB) |  | HTML iconHTML  

    This paper presents an approach to speaker clustering using decision tree-based phone cluster models (DT-PCMs). In this approach, phone clustering is first applied to construct the universal phone cluster models to accommodate acoustic characteristics from different speakers. Since pitch feature is highly speaker-related and beneficial for speaker identification, the decision trees based on multi-space probability distributions (MSDs), useful to model both pitch and cepstral features for voiced and unvoiced speech simultaneously, are constructed. In speaker clustering based on DT-PCMs, contextual, phonetic, and prosodic features of each input speech segment is used to select the speaker-related MSDs from the MSD decision trees to construct the initial phone cluster models. The maximum-likelihood linear regression (MLLR) method is then employed to adapt the initial models to the speaker-adapted phone cluster models according to the input speech segment. Finally, the agglomerative clustering algorithm is applied on all speaker-adapted phone cluster models, each representing one input speech segment, for speaker clustering. In addition, an efficient estimation method for phone model merging is proposed for model parameter combination. Experimental results show that the MSD-based DT-PCMs outperform the conventional GMM- and HMM-based approaches for speaker clustering on the RT09 tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a New Reference Impairment System in the Subjective Evaluation of Speech Codecs

    Publication Year: 2011 , Page(s): 1301 - 1315
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1440 KB) |  | HTML iconHTML  

    For subjective assessment of speech quality in codecs, a reference impairment system is required to introduce controlled degradations to calibrate subjective evaluation. A reference system provides a convenient means for making meaningful comparisons between subjective test results across laboratories and can be viewed as a scale on which mean opinion scores are projected, this scale being supposed to cover the whole range of quality. Nowadays, standardized anchor systems do not fit any more the degradations brought by the present codecs. This paper aims at offering new reference signals simulating the defaults of codecs currently used on telecommunication networks. Twenty wideband codecs are compared through dissimilarity tests. A multidimensional scaling technique allows us to define a four-dimensional perceptive space that appears stable for male and female talkers. A verbalization task suggests qualifying the degradations perceived by the listeners with the following attributes: muffle, background noise, noise on speech, and hiss, each conveyed by one dimension. These dimensions are correlated with objective measures such as spectral centroid, energy in the silent part in the high frequency sub-band, ratio of brightness between deterministic part and residual part of the signal and spectral correlation coefficient. New reference signals are produced and a phase of validation suggests a perceptive space quite coherent with the original one. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Regularized Maximum Figure-of-Merit (rMFoM) Approach to Supervised and Semi-Supervised Learning

    Publication Year: 2011 , Page(s): 1316 - 1327
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (703 KB) |  | HTML iconHTML  

    We propose a regularized extension to supervised maximum figure-of-merit learning to improve its generalization capability and successfully extend it to semi-supervised learning. The proposed method can be used to approximate any objective function consisting of the commonly used performance metrics. We first derive detailed learning algorithms for supervised learning problems and then extend it to more general semi-supervised scenarios, where only a small part of the training data is labeled. The effectiveness of the proposed approach is justified by several text categorization experiments on different datasets. The novelty of this paper lies in several aspects: 1) Tikhonov regularization is used to alleviate potential overfitting of the maximum figure-of-merit criteria; 2) the regularized maximum figure-of-merit algorithm is successfully extended to semi-supervised learning tasks; 3) the proposed approach has good scalability to large-scale applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research