Scheduled System Maintenance:
On Monday, April 27th, IEEE Xplore will undergo scheduled maintenance from 1:00 PM - 3:00 PM ET (17:00 - 19:00 UTC). No interruption in service is anticipated.
By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 4 • Date July 2005

Filter Results

Displaying Results 1 - 23 of 23
  • Table of contents

    Publication Year: 2005 , Page(s): c1 - c4
    Save to Project icon | Request Permissions | PDF file iconPDF (42 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Speech and Audio Processing publication information

    Publication Year: 2005 , Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • Editorial

    Publication Year: 2005 , Page(s): 457
    Save to Project icon | Request Permissions | PDF file iconPDF (25 KB)  
    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-domain isolated phoneme classification using reconstructed phase spaces

    Publication Year: 2005 , Page(s): 458 - 466
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (648 KB) |  | HTML iconHTML  

    This paper introduces a novel time-domain approach to modeling and classifying speech phoneme waveforms. The approach is based on statistical models of reconstructed phase spaces, which offer significant theoretical benefits as representations that are known to be topologically equivalent to the state dynamics of the underlying production system. The lag and dimension parameters of the reconstruction process for speech are examined in detail, comparing common estimation heuristics for these parameters with corresponding maximum likelihood recognition accuracy over the TIMIT data set. Overall accuracies are compared with a Mel-frequency cepstral baseline system across five different phonetic classes within TIMIT, and a composite classifier using both cepstral and phase space features is developed. Results indicate that although the accuracy of the phase space approach by itself is still currently below that of baseline cepstral methods, a combined approach is capable of increasing speaker independent phoneme accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient audio stream segmentation via the combined T2 statistic and Bayesian information criterion

    Publication Year: 2005 , Page(s): 467 - 474
    Cited by:  Papers (40)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB) |  | HTML iconHTML  

    In many speech and audio applications, it is first necessary to partition and classify acoustic events prior to voice coding for communication or speech recognition for spoken document retrieval. In this paper, we propose an efficient approach for unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion (BIC). The proposed method extends an earlier formulation by Chen and Gopalakrishnan. In our formulation, Hotelling's T2-Statistic is used to pre-select candidate segmentation boundaries followed by BIC to perform the segmentation decision. The proposed algorithm also incorporates a variable-size increasing window scheme and a skip-frame test. Our experiments show that we can improve the final algorithm speed by a factor of 100 compared to that in Chen and Gopalakrishnan's while achieving a 6.7% reduction in the acoustic boundary miss rate at the expense of a 5.7% increase in false alarm rate using DARPA Hub4 1997 evaluation data. The approach is particularly successful for short segment turns of less than 2 s in duration. The results suggest that the proposed algorithm is sufficiently effective and efficient for audio stream segmentation applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • β-order MMSE spectral amplitude estimation for speech enhancement

    Publication Year: 2005 , Page(s): 475 - 486
    Cited by:  Papers (17)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1520 KB) |  | HTML iconHTML  

    This paper proposes β-order minimum mean-square error (MMSE) speech enhancement approach for estimating the short time spectral amplitude (STSA) of a speech signal. We analyze the characteristics of the β-order STSA MMSE estimator and the relation between the value of β and the spectral amplitude gain function of the MMSE method. We further investigate the effectiveness of a range of fixed-β values in estimating STSA based on the MMSE criterion, and discuss how the β value could be adapted using the frame signal-to-noise ratio (SNR). The performance of the proposed speech enhancement approach is then evaluated through spectrogram inspection, objective speech distortion measures and subjective listening tests using several types of noise sources from the NOISEX-92 database. Evaluation results show that our approach can achieve a more significant noise reduction and a better spectral estimation of weak speech spectral components from a noisy signal as compared to many existing speech enhancement algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robustness analysis of multichannel Wiener filtering and generalized sidelobe cancellation for multimicrophone noise reduction in hearing aid applications

    Publication Year: 2005 , Page(s): 487 - 503
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (784 KB)  

    For small-sized arrays such as hearing aids, noise reduction is obtained at the expense of an increased sensitivity to errors in the assumed signal model, i.e., microphone mismatch, variations in speaker and microphone positions, reverberation. However, the noise reduction algorithm should still be robust, i.e., insensitive to small signal model errors. In this paper, we evaluate the robustness of the Generalized Sidelobe Canceller (GSC) and a recently developed Multichannel Wiener Filtering (MWF) technique for hearing aid applications both analytically and experimentally. The analysis reveals that robustness of the GSC is especially crucial in complicated noise scenarios and that microphone mismatch is particularly harmful to the GSC, even when the adaptive noise canceller is adapted during noise only. Hence, a constraint on the noise sensitivity of the GSC is essential, at the expense of less noise reduction. The MWF on the other hand, is not affected by microphone mismatch and has a potential benefit over the robust GSC with noise sensitivity constraint. However, the MWF is sensitive to the estimation accuracy of the second order statistics of speech and noise so that its benefit may be lost in nonstationary noise scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active learning: theory and applications to automatic speech recognition

    Publication Year: 2005 , Page(s): 504 - 511
    Cited by:  Papers (31)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (432 KB) |  | HTML iconHTML  

    We are interested in the problem of adaptive learning in the context of automatic speech recognition (ASR). In this paper, we propose an active learning algorithm for ASR. Automatic speech recognition systems are trained using human supervision to provide transcriptions of speech utterances. The goal of Active Learning is to minimize the human supervision for training acoustic and language models and to maximize the performance given the transcribed and untranscribed data. Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function for a human to label. In this paper we describe how to estimate the confidence score for each utterance through an on-line algorithm using the lattice output of a speech recognizer. The utterance scores are filtered through the informativeness function and an optimal subset of training samples is selected. The active learning algorithm has been applied to both batch and on-line learning scheme and we have experimented with different selective sampling algorithms. Our experiments show that by using active learning the amount of labeled data needed for a given word accuracy can be reduced by more than 60% with respect to random sampling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximizing information content in feature extraction

    Publication Year: 2005 , Page(s): 512 - 519
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    In this paper, we consider the problem of quantifying the amount of information contained in a set of features, to discriminate between various classes. We explore these ideas in the context of a speech recognition system, where an important classification sub-problem is to predict the phonetic class, given an observed acoustic feature vector. The connection between information content and speech recognition system performance is first explored in the context of various feature extraction schemes used in speech recognition applications. Subsequently, the idea of optimizing the information content to improve recognition accuracy is generalized to a linear projection of the underlying features. We show that several prior methods to compute linear transformations (such as linear/heteroscedastic discriminant analysis) can be interpreted in this general framework of maximizing the information content. Subsequently, we extend this reasoning and propose a new objective function to maximize a penalized mutual information (pMI) measure. This objective function is seen to be very well correlated with the word error rate of the final system. Finally experimental results are provided that show that the proposed pMI projection consistently outperforms other methods for a variety of cases, leading to relative improvements in the word error rate of 5%-16% over earlier methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The use of virtual hypothesis copies in decoding of large-vocabulary continuous speech

    Publication Year: 2005 , Page(s): 520 - 533
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (704 KB) |  | HTML iconHTML  

    High computational effort hinders wide-spread deployment of large-vocabulary continuous-speech recognition (LVCSR), for example in home or mobile devices. To this end, we developed a novel approach to LVCSR Viterbi decoding with significantly reduced effort. By a novel search-space organization called virtual hypothesis copies, we eliminate search-space copies that are approximately redundant: 1) Word-lattice generation and (M+1)-gram lattice rescoring are integrated into a single-pass time-synchronous beam search. Hypothesis copying becomes independent from the language-model order. 2) The word-pair approximation is replaced by the novel phone-history approximation (PHA). Tree copies are shared among multiple linguistic histories that end in the same phone(s). 3) Copies of individual tree arcs are shared by recombining within-word hypotheses at phone boundaries according to the PHA. At no loss of accuracy, we achieve a search-space reduction of 60-80% for Mandarin LVCSR, and of 40-50% for English (NAB 64 K). The method is exact under certain model assumptions. A formal specification is derived. In addition, we propose an extremely effective syllable lookahead for Mandarin. Together with the methods above, search space was reduced 12-15 times and state likelihood evaluations 4-9 times without significant error increase. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic confidence measurement for spoken dialog systems

    Publication Year: 2005 , Page(s): 534 - 545
    Cited by:  Papers (6)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB) |  | HTML iconHTML  

    This paper proposes two methods to incorporate semantic information into word and concept level confidence measurement. The first method uses tag and extension probabilities obtained from a statistical classer and parser. The second method uses a maximum entropy based semantic structured language model to assign probabilities to each word. Incorporation of semantic features into a lattice posterior probability based confidence measure provides significant improvements compared to posterior probability when used together in an air travel reservation task. At 5% False Alarm (FA) rate relative improvements of 28% and 61% in Correct Acceptance (CA) rate are achieved for word level and concept level confidence measurements, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new verification-based fast-match for large vocabulary continuous speech recognition

    Publication Year: 2005 , Page(s): 546 - 553
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB) |  | HTML iconHTML  

    Acoustic fast-match is a popular way to accelerate the search in large-vocabulary continuous-speech recognition, where an efficient method is used to identify poorly scoring phonemes and discard them from detailed evaluation. In this paper we view acoustic fast-match as a verification problem, and hence develop an efficient likelihood ratio test, similar to other verification scenarios, to perform the fast match. Various aspects of the test like the design of alternate hypothesis models and the setting of phoneme look-ahead durations and decision thresholds are studied, resulting in an efficient implementation. The proposed fast-match is tested in a large vocabulary speech recognition task and it is demonstrated that depending on the decision threshold, it leads to 20-30% improvement in speed without any loss in recognition accuracy. In addition, it significantly outperforms a similar test based on using likelihoods only, which fails, in our setting, to bring any improvement in speed-accuracy trade-off. In a larger set of experiments with varying acoustic and task conditions, similar improvements are observed for the fast-match with the same model and setting. This indicates the robustness of the proposed technique. The gains due to the proposed method are obtained within a highly efficient 2-pass search strategy and similar or even higher gains are expected in other alternative search architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rapid discriminative acoustic model based on eigenspace mapping for fast speaker adaptation

    Publication Year: 2005 , Page(s): 554 - 564
    Cited by:  Papers (10)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (408 KB) |  | HTML iconHTML  

    It is widely believed that strong correlations exist across an utterance as a consequence of time-invariant characteristics of speaker and acoustic environments. It is verified in this paper that the first primary eigendirections of the utterance covariance matrix are speaker dependent. Based on this observation, a novel family of fast speaker adaptation algorithms entitled Eigenspace Mapping (EigMap) is proposed. The proposed algorithms are applied to continuous density Hidden Markov Model (HMM) based speech recognition. The EigMap algorithm rapidly constructs discriminative acoustic models in the test speaker's eigenspace by preserving discriminative information learned from baseline models in the directions of the test speaker's eigenspace. Moreover, the adapted models are compressed by discarding model parameters that are assumed to contain no discrimination information. The core idea of EigMap can be extended in many ways, and a family of algorithms based on EigMap is described in this paper. Unsupervised adaptation experiments show that EigMap is effective in improving baseline models using very limited amounts of adaptation data with superior performance to conventional adaptation techniques such as MLLR and block diagonal MLLR. A relative improvement of 18.4% over a baseline recognizer is achieved using EigMap with only about 4.5 s of adaptation data. Furthermore, it is also demonstrated that EigMap is additive to MLLR by encompassing important speaker dependent discriminative information. A significant relative improvement of 24.6% over baseline is observed using 4.5 s of adaptation data by combining MLLR and EigMap techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combination of autocorrelation-based features and projection measure technique for speaker identification

    Publication Year: 2005 , Page(s): 565 - 574
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (704 KB) |  | HTML iconHTML  

    This paper presents a robust approach for speaker identification when the speech signal is corrupted by additive noise and channel distortion. Robust features are derived by assuming that the corrupting noise is stationary and the channel effect is fixed during an utterance. A two-step temporal filtering procedure on the autocorrelation sequence is proposed to minimize the effect of additive and convolutional noises. The first step applies a temporal filtering procedure in autocorrelation domain to remove the additive noise, and the second step is to perform the mean subtraction on the filtered autocorrelation sequence in logarithmic spectrum domain to remove the channel effect. No prior knowledge of noise characteristic is necessary. The additive noise can be a colored noise. Then the proposed robust feature is combined with the projection measure technique to gain further improvement in recognition accuracy. Experimental results show that the proposed method can significantly improve the performance of speaker identification task in noisy environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system

    Publication Year: 2005 , Page(s): 575 - 582
    Cited by:  Papers (26)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (464 KB) |  | HTML iconHTML  

    This paper proposes a text-dependent (fixed-text) speaker verification system which uses different types of information for making a decision regarding the identity claim of a speaker. The baseline system uses the dynamic time warping (DTW) technique for matching. Detection of the end-points of an utterance is crucial for the performance of the DTW-based template matching. A method based on the vowel onset point (VOP) is proposed for locating the end-points of an utterance. The proposed method for speaker verification uses the suprasegmental and source features, besides spectral features. The suprasegmental features such as pitch and duration are extracted using the warping path information in the DTW algorithm. Features of the excitation source, extracted using the neural network models, are also used in the text-dependent speaker verification system. Although the suprasegmental and source features individually may not yield good performance, combining the evidence from these features seem to improve the performance of the system significantly. Neural network models are used to combine the evidence from multiple sources of information. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing

    Publication Year: 2005 , Page(s): 583 - 592
    Cited by:  Papers (11)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    In conventional speaker recognition tasks, the amount of training data is almost the same for each speaker, and the speaker model structure is uniform and specified manually according to the nature of the task and the available size of the training data. In real-world speech data such as telephone conversations and meetings, however, serious problems arise in applying a uniform model because variations in the utterance durations of speakers are large, with numerous short utterances. We therefore propose a flexible framework in which an optimal speaker model (GMM or VQ) is automatically selected based on the Bayesian Information Criterion (BIC) according to the amount of training data available. The framework makes it possible to use a discrete model when the data is sparse, and to seamlessly switch to a continuous model after a large amount of data is obtained. The proposed framework was implemented in unsupervised speaker indexing of a discussion audio. For a real discussion archive with a total duration of 10 hours, we demonstrate that the proposed method has higher indexing performance than that of conventional methods. The speaker index is also used to adapt a speaker-independent acoustic model to each participant for automatic transcription of the discussion. We demonstrate that speaker indexing with our method is sufficiently accurate for adaptation of the acoustic model. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance of real-time source-location estimators for a large-aperture microphone array

    Publication Year: 2005 , Page(s): 593 - 606
    Cited by:  Papers (21)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1664 KB) |  | HTML iconHTML  

    A large array of microphones is being studied as a possible means of acquiring data in offices, conference rooms, and auditoria without requiring close-talking microphones. An array that surrounds all possible sources has a large aperture and such arrays have attractive properties for accurate spatial resolution and significant signal-to-noise enhancement. For the first time, this paper presents all the details of a real-time, source-location algorithm (LEMSalg) based on time-of-arrival delays derived from a phase transform applied to the generalized cross-power spectrum. It is being used successfully in a representative environment where microphone SNRs are below 0 dB. We have found that many small features are required to make a useful location estimating algorithm work and work well in real-time. We present an experimental evaluation of the current algorithm's performance using data taken with the Huge Microphone Array (HMA) system, which has 448 microphones in a noisy, reverberant environment. Using off-line computation, we also compared the LEMSalg to two alternative methods. The first of these adds local beamforming to the preprocessing of the base algorithm, increasing performance significantly at modest additional computational cost. The second algorithm maximizes the total steered-response power in the same phase transform. While able to derive good position estimates from shorter data runs, this method is two orders of magnitude more computationally expensive and is not yet suitable for real-time use. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A robust hybrid feedback active noise cancellation headset

    Publication Year: 2005 , Page(s): 607 - 617
    Cited by:  Papers (21)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (464 KB) |  | HTML iconHTML  

    This paper investigates the robustness of a hybrid analog/digital feedback active noise cancellation (ANC) headset system. The digital ANC systems with the filtered-x least-mean-square (FXLMS) algorithm require accurate estimation of the secondary path for the stability and convergence of the algorithm. This demands a great challenge for the ANC headset design because the secondary path may fluctuate dramatically such as when the user adjusts the position of the ear-cup. In this paper, we analytically show that adding an analog feedback loop into the digital ANC systems can effectively reduce the plant fluctuation, thus achieving a more robust system. The method for designing the analog controller is highlighted. A practical hybrid analog/digital feedback ANC headset has been built and used to conduct experiments, and the experimental results show that the hybrid headset system is more robust under large plant fluctuation, and has achieved satisfactory noise cancellation for both narrowband and broadband noises. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On comparison of online secondary path modeling methods with auxiliary noise

    Publication Year: 2005 , Page(s): 618 - 628
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (400 KB) |  | HTML iconHTML  

    In the past few years, many methods were proposed to model the so-called secondary path on line for an active noise control (ANC) system. In these methods, an auxiliary noise is employed to excite the modeling process. Among many factors to affect the performance of these methods, some play important roles. They are: 1) power of auxiliary noise; 2) ability to detect secondary path change; 3) optimal step sizes for adaptive filters used in an ANC system and 4) independence between ANC and secondary path modeling processes. In literature, however, there is lack of analysis to show how these factors affect the performance of the ANC system. Moreover, the aforementioned online secondary path modeling methods have not been systematically evaluated. In this paper, the performance of these methods is evaluated by investigating the above-mentioned factors through both statistical analysis and computer simulations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Speech and Audio Processing Edics

    Publication Year: 2005 , Page(s): 629
    Save to Project icon | Request Permissions | PDF file iconPDF (23 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Speech and Audio Processing Information for authors

    Publication Year: 2005 , Page(s): 630 - 631
    Save to Project icon | Request Permissions | PDF file iconPDF (50 KB)  
    Freely Available from IEEE
  • Special issue on objective quality assessment of speech and audio

    Publication Year: 2005 , Page(s): 632
    Save to Project icon | Request Permissions | PDF file iconPDF (120 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Publication Year: 2005 , Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope