By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 6 • Date Nov. 2006

Filter Results

Displaying Results 1 - 25 of 46
  • Table of contents

    Page(s): c1 - c4
    Save to Project icon | Request Permissions | PDF file iconPDF (54 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (37 KB)  
    Freely Available from IEEE
  • Introduction to the Special Section on Objective Quality Assessment of Speech and Audio

    Page(s): 1889
    Save to Project icon | Request Permissions | PDF file iconPDF (24 KB)  
    Freely Available from IEEE
  • Objective Assessment of Speech and Audio Quality—Technology and Applications

    Page(s): 1890 - 1901
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (687 KB) |  | HTML iconHTML  

    In the past few years, objective quality assessment models have become increasingly used for assessing or monitoring speech and audio quality. By measuring perceived quality on an easily-understood subjective scale, such as listening quality (excellent, good, fair, poor, bad), these methods provide a quick and repeatable way to estimate customer experience. Typical applications include audio quality evaluation, selection of codecs or other equipment, and measuring the quality of telephone networks. To introduce this special issue, this paper provides an overview of the field, outlining the main approaches to intrusive, nonintrusive and parametric models and discussing some of their limitations and areas of future work View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PEMO-Q—A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception

    Page(s): 1902 - 1911
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (532 KB) |  | HTML iconHTML  

    A new method for the objective assessment and prediction of perceived audio quality is introduced. It represents an expansion of the speech quality measure qC, introduced by Hansen and Kollmeier, and is based on a psychoacoustically validated, quantitative model of the "effective" peripheral auditory processing by Dau et al. To evaluate the audio quality of a given distorted signal relative to a corresponding high-quality reference signal, the auditory model is employed to compute "internal representations" of the signals, which are partly assimilated in order to account for assumed cognitive aspects. The linear cross correlation coefficient of the assimilated internal representations represents the perceptual similarity measure (PSM). PSM shows good correlations with subjective quality ratings if different types of audio signals are considered separately, whereas a better accuracy of signal-independent quality prediction is achieved by a second quality measure PSMt represented by the fifth percentile of the sequence of instantaneous audio quality PSM(t). The new measures were evaluated using a large database of subjective listening tests that were originally carried out on behalf of the International Telecommunication Union (ITU) and Moving Pictures Experts Group (MPEG) for the evaluation of various low bit-rate audio codecs. Additional tests with data unknown in the development phase of the model were carried out. Except for linear distortions, the new method shows a higher prediction accuracy than the ITU-R recommendation BS.1387 ("PEAQ") for the tested data View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Multiresolution Model of Auditory Excitation Pattern and Its Application to Objective Evaluation of Perceived Speech Quality

    Page(s): 1912 - 1923
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1090 KB) |  | HTML iconHTML  

    This paper proposes a multiresolution model of auditory excitation pattern and applies it to the problem of objective evaluation of subjective wideband speech quality. The model uses wavelet packet transform for time-frequency decomposition of the input signal. The selection of the wavelet packet tree is based on an optimality criterion formulated to minimize a cost function based on the critical band structure. The models of the different auditory phenomena are reformulated for the multiresolution framework. This includes the proposition of duration dependent outer and middle ear weighting, multiresolution spectral spreading, and multiresolution temporal smearing. As an application, the excitation pattern is used to define an objective measure of auditory distortion of a distorted speech signal compared to the undistorted one. The performance of this objective measure is evaluated with a database of various kinds of NOISEX-92 degraded wideband speech signals in predicting the subjective mean opinion score (MOS) and is compared with the fast Fourier transform (FFT)-based ITU-T PESQ P.862.2 algorithm. The proposed measure is found to achieve comparable correlation between subjective MOS and objective MOS as PESQ P.862.2, with a trend suggesting better correlation for the nonstationary degradations compared to the stationary ones. Further refinement of the measure for distortion types other than additive noise is anticipated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • P.563—The ITU-T Standard for Single-Ended Speech Quality Assessment

    Page(s): 1924 - 1934
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1020 KB) |  | HTML iconHTML  

    Objective voice quality assessment has been the subject of research for many years. Up until very recently, objective models required a copy of the unprocessed signal for estimating the quality of a signal transmitted across a telecommunication network, making live call monitoring impossible. This paper introduces a method for nonintrusive assessment of speech quality for narrow-band telephony, which was approved by the International Telecommunication Union (ITU-T) in May 2004. Essentially based on models of voice production and perception, the algorithm demonstrates good performance on more than 48 subjective experiments representing most distortions that occur on voice networks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Single-Ended Speech Quality Measurement Using Machine Learning Methods

    Page(s): 1935 - 1947
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (739 KB) |  | HTML iconHTML  

    We describe a novel single-ended algorithm constructed from models of speech signals, including clean and degraded speech, and speech corrupted by multiplicative noise and temporal discontinuities. Machine learning methods are used to design the models, including Gaussian mixture models, support vector machines, and random forest classifiers. Estimates of the subjective mean opinion score (MOS) generated by the models are combined using hard or soft decisions generated by a classifier which has learned to match the input signal with the models. Test results show the algorithm outperforming ITU-T P.563, the current "state-of-art" standard single-ended algorithm. Employed in a distributed double-ended measurement configuration, the proposed algorithm is found to be more effective than P.563 in assessing the quality of noise reduction systems and can provide a functionality not available with P.862 PESQ, the current double-ended standard algorithm View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Complexity, Nonintrusive Speech Quality Assessment

    Page(s): 1948 - 1956
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (290 KB) |  | HTML iconHTML  

    Monitoring of speech quality in emerging heterogeneous networks is of great interest to network operators. The most efficient way to satisfy such a need is through nonintrusive, objective speech quality assessment. In this paper, we describe a low-complexity algorithm for monitoring the speech quality over a network. The features used in the proposed algorithm can be computed from commonly used speech-coding parameters. Reconstruction and perceptual transformation of the signal is not performed. The critical advantage of the approach lies in generating quality assessment ratings without explicit distortion modeling. The results from the performed experiments indicate that the proposed nonintrusive objective quality measure performs better than the ITU-T P.563 standard View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Short- and Long-Term Packet Loss Behavior: Towards Speech Quality Prediction for Arbitrary Loss Distributions

    Page(s): 1957 - 1968
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (617 KB) |  | HTML iconHTML  

    A speech-quality-oriented classification of packet loss distributions is proposed according to both the short- and long-term loss behavior. While the short-term behavior (microscopic loss behavior) relates to the effect of packet loss on the coder and packet loss concealment performance, the long-term loss behavior (macroscopic loss behavior) is defined so that it reflects the loss behavior that ultimately leads to speech quality that perceptively changes over time. Based on this classification, different parametric (objective) modeling approaches for predicting speech quality are discussed. To this aim, a packet loss averaging approach is presented for modeling speech quality under short-term loss. Starting from this model, two different ways for predicting speech quality under long-term-dependent packet loss are analyzed and compared to auditory (subjective) test results: quality prediction based on the averaging at packet trace level as provided, for example, by the E-model (2005), and the prediction based on the time-averaging of estimated instantaneous quality profiles, as suggested, for example, by L. Gros and N. Chateau (2001) (1998). From this comparison, the suitability of the different approaches for network planning are discussed, and their limitations in case of particular loss distributions are pointed out View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impairment Factor Framework for Wide-Band Speech Codecs

    Page(s): 1969 - 1976
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (655 KB) |  | HTML iconHTML  

    A new method is described for quantifying the quality degradation introduced by wide-band speech codecs via a one-dimensional impairment factor. The method is based on auditory listening-only tests, but the resulting impairment factors may be used for predicting speech quality in an instrumental way, e.g., for network planning purposes. Following the method, auditory test results are first transformed to an overall quality rating scale, and then adjusted to rule out test-specific effects. The derived impairment factors fit into the common framework which is defined by the E-model for narrow-band telephone networks, and which is hereby extended towards wide-band speech transmission. This paper presents the necessary auditory test data, describes the derivation and adjustment methodology, and provides numerical values for a range of wide-band speech codecs. The values are tested for their robustness in case of codec tandems and adjusted to represent the effects of packet loss View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VoIP Quality Assessment: Taking Account of the Edge-Device

    Page(s): 1977 - 1983
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (321 KB) |  | HTML iconHTML  

    Voice-over-IP networks (VoIP) have yet to be universally accepted as a replacement for public switched telephone network services. One of the barriers to this step is the ability to manage the network to ensure the user receives a suitable quality of service during a call. IP networks are nondeterministic, and the impact of network degradations, such as loss and jitter, on the quality perceived by the user is difficult to measure. This paper investigates the role played by the edge-device in the impact of network degradations on user-perceived quality. It shows that all edge-devices are not equal, and proposes calibration as a method of accounting for different devices when monitoring VoIP streams. Finally, it presents results that show the prediction accuracy that can be obtained by using such a method View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Objective Assessment Methodology for Estimating Conversational Quality in VoIP

    Page(s): 1984 - 1993
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (865 KB) |  | HTML iconHTML  

    Voice over IP (VoIP) is becoming one of the key technologies for telecommunications. Since IP networks generally do not guarantee transmission quality, it is extremely important to design and manage the quality of service (QoS) properly. To do this, it is desirable to develop an objective quality assessment method that estimates subjective quality based on the physical characteristics of the VoIP system. This paper first proposes a framework of objective models that can be applied not only to quality planning, which is an intended application of the existing standard methodology known as International Telecommunication Union - Telecommunication Standardization Sector (ITU-T) Recommendation G.107, "the E-model," but also to quality benchmarking and management. Then, it proposes a model that complies with the proposed framework. Experimental results show that the proposed model has sufficient accuracy in the evaluation of practical VoIP systems. In addition, we attempt to integrate the opinion model with other objective quality measures, such as perceptual evaluation of speech quality (PESQ), standardized in Recommendation P.862 in ITU-T. Finally, we examine the task dependence of the performance of the proposed model View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Feature Extraction for the Prediction of Multichannel Spatial Audio Fidelity

    Page(s): 1994 - 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1165 KB) |  | HTML iconHTML  

    This paper seeks to present an algorithm for the prediction of frontal spatial fidelity and surround spatial fidelity of multichannel audio, which are two attributes of the subjective parameter called basic audio quality. A number of features chosen to represent spectral and spatial changes were extracted from a set of recordings and used in a regression model as independent variables for the prediction of spatial fidelities. The calibration of the model was done by ridge regression using a database of scores obtained from a series of formal listening tests. The statistically significant features based on interaural cross correlation and spectral features found from an initial model were employed to build a simplified model and these selected features were validated. The results obtained from the validation experiment were highly correlated with the listening test scores and had a low standard error comparable to that encountered in typical listening tests. The applicability of the developed algorithm is limited to predicting the basic audio quality of low-pass filtered and down-mixed recordings (as obtained in listening tests based on a multistimulus test paradigm with reference and two anchors: a 3.5-kHz low-pass filtered signal and a mono signal) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Estimation of Speech Recognition System Under Noise Conditions Using Objective Quality Measures and Artificial Voice

    Page(s): 2006 - 2013
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (488 KB) |  | HTML iconHTML  

    It is essential to ensure quality of service (QoS) when offering a speech recognition service for use in noisy environments. This means that the recognition performance in the target noise environment must be investigated. One approach is to estimate the recognition performance from a distortion value, which represents the difference between noisy speech and its original clean version. Previously, estimation methods using the segmental signal-to-noise ratio (SNRseg), the cepstral distance (CD), and the perceptual evaluation of speech quality (PESQ) have been proposed. However, their estimation accuracy has not been verified for the case when a noise reduction algorithm is adopted as a preprocessing stage in speech recognition. We, therefore, evaluated the effectiveness of these distortion measures by experiments using the AURORA-2J connected digit recognition task and four different noise reduction algorithms. The results showed that in each case the distortion measure correlates well with the word accuracy when the estimators used are optimized for each individual noise reduction algorithm. In addition, it was confirmed that when a single estimator, optimized for all the noise reduction algorithms, is used, the PESQ method gives a more accurate estimate than SNRseg and CD. Furthermore, we have proposed the use of artificial voice of several seconds duration instead of a large amount of real speech and confirmed that a relatively accurate estimate can be obtained by using the artificial voice View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Monaural Speech Separation Based on Computational Auditory Scene Analysis and Objective Quality Assessment of Speech

    Page(s): 2014 - 2023
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (673 KB) |  | HTML iconHTML  

    Monaural speech separation is a very challenging problem in speech signal processing. It has been studied extensively, and many separation systems based on computational auditory scene analysis (CASA) have been proposed in the last two decades. Although the research on CASA has tended to introduce high-level knowledge into separation processes using primitive data-driven methods, the knowledge on speech quality still has not been combined with it. This makes the performance evaluation of CASA mainly focused on the signal-to-noise ratio (SNR) improvement. Actually, the quality of the separated speech is not directly related to its SNR. In order to solve this problem, we propose a new method which combines CASA with objective quality assessment of speech (OQAS). In the grouping process of CASA, we use OQAS as the guide to instruct the CASA system. With this combination, the performance of the speech separation can be improved not only in SNR, but also in mean opinion score (MOS). Our system is systematically evaluated and compared with previous systems, and it yields substantially better performance, especially for the subjective perceptual quality of separated speech View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiband Modulation Energy Tracking for Noisy Speech Detection

    Page(s): 2024 - 2038
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (625 KB) |  | HTML iconHTML  

    The ability to accurately locate the boundaries of speech activity is an important attribute of any modern speech recognition, processing, or transmission system. The effort in this paper is the development of efficient, sophisticated features for speech detection in noisy environments, using ideas and techniques from recent advances in speech modeling and analysis, like presence of modulations in speech formants, energy separation and multiband filtering. First we present a method, conceptually based on a classic speech-silence discrimination procedure, that uses some newly developed, short-time signal analysis tools and provide for it a detection theoretic motivation. The new energy and spectral content representations are derived through filtering the signal in various frequency bands, estimating the Teager-Kaiser energy for each and demodulating the most active one in order to derive the signal's dominant AM-FM components. This modulation approach demonstrated an improved robustness in noise over the classic algorithm, reaching an average error reduction of 33.5% under 5-30-dB noise. Second, by incorporating alternative modulation energy features in voice activity detection, improvement in overall misclassification error of a high hit rate detector reached 7.5% and 9.5% on different benchmarks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Novel Audio Coding Scheme Using Warped Linear Prediction Model and the Discrete Wavelet Transform

    Page(s): 2039 - 2048
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (490 KB) |  | HTML iconHTML  

    In this paper, we present a novel audio coder using the discrete wavelet transform (DWT) and warped linear prediction (WLP). In contrast to conventional LP, WLP allows for the control of frequency resolution to closely match the response of the human auditory system. The structure of the system is similar to the transform coded excitation techniques used in wideband speech coding, where LP has been replaced with WLP, and the residual is analyzed by a wavelet filterbank designed to approximate the critical bands. The inherent shaping of the WLP synthesis filter, and a controlled bit allocation to the wavelet coefficients helps minimise the perceptually significant noise due to the quantization error in the residual. For monophonic signals sampled at 44.1 kHz, the coder achieves near transparent to transparent quality for a variety of speech and music signals at an average bitrate of about 64 kb/s. Tests also show that the coder (in its initial implementation) delivers superior quality to the MPEG layer III and comparable quality to the MPEG2-AAC codec when operating at the same bitrate View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech Enhancement Based on Generalized Minimum Mean Square Error Estimators and Masking Properties of the Auditory System

    Page(s): 2049 - 2063
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (803 KB) |  | HTML iconHTML  

    In this paper, the family of conditional minimum mean square error (MMSE) spectral estimators is studied which take on the form (E(XEalphap/|Xp + Dp|))1alpha/, where Xp is the clean speech spectrum, and Dp is the noise spectrum, resulting in a generalized MMSE estimator (GMMSE). The degree of noise suppression versus musical tone artifacts of these estimators is studied. The tradeoffs in selection of (alpha), across noise spectral structure and signal-to-noise ratio (SNR) level, are also considered. Members of this family of estimators include the Ephraim-Malah (EM) amplitude estimator and, for high SNRs, the Wiener Filter. It is shown that the colorless residual noise observed in the EM estimator is a characteristic of this general family of estimators. An application of these estimators in an auditory enhancement scheme using the masking threshold of the human auditory system is formulated, resulting in the GMMSE-auditory masking threshold (AMT) enhancement method. Finally, a detailed evaluation of the proposed algorithms is performed over the phonetically balanced TIMIT database and the National Gallery of the Spoken Word (NGSW) audio archive using subjective and objective speech quality measures. Results show that the proposed GMMSE-AMT outperforms MMSE and log-MMSE enhancement methods using a detailed phoneme-based objective quality analysis View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive Time Segmentation for Improved Speech Enhancement

    Page(s): 2064 - 2074
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB) |  | HTML iconHTML  

    Single-channel enhancement algorithms are widely used to overcome the degradation of noisy speech signals. Speech enhancement gain functions are typically computed from two quantities, namely, an estimate of the noise power spectrum and of the noisy speech power spectrum. The variance of these power spectral estimates degrades the quality of the enhanced signal and smoothing techniques are, therefore, often used to decrease the variance. In this paper, we present a method to determine the noisy speech power spectrum based on an adaptive time segmentation. More specifically, the proposed algorithm determines for each noisy frame which of the surrounding frames should contribute to the corresponding noisy power spectral estimate. Further, we demonstrate the potential of our adaptive segmentation in both maximum likelihood and decision direction-based speech enhancement methods by making a better estimate of the a priori signal-to-noise ratio (SNR) xi. Objective and subjective experiments show that an adaptive time segmentation leads to significant performance improvements in comparison to the conventionally used fixed segmentations, particularly in transitional regions, where we observe local SNR improvements in the order of 5 dB View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Blind Source Separation Based on Time-Domain Optimization of a Frequency-Domain Independence Criterion

    Page(s): 2075 - 2085
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (614 KB) |  | HTML iconHTML  

    A new technique for the blind separation of convolutive mixtures is proposed in this paper. Inspired by the works of Amari, Sabala , and Rahbar, we firstly start from the application of Kullback-Leibler divergence in frequency domain, and then we integrate Kullback-Leibler divergence over the whole frequency range of interest to yield a new objective function which turns out to be time-domain variable dependent. In other words, the objective function is derived in frequency domain which can be optimized with respect to time domain variables. The proposed technique has the advantages of frequency domain approaches and is suitable for very long mixing channels, but does not suffer from the local permutation problem as the separation is achieved in time-domain View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Implementation of KLT-Based Speech Enhancement Using Vector Quantization

    Page(s): 2086 - 2097
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (730 KB) |  | HTML iconHTML  

    We propose a new method for implementing Karhunen-Loeve transform (KLT)-based speech enhancement to exploit vector quantization (VQ). The method is suitable for real-time processing. The proposed method consists of a VQ learning stage and a filtering stage. In the VQ learning stage, the autocorrelation vectors comprising the first K elements of the autocorrelation function are extracted from learning data. The autocorrelation vectors are used as codewords in the VQ codebook. Next, the KLT bases that correspond to all the codeword vectors are estimated through eigendecomposition (ED) of the empirical Toeplitz covariance matrices constructed from the codeword vectors. In the filtering stage, the autocorrelation vectors that are estimated from the input signal are compared to the codewords. The nearest one is chosen in each frame. The precomputed KLT bases corresponding to the chosen codeword are used for filtering instead of performing ED, which is computationally intensive. Speech quality evaluation using objective measures shows that the proposed method is comparable to a conventional KLT-based method that performs ED in the filtering process. Results of subjective tests also support this result. In addition, processing time is reduced to about 1/66 that of the conventional method in the case where a frame length of 120 points is used. This complexity reduction is attained after the computational cost in the learning stage and a corresponding increase in the associated memory requirement. Nevertheless, these results demonstrate that the proposed method reduces computational complexity while maintaining the speech quality of the KLT-based speech enhancement View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved Signal-to-Noise Ratio Estimation for Speech Enhancement

    Page(s): 2098 - 2108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1102 KB) |  | HTML iconHTML  

    This paper addresses the problem of single-microphone speech enhancement in noisy environments. State-of-the-art short-time noise reduction techniques are most often expressed as a spectral gain depending on the signal-to-noise ratio (SNR). The well-known decision-directed (DD) approach drastically limits the level of musical noise, but the estimated a priori SNR is biased since it depends on the speech spectrum estimation in the previous frame. Therefore, the gain function matches the previous frame rather than the current one which degrades the noise reduction performance. The consequence of this bias is an annoying reverberation effect. We propose a method called two-step noise reduction (TSNR) technique which solves this problem while maintaining the benefits of the decision-directed approach. The estimation of the a priori SNR is refined by a second step to remove the bias of the DD approach, thus removing the reverberation effect. However, classic short-time noise reduction techniques, including TSNR, introduce harmonic distortion in enhanced speech because of the unreliability of estimators for small signal-to-noise ratios. This is mainly due to the difficult task of noise power spectrum density (PSD) estimation in single-microphone schemes. To overcome this problem, we propose a method called harmonic regeneration noise reduction (HRNR). A nonlinearity is used to regenerate the degraded harmonics of the distorted signal in an efficient way. The resulting artificial signal is produced in order to refine the a priori SNR used to compute a spectral gain able to preserve the speech harmonics. These methods are analyzed and objective and formal subjective test results between HRNR and TSNR techniques are provided. A significant improvement is brought by HRNR compared to TSNR thanks to the preservation of harmonics View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments

    Page(s): 2109 - 2121
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (615 KB) |  | HTML iconHTML  

    In this paper, we introduce subband likelihood-maximizing beamforming (S-LIMABEAM), a new microphone-array processing algorithm specifically designed for speech recognition applications. The proposed algorithm is an extension of the previously developed LIMABEAM array processing algorithm. Unlike most array processing algorithms which operate according to some waveform-level objective function, the goal of LIMABEAM is to find the set of array parameters that maximizes the likelihood of the correct recognition hypothesis. Optimizing the array parameters in this manner results in significant improvements in recognition accuracy over conventional array processing methods when speech is corrupted by additive noise and moderate levels of reverberation. Despite the success of the LIMABEAM algorithm in such environments, little improvement was achieved in highly reverberant environments. In such situations where the noise is highly correlated to the speech signal and the number of filter parameters to estimate is large, subband processing has been used to improve the performance of LMS-type adaptive filtering algorithms. We use subband processing principles to design a novel array processing architecture in which select groups of subbands are processed jointly to maximize the likelihood of the resulting speech recognition features, as measured by the recognizer itself. By creating a subband filtering architecture that explicitly accounts for the manner in which recognition features are computed, we can effectively apply the LIMABEAM framework to highly reverberant environments. By doing so, we are able to achieve improvements in word error rate of over 20% compared to conventional methods in highly reverberant environments View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Robust Viterbi Algorithm Against Impulsive Noise With Application to Speech Recognition

    Page(s): 2122 - 2133
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (441 KB) |  | HTML iconHTML  

    The Viterbi algorithm has been successfully applied to different pattern recognition and communication tasks. However, if some observations are corrupted by unknown impulsives noise which are not accounted for by the distortion measures, recognition performance can degrade significantly. In this paper, we propose a robust Viterbi algorithm to handle short impulsive noises with unknown characteristics by means of joint decoding and detection during the Viterbi search. To make the algorithm applicable to different noisy conditions with varying amounts of impulsive noise, we further proposed an approach to efficiently estimate the number of corruptions. We demonstrate the effectiveness of the proposed robust algorithms using spoken digit recognition experiments under two different impulsive noise environments. Under random Gaussian replacement noise, the proposed algorithm reduced digit error by more than 65%. Under the GSM network environment in which lost frames are replaced by interpolated neighboring frames, the robust algorithm reduced digit error by 20%. Furthermore, the proposed algorithm does not degrade performance when impulsive noise is not present View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research