Scheduled System Maintenance:
Some services will be unavailable Sunday, March 29th through Monday, March 30th. We apologize for the inconvenience.
By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 6 • Date Aug. 2010

Filter Results

Displaying Results 1 - 25 of 57
  • Table of contents

    Publication Year: 2010 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (107 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2010 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • Enhanced Phone Posteriors for Improving Speech Recognition Systems

    Publication Year: 2010 , Page(s): 1094 - 1106
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (417 KB) |  | HTML iconHTML  

    Using phone posterior probabilities has been increasingly explored for improving automatic speech recognition (ASR) systems. In this paper, we propose two approaches for hierarchically enhancing these phone posteriors, by integrating long acoustic context, as well as phonetic and lexical knowledge. In the first approach, phone posteriors estimated with a multilayer perceptron (MLP), are used as emission probabilities in hidden Markov model (HMM) forward-backward recursions. This yields new enhanced posterior estimates integrating HMM topological constraints (encoding specific phonetic and lexical knowledge), and context. In the second approach, temporal contexts of the regular MLP posteriors are postprocessed by a secondary MLP, in order to learn inter- and intra-dependencies between the phone posteriors. These dependencies are phonetic knowledge. The learned knowledge is integrated in the posterior estimation during the inference (forward pass) of the second MLP, resulting in enhanced phone posteriors. We investigate the use of the enhanced posteriors in hybrid HMM/artificial neural network (ANN) and Tandem configurations. We propose using the enhanced posteriors as replacement, or as complementary evidences to the regular MLP posteriors. The proposed methods have been tested on different small and large vocabulary databases, always resulting in consistent improvements in frame, phone, and word recognition rates. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Restoration of Audio Documents by Means of Extended Kalman Filter

    Publication Year: 2010 , Page(s): 1107 - 1115
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1117 KB) |  | HTML iconHTML  

    We present some results on audio restoration obtained with an algorithm that solves the problems of broadband noise filtering, signal parameters tracking, and impulsive noise removal by using the Extended Kalman Filter (EKF) theory. We show that, to achieve maximum performance, it is essential to optimize the EKF implementation. To this purpose, having to cope with the nonstationarity of the audio signal, we use two properly combined EKF filters (forward and backward), and introduce a bootstrapping procedure for model tracking. The careful combination of the proposed techniques and an accurate choice of some critical parameters, allows to improve the performance of the EKF algorithm. The presented procedure is validated by listening tests. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiple Fundamental Frequency Estimation and Polyphony Inference of Polyphonic Music Signals

    Publication Year: 2010 , Page(s): 1116 - 1126
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (603 KB) |  | HTML iconHTML  

    This paper presents a frame-based system for estimating multiple fundamental frequencies (F0s) of polyphonic music signals based on the short-time Fourier transform (STFT) representation. To estimate the number of sources along with their F0s, it is proposed to estimate the noise level beforehand and then jointly evaluate all the possible combinations among pre-selected F0 candidates. Given a set of F0 hypotheses, their hypothetical partial sequences are derived, taking into account where partial overlap may occur. A score function is used to select the plausible sets of F0 hypotheses. To infer the best combination, hypothetical sources are progressively combined and iteratively verified. A hypothetical source is considered valid if it either explains more energy than the noise, or improves significantly the envelope smoothness once the overlapping partials are treated. The proposed system has been submitted to Music Information Retrieval Evaluation eXchange (MIREX) 2007 and 2008 contests where the accuracy has been evaluated with respect to the number of sources inferred and the precision of the F0s estimated. The encouraging results demonstrate its competitive performance among the state-of-the-art methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech Enhancement Using Gaussian Scale Mixture Models

    Publication Year: 2010 , Page(s): 1127 - 1136
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1164 KB) |  | HTML iconHTML  

    This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture model (GMM). The frequency coefficients obey a zero-mean Gaussian whose covariance equals to the exponential of the log-spectra. This results in a Gaussian scale mixture model (GSMM) for the speech signal in the frequency domain, since the log-spectra can be regarded as scaling factors. The probabilistic relation between frequency coefficients and log-spectra allows these to be treated as two random variables, both to be estimated from the noisy signals. Expectation-maximization (EM) was used to train the GSMM and Bayesian inference was used to compute the posterior signal distribution. Because exact inference of this full probabilistic model is computationally intractable, we developed two approaches to enhance the efficiency: the Laplace method and a variational approximation. The proposed methods were applied to enhance speech corrupted by Gaussian noise and speech-shaped noise (SSN). For both approximations, signals reconstructed from the estimated frequency coefficients provided higher signal-to-noise ratio (SNR) and those reconstructed from the estimated log-spectra produced lower word recognition error rate because the log-spectra fit the inputs to the recognizer better. Our algorithms effectively reduced the SSN, which algorithms based on spectral analysis were not able to suppress. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrated Active Noise Control and Noise Reduction in Hearing Aids

    Publication Year: 2010 , Page(s): 1137 - 1146
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (872 KB) |  | HTML iconHTML  

    This paper presents combined active noise control and noise reduction schemes for hearing aids to tackle secondary path effects and effects of noise leakage through an open fitting. While such leakage contributions and the secondary acoustic path from the loudspeaker to the tympanic membrane are usually not taken into account in standard noise reduction systems, they appear to have a non-negligible impact on the final signal-to-noise ratio. Using a noise-reduction algorithm and an active noise control system in cascade may be efficient as long as the causality margin of the system is large enough. Putting the two functional blocks in parallel and then integrating them is found to lead to a more robust algorithm. A Filtered-x Multichannel Wiener Filter is presented and applied to integrate noise reduction and active noise control. The cascaded scheme and the integrated scheme are compared experimentally with a Multichannel Wiener Filter in a classic noise reduction framework without active noise control, where the integrated scheme is found to provide the best performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extractive Speech Summarization Using Shallow Rhetorical Structure Modeling

    Publication Year: 2010 , Page(s): 1147 - 1157
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1685 KB) |  | HTML iconHTML  

    We propose an extractive summarization approach with a novel shallow rhetorical structure learning framework for speech summarization. One of the most under-utilized features in extractive summarization is hierarchical structure information-semantically cohesive units that are hidden in spoken documents. We first present empirical evidence that rhetorical structure is the underlying semantic information, which is rendered in linguistic and acoustic/prosodic forms in lecture speech. A segmental summarization method, where the document is partitioned into rhetorical units by K-means clustering, is first proposed to test this hypothesis. We show that this system produces summaries at 67.36% ROUGE-L F-measure, a 4.29% absolute increase in performance compared with that of the baseline system. We then propose Rhetorical-State Hidden Markov Models (RSHMMs) to automatically decode the underlying hierarchical rhetorical structure in speech. Tenfold cross validation experiments are carried out on conference speeches. We show that system based on RSHMMs gives a 71.31% ROUGE-L F-measure, a 8.24% absolute increase in lecture speech summarization performance compared with the baseline system without using RSHMM. Our method equally outperforms the baseline with a conventional discourse feature. We also present a thorough investigation of the relative contribution of different features and show that, for lecture speech, speaker-normalized acoustic features give the most contribution at 68.5% ROUGE-L F-measure, compared to 62.9% ROUGE-L F-measure for linguistic features, and 59.2% ROUGE-L F-measure for un-normalized acoustic features. This shows that the individual speaking style of each speaker is highly relevant to the summarization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Study on the Generalization Capability of Acoustic Models for Robust Speech Recognition

    Publication Year: 2010 , Page(s): 1158 - 1169
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (759 KB) |  | HTML iconHTML  

    In this paper, we explore the generalization capability of acoustic model for improving speech recognition robustness against noise distortions. While generalization in statistical learning theory originally refers to the model's ability to generalize well on unseen testing data drawn from the same distribution as that of the training data, we show that good generalization capability is also desirable for mismatched cases. One way to obtain such general models is to use margin-based model training method, e.g., soft-margin estimation (SME), to enable some tolerance to acoustic mismatches without a detailed knowledge about the distortion mechanisms through enhancing margins between competing models. Experimental results on the Aurora-2 and Aurora-3 connected digit string recognition tasks demonstrate that, by improving the model's generalization capability through SME training, speech recognition performance can be significantly improved in both matched and low to medium mismatched testing cases with no language model constraints. Recognition results show that SME indeed performs better with than without mean and variance normalization, and therefore provides a complimentary benefit to conventional feature normalization techniques such that they can be combined to further improve the system performance. Although this study is focused on noisy speech recognition, we believe the proposed margin-based learning framework can be extended to dealing with different types of distortions and robustness issues in other machine learning applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sentence Correction Incorporating Relative Position and Parse Template Language Models

    Publication Year: 2010 , Page(s): 1170 - 1181
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (659 KB) |  | HTML iconHTML  

    Sentence correction has been an important emerging issue in computer-assisted language learning. However, existing techniques based on grammar rules or statistical machine translation are still not robust enough to tackle the common errors in sentences produced by second language learners. In this paper, a relative position language model and a parse template language model are proposed to complement traditional language modeling techniques in addressing this problem. A corpus of erroneous English-Chinese language transfer sentences along with their corrected counterparts is created and manually judged by human annotators. Experimental results show that compared to a state-of-the-art phrase-based statistical machine translation system, the error correction performance of the proposed approach achieves a significant improvement using human evaluation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Making Confident Speaker Verification Decisions With Minimal Speech

    Publication Year: 2010 , Page(s): 1182 - 1192
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (986 KB) |  | HTML iconHTML  

    Proposed is an approach to estimating confidence measures on the verification score produced by a Gaussian mixture model (GMM)-based automatic speaker verification system with applications to drastically reducing the typical data requirements for producing a confident verification decision. The confidence measures are based on estimating the distribution of the observed frame scores. The confidence estimation procedure is also extended to produce robust results with very limited and highly correlated frame scores as well as in the presence of score normalization. The proposed Early Verification Decision method utilizes the developed confidence measures in a sequential hypothesis testing framework, demonstrating that as little as 2-10 s of speech on average was able to produce verification results approaching that of using an average of over 100 s of speech on the 2005 NIST SRE protocol. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Batch and Adaptive PARAFAC-Based Blind Separation of Convolutive Speech Mixtures

    Publication Year: 2010 , Page(s): 1193 - 1207
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1690 KB) |  | HTML iconHTML  

    We present a frequency-domain technique based on PARAllel FACtor (PARAFAC) analysis that performs multichannel blind source separation (BSS) of convolutive speech mixtures. PARAFAC algorithms are combined with a dimensionality reduction step to significantly reduce computational complexity. The identifiability potential of PARAFAC is exploited to derive a BSS algorithm for the under-determined case (more speakers than microphones), combining PARAFAC analysis with time-varying Capon beamforming. Finally, a low-complexity adaptive version of the BSS algorithm is proposed that can track changes in the mixing environment. Extensive experiments with realistic and measured data corroborate our claims, including the under-determined case. Signal-to-interference ratio improvements of up to 6 dB are shown compared to state-of-the-art BSS algorithms, at an order of magnitude lower computational complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Glottal-Shape Codebook to Improve Robustness of CELP Codecs

    Publication Year: 2010 , Page(s): 1208 - 1217
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (538 KB) |  | HTML iconHTML  

    This paper presents a new technique for the class of code-excited linear prediction speech codecs designed to reduce error propagation after lost frames. Its principle consists in replacing the interframe long-term prediction with a glottal-shape codebook in the subframe containing the first glottal impulse in a given frame. This technique, independent of previous frames, is of particular interest in voiced speech frames following transitions as these frames are the most sensitive to frame erasures. It is a basis of a structured coding scheme called transition coding (TC). The TC greatly improves codec performance in noisy channels while maintaining clean channel performance. It is a part of the new embedded speech and audio codec recently standardized as Recommendation G.718 by ITU-T. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reduction of the Impact of Distortion Outliers and Source Mismatch in Resolution-Constrained Quantization

    Publication Year: 2010 , Page(s): 1218 - 1227
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB) |  | HTML iconHTML  

    The rate-distortion performance of conventional resolution-constrained quantization (RCQ) based on the mean-squared error criterion (MSE-RCQ) is generally compromised by the impact of distortion outliers and source mismatch. Not only the mean distortion, but also the number of distortion outliers should be considered in quantizer design. Thus, we propose the use of a design criterion that gives more importance to the tail of the source distribution, which leads to RCQ based on the second moment of distortion (SMD-RCQ). A continuous range of alternatives between MSE-RCQ and SMD-RCQ is also defined and implemented based on the weighted arithmetic-mean measure (WAM-RCQ). It can be used to control the centroid density in the tail of the source distribution. Experimental results with a Gaussian source and line spectral frequencies (LSFs) show that the proposed WAM-RCQ not only produces a similar mean distortion as conventional MSE-RCQ, but has a lower percentage of distortion outliers and a significantly reduced sensitivity to source mismatch. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Acoustic Source Localization and Tracking Using Track Before Detect

    Publication Year: 2010 , Page(s): 1228 - 1242
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1245 KB) |  | HTML iconHTML  

    Particle Filter-based Acoustic Source Localization algorithms attempt to track the position of a sound source - one or more people speaking in a room - based on the current data from a microphone array as well as all previous data up to that point. This paper first discusses some of the inherent behavioral traits of the steered beamformer localization function. Using conclusions drawn from that study, a multitarget methodology for acoustic source tracking based on the Track Before Detect (TBD) framework is introduced. The algorithm also implicitly evaluates source activity using a variable appended to the state vector. Using the TBD methodology avoids the need to identify a set of source measurements and also allows for a vast increase in the number of particles used for a comparitive computational load which results in increased tracking stability in challenging recording environments. An evaluation of tracking performance is given using a set of real speech recordings with two simultaneously active speech sources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech Enhancement With Inventory Style Speech Resynthesis

    Publication Year: 2010 , Page(s): 1243 - 1257
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4614 KB) |  | HTML iconHTML  

    We present a new method for the enhancement of speech. The method is designed for scenarios in which targeted speaker enrollment as well as system training within the typical noise environment are feasible. The proposed procedure is fundamentally different from most conventional and state-of-the-art denoising approaches. Instead of filtering a distorted signal we are resynthesizing a new “clean” signal based on its likely characteristics. These characteristics are estimated from the distorted signal. A successful implementation of the proposed method is presented. Experiments were performed in a scenario with roughly one hour of clean speech training data. Our results show that the proposed method compares very favorably to other state-of-the-art systems in both objective and subjective speech quality assessments. Potential applications for the proposed method include jet cockpit communication systems and offline methods for the restoration of audio recordings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Multipulse-Based Forward Error Correction Technique for Robust CELP-Coded Speech Transmission Over Erasure Channels

    Publication Year: 2010 , Page(s): 1258 - 1268
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (794 KB) |  | HTML iconHTML  

    The widely used code-excited linear prediction (CELP) paradigm relies on a strong interframe dependency which renders CELP-based codecs vulnerable to packet loss. The use of long-term prediction (LTP) or adaptive codebooks (ACB) is the main source of interframe dependency in these codecs, since they employ the excitation from previous frames. After a frame erasure, previous excitation is unavailable and a desynchronization between the encoder and the decoder appears, causing an additional distortion which is propagated to the subsequent frames. In this paper, we propose a novel media-specific Forward Error Correction (FEC) technique which retrieves LTP-resynchronization with no additional delay at the cost of a very small bit of overhead. In particular, the proposed FEC code contains a multipulse signal which replaces the excitation of the previous frame (i.e., ACB memory) when this has been lost. This multipulse description of the previous excitation is optimized to minimize the perceptual error between the synthesized speech signal and the original one. To this end, we develop a multipulse formulation which includes the additional CELP processing and, in addition, can cope with the presence of advanced LTP filters and the usual subframe segmentation applied in modern codecs. Finally, a quantization scheme is proposed to encode pulse parameters. Objective and subjective quality tests applied to our proposal show that the propagation error due to LTP filter can practically be removed with a very little bandwidth increase. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Error Approximation and Minimum Phone Error Acoustic Model Estimation

    Publication Year: 2010 , Page(s): 1269 - 1279
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (471 KB) |  | HTML iconHTML  

    Minimum phone error (MPE) acoustic parameter estimation involves calculation of edit distances (errors) between correct and incorrect hypotheses. In the context of large-vocabulary continuous-speech recognition, this error calculation becomes prohibitively expensive and so errors are approximated. This paper introduces a novel error approximation technique. Analysis shows that this approximation yields a higher correlation to the Levenshtein error metric than a previously used approximation. Experimental evaluations on a large-vocabulary recognition task demonstrate that the novel approximation also delivers significant performance improvements over the previously used approximation when applied to MPE acoustic model estimation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simultaneous Estimation of Chords and Musical Context From Audio

    Publication Year: 2010 , Page(s): 1280 - 1289
    Cited by:  Papers (25)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (802 KB) |  | HTML iconHTML  

    Chord labels provide a concise description of musical harmony. In pop and jazz music, a sequence of chord labels is often the only written record of a song, and forms the basis of so-called lead sheets. We devise a fully automatic method to simultaneously estimate from an audio waveform the chord sequence including bass notes, the metric positions of chords, and the key. The core of the method is a six-layered dynamic Bayesian network, in which the four hidden source layers jointly model metric position, key, chord, and bass pitch class, while the two observed layers model low-level audio features corresponding to bass and treble tonal content. Using 109 different chords our method provides substantially more harmonic detail than previous approaches while maintaining a high level of accuracy. We show that with 71% correctly classified chords our method significantly exceeds the state of the art when tested against manually annotated ground truth transcriptions on the 176 audio tracks from the MIREX 2008 Chord Detection Task. We introduce a measure of segmentation quality and show that bass and meter modeling are especially beneficial for obtaining the correct level of granularity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Variable Step-Size Matrix Normalized Subband Adaptive Filter

    Publication Year: 2010 , Page(s): 1290 - 1299
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (803 KB) |  | HTML iconHTML  

    The normalized subband adaptive filter (NSAF) presented by Lee and Gan can obtain faster convergence rate than the normalized least-mean-square (NLMS) algorithm with colored input signals. However, similar to other fixed step-size adaptive filtering algorithms, the NSAF requires a tradeoff between fast convergence rate and low misadjustment. Recently, a set-membership NSAF (SM-NSAF) has been developed to address this problem. Nevertheless, in order to determine the error bound of the SM-NSAF, the power of the system noise should be known. In this paper, we propose a variable step-size matrix NSAF (VSSM-NSAF) from another point of view, i.e., recovering the powers of the subband system noises from those of the subband error signals of the adaptive filter, to further improve the performance of the NSAF. The VSSM-NSAF uses an effective system noise power estimate method, which can also be applied to the under-modeling scenario, and therefore need not know the powers of the subband system noises in advance. Besides, the steady-state mean-square behavior of the proposed algorithm is analyzed, which theoretically proves that the VSSM-NSAF can obtain a low misadjustment. Simulation results show good performance of the new algorithm as compared to other members of the NSAF family. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GMM-SVM Kernel With a Bhattacharyya-Based Distance for Speaker Recognition

    Publication Year: 2010 , Page(s): 1300 - 1312
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (834 KB) |  | HTML iconHTML  

    Among conventional methods for text-independent speaker recognition, Gaussian mixture model (GMM) is known for its effectiveness and scalability in modeling the spectral distribution of speech. A GMM-supervector characterizes a speaker's voice by the GMM parameters such as the mean vectors, covariance matrices and mixture weights. Besides the first-order statistics, it is generally believed that speaker's cues are partly conveyed by the second-order statistics. In this paper, we introduce a Bhattacharyya-based GMM-distance to measure the distance between two GMM distributions. Subsequently, the GMM-UBM mean interval (GUMI) concept is introduced to derive a GUMI kernel which can be used in conjunction with support vector machine (SVM) for speaker recognition. The GUMI kernel allows us to exploit the speaker's information not only from the mean vectors of GMM but also from the covariance matrices. Moreover, by analyzing the Bhattacharyya-based GMM-distance measure, we extend the Bhattacharyya-based kernel by involving both the mean and covariance statistical dissimilarities. We demonstrate the effectiveness of the new kernel on the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2006 dataset. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting Morphology and Local Word Reordering in English-to-Turkish Phrase-Based Statistical Machine Translation

    Publication Year: 2010 , Page(s): 1313 - 1322
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (403 KB) |  | HTML iconHTML  

    In this paper, we present the results of our work on the development of a phrase-based statistical machine translation prototype from English to Turkish-an agglutinative language with very productive inflectional and derivational morphology. We experiment with different morpheme-level representations for English-Turkish parallel texts. Additionally, to help with word alignment, we experiment with local word reordering on the English side, to bring the word order of specific English prepositional phrases and auxiliary verb complexes, in line with the morpheme order of the corresponding case-marked nouns and complex verbs, on the Turkish side. To alleviate the dearth of the parallel data available, we also augment the training data with sentences just with content word roots obtained from the original training data to bias root word alignment, and with highly reliable phrase-pairs from an earlier corpus alignment. We use a morpheme-based language model in decoding and a word-based language model in re-ranking the n-best lists generated by the decoder. Lastly, we present a scheme for repairing the decoder output by correcting words which have incorrect morphological structure or which are out-of-vocabulary with respect to the training data and language model, to further improve the translations. We improve from 15.53 BLEU points for our word-based baseline model to 25.17 BLEU points for an improvement of 9.64 points or about 62% relative. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active Learning With Sampling by Uncertainty and Density for Data Annotations

    Publication Year: 2010 , Page(s): 1323 - 1331
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (427 KB) |  | HTML iconHTML  

    To solve the knowledge bottleneck problem, active learning has been widely used for its ability to automatically select the most informative unlabeled examples for human annotation. One of the key enabling techniques of active learning is uncertainty sampling, which uses one classifier to identify unlabeled examples with the least confidence. Uncertainty sampling often presents problems when outliers are selected. To solve the outlier problem, this paper presents two techniques, sampling by uncertainty and density (SUD) and density-based re-ranking. Both techniques prefer not only the most informative example in terms of uncertainty criterion, but also the most representative example in terms of density criterion. Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Selecting Feature Frames for Automatic Speaker Recognition Using Mutual Information

    Publication Year: 2010 , Page(s): 1332 - 1340
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (784 KB) |  | HTML iconHTML  

    In this paper, an information theoretic approach to selecting feature frames for speaker recognition systems is proposed. A conventional approach in which the frame shift is fixed to around half of the frame length may not be the best choice, because the characteristics of the speech signal may rapidly change, especially at phonetic boundaries. Experimental results show that the recognition accuracy increases if the frame interval is directly controlled using phonetic information. By applying these results to the well-known fact that the recognition accuracy is directly correlated with the amount of mutual information, this paper suggests a novel feature frame selection method for speaker recognition. Specifically, feature frames are chosen to have minimum-redundancy within selected feature frames, but maximum-relevancy to speaker models. It is verified by experiments that the proposed method produces consistent improvement, especially in a speaker verification system. It is also robust against variations in acoustic environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MMSE-Based Packet Loss Concealment for CELP-Coded Speech Recognition

    Publication Year: 2010 , Page(s): 1341 - 1353
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (888 KB) |  | HTML iconHTML  

    In this paper, we analyze the performance of network speech recognition (NSR) over IP networks, adapting and proposing new solutions to the packet loss problem for code excited linear prediction (CELP) codecs. NSR has a client-server architecture which places the recognizer at the server side using a standard speech codec for speech transmission. Its main advantage is that no changes are required for the existing client devices and networks. However, the use of speech codecs degrades its performance, mainly in the presence of packet losses. First, we study the degradations introduced by CELP codecs in lossy packet networks. Later, we propose a reconstruction technique based on minimum mean square error (MMSE) estimation using hidden Markov models. This approach also allows us to obtain reliability measures associated to each estimate. We show how to use this information to improve the recognition performance by means of soft-data decoding and weighted Viterbi algorithm. The experimental results are obtained for two well-known CELP codecs, G.729 and AMR 12.2 kbps, carrying out recognition from decoded speech. Finally, we analyze an efficient and improved implementation of the proposed techniques using an NSR system which extracts speech recognition features directly from the bit-stream parameters. The experimental results show that the different proposed NSR systems achieve a comparable performance to distributed speech recognition (DSR). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research