By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 5 • Date Sept. 2006

Filter Results

Displaying Results 1 - 25 of 46
  • Table of contents

    Page(s): c1 - c4
    Save to Project icon | Request Permissions | PDF file iconPDF (54 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • From the Editor-in-Chief

    Page(s): 1489
    Save to Project icon | Request Permissions | PDF file iconPDF (26 KB)  
    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Structured speech modeling

    Page(s): 1492 - 1504
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1147 KB)  

    Modeling dynamic structure of speech is a novel paradigm in speech recognition research within the generative modeling framework, and it offers a potential to overcome limitations of the current hidden Markov modeling approach. Analogous to structured language models where syntactic structure is exploited to represent long-distance relationships among words , the structured speech model described in this paper makes use of the dynamic structure in the hidden vocal tract resonance space to characterize long-span contextual influence among phonetic units. A general overview is provided first on hierarchically classified types of dynamic speech models in the literature. A detailed account is then given for a specific model type called the hidden trajectory model, and we describe detailed steps of model construction and the parameter estimation algorithms. We show how the use of resonance target parameters and their temporal filtering enables joint modeling of long-span coarticulation and phonetic reduction effects. Experiments on phonetic recognition evaluation demonstrate superior recognizer performance over a modern hidden Markov model-based system. Error analysis shows that the greatest performance gain occurs within the sonorant speech class View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multistage speaker diarization of broadcast news

    Page(s): 1505 - 1512
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (629 KB) |  | HTML iconHTML  

    This paper describes recent advances in speaker diarization with a multistage segmentation and clustering system, which incorporates a speaker identification step. This system builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system. The baseline partitioner provides a high cluster purity, but has a tendency to split data from speakers with a large quantity of data into several segment clusters. Several improvements to the baseline system have been made. First, the iterative Gaussian mixture model (GMM) clustering has been replaced by a Bayesian information criterion (BIC) agglomerative clustering. Second, an additional clustering stage has been added, using a GMM-based speaker identification method. Finally, a post-processing stage refines the segment boundaries using the output of a transcription system. On the National Institute of Standards and Technology (NIST) RT-04F and ESTER evaluation data, the multistage system reduces the speaker error by over 70% relative to the baseline system, and gives between 40% and 50% reduction relative to a single-stage BIC clustering system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Progress in the CU-HTK broadcast news transcription system

    Page(s): 1513 - 1525
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (633 KB)  

    Broadcast news (BN) transcription has been a challenging research area for many years. In the last couple of years, the availability of large amounts of roughly transcribed acoustic training data and advanced model training techniques has offered the opportunity to greatly reduce the error rate on this task. This paper describes the design and performance of BN transcription systems which make use of these developments. First, the effects of using lightly supervised training data and advanced acoustic modeling techniques are discussed. The design of a real-time broadcast news recognition system is then detailed using these new models. As system combination has been found to yield large gains in performance, a range of frameworks that allow multiple recognition outputs to be combined are next described. These include the use of multiple types of acoustic models and multiple segmentations. As a contrast a system developed by multiple sites allowing cross-site combination, the "SuperEARS" system, is also described. The various models and recognition configurations are evaluated using several recent BN development and evaluation test sets. These new BN transcription systems can give gains of over 25% relative to the CU-HTK 2003 BN system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enriching speech recognition with automatic detection of sentence boundaries and disfluencies

    Page(s): 1526 - 1540
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (881 KB)  

    Effective human and automatic processing of speech requires recovery of more than just the words. It also involves recovering phenomena such as sentence boundaries, filler words, and disfluencies, referred to as structural metadata. We describe a metadata detection system that combines information from different types of textual knowledge sources with information from a prosodic classifier. We investigate maximum entropy and conditional random field models, as well as the predominant hidden Markov model (HMM) approach, and find that discriminative models generally outperform generative models. We report system performance on both broadcast news and conversational telephone speech tasks, illustrating significant performance differences across tasks and as a function of recognizer performance. The results represent the state of the art, as assessed in the NIST RT-04F evaluation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system

    Page(s): 1541 - 1556
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1403 KB) |  | HTML iconHTML  

    This paper describes the progress made in the transcription of broadcast news (BN) and conversational telephone speech (CTS) within the combined BBN/LIMSI system from May 2002 to September 2004. During that period, BBN and LIMSI collaborated in an effort to produce significant reductions in the word error rate (WER), as directed by the aggressive goals of the Effective, Affordable, Reusable, Speech-to-text [Defense Advanced Research Projects Agency (DARPA) EARS] program. The paper focuses on general modeling techniques that led to recognition accuracy improvements, as well as engineering approaches that enabled efficient use of large amounts of training data and fast decoding architectures. Special attention is given on efforts to integrate components of the BBN and LIMSI systems, discussing the tradeoff between speed and accuracy for various system combination strategies. Results on the EARS progress test sets show that the combined BBN/LIMSI system achieved relative reductions of 47% and 51% on the BN and CTS domains, respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An overview of automatic speaker diarization systems

    Page(s): 1557 - 1565
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1471 KB) |  | HTML iconHTML  

    Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources, and other signal source/channel characteristics. Diarization can be used for helping speech recognition, facilitating the searching and indexing of audio archives, and increasing the richness of automatic transcriptions, making them more readable. In this paper, we provide an overview of the approaches currently used in a key area of audio diarization, namely speaker diarization, and discuss their relative merits and limitations. Performances using the different techniques are compared within the framework of the speaker diarization task in the DARPA EARS Rich Transcription evaluations. We also look at how the techniques are being introduced into real broadcast news systems and their portability to other domains and tasks such as meetings and speaker verification View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Recognizing disfluencies in conversational speech

    Page(s): 1566 - 1573
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (537 KB) |  | HTML iconHTML  

    We present a system for modeling disfluency in conversational speech: repairs, fillers, and self-interruption points (IPs). For each sentence, candidate repair analyses are generated by a stochastic tree adjoining grammar (TAG) noisy-channel model. A probabilistic syntactic language model scores the fluency of each analysis, and a maximum-entropy model selects the most likely analysis given the language model score and other features. Fillers are detected independently via a small set of deterministic rules, and IPs are detected by combining the output of repair and filler detection modules. In the recent Rich Transcription Fall 2004 (RT-04F) blind evaluation, systems competed to detect these three forms of disfluency under two input conditions: a best-case scenario of manually transcribed words and a fully automatic case of automatic speech recognition (ASR) output. For all three tasks and on both types of input, our system was the top performer in the evaluation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Edit disfluency detection and correction using a cleanup language model and an alignment model

    Page(s): 1574 - 1583
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (645 KB) |  | HTML iconHTML  

    This investigation presents a novel approach to detecting and correcting the edit disfluency in spontaneous speech. Hypothesis testing using acoustic features is first adopted to detect potential interruption points (IPs) in the input speech. The word order of the cleanup utterance is then cleaned up based on the potential IPs using a class-based cleanup language model, the deletable region and the correction are aligned using an alignment model. Finally, log linear weighting is applied to optimize the performance. Using the acoustic features, the IP detection rate is significantly improved especially in recall rate. Based on the positions of the potential IPs, the cleanup language model and the alignment model are able to detect and correct the edit disfluency efficiently. Experimental results demonstrate that the proposed approach has achieved error rates of 0.33 and 0.21 for IP detection and edit word deletion, respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Large margin hidden Markov models for speech recognition

    Page(s): 1584 - 1595
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (603 KB) |  | HTML iconHTML  

    In this paper, motivated by large margin classifiers in machine learning, we propose a novel method to estimate continuous-density hidden Markov model (CDHMM) for speech recognition according to the principle of maximizing the minimum multiclass separation margin. The approach is named large margin HMM. First, we show this type of large margin HMM estimation problem can be formulated as a constrained minimax optimization problem. Second, we propose to solve this constrained minimax optimization problem by using a penalized gradient descent algorithm, where the original objective function, i.e., minimum margin, is approximated by a differentiable function and the constraints are cast as penalty terms in the objective function. The new training method is evaluated in the speaker-independent isolated E-set recognition and the TIDIGITS connected digit string recognition tasks. Experimental results clearly show that the large margin HMMs consistently outperform the conventional HMM training methods. It has been consistently observed that the large margin training method yields significant recognition error rate reduction even on top of some popular discriminative training methods View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Advances in speech transcription at IBM under the DARPA EARS program

    Page(s): 1596 - 1608
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (976 KB) |  | HTML iconHTML  

    This paper describes the technical and system building advances made in IBM's speech recognition technology over the course of the Defense Advanced Research Projects Agency (DARPA) Effective Affordable Reusable Speech-to-Text (EARS) program. At a technical level, these advances include the development of a new form of feature-based minimum phone error training (fMPE), the use of large-scale discriminatively trained full-covariance Gaussian models, the use of septaphone acoustic context in static decoding graphs, and improvements in basic decoding algorithms. At a system building level, the advances include a system architecture based on cross-adaptation and the incorporation of 2100 h of training data in every system component. We present results on English conversational telephony test data from the 2003 and 2004 NIST evaluations. The combination of technical advances and an order of magnitude more training data in 2004 reduced the error rate on the 2003 test set by approximately 21% relative-from 20.4% to 16.1%-over the most accurate system in the 2003 evaluation and produced the most accurate results on the 2004 test sets in every speed category View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hidden Markov model-based packet loss concealment for voice over IP

    Page(s): 1609 - 1623
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (604 KB) |  | HTML iconHTML  

    As voice over IP proliferates, packet loss concealment (PLC) at the receiver has emerged as an important factor in determining voice quality of service. Through the use of heuristic variations of signal and parameter repetition and overlap-add interpolation to handle packet loss, conventional PLC systems largely ignore the dynamics of the statistical evolution of the speech signal, possibly leading to perceptually annoying artifacts. To address this problem, we propose the use of hidden Markov models for PLC. With a hidden Markov model (HMM) tracking the evolution of speech signal parameters, we demonstrate how PLC is performed within a statistical signal processing framework. Moreover, we show how the HMM is used to index a specially designed PLC module for the particular signal context, leading to signal-contingent PLC. Simulation examples, objective tests, and subjective listening tests are provided showing the ability of an HMM-based PLC built with a sinusoidal analysis/synthesis model to provide better loss concealment than a conventional PLC based on the same sinusoidal model for all types of speech signals, including onsets and signal transitions View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Single and double frame coding of speech LPC parameters using a lattice-based quantization scheme

    Page(s): 1624 - 1632
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (482 KB) |  | HTML iconHTML  

    A lattice-based scheme for the single-frame and the double-frame quantization of the speech line spectral frequency parameters is proposed. The lattice structure provides a low-complexity vector quantization framework, which is implemented using a trellis structure. In the single-frame scheme, the intraframe dependencies are exploited using a linear predictor. In the double-frame scheme, the parameters of two consecutive frames are jointly quantized and hence the interframe dependencies are also exploited. A switched scheme is also considered in which, lattice-based double-frame and single-frame quantization is performed for each two frame and the one which results in a lower distortion is chosen. Comparisons to the Split-VQ, the Multi-Stage VQ, the Trellis Coded Quantization, the interframe Block-Based Trellis Quantizer, and the interframe scheme used in IS-641 EFRC and the GSM AMR codec are provided. These results demonstrate the effectiveness of the proposed lattice-based quantization schemes, while maintaining a very low complexity. Finally, the issue of the robustness to channel errors is investigated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust extended multidelay filter and double-talk detector for acoustic echo cancellation

    Page(s): 1633 - 1644
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (713 KB) |  | HTML iconHTML  

    We propose an integrated acoustic echo cancellation solution based on a novel class of efficient and robust adaptive algorithms in the frequency domain, the extended multidelay filter (EMDF). The approach is tailored to very long adaptive filters and highly auto-correlated input signals as they arise in wideband full-duplex audio applications. The EMDF algorithm allows an attractive tradeoff between the well-known multidelay filter and the recursive least-squares algorithm. It exhibits fast convergence, superior tracking capabilities of the signal statistics, and very low delay. The low computational complexity of the conventional frequency-domain adaptive algorithms can be maintained thanks to efficient fast realizations. We also show how this approach can be combined efficiently with a suitable double-talk detector (DTD). We consider a corresponding extension of a recently proposed DTD based on a normalized cross-correlation vector whose performance was shown to be superior compared to other DTDs based on the cross-correlation coefficient. Since the resulting DTD also has an EMDF structure it is easy to implement, and the fast realization also carries over to the DTD scheme. Moreover, as the robustness issue during double talk is particularly crucial for fast-converging algorithms, we apply the concept of robust statistics into our extended frequency-domain approach. Due to the robust generalization of the cost function leading to a so-called M-estimator, the algorithms become inherently less sensitive to outliers, i.e., short bursts that may be caused by inevitable detection failures of a DTD. The proposed structure is also well suited for an efficient generalization to the multichannel case View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimation of the short-term predictor parameters of speech under noisy conditions

    Page(s): 1645 - 1655
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (325 KB) |  | HTML iconHTML  

    Speech coding algorithms that have been developed for clean speech are often used in a noisy environment. We describe maximum a posteriori (MAP) and minimum mean square error (MMSE) techniques to estimate the clean-speech short-term predictor (STP) parameters from noisy speech. The MAP and MMSE estimates are obtained using a likelihood function computed by means of the DFT or Kalman filtering and empirical probability distributions based on multidimensional histograms. The method is assessed in terms of the resulting root mean spectral distortion between the "clean" speech STP parameters and the STP parameters computed with the proposed method from noisy speech. The estimated parameters are also applied to obtain clean speech estimates by means of a Kalman filter. The quality of the estimated speech as compared to the "clean" speech is assessed by means of subjective tests, signal-to-noise ratio improvement, and the perceptual speech quality measurement method View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An approach to automatic acquisition of translation templates based on phrase structure extraction and alignment

    Page(s): 1656 - 1663
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (477 KB) |  | HTML iconHTML  

    In this paper, we propose a new approach for automatically acquiring translation templates from unannotated bilingual spoken language corpora. Two basic algorithms are adopted: a grammar induction algorithm, and an alignment algorithm using bracketing transduction grammar. The approach is unsupervised, statistical, and data-driven, and employs no parsing procedure. The acquisition procedure consists of two steps. First, semantic groups and phrase structure groups are extracted from both the source language and the target language. Second, an alignment algorithm based on bracketing transduction grammar aligns the phrase structure groups. The aligned phrase structure groups are post-processed, yielding translation templates. Preliminary experimental results show that the algorithm is effective View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The NESPOLE! System for multilingual speech communication over the Internet

    Page(s): 1664 - 1673
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (821 KB) |  | HTML iconHTML  

    The NESPOLE! System is a speech communication system designed to support multilingual interaction between common users and providers of e-commerce services over the Internet. The core of the system is a distributed interlingua-based speech-to-speech translation system, which is supported by multimodal capabilities that allow the two parties participating in the communication to share Web pages and graphical content which can be annotated using gestures. We describe the unique features and considerations behind the design and implementation of this system, and evaluate these within the context of a constructed full prototype of the system that was developed for the domain of travel planning View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparative study on corpora for speech translation

    Page(s): 1674 - 1682
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (734 KB) |  | HTML iconHTML  

    This paper investigates issues in preparing corpora for developing speech-to-speech translation (S2ST). It is impractical to create a broad-coverage parallel corpus only from dialog speech. An alternative approach is to have bilingual experts write conversational-style texts in the target domain, with translations. There is, however, a risk of losing fidelity to the actual utterances. This paper focuses on balancing a tradeoff between these two kinds of corpora through the analysis of two newly developed corpora in the travel domain: a bilingual parallel corpus with 420 K utterances and a collection of in-domain dialogs using actual S2ST systems. We found that the first corpus is effective for covering utterances in the second corpus if complimented with a small number of utterances taken from monolingual dialogs. We also found that characteristics of in-domain utterances become closer to those of the first corpus when more restrictive conditions and instructions to speakers are given. These results suggest the possibility of a bootstrap-style of development of corpora and S2ST systems, where an initial S2ST system is developed with parallel texts, and is then gradually improved with in-domain utterances collected by the system as restrictions are relaxed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high-speed, low-resource ASR back-end based on custom arithmetic

    Page(s): 1683 - 1693
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (458 KB) |  | HTML iconHTML  

    With the skyrocketing popularity of mobile devices, new processing methods tailored to a specific application have become necessary for low-resource systems. This work presents a high-speed, low-resource speech recognition system using custom arithmetic units, where all system variables are represented by integer indices and all arithmetic operations are replaced by hardware-based table lookups. To this end, several reordering and rescaling techniques, including two accumulation structures for Gaussian evaluation and a novel method for the normalization of Viterbi search scores, are proposed to ensure low entropy for all variables. Furthermore, a discriminatively inspired distortion measure is investigated for scalar quantization of forward probabilities to maximize the recognition rate. Finally, heuristic algorithms are explored to optimize system-wide resource allocation. Our best bit-width allocation scheme only requires 59 kB of ROMs to hold the lookup tables, and its recognition performance with various vocabulary sizes in both clean and noisy conditions is nearly as good as that of a system using a 32-bit floating-point unit. Simulations on various architectures show that, on most modern processor designs, we can expect a cycle-count speedup of at least three times over systems with floating-point units. Additionally, the memory bandwidth is reduced by over 70% and the offline storage for model parameters is reduced by 80% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discriminative cluster adaptive training

    Page(s): 1694 - 1703
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (380 KB) |  | HTML iconHTML  

    Multiple-cluster schemes, such as cluster adaptive training (CAT) or eigenvoice systems, are a popular approach for rapid speaker and environment adaptation. Interpolation weights are used to transform a multiple-cluster, canonical, model to a standard hidden Markov model (HMM) set representative of an individual speaker or acoustic environment. Maximum likelihood training for CAT has previously been investigated. However, in state-of-the-art large vocabulary continuous speech recognition systems, discriminative training is commonly employed. This paper investigates applying discriminative training to multiple-cluster systems. In particular, minimum phone error (MPE) update formulae for CAT systems are derived. In order to use MPE in this case, modifications to the standard MPE smoothing function and the prior distribution associated with MPE training are required. A more complex adaptive training scheme combining both interpolation weights and linear transforms, a structured transform (ST), is also discussed within the MPE training framework. Discriminatively trained CAT and ST systems were evaluated on a state-of-the-art conversational telephone speech task. These multiple-cluster systems were found to outperform both standard and adaptively trained systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Recursive likelihood evaluation and fast search algorithm for polynomial segment model with application to speech recognition

    Page(s): 1704 - 1718
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (377 KB) |  | HTML iconHTML  

    Polynomial segment models (PSMs), which are generalization of the hidden Markov models (HMMs), have opened an alternative research direction for speech recognition. However, they have been limited by their computational complexity. Traditionally, any change in PSM segment boundary requires likelihood recomputation of all the frames within the segment. This makes the PSM's segment likelihood evaluation an order of magnitude more expensive than the HMM's. Furthermore, because recognition using segment models needs to search over all possible segment boundaries, the PSM recognition is computationally unfeasible beyond N-best rescoring. By exploiting the properties of the time normalization in PSM, and by decomposing the PSM segment likelihood into a simple function of "sufficient statistics", in this paper, we show that segment likelihood can be evaluated efficiently in an order of computational complexity similar to HMM. In addition, by reformulating the PSM recognition as a search for the optimal path through a graph, this paper introduces a fast PSM search algorithm that intelligently prunes the number of hypothesized segment boundaries, such that PSM recognition can be performed in an order of complexity similar to HMM. We demonstrate the effectiveness of the proposed algorithms with experiments using a PSM-based recognition system on two different recognition tasks: TIDIGIT digit recognition and the Wall Street Journal dictation task. In both tasks, PSM recognition is feasible and out-performed traditional HMM by more than 14% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Association pattern language modeling

    Page(s): 1719 - 1728
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (369 KB) |  | HTML iconHTML  

    Statistical n-gram language modeling is popular for speech recognition and many other applications. The conventional n-gram suffers from the insufficiency of modeling long-distance language dependencies. This paper presents a novel approach focusing on mining long distance word associations and incorporating these features into language models based on linear interpolation and maximum entropy (ME) principles. We highlight the discovery of the associations of multiple distant words from training corpus. A mining algorithm is exploited to recursively merge the frequent word subsets and efficiently construct the set of association patterns. By combining the features of association patterns into n-gram models, the association pattern n-grams are estimated with a special realization to trigger pair n-gram where only the associations of two distant words are considered. In the experiments on Chinese language modeling, we find that the incorporation of association patterns significantly reduces the perplexities of n-gram models. The incorporation using ME outperforms that using linear interpolation. Association pattern n-gram is superior to trigger pair n-gram. The perplexities are further reduced using more association steps. Further, the proposed association pattern n-grams are not only able to elevate document classification accuracies but also improve speech recognition rates View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research