By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 8 • Date Oct. 2012

Filter Results

Displaying Results 1 - 23 of 23
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (172 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (137 KB)  
    Freely Available from IEEE
  • A Parametric Objective Quality Assessment Tool for Speech Signals Degraded by Acoustic Echo

    Page(s): 2181 - 2190
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1234 KB) |  | HTML iconHTML  

    This paper discusses the automatic quality assessment of echo-degraded speech in the context of teleconference systems. Subjective listening tests conducted over a carefully designed database of signals degraded by acoustic echo have been used to assess how this impairment is perceived and to determine which parameters have a significant impact on speech quality. The results have shown that, similarly to electric transmission line echo, acoustic echo is mainly influenced by echo delay and echo gain. Based on this observation, a mapping between these two parameters and the mean subjective score is devised. Moreover, a signal-based algorithm for the estimation of these parameters is described, and its performance is evaluated. The complete system comprising both the parameter estimators and the mapping function achieves a correlation of 94% between predicted and actual subjective scores, and can be employed as a non-intrusive monitoring tool for in-service quality evaluation of teleconference systems. Further validation indicates the operating range of the proposed quality assessment tool can be extended by proper retraining. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonlinear Compensation Using the Gauss–Newton Method for Noise-Robust Speech Recognition

    Page(s): 2191 - 2206
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3127 KB) |  | HTML iconHTML  

    In this paper, we present the Gauss-Newton method as a unified approach to estimating noise parameters of the prevalent nonlinear compensation models, such as vector Taylor series (VTS), data-driven parallel model combination (DPMC), and unscented transform (UT), for noise-robust speech recognition. While iterative estimation of noise means in a generalized EM framework has been widely known, we demonstrate that such approaches are variants of the Gauss-Newton method. Furthermore, we propose a novel noise variance estimation algorithm that is consistent with the Gauss-Newton principle. The formulation of the Gauss-Newton method reduces the noise estimation problem to determining the Jacobians of the corrupted speech parameters. For sampling-based compensations, we present two methods, sample Jacobian average (SJA) and cross-covariance (XCOV), to evaluate these Jacobians. The proposed noise estimation algorithm is evaluated for various compensation models on two tasks. The first is to fit a Gaussian mixture model (GMM) model to artificially corrupted samples, and the second is to perform speech recognition on the Aurora 2 database. The significant performance improvements confirm the efficacy of the Gauss-Newton method to estimating the noise parameters of the nonlinear compensation models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning Content Similarity for Music Recommendation

    Page(s): 2207 - 2218
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1012 KB) |  | HTML iconHTML  

    Many tasks in music information retrieval, such as recommendation, and playlist generation for online radio, fall naturally into the query-by-example setting, wherein a user queries the system by providing a song, and the system responds with a list of relevant or similar song recommendations. Such applications ultimately depend on the notion of similarity between items to produce high-quality results. Current state-of-the-art systems employ collaborative filter methods to represent musical items, effectively comparing items in terms of their constituent users. While collaborative filter techniques perform well when historical data is available for each item, their reliance on historical data impedes performance on novel or unpopular items. To combat this problem, practitioners rely on content-based similarity, which naturally extends to novel items, but is typically outperformed by collaborative filter methods. In this paper, we propose a method for optimizing content-based similarity by learning from a sample of collaborative filter data. The optimized content-based similarity metric can then be applied to answer queries on novel and unpopular items, while still maintaining high recommendation accuracy. The proposed system yields accurate and efficient representations of audio content, and experimental results show significant improvements in accuracy over competing content-based recommendation techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bandwidth Extension of Telephone Speech to Low Frequencies Using Sinusoidal Synthesis and a Gaussian Mixture Model

    Page(s): 2219 - 2231
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1487 KB) |  | HTML iconHTML  

    The quality of narrowband telephone speech is degraded by the limited audio bandwidth. This paper describes a method that extends the bandwidth of telephone speech to the frequency range 0-300 Hz. The method generates the lowest harmonics of voiced speech using sinusoidal synthesis. The energy in the extension band is estimated from spectral features using a Gaussian mixture model. The amplitudes and phases of the synthesized sinusoidal components are adjusted based on the amplitudes and phases of the narrowband input speech, which provides adaptivity to varying input bandwidth characteristics. The proposed method was evaluated with listening tests in combination with another bandwidth extension method for the frequency range 4-8 kHz. While the low-frequency bandwidth extension was not found to improve perceived quality, the method reduced dissimilarity with wideband speech. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust Patchwork-Based Embedding and Decoding Scheme for Digital Audio Watermarking

    Page(s): 2232 - 2239
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (649 KB) |  | HTML iconHTML  

    This paper presents a novel patchwork-based embedding and decoding scheme for digital audio watermarking. At the embedding stage, an audio segment is divided into two subsegments and the discrete cosine transform (DCT) coefficients of the subsegments are computed. The DCT coefficients related to a specified frequency region are then partitioned into a number of frame pairs. The DCT frame pairs suitable for watermark embedding are chosen by a selection criterion and watermarks are embedded into the selected DCT frame pairs by modifying their coefficients, controlled by a secret key. The modifications are conducted in such a way that the selection criterion used at the embedding stage can be applied at the decoding stage to identify the watermarked DCT frame pairs. At the decoding stage, the secret key is utilized to extract watermarks from the watermarked DCT frame pairs. Compared with existing patchwork watermarking methods, the proposed scheme does not require information of which frame pairs of the watermarked audio signal enclose watermarks and is more robust to conventional attacks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Structural Classification Methods Based on Weighted Finite-State Transducers for Automatic Speech Recognition

    Page(s): 2240 - 2251
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2030 KB) |  | HTML iconHTML  

    The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hidden Markov Acoustic Modeling With Bootstrap and Restructuring for Low-Resourced Languages

    Page(s): 2252 - 2264
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (838 KB) |  | HTML iconHTML  

    This paper proposes an acoustic modeling approach based on bootstrap and restructuring to dealing with data sparsity for low-resourced languages. The goal of the approach is to improve the statistical reliability of acoustic modeling for automatic speech recognition (ASR) in the context of speed, memory and response latency requirements for real-world applications. In this approach, randomized hidden Markov models (HMMs) estimated from the bootstrapped training data are aggregated for reliable sequence prediction. The aggregation leads to an HMM with superior prediction capability at cost of a substantially larger size. For practical usage the aggregated HMM is restructured by Gaussian clustering followed by model refinement. The restructuring aims at reducing the aggregated HMM to a desirable model size while maintaining its performance close to the original aggregated HMM. To that end, various Gaussian clustering criteria and model refinement algorithms have been investigated in the full covariance model space before the conversion to the diagonal covariance model space in the last stage of the restructuring. Large vocabulary continuous speech recognition (LVCSR) experiments on Pashto and Dari have shown that acoustic models obtained by the proposed approach can yield superior performance over the conventional training procedure with almost the same run-time memory consumption and decoding speed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Phoneme Selective Speech Enhancement Using Parametric Estimators and the Mixture Maximum Model: A Unifying Approach

    Page(s): 2265 - 2279
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1301 KB) |  | HTML iconHTML  

    This study presents a ROVER speech enhancement algorithm that employs a series of prior enhanced utterances, each customized for a specific broad level phoneme class, to generate a single composite utterance which provides overall improved objective quality across all classes. The noisy utterance is first partitioned into speech and non-speech regions using a voice activity detector, followed by a mixture maximum (MIXMAX) model which is used to make probabilistic decisions in the speech regions to determine phoneme class weights. The prior enhanced utterances are weighted by these decisions and combined to form the final composite utterance. The enhancement system that generates the prior enhanced utterances comprises of a family of parametric gain functions whose parameters are flexible and can be varied to achieve high enhancement levels per phoneme class. These parametric gain functions are derived using 1) a weighted Euclidean distortion cost function, and 2) by modeling clean speech spectral magnitudes or discrete Fourier transform coefficients by Chi or two-sided Gamma priors, respectively. The special case estimators of these gain functions are the generalized spectral subtraction (GSS), minimum mean square error (MMSE), two-sided Gamma or joint maximum a posteriori (MAP) estimators. Performance evaluations performed over two noise types and signal-to-noise ratios (SNRs) ranging from ${-}$ 5 dB to 10 dB suggest that the proposed ROVER algorithm not only outperforms the special case estimators but also the family of parametric estimators when all phoneme classes are jointly considered. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech

    Page(s): 2280 - 2290
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1342 KB) |  | HTML iconHTML  

    In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Singer Identification Based on Spoken Data in Voice Characterization

    Page(s): 2291 - 2300
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2072 KB) |  | HTML iconHTML  

    Currently existing singer identification (SID) methods follow the framework of speaker identification (SPID), which requires that singing data be collected beforehand to establish each singer's voice characteristics. This framework, however, is unsuitable for many SID applications, because acquiring solo a cappella from each singer is usually not as feasible as collecting spoken data in SPID applications. Since a cappella data are difficult to acquire, many studies have tried to improve SID accuracies when only accompanied singing data are available for training; but, the improvements are not always satisfactory. Recognizing that spoken data are usually available easily, this work investigates the possibility of characterizing singers' voices using the spoken data instead of their singing data. Unfortunately, our experiment found it difficult to replace singing data fully by using spoken data in singer voice characterization, due to the significant difference between singing and speech voice for most people. Thus, we propose two alternative solutions based on the use of few singing data. The first solution aims at adapting a speech-derived model to cover singing voice characteristics. The second solution attempts to establish the relationships between speech and singing using a transformation, so that an unknown test singing clip can be converted into its speech counterpart and then identified using speech-derived models; or alternatively, training data can be converted from speech into singing to generate a singer model capable of matching test singing clips. Our experiments conducted using a 20-singer database validate the proposed solutions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Foreign Accent Conversion Through Concatenative Synthesis in the Articulatory Domain

    Page(s): 2301 - 2312
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (881 KB) |  | HTML iconHTML  

    We propose a concatenative synthesis approach to the problem of foreign accent conversion. The approach consists of replacing the most accented portions of nonnative speech with alternative segments from a corpus of the speaker's own speech based on their similarity to those from a reference native speaker. We propose and compare two approaches for selecting units, one based on acoustic similarity [e.g., mel frequency cepstral coefficients (MFCCs)] and a second one based on articulatory similarity, as measured through electromagnetic articulography (EMA). Our hypothesis is that articulatory features provide a better metric for linguistic similarity across speakers than acoustic features. To test this hypothesis, we recorded an articulatory-acoustic corpus from a native and a nonnative speaker, and evaluated the two speech representations (acoustic versus articulatory) through a series of perceptual experiments. Formal listening tests indicate that the approach can achieve a 20% reduction in perceived accent, but also reveal a strong coupling between accent and speaker identity. To address this issue, we disguised original and resynthesized utterances by altering their average pitch and normalizing vocal tract length. An additional listening experiment supports the hypothesis that articulatory features are less speaker dependent than acoustic features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Transcription of Polyphonic Piano Music Using Genetic Algorithms, Adaptive Spectral Envelope Modeling, and Dynamic Noise Level Estimation

    Page(s): 2313 - 2328
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1359 KB) |  | HTML iconHTML  

    This paper presents a new method for multiple fundamental frequency (F0) estimation on piano recordings. We propose a framework based on a genetic algorithm in order to analyze the overlapping overtones and search for the most likely F0 combination. The search process is aided by adaptive spectral envelope modeling and dynamic noise level estimation: while the noise is dynamically estimated, the spectral envelope of previously recorded piano samples (internal database) is adapted in order to best match the piano played on the input signals and aid the search process for the most likely combination of F0s. For comparison, several state-of-the-art algorithms were run across various musical pieces played by different pianos and then compared using three different metrics. The proposed algorithm ranked first place on Hybrid Decay/Sustain Score metric, which has better correlation with the human hearing perception and ranked second place on both onset-only and onset–offset metrics. A previous genetic algorithm approach is also included in the comparison to show how the proposed system brings significant improvements on both quality of the results and computing time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generating Human-Like Behaviors Using Joint, Speech-Driven Models for Conversational Agents

    Page(s): 2329 - 2340
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (940 KB) |  | HTML iconHTML  

    During human communication, every spoken message is intrinsically modulated within different verbal and nonverbal cues that are externalized through various aspects of speech and facial gestures. These communication channels are strongly interrelated, which suggests that generating human-like behavior requires a careful study of their relationship. Neglecting the mutual influence of different communicative channels in the modeling of natural behavior for a conversational agent may result in unrealistic behaviors that can affect the intended visual perception of the animation. This relationship exists both between audiovisual information and within different visual aspects. This paper explores the idea of using joint models to preserve the coupling not only between speech and facial expression, but also within facial gestures. As a case study, the paper focuses on building a speech-driven facial animation framework to generate natural head and eyebrow motions. We propose three dynamic Bayesian networks (DBNs), which make different assumptions about the coupling between speech, eyebrow and head motion. Synthesized animations are produced based on the MPEG-4 facial animation standard, using the audiovisual IEMOCAP database. The experimental results based on perceptual evaluations reveal that the proposed joint models (speech/eyebrow/head) outperform audiovisual models that are separately trained (speech/head and speech/eyebrow). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Morpholexical and Discriminative Language Models for Turkish Automatic Speech Recognition

    Page(s): 2341 - 2351
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1275 KB) |  | HTML iconHTML  

    This paper introduces two complementary language modeling approaches for morphologically rich languages aiming to alleviate out-of-vocabulary (OOV) word problem and to exploit morphology as a knowledge source. The first model, morpholexical language model, is a generative $n$-gram model, where modeling units are lexical-grammatical morphemes instead of commonly used words or statistical sub-words. This paper also proposes a novel approach for integrating the morphology into an automatic speech recognition (ASR) system in the finite-state transducer framework as a knowledge source. We accomplish that by building a morpholexical search network obtained by the composition of lexical transducer of a computational lexicon with a morpholexical language model. The second model is a linear reranking model trained discriminatively with a variant of the perceptron algorithm using morpholexical features. This variant of the perceptron algorithm, WER-sensitive perceptron, is shown to perform better for reranking $n$ -best candidates obtained with the generative model. We apply the proposed models in Turkish broadcast news transcription task and give experimental results. The morpholexical model leads to an elegant morphology-integrated search network with unlimited vocabulary. Thus, it is highly effective in alleviating OOV problem and improves the word error rate (WER) over word and statistical sub-word models by 1.8% and 0.4% absolute, respectively. The discriminatively trained morpholexical model further improves the WER of the system by 0.8% absolute. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptation of Hidden Markov Models Using Model-as-Matrix Representation

    Page(s): 2352 - 2364
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (535 KB) |  | HTML iconHTML  

    In this paper, we describe basis-based speaker adaptation techniques using the matrix representation of training models. Bases are obtained from training models by decomposition techniques for matrix-variate objects: two-dimensional principal component analysis (2DPCA) and generalized low rank approximations of matrices (GLRAM). The motivation for using matrix representation is that the sample covariance matrix of training models can be more accurately computed and the speaker weight becomes a matrix. Speaker adaptation equations are derived in the maximum-likelihood (ML) framework, and the adaptation equations can be solved using the maximum-likelihood linear regression technique. Additionally, novel applications of probabilistic 2DPCA and GLRAM to speaker adaptation are presented. From the probabilistic 2DPCA/GLRAM of training models, speaker adaptation equations are formulated in the maximum a posteriori (MAP) framework. The adaptation equations can be solved using the MAP linear regression technique. In the isolated-word experiments, the matrix representation-based methods in the ML and MAP frameworks outperformed maximum-likelihood linear regression adaptation, MAP adaptation, eigenvoice, and probabilistic PCA-based model for adaptation data longer than 20 s. Furthermore, the adaptation methods using probabilistic 2DPCA/GLRAM showed additional performance improvement over the adaptation methods using 2DPCA/GLRAM for small amounts of adaptation data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Frequency Domain Implementation of Noncausal Multichannel Blind Deconvolution for Convolutive Mixtures of Speech

    Page(s): 2365 - 2377
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (974 KB) |  | HTML iconHTML  

    Multichannel blind deconvolution (MCBD) algorithms are known to suffer from an extensive computational complexity problem, which makes them impractical for blind source separation (BSS) of speech and audio signals. This problem is even more serious with noncausal MCBD algorithms that must be used in many frequently occurring BSS setups. In this paper, we propose a novel frequency domain algorithm for the efficient implementation of noncausal multichannel blind deconvolution. A block-wise formulation is first developed for filtering and adaptation of filter coefficients. Based on this formulation, we present a modified overlap-save procedure for noncausal filtering in the frequency domain. We also derive update equations for training both causal and anti-causal filters in the frequency domain. Our evaluations indicate that the proposed frequency domain implementation reduces the computational requirements of the algorithm by a factor of more than 100 for typical filter lengths used in blind speech separation. The algorithm is employed successfully for the separation of speech mixtures in a reverberant room. Simulation results demonstrate the superior performance of the proposed algorithm over causal MCBD algorithms in many potential source and microphone positions. It is shown that in BSS problems, causal MCBD algorithms with center-spike initialization do not always converge to a delayed form of the desired noncausal solution, further revealing the need for an efficient noncausal MCBD algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

    Page(s): 2378 - 2387
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1025 KB) |  | HTML iconHTML  

    We compare two approaches for synthesizing visual speech using active appearance models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g., a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sound Field Reproduction With Energy Constraint on Loudspeaker Weights

    Page(s): 2388 - 2392
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (411 KB) |  | HTML iconHTML  

    Audio rendering problems are not always well-posed. An approach is devised for solving ill-posed sound field reproduction problems using regularization, where the Tikhonov parameter is chosen by upper bounding the summed square of the loudspeaker weights. The method ensures that the sound in the room remains at reasonable levels. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Audio, Speech, and Language Processing EDICS

    Page(s): 2393 - 2394
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing Information for Authors

    Page(s): 2395 - 2396
    Save to Project icon | Request Permissions | PDF file iconPDF (48 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (33 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research