By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 4 • Date May 2012

Filter Results

Displaying Results 1 - 25 of 35
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (102 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • Speaker Identification and Verification by Combining MFCC and Phase Information

    Page(s): 1085 - 1095
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1226 KB) |  | HTML iconHTML  

    In conventional speaker recognition methods based on Mel-frequency cepstral coefficients (MFCCs), phase information has hitherto been ignored. In this paper, we propose a phase information extraction method that normalizes the change variation in the phase according to the frame position of the input speech and combines the phase information with MFCCs in text-independent speaker identification and verification methods. There is a problem with the original phase information extraction method when comparing two phase values. For example, the difference in the two values of π-mathtildeθ1 and mathtildeθ2=-π+mathtildeθ1 is 2π-2mathtildeθ1 . If mathtildeθ1 ≈ 0, then the difference ≈ 2π, despite the two phases being very similar to one another. To address this problem, we map the phase into coordinates on a unit circle. Speaker identification and verification experiments are performed using the NTT database which consists of sentences uttered by 35 (22 male and 13 female) Japanese speakers with normal, fast and slow speaking modes during five sessions. Although the phase information-based method performs worse than the MFCC-based method, it augments the MFCC and the combination is useful for speaker recognition. The proposed modified phase information is more robust than the original phase information for all speaking modes. By integrating the modified phase information with the MFCCs, the speaker identification rate was improved to 98.8% from 97.4% (MFCC), and equal error rate for speaker verification was reduced to 0.45% from 0.72% (MFCC), respectively. We also conducted the speaker identification and verification experiments on a large-scale Japanese Newspaper Article Sentences (JNAS) database, a similar trend as NTT database was obtained. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Generative Context Model for Semantic Music Annotation and Retrieval

    Page(s): 1096 - 1108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1109 KB) |  | HTML iconHTML  

    While a listener may derive semantic associations for audio clips from direct auditory cues (e.g., hearing “bass guitar”) as well as from “context” (e.g., inferring “bass guitar” in the context of a “rock” song), most state-of-the-art systems for automatic music annotation ignore this context. Indeed, although contextual relationships correlate tags, many auto-taggers model tags independently. This paper presents a novel, generative approach to improve automatic music annotation by modeling contextual relationships between tags. A Dirichlet mixture model (DMM) is proposed as a second, additional stage in the modeling process, to supplement any auto-tagging system that generates a semantic multinomial (SMN) over a vocabulary of tags when annotating a song. For each tag in the vocabulary, a DMM captures the broader context the tag defines by modeling tag co-occurrence patterns in the SMNs of songs associated with the tag. When annotating songs, the DMMs refine SMN annotations by leveraging contextual evidence. Experimental results demonstrate the benefits of combining a variety of auto-taggers with this generative context model. It generally outperforms other approaches to modeling context as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Generative Data Augmentation Model for Enhancing Chinese Dialect Pronunciation Prediction

    Page(s): 1109 - 1117
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1601 KB) |  | HTML iconHTML  

    Most spoken Chinese dialects lack comprehensive digital pronunciation databases, which are crucial for speech processing tasks. Given complete pronunciation databases for related dialects, one can use supervised learning techniques to predict a Chinese character's pronunciation in a target dialect based on the character's features and its pronunciation in other related dialects. Unfortunately, Chinese dialect pronunciation databases are far from complete. We propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books to discover patterns that exist in multiple dialects. The proposed model can augment missing dialectal pronunciations based on existing dialect pronunciation tables (even if incomplete) and the pronunciation data in rime books. The augmented pronunciation database can then be used in supervised learning settings. We evaluate the prediction accuracy in terms of phonological features, such as tone, initial phoneme, final phoneme, etc. For each character, features are evaluated on the whole, overall pronunciation feature accuracy (OPFA). Our first experimental results show that adding features from dialectal pronunciation data to our baseline rime-book model dramatically improves OPFA using the support vector machine (SVM) model. In the second experiment, we compare the performance of the SVM model using phonological features from closely related dialects with that of the model using phonological features from non-closely related dialects. The experimental results show that using features from closely related dialects results in higher accuracy. In the third experiment, we show that using our proposed data augmentation model to fill in missing data can increase the SVM model's OPFA by up to 7.6%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A General Flexible Framework for the Handling of Prior Information in Audio Source Separation

    Page(s): 1118 - 1133
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1011 KB) |  | HTML iconHTML  

    Most audio source separation methods are developed for a particular scenario characterized by the number of sources and channels and the characteristics of the sources and the mixing process. In this paper, we introduce a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints. While this framework generalizes several existing audio source separation methods, it also allows to imagine and implement new efficient methods that were not yet reported in the literature. We first introduce the framework by describing the model structure and constraints, explaining its generality, and summarizing its algorithmic implementation using a generalized expectation-maximization algorithm. Finally, we illustrate the above-mentioned capabilities of the framework by applying it in several new and existing configurations to different source separation problems. We have released a software tool named Flexible Audio Source Separation Toolbox (FASST) implementing a baseline version of the framework in Matlab. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discovering Time-Constrained Sequential Patterns for Music Genre Classification

    Page(s): 1134 - 1144
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1455 KB) |  | HTML iconHTML  

    A music piece can be considered as a sequence of sound events which represent both short-term and long-term temporal information. However, in the task of automatic music genre classification, most of text-categorization-based approaches could only capture temporal local dependencies (e.g., unigram and bigram-based occurrence statistics) to represent music contents. In this paper, we propose the use of time-constrained sequential patterns (TSPs) as effective features for music genre classification. First of all, an automatic language identification technique is performed to tokenize each music piece into a sequence of hidden Markov model indices. Then TSP mining is applied to discover genre-specific TSPs, followed by the computation of occurrence frequencies of TSPs in each music piece. Finally, support vector machine classifiers are employed based on these occurrence frequencies to perform the classification task. Experiments conducted on two widely used datasets for music genre classification, GTZAN and ISMIR2004Genre, show that the proposed method can discover more discriminative temporal structures and achieve a better recognition accuracy than the unigram and bigram-based statistical approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Dynamic Stream Weighting for Audio-Visual Speech Recognition

    Page(s): 1145 - 1157
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1188 KB) |  | HTML iconHTML  

    The integration of audio and visual information improves speech recognition performance, specially in the presence of noise. In these circumstances it is necessary to introduce audio and visual weights to control the contribution of each modality to the recognition task. We present a method to set the value of the weights associated to each stream according to their reliability for speech recognition, allowing them to change with time and adapt to different noise and working conditions. Our dynamic weights are derived from several measures of the stream reliability, some specific to speech processing and others inherent to any classification task, and take into account the special role of silence detection in the definition of audio and visual weights. In this paper, we propose a new confidence measure, compare it to existing ones, and point out the importance of the correct detection of silence utterances in the definition of the weighting system. Experimental results support our main contribution: the inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EMD-Based Filtering (EMDF) of Low-Frequency Noise for Speech Enhancement

    Page(s): 1158 - 1166
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1700 KB) |  | HTML iconHTML  

    An empirical mode decomposition-based filtering (EMDF) approach is presented as a postprocessing stage for speech enhancement. This method is particularly effective in low-frequency noise environments. Unlike previous EMD-based denoising methods, this approach does not make the assumption that the contaminating noise signal is fractional Gaussian noise. An adaptive method is developed to select the IMF index for separating the noise components from the speech based on the second-order IMF statistics. The low-frequency noise components are then separated by a partial reconstruction from the IMFs. It is shown that the proposed EMDF technique is able to suppress residual noise from speech signals that were enhanced by the conventional optimally modified log-spectral amplitude approach which uses a minimum statistics-based noise estimate. A comparative performance study is included that demonstrates the effectiveness of the EMDF system in various noise environments, such as car interior noise, military vehicle noise, and babble noise. In particular, improvements up to 10 dB are obtained in car noise environments. Listening tests were performed that confirm the results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of Bit-Plane Probability for Generalized Gaussian Distribution and its Application in Audio Coding

    Page(s): 1167 - 1176
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1183 KB) |  | HTML iconHTML  

    Bit-plane probability of data is useful information for many applications such as entropy coding and rate estimation, particularly for scalable coding system. Typically, the value of bit-plane probability varies according to the distribution of the data represented, and due to inter-plane correlation, it is hard to get an analytic solution of bit-plane probability for the data with generalized Gaussian distribution. In this paper, bit-plane probability is analyzed when Generalized Gaussian distribution is used to model the input data. Based on the study of bit-plane probability for Laplace distribution and the relationship between different bit-planes, an approximated bit-plane probability for Generalized Gaussian distribution is presented. This closed-form expression is of low computational cost. Furthermore, a much practical format is derived with reduced complexity for implementation in the state-of-the-art MPEG-4 scalable to lossless audio coding system. With the same computational cost, the proposed algorithm presents higher compression efficiency than MPEG-4 scalable to lossless audio coding, which considers Laplace distribution only. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Robustness of Codebook-Based Noise Estimation Approaches With Delta Codebooks

    Page(s): 1177 - 1188
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1228 KB) |  | HTML iconHTML  

    We present a new codebook-based speech enhancement approach which is able to increase robustness of conventional codebook-based approaches against model mismatch and unknown noise types. This is achieved by training only the difference between the actual noise and a robust estimate (e.g., obtained by minimum statistics or recursive minimum tracking) in the cepstral domain instead of the noise itself. The noise codebook is then generated by shifting the so obtained delta-codebook by the cepstral representation of a robust noise estimate. We use the recursive minimum tracking approach as robust estimate. It is thus guaranteed that the robust estimate is also a valid estimate of the codebook-based algorithm. Consequently, the codebook-based algorithm inherits the robustness from the recursive minimum tracking approach. Objective and subjective experiments show that the proposed method yields a consistent quality improvement over the basic codebook-based approach and recursive minimum tracking. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transformation Between Uniform Linear and Spherical Microphone Arrays With Symmetric Responses

    Page(s): 1189 - 1195
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1109 KB) |  | HTML iconHTML  

    Spherical microphone arrays are suited for phase mode processing, which complements spatio-temporal processing and often simplifies both the understanding and development of different beamforming techniques. Since a spherical geometry cannot benefit directly from methods specially developed for uniform linear arrays, the weight and algorithm design for spherical arrays is mostly optimized and found numerically. One of the exceptions is a recent study that develops the well-known Dolph-Chebyshev weights in the phase mode framework. We show that for the case of a symmetric response, the spherical and linear array geometries are one-to-one, related through a herein developed linear transformation. With this transformation, uniform linear array specific processing techniques become readily available in closed form for spherical array processing. Any uniform linear array weight design technique can be directly applied to spherical arrays. We also show how this transformation can be used to generate virtual uniform linear array data from spherical array data, enabling us to apply uniform linear array specific adaptive algorithms to spherical arrays. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prosodic Realization of Rhetorical Structure in Chinese Discourse

    Page(s): 1196 - 1206
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1289 KB) |  | HTML iconHTML  

    The research reported in this paper is an acoustic experiment attempting to elucidate the relationship between prosodic variation and rhetorical structure in discourse. Based on Rhetorical Structure Theory, facets of discourse structure such as hierarchy, relation, and the relative importance of discourse segment were identified. Five speakers of standard Chinese were recorded reading ten paragraphs with two repetitions. Boundary pause duration, f0 max, f0 min, and pitch range of the segments were measured. It was found that speakers realized longer pauses at boundaries of higher hierarchy. Furthermore, compared with segments linked by nucleus-satellite relation, segments linked by multinuclear relation were found to have wider pre-boundary pitch range. Additionally, important segments were found to be articulated with wider pitch range than unimportant segments. These results suggest that rhetorical structure is reliably conveyed by prosodic parameters in standard Chinese. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automated Physical Modeling of Nonlinear Audio Circuits for Real-Time Audio Effects—Part II: BJT and Vacuum Tube Examples

    Page(s): 1207 - 1216
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1844 KB) |  | HTML iconHTML  

    This is the second part of a two-part paper that presents a procedural approach to derive nonlinear filters from schematics of audio circuits for the purpose of digitally emulating musical effects circuits in real-time. This work presents the results of applying this physics-based technique to two audio preamplifier circuits. The approach extends a thread of research that uses variable transformation and offline solution of the global nonlinear system. The solution is approximated with multidimensional linear interpolation during runtime to avoid uncertainties in convergence. The methods are evaluated here experimentally against a reference SPICE circuit simulation. The circuits studied here are the bipolar junction transistor (BJT) common emitter amplifier, and the triode preamplifier. The results suggest the use of function approximation to represent the solved system nonlinearity of the K-method and invite future work along these lines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonintrusive Quality Assessment of Noise Suppressed Speech With Mel-Filtered Energies and Support Vector Regression

    Page(s): 1217 - 1232
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1644 KB) |  | HTML iconHTML  

    Objective speech quality assessment is a challenging task which aims to emulate human judgment in the complex and time consuming task of subjective assessment. It is difficult to perform in line with the human perception due the complex and nonlinear nature of the human auditory system. The challenge lies in representing speech signals using appropriate features and subsequently mapping these features into a quality score. This paper proposes a nonintrusive metric for the quality assessment of noise-suppressed speech. The originality of the proposed approach lies primarily in the use of Mel filter bank energies (FBEs) as features and the use of support vector regression (SVR) for feature mapping. We utilize the sensitivity of FBEs to noise in order to obtain an effective representation of speech towards quality assessment. In addition, the use of SVR exploits the advantages of kernels which allow the regression algorithm to learn complex data patterns via nonlinear transformation for an effective and generalized mapping of features into the quality score. Extensive experiments conducted using two third party databases with different noise-suppressed speech signals show the effectiveness of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Evaluation of Karaoke Singing Based on Pitch, Volume, and Rhythm Features

    Page(s): 1233 - 1243
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1632 KB) |  | HTML iconHTML  

    This study aims to develop an automatic singing evaluation system for Karaoke performances. Many Karaoke systems in the market today come with a scoring function. The addition of the feature enhances the entertainment appeal of the system due to the competitive nature of humans. The automatic Karaoke scoring mechanism to date, however, is still rudimentary, often giving inconsistent results with scoring by human raters. A cause of blunder arises from the fact that often only the singing volume is used as the evaluation criteria. To improve on the singing evaluation capabilities on Karaoke machines, this study exploits various acoustic features, including pitch, volume, and rhythm to assess a singing performance. We invited a number of singers having different levels of singing capabilities to record for Karaoke solo vocal samples. The performances were rated independently by four musicians, and then used in conjunction with additional Karaoke Video Compact Disk music for the training of our proposed system. Our experiment shows that the results of automatic singing evaluation are close to the human rating, where the Pearson product-moment correlation coefficient between them is 0.82. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Round-Robin Duel Discriminative Language Models

    Page(s): 1244 - 1255
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (623 KB) |  | HTML iconHTML  

    Discriminative training has received a lot of attention from both the machine learning and speech recognition communities. The idea behind the discriminative approach is to construct a model that distinguishes correct samples from incorrect samples, while the conventional generative approach estimates the distributions of correct samples. We propose a novel discriminative training method and apply it to a language model for reranking speech recognition hypotheses. Our proposed method has round-robin duel discrimination (R2D2) criteria in which all the pairs of sentence hypotheses including pairs of incorrect sentences are distinguished from each other, taking their error rate into account. Since the objective function is convex, the global optimum can be found through a normal parameter estimation method such as the quasi-Newton method. Furthermore, the proposed method is an expansion of the global conditional log-linear model whose objective function corresponds to the conditional random fields. Our experimental results show that R2D2 outperforms conventional methods in many situations, including different languages, different feature constructions and different difficulties. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Multi-Frame Approach to the Frequency-Domain Single-Channel Noise Reduction Problem

    Page(s): 1256 - 1269
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1487 KB) |  | HTML iconHTML  

    This paper focuses on the class of single-channel noise reduction methods that are performed in the frequency domain via the short-time Fourier transform (STFT). The simplicity and relative effectiveness of this class of approaches make them the dominant choice in practical systems. Over the past years, many popular algorithms have been proposed. These algorithms, no matter how they are developed, have one feature in common: the solution is eventually formulated as a gain function applied to the STFT of the noisy signal only in the current frame, implying that the interframe correlation is ignored. This assumption is not accurate for speech enhancement since speech is a highly self-correlated signal. In this paper, by taking the interframe correlation into account, a new linear model for speech spectral estimation and some optimal filters are proposed. They include the multi-frame Wiener and minimum variance distortionless response (MVDR) filters. With these filters, both the narrowband and fullband signal-to-noise ratios (SNRs) can be improved. Furthermore, with the MVDR filter, speech distortion at the output can be zero. Simulations present promising results in support of the claimed merits obtained by theoretical analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Single and Piecewise Polynomials for Modeling of Pitched Sounds

    Page(s): 1270 - 1281
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1507 KB) |  | HTML iconHTML  

    We present a compact approach to simultaneous modeling of non-stationary harmonic and transient components in pitched sound sources. The harmonic and transient components are described by separate models which are built from a common sinusoidal basis modified by a joint action of single and linear piecewise time polynomials respectively. A single polynomial accounts for slow and continuous signal time variations, while various piecewise polynomials can capture fast signal changes on smaller subintervals within the analysis window. The resulting model is linear-in-parameters and the solution to the corresponding linear system of equations provides correct model parameter estimates according to the signal content in the analysis window. The model is extended to deal with mixtures of sounds, where harmonics clustered in a small bandwidth are jointly modeled as a single harmonic. The comparative results suggest that the proposed model outperforms two reference modeling methods in terms of modeling errors and number of parameters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bayesian Focusing for Coherent Wideband Beamforming

    Page(s): 1282 - 1296
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (638 KB) |  | HTML iconHTML  

    In this paper, we present and study a Bayesian focusing transformation (BFT) for coherent wideband array processing, which takes into account the uncertainty of the direction of arrivals (DOAs). The Bayesian focusing method minimizes the mean-square error of the transformation over the probability density functions (pdfs) of the DOAs, thus achieving improved focusing accuracy over the entire bandwidth. In order to solve the Bayesian focusing problem, we derive and utilize a weighted extension of the wavefield interpolated narrowband generated subspace (WINGS) focusing transformation. We provide a closed-form expression for the optimal BFT and extend it to the case of directional sensors. We then consider a numerical computation scheme of the BFT in the angular domain. We show that if an angular sampling condition is satisfied then the angle domain approximation yields the optimal BFT. We also treat the important issue of robust focused minimum variance distortionless response (MVDR) beamformer. We analyze the sensitivity of the focused MVDR to the focusing errors and show that the array gain (AG) is inversely proportional to the square of the signal-to-noise ratio (SNR) for large values of the SNR, and highly sensitive to the focusing errors. In order to reduce this sensitivity we generalize the popular narrowband diagonal loaded MVDR to the focused wideband case, referred to as the Q-loaded focused MVDR wideband beamformer. We derive a closed-form analytic expression for the AG of the Q-loaded focused MVDR beamformer which depends on the focusing transformations. A numerical performance evaluation and simulations demonstrate the advantage of the BFT over that of other focusing transformations, for multiple source scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Local Key Estimation From an Audio Signal Relying on Harmonic and Metrical Structures

    Page(s): 1297 - 1312
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1296 KB) |  | HTML iconHTML  

    In this paper, we present a method for estimating the progression of musical key from an audio signal. We address the problem of local key finding by investigating the possible combination and extension of different previously proposed approaches for global key estimation. In this work, key progression is estimated from the chord progression. Specifically, we introduce key dependency on the harmonic and the metrical structures. A contribution of our work is that we address the problem of finding an analysis window length for local key estimation that is adapted to the intrinsic music content of the analyzed piece by introducing information related to the metrical structure in our model. Key estimation is not performed on empirically chosen segments but on segments that are expressed in relationship with the tempo period. We evaluate and analyze our results on two databases of different styles. We systematically analyze the influence of various parameters to determine factors important to our model, we study the relationships between the various musical attributes that are taken into account in our work, and we provide case study examples. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora

    Page(s): 1313 - 1323
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (700 KB) |  | HTML iconHTML  

    In Voice Conversion (VC), the speech of a source speaker is modified to resemble that of a particular target speaker. Currently, standard VC approaches use Gaussian mixture model (GMM)-based transformations that do not generate high-quality converted speech due to “over-smoothing” resulting from weak links between individual source and target frame parameters. Dynamic Frequency Warping (DFW) offers an appealing alternative to GMM-based methods, as more spectral details are maintained in transformation; however, the speaker timbre is less successfully converted because spectral power is not adjusted explicitly. Previous work combines separate GMM- and DFW-transformed spectral envelopes for each frame. This paper proposes a more effective DFW-based approach that (1) does not rely on the baseline GMM methods, and (2) functions on the acoustic class level. To adjust spectral power, an amplitude scaling function is used that compares the average target and warped source log spectra for each acoustic class. The proposed DFW with Amplitude scaling (DFWA) outperforms standard GMM and hybrid GMM-DFW methods for VC in terms of both speech quality and timbre conversion, as is confirmed in extensive objective and subjective testing. Furthermore, by not requiring time-alignment of source and target speech, DFWA is able to perform equally well using parallel or nonparallel corpora, as is demonstrated explicitly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Model-Based Speech Enhancement With Improved Spectral Envelope Estimation via Dynamics Tracking

    Page(s): 1324 - 1336
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1658 KB) |  | HTML iconHTML  

    In this work, we present a model-based approach to enhance noisy speech using an analysis-synthesis framework. Target speech is reconstructed with model parameters estimated from noisy observations. In particular, spectral envelope is estimated by tracking its temporal trajectories in order to improve the noise-distorted short-time spectral amplitude. Initially, we propose an analysis-synthesis framework for speech enhancement based on harmonic noise model (HNM). Acoustic parameters such as pitch, spectral envelope, and spectral gain are extracted from HNM analysis. Spectral envelope estimation is improved by tracking its line spectrum frequency trajectories through Kalman filtering. System identification of Kalman filter is achieved via a combined design of codebook mapping scheme and maximum-likelihood estimator with parallel training data. Complete system design and experimental validations are given in details. Through performance evaluation based on a study of spectrogram, objective measures and a subjective listening test, it is demonstrated that the proposed approach achieves significant improvement over conventional methods in various conditions. A distinct advantage of the proposed method is that it successfully tackles the “musical tones” problem. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Novel Variations of Group Sparse Regularization Techniques With Applications to Noise Robust Automatic Speech Recognition

    Page(s): 1337 - 1346
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB) |  | HTML iconHTML  

    This paper presents novel variations of group sparse regularization techniques. We expand upon the Sparse Group LASSO formulation to incorporate different learning techniques for better sparsity enforcement within a group and demonstrate the effectiveness of the algorithms for spectral denoising with applications to robust Automatic Speech Recognition (ASR). In particular, we show that with a strategic selection of groupings greater robustness to noisy speech recognition can be achieved when compared to state-of-the-art techniques like the Fast Iterative Shrinkage Thresholding Algorithm (FISTA) implementation of the Sparse Group LASSO. Moreover, we demonstrate that group sparse regularization techniques can offer significant gains over efficient techniques like the Elastic Net. We also show that the proposed algorithms are effective in exploiting collinear dictionaries to deal with the inherent highly coherent nature of speech spectral segments. Experiments on the Aurora 2.0 continuous digit database and the Aurora 3.0 realistic noisy database demonstrate the performance improvement with the proposed methods, including showing that their execution time is comparable to FISTA, making our algorithms practical for application to a wide range of regularization problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-Time Robust Automatic Speech Recognition Using Compact Support Vector Machines

    Page(s): 1347 - 1361
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (871 KB) |  | HTML iconHTML  

    In the last years, support vector machines (SVMs) have shown excellent performance in many applications, especially in the presence of noise. In particular, SVMs offer several advantages over artificial neural networks (ANNs) that have attracted the attention of the speech processing community. Nevertheless, their high computational requirements prevent them from being used in practice in automatic speech recognition (ASR), where ANNs have proven to be successful. The high complexity of SVMs in this context arises from the use of huge speech training databases with millions of samples and highly overlapped classes. This paper suggests the use of a weighted least squares (WLS) training procedure that facilitates the possibility of imposing a compact semiparametric model on the SVM, which results in a dramatic complexity reduction. Such a complexity reduction with respect to conventional SVMs, which is between two and three orders of magnitude, allows the proposed hybrid WLS-SVC/HMM system to perform real-time speech decoding on a connected-digit recognition task (SpeechDat Spanish database). The experimental evaluation of the proposed system shows encouraging performance levels in clean and noisy conditions, although further improvements are required to reach the maturity level of current context-dependent HMM-based recognizers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research