Scheduled System Maintenance on May 29th, 2015:
IEEE Xplore will be upgraded between 11:00 AM and 10:00 PM EDT. During this time there may be intermittent impact on performance. We apologize for any inconvenience.
By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 5 • Date Sep 2000

Filter Results

Displaying Results 1 - 14 of 14
  • Automatic verbal information verification for user authentication

    Publication Year: 2000 , Page(s): 585 - 596
    Cited by:  Papers (13)  |  Patents (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB)  

    Traditional speaker authentication focuses on speaker verification (SV) and speaker identification, which is accomplished by matching the speaker's voice with his or her registered speech patterns. In this paper, we propose a new technique, verbal information verification (VIV), in which spoken utterances of a claimed speaker are verified against the key (usually confidential) information in the speaker's registered profile automatically; to decide whether the claimed identity should be accepted or rejected. Using the proposed sequential procedure involving three question-response turns, we achieved an error-free result in a telephone speaker authentication experiment with 100 speakers. We further propose a speaker authentication system by combining VIV with SV. In the system, a user is verified by VIV in the first four to five accesses, usually from different acoustic environments. During these uses, one of the key questions pertains to a pass-phrase for SV. The VIV system collects and verifies the pass-phrase utterance for use as training data for speaker model construction. After a speaker-dependent model is constructed, the system then migrates to SV. This approach avoids the inconvenience of a formal enrollment procedure, ensures the quality of the training data for SV, and mitigates the mismatch caused by different acoustic environments between training and testing. Experiments showed that the proposed system improved the SV performance by over 40% in equal-error rate compared to a conventional SV system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust decision tree state tying for continuous speech recognition

    Publication Year: 2000 , Page(s): 555 - 566
    Cited by:  Papers (16)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB)  

    Methods of improving the robustness and accuracy of acoustic modeling using decision tree based state tying are described. A new two-level segmental clustering approach is devised which combines the decision tree based state tying with agglomerative clustering of rare acoustic phonetic events. In addition, a unified maximum likelihood framework for incorporating both phonetic and nonphonetic features in decision tree based state tying is presented. In contrast to other heuristic data separation methods, which often lead to training data depletion, a tagging scheme is used to attach various features of interest and the selection of these features in the decision tree is data driven. Finally, two methods of using multiple-mixture parameterization to improve the quality of the evaluation function in decision tree state tying are described. One method is based on the approach of k-means fitting and the other method is based on a novel use of a local multilevel optimal subtree. Both methods provide more accurate likelihood evaluation in decision tree clustering and are consistent with the structure of the decision tree. Experimental results on Wall Street Journal corpora demonstrate that the proposed approaches lead to a significant improvement in model quality and recognition performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Content-based audio classification and retrieval using the nearest feature line method

    Publication Year: 2000 , Page(s): 619 - 625
    Cited by:  Papers (74)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (204 KB)  

    A method is presented for content-based audio classification and retrieval. It is based on a new pattern classification method called the nearest feature line (NFL). In the NFL, information provided by multiple prototypes per class is explored. This contrasts to the nearest neighbor (NN) classification in which the query is compared to each prototype individually. Regarding audio representation, perceptual and cepstral features and their combinations are considered. Extensive experiments are performed to compare various classification methods and feature sets. The results show that the NFL-based method produces consistently better results than the NN-based and other methods. A system resulting from this work has achieved the error rate of 9.78%, as compared to that of 18.34% of a compelling existing system, as tested on a common audio database View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech enhancement based on the subspace method

    Publication Year: 2000 , Page(s): 497 - 507
    Cited by:  Papers (33)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB)  

    A method of speech enhancement using microphone-array signal processing based on the subspace method is proposed and evaluated. The method consists of the following two stages corresponding to the different types of noise. In the first stage, less-directional ambient noise is reduced by eliminating the noise-dominant subspace. It is realized by weighting the eigenvalues of the spatial correlation matrix. This is based on the fact that the energy of less-directional noise spreads over all eigenvalues while that of directional components is concentrated on a few dominant eigenvalues. In the second stage, the spectrum of the target source is extracted from the mixture of spectra of the multiple directional components remaining in the modified spatial correlation matrix by using a minimum variance beamformer. Finally, the proposed method is evaluated in both a simulated model environment and a real environment View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximum entropy language modeling and the smoothing problem

    Publication Year: 2000 , Page(s): 626 - 632
    Cited by:  Papers (3)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (236 KB)  

    This paper discusses various aspects of smoothing techniques in maximum entropy language modeling. This topic is typically not addressed in literature. The results can be summarized in four statements: 1) straightforward maximum entropy models with nested features, e.g., tri-, bi-, and uni-grams, result in unsmoothed relative frequencies models, 2) maximum entropy models with nested features and discounted feature counts approximate backing-off smoothed relative frequencies models with Kneser's advanced marginal back-off distribution. This explains some of the reported success of maximum entropy models in the past. 3) We give perplexity results for nested and nonnested features, e.g., trigrams and distance-trigrams, on a 4 million word subset of the Wall Street Journal Corpus. From these results we conclude that the smoothing method has more effect on the perplexity than the method of how to combine the different types of features. 4) We show perplexity results for nonnested features using log-linear interpolation of conventionally smoothed language models, giving evidence that this approach may be a first step to overcome the smoothing problem in the context of maximum entropy View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VERBMOBIL: the use of prosody in the linguistic components of a speech understanding system

    Publication Year: 2000 , Page(s): 519 - 532
    Cited by:  Papers (15)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB)  

    We show how prosody can be used in speech understanding systems. This is demonstrated with the VERBMOBIL speech to-speech translation system which, to our knowledge, is the first complete system which successfully uses prosodic information in the linguistic analysis. Prosody is used by computing probabilities for clause boundaries, accentuation, and different types, of sentence mood for each of the word hypotheses computed by the word recognizer. These probabilities guide the search of the linguistic analysis. Disambiguation is already achieved during the analysis and not by a prosodic verification of different linguistic hypotheses. So far, the most useful prosodic information is provided by clause boundaries. These are detected with a recognition rate of 94%. For the parsing of word hypotheses graphs, the use of clause boundary probabilities yields a speed-up of 92% and a 96% reduction of alternative readings View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Estimation of handset nonlinearity with application to speaker recognition

    Publication Year: 2000 , Page(s): 567 - 584
    Cited by:  Papers (23)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB)  

    A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This “magnitude only” representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that are a potential source of degradation in speaker and speech recognition algorithms. As such, the method is particularly suited to algorithms that use only spectral magnitude information. The distortion model consists of a memoryless nonlinearity sandwiched between two finite-length linear filters. Nonlinearities considered include arbitrary finite-order polynomials and parametric sigmoidal functionals derived from a carbon-button handset model. Minimization of a mean-squared spectral magnitude distance with respect to model parameters relies on iterative estimation via a gradient descent technique. Initial work has demonstrated the importance of addressing handset nonlinearity, in addition to linear distortion, in speaker recognition over telephone channels. A nonlinear handset “mapping,” applied to training or testing data to reduce mismatch between different types of handset microphone outputs, improves speaker verification performance relative to linear compensation only. Finally, a method is proposed to merge the mapper strategy with a method of likelihood score normalization (hnorm) for further mismatch reduction and speaker verification performance improvement View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Joint position and amplitude search of algebraic multipulses

    Publication Year: 2000 , Page(s): 633 - 637
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (160 KB)  

    A joint position and amplitude search algorithm is proposed for algebraic multipulse codebooks to be used in code-excited linear predictive (CELP) coders. The joint search complexity is below one quarter that of the focused search and ranks below those of the G.729A and IS-641-A coders. Listening tests indicate an equivalence in perceived quality View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mixture IMM for speech enhancement under nonstationary noise

    Publication Year: 2000 , Page(s): 637 - 641
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB)  

    A mixture interacting multiple model (MIMM) algorithm is proposed to enhance speech contaminated by additive nonstationary noise. In this approach, a mixture hidden filter model (HFM) is used for clean speech modeling and a single hidden filter is used for noise process modeling. The MIMM algorithm gives better enhancement results than the IMM algorithm. The results show that the proposed method offers performance gain compared to the previous results in with slightly increased complexity View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proportionate normalized least-mean-squares adaptation in echo cancelers

    Publication Year: 2000 , Page(s): 508 - 518
    Cited by:  Papers (200)  |  Patents (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (204 KB)  

    On typical echo paths, the proportionate normalized least-mean-squares (PNLMS) adaptation algorithm converges significantly faster than the normalized least-mean-squares (NLMS) algorithm generally used in echo cancelers to date. In PNLMS adaptation, the adaptation gain at each tap position varies from position to position and is roughly proportional at each tap position to the absolute value of the current tap weight estimate. The total adaptation gain being distributed over the taps is carefully monitored and controlled so as to hold the adaptation quality (misadjustment noise) constant. PNLMS adaptation only entails a modest increase in computational complexity View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multichannel recursive-least-square algorithms and fast-transversal-filter algorithms for active noise control and sound reproduction systems

    Publication Year: 2000 , Page(s): 606 - 618
    Cited by:  Papers (42)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB)  

    There has been much research on active noise control (ANC) systems and transaural sound reproduction (TSR) systems. In those fields, multichannel FIR adaptive filters are extensively used. For the learning of FIR adaptive filters, recursive-least-squares (RLS) algorithms are known to produce a faster convergence speed than stochastic gradient descent techniques, such as the basic least-mean-squares (LMS) algorithm or even the fast convergence Newton-LMS, the gradient-adaptive-lattice (GAL) LMS and the discrete-cosine-transform (DCT) LMS algorithms. In this paper, multichannel RLS algorithms and multichannel fast-transversal-filter (FTF) algorithms are introduced, with the structures of some stochastic gradient descent algorithms used in ANC: the filtered-x LMS, the modified filtered-x LMS and the adjoint-LMS. The new algorithms can be used in ANC systems or for the deconvolution of sounds in TSR systems. Simulation results comparing the convergence speed, the numerical stability and the performance using noisy plant models for the different multichannel algorithms are presented, showing the large gain of convergence speed that can be achieved by using some of the introduced algorithms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Elimination of delay-free loops in discrete-time models of nonlinear acoustic systems

    Publication Year: 2000 , Page(s): 597 - 605
    Cited by:  Papers (19)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (244 KB)  

    Nonlinear acoustic systems are often described by means of nonlinear maps acting as instantaneous constraints on the solutions of a system of linear differential equations. This description leads to discrete-time models exhibiting noncomputable loops. We present a solution to this computability problem by means of geometrical transformation of the nonlinearities and algebraic transformation of the time-dependent equations. The proposed solution leads to stable and accurate simulations even at relatively low sampling rates View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Word boundary detection with mel-scale frequency bank in noisy environment

    Publication Year: 2000 , Page(s): 541 - 554
    Cited by:  Papers (22)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB)  

    This paper addresses the problem of automatic word boundary detection in the presence of noise. We first propose an adaptive time-frequency (ATF) parameter for extracting both the time and frequency features of noisy speech signals. The ATF parameter extends the TF parameter proposed by Junqua et al. (1994) from single band to multiband spectrum analysis, where the frequency bands help to make the distinction of speech and noise signals clear. The ATF parameter can extract useful frequency information by adaptively choosing proper bands of the mel-scale frequency bank. The ATF parameter increased the recognition rate by about 3% of a TF-based robust algorithm which has been shown to outperform several commonly used algorithms for word boundary detection in the presence of noise. The ATF parameter also reduced the recognition error rate due to endpoint detection to about 20%. Based on the ATF parameter, we further propose a new word boundary detection algorithm by using a neural fuzzy network (called SONFIN) for identifying islands of word signals in a noisy environment. Due to the self-learning ability of SONFIN, the proposed algorithm avoids the need of empirically determining thresholds and ambiguous rules in normal word boundary detection algorithms. As compared to normal neural networks, the SONFIN can always find itself an economic network size in high learning speed. Our results also showed that the SONFIN's performance is not significantly affected by the size of training set. The ATF-based SONFIN achieved higher recognition rate than the TF-based robust algorithm by about 5%. It also reduced the recognition error rate due to endpoint detection to about 10%, compared to an average of approximately 30% obtained with the TF-based robust algorithm, and 50% obtained with the modified version of the Lamel et al. (1981) algorithm View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise-compensated hidden Markov models

    Publication Year: 2000 , Page(s): 533 - 540
    Cited by:  Papers (7)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (144 KB)  

    The technique of hidden Markov models has been established as one of the most successful methods applied to the problem of speech recognition. However, its performance is considerably degraded when the speech signal is contaminated by noise. This work presents a technique which improves the performance of hidden Markov models when these models are used in different noise conditions during the speech recognition process. The input speech signal enters unchanged to the recognition process, while the models used by the recognition system are compensated according to the affecting noise characteristics, power and spectral shape. Hence, the compensation stage is independent of the recognition stage, allowing the models to be continually adjusted. The models used in this work are from a continuous density hidden Markov algorithm, having cepstral coefficients derived from linear predictive analysis as state parameters. It is used only static features in the models in order to show that, when properly compensated for the noise, these static features contribute significantly to improve noisy speech recognition. It is observed from the results that the parameters kept their capability to discriminate among different classes of signals, indicating that, in the context of speech recognition, the use of autoregressive-derived parameters with noisy signals does not represent an impediment. A matrix-way of converting from autoregressive coefficients to normalized autocorrelation coefficients is presented. The affecting noise is assumed additive and statistically independent of the speech signal. Although the noise dealt with should also be stationary, good performance was achieved for nonstationary noise, such as operations room noise and factory environment noise. The concept of intra-word signal-to-noise ratio is presented and successfully applied. The resulting compensated models are revealed to be less dependent on the training data set when compared to the trained hidden Markov models. Due to the computational simplicity, the time required to adjust a model is significantly shorter than the time to train it View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope