By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 3 • Date May 1998

Filter Results

Displaying Results 1 - 11 of 11
  • An RNN-based prosodic information synthesizer for Mandarin text-to-speech

    Publication Year: 1998 , Page(s): 226 - 239
    Cited by:  Papers (34)  |  Patents (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (356 KB)  

    A new RNN-based prosodic information synthesizer for Mandarin Chinese text-to-speech (TTS) is proposed in this paper. Its four-layer recurrent neural network (RNN) generates prosodic information such as syllable pitch contours, syllable energy levels, syllable initial and final durations, as well as intersyllable pause durations. The input layer and first hidden layer operate with a word-synchronized clock to represent current-word phonologic states within the prosodic structure of text to be synthesized. The second hidden layer and output layer operate on a syllable-synchronized clock and use outputs from the preceding layers, along with additional syllable-level inputs fed directly to the second hidden layer, to generate desired prosodic parameters. The RNN was trained on a large set of actual utterances accompanied by associated texts, and can automatically learn many human-prosody phonologic rules, including the well-known Sandhi Tone 3 F0-change rule. Experimental results show that all synthesized prosodic parameter sequences matched quite well with their original counterparts, and a pitch-synchronous-overlap-add-based (PSOLA-based) Mandarin TTS system was also used for testing of our approach. While subjective tests are difficult to perform and remain to be done in the future, we have carried out informal listening tests by a significant number of native Chinese speakers and the results confirmed that all synthesized speech sounded quite natural View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimization of the maximum error signal in active control

    Publication Year: 1998 , Page(s): 268 - 281
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (356 KB)  

    This paper presents a new multichannel adaptive filtering algorithm for the active control of single frequency noise in acoustic systems. Most active control systems with multiple error sensors minimize the sum of the modulus squared output of these sensors. An adaptive algorithm is presented that minimizes an alternative cost function which, in the limit, is equal to the maximum of the modulus squared values of all the error sensors. The physical consequences of minimizing the maximum modulus squared output to achieve noise reduction are investigated. An analytical framework is developed that covers the steady state performance as well as the convergence properties. By means of simulations, the proposed algorithm has been applied to a linear model of a one-dimensional (1-D) acoustic system and compared with the classical least squares solutions. The proposed algorithm is also used in simulations of the control of the pressure measured at 32 microphone positions in a room using 16 loudspeakers, when the room is excited with an 88 Hz pure tone. The results of these simulations show that the observed convergence and steady-state properties agree well with the theoretical predictions. A comparison with the least squares solutions leads to the conclusion that the proposed algorithm leads to a more uniform acoustic field in the enclosure than the classical least squares algorithm View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress

    Publication Year: 1998 , Page(s): 201 - 216
    Cited by:  Papers (15)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (348 KB)  

    A novel approach is proposed for modeling speech parameter variations between neutral and stressed conditions and employed in a technique for stressed speech synthesis and recognition. The proposed method consists of modeling the variations in pitch contour, voiced speech duration, and average spectral structure using hidden Markov models (HMMs). While HMMs have traditionally been used for recognition applications, here they are employed to statistically model the characteristics needed for generating pitch contour and spectral perturbation contour patterns to modify the speaking style of isolated neutral words. The proposed HMM models are both speaker and word-independent, but unique to each speaking style. While the modeling scheme is applicable to a variety of stress and emotional speaking styles, the evaluations presented focus on angry speech, the Lombard (1911) effect, and loud spoken speech in three areas. First, formal subjective listener evaluations of the modified speech confirm the HMMs ability to capture the parameter variations under stressed conditions. Second, an objective evaluation using a separately formulated stress classifier is employed to assess the presence of stress imparted on the synthetic speech. Finally, the stressed speech is also used for training and shown to measurably improve the performance of an HMM-based stressed speech recognizer View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computational complexity of a fast Viterbi decoding algorithm for stochastic letter-phoneme transduction

    Publication Year: 1998 , Page(s): 217 - 225
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (212 KB)  

    This paper describes a modification to, and a fast implementation of the Viterbi algorithm for use in stochastic letter-to-phoneme conversion. A straightforward (but unrealistic) implementation of the Viterbi (1967) algorithm has a linear time complexity with respect to the length of the letter string, but quadratic complexity if we additionally consider the number of letter-to-phoneme correspondences to be a variable determining the problem size. Since the number of correspondences can be large, processing time is long. If the correspondences are precompiled to a deterministic finite-state automaton to simplify the process of matching to determine state survivors, execution time is reduced by a large multiplicative factor. Speed-up is inferred indirectly since the straightforward implementation of Viterbi decoding is too slow for practical comparison, and ranges between about 200 and 4000 depending upon the number of letters processed and the particular correspondences employed in the transduction. Space complexity is increased linearly with respect to the number of states of the automaton. This work has implications for fast, efficient implementation of a variety of speech and language engineering systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Isolated Mandarin base-syllable recognition based upon the segmental probability model

    Publication Year: 1998 , Page(s): 293 - 299
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB)  

    A segmental probability model (SPM) is proposed for fast and accurate recognition of the highly confusing isolated Mandarin base-syllables by deleting the state transition probabilities of continuous density hidden Markov models (CHMM), abandoning the dynamic programming process, letting the states equally segment the base-syllables deterministically, and using several special approaches to improve the accuracy and speed. This is achieved by considering the special characteristics of the target vocabulary View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering

    Publication Year: 1998 , Page(s): 240 - 259
    Cited by:  Papers (71)  |  Patents (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (724 KB)  

    In teleconferencing systems, the use of hands-free sound pick-up reduces speech quality. This is due to ambient noise, acoustic echo, and the reverberation produced by the acoustical environment. This paper sets out to present a theoretical analysis of noise reduction and dereverberation algorithms based on a microphone array combined with a Wiener postfilter. It is shown that the transfer function of the postfilter depends on the input signal-to-noise ratio (SNR) and on the noise reduction yielded by the array. The use of a directivity-controlled array instead of a conventional beam-former is proposed to improve the performance of the whole system. Examples in real room environments are provided, which confirm the theoretical results, It is observed that the postfilter gives a limited reduction of the reverberation. On the contrary, an appreciable reduction of acoustic echo and localized noise is obtained and makes the whole system highly attractive for hands-free communication systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker identification based on the use of robust cepstral features obtained from pole-zero transfer functions

    Publication Year: 1998 , Page(s): 260 - 267
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (196 KB)  

    A common problem in speaker identification systems is that a mismatch in the training and testing conditions sacrifices much performance. We attempt to alleviate this problem by proposing new features that show less variation when speech is corrupted by convolutional noise (channel) and/or additive noise. The conventional feature used is the linear predictive (LP) cepstrum that is derived from an all-pole transfer function which, in turn, achieves a good approximation to the spectral envelope of the speech. A different cepstral feature based on a pole-zero function (called the adaptive component weighted or ACW cepstrum) was previously introduced. We propose four additional new cepstral features based on pole-zero transfer functions. One is an alternative way of doing adaptive component weighting and is called the ACW2 cepstrum. Two others (known as the PFL1 cepstrum and the PFL2 cepstrum) are based on a pole-zero postfilter used in speech enhancement. Finally, an autoregressive moving-average (ARMA) analysis of speech results in a pole-zero transfer function describing the spectral envelope. The cepstrum of this transfer function is the feature. Experiments involving a closed set, text-independent and vector quantizer based speaker identification system are done to compare the various features. The TIMIT and King databases are used. The ACW and PFL1 features are the preferred features, since they do as well or better than the LP cepstrum for all the test conditions. The corresponding spectra show a clear emphasis of the formants and no spectral tilt View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improvement of speech spectrogram accuracy by the method of reassignment

    Publication Year: 1998 , Page(s): 282 - 287
    Cited by:  Papers (5)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (228 KB)  

    An improvement of the speech spectrogram based on the method of reassignment is presented. This method consists of moving each point of the spectrogram to a new point that represents the distribution of the energy in the time-frequency window more accurately. Examples of natural speech show an improvement of the energy localization in both time and frequency domains. This allows a better description of speech features View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison of constrained trajectory segment models for large vocabulary speech recognition

    Publication Year: 1998 , Page(s): 303 - 306
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (128 KB)  

    This paper compares parametric and nonparametric constrained-mean trajectory segment models for large vocabulary speech recognition, extending distribution clustering techniques to handle polynomial mean trajectory models for robust parameter estimation. The parametric model has fewer free parameters and gives similar recognition performance to the nonparametric model, but has higher recognition costs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deleted strategy for MMI-based HMM training

    Publication Year: 1998 , Page(s): 299 - 303
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (172 KB)  

    We apply the maximum mutual information (MMI) criterion to discriminative training of hidden Markov model (HMM) parameters. In contrast to the conventional MMI training approach, we adopt the cross-validatory strategy with which the parameters are estimated on a part and assessed on the other parts of the training data. For this purpose, we propose the deleted MMI training method, which performs cross-validatory parameter updating while maintaining the converging behavior of the conventional MMI-based algorithm. The proposed method is compared to the conventional MMI approach in classification of artificial data and in speaker-independent continuous speech recognition, and shows better performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Postprocessing method for suppressing musical noise generated by spectral subtraction

    Publication Year: 1998 , Page(s): 287 - 292
    Cited by:  Papers (25)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (212 KB)  

    We investigate whether musical noise, which often exists in speech enhanced using spectral subtraction, can be suppressed. Via exploiting some specific characteristics of human speech, we propose a method that can effectively suppress musical noise without a noticeable effect on speech intelligibility. Performance assessments confirm that our method is effective View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope