By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 1 • Date Jan 1997

Filter Results

Displaying Results 1 - 12 of 12
  • Circulant and elliptic feedback delay networks for artificial reverberation

    Page(s): 51 - 63
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB)  

    The feedback delay network (FDN) has been proposed for digital reverberation, The digital waveguide network (DWN) is also proposed with similar advantages. This paper notes that the commonly used FDN with an N×N orthogonal feedback matrix is isomorphic to a normalized digital waveguide network consisting of one scattering junction joining N reflectively terminated branches. Generalizations of FDNs and DWNs are discussed. The general case of a lossless FDN feedback matrix is shown to be any matrix having unit-modulus eigenvalues and linearly independent eigenvectors. A special class of FDNs using circulant matrices is proposed. These structures can be efficiently implemented and allow control of the time and frequency behavior. Applications of circulant feedback delay networks in audio signal processing are discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Isolated Mandarin syllable recognition with limited training data specially considering the effect of tones

    Page(s): 75 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (344 KB)  

    A set of new approaches is proposed to model the Mandarin syllables for accurate recognition with limited training data while specially considering the effect of tones, including improved initial values and state transition topologies, and making use of the durational cue. The results show that these approaches are very useful practically View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise adaptation of HMM speech recognition systems using tied-mixtures in the spectral domain

    Page(s): 72 - 74
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (100 KB)  

    We compare two different approaches to the problem of additive noise in a hidden Markov model (HMM) filterbank-based speech recognition system: (i) preprocessing by estimation and (ii) adaptation of the HMM output probability distributions. The adaptation method, previously formulated only for the static spectral features, is generalized in this paper to the time-derivative of the spectrum. Estimation and adaptation are formulated with a common statistical model (MIXMAX) and are compared using the same recognition system. We find that under low and medium signal-to-noise ratio (SNR) conditions, parameter adaptation is superior to preprocessing by estimation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic word recognition based on second-order hidden Markov models

    Page(s): 22 - 25
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (116 KB)  

    We propose an extension of the Viterbi algorithm that makes second-order hidden Markov models computationally efficient. A comparative study between first-order (HMM1s) and second-order Markov models (HMM2s) is carried out. Experimental results show that HMM2s provide a better state occupancy modeling and, alone, have performances comparable with HMM1s plus postprocessing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noncausal all-pole modeling of voiced speech

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB)  

    This paper introduces noncausal all-pole models that are capable of efficiently capturing both the magnitude and phase information of voiced speech. It is shown that noncausal all-pole filter models are better able to match both magnitude and phase information and are particularly appropriate for voiced speech due to the nature of the glottal excitation. By modeling speech in the frequency domain, the standard difficulties that occur when using noncausal all-pole filters are avoided. Several algorithms for determining the model parameters based on frequency-domain information and the masking effects of the ear are described. Our work suggests that high-quality voiced speech can be produced using a 14th-order noncausal all-pole model View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition

    Page(s): 80 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (140 KB)  

    The article presents a robust representation of speech based on AR modeling of the causal part of the autocorrelation sequence. In noisy speech recognition, this new representation achieves better results than several other related techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Global frequency modulation laws extraction from the Gabor transform of a signal: a first study of the interacting components case

    Page(s): 64 - 71
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (252 KB)  

    Under some asymptotic conditions, modulated trajectories in the time-frequency plane can be directly extracted from the coefficients of the Gabor (1946) transform (or wavelets) of a signal. For simple modulated signals, it has been shown that these trajectories allow us to accurately characterize the frequency-modulation (FM) laws of the analyzed signals. For more modulated signals (for instance musical sounds), some interactions between the trajectories may occur and complex time-frequency structures appear. These interactions and the time-frequency diagrams thus obtained are mathematically and qualitatively described. In particular, some connections with the diagrams of the phase of the transform and with acoustical interaction phenomena (like beats phenomena) are discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise compensation methods for hidden Markov model speech recognition in adverse environments

    Page(s): 11 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (344 KB)  

    Several noise compensation schemes for speech recognition in impulsive and nonimpulsive noise are considered. The noise compensation schemes are spectral subtraction, HMM-based Wiener (1949) filters, noise-adaptive HMMs, and a front-end impulsive noise removal. The use of the cepstral-time matrix as an improved speech feature set is explored, and the noise compensation methods are extended for use with cepstral-time features. Experimental evaluations, on a spoken digit database, in the presence of ear noise, helicopter noise, and impulsive noise, demonstrate that the noise compensation methods achieve substantial improvement in recognition across a wide range of signal-to-noise ratios. The results also show that the cepstral-time matrix is more robust than a vector of identical size, which is composed of a combination of cepstral and differential cepstral features View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A closed-form location estimator for use with room environment microphone arrays

    Page(s): 45 - 50
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (164 KB)  

    The linear intersection (LI) estimator, which is a closed-form method for the localization of source positions given sensor array time-delay estimate information, is presented. The LI estimator is shown to be robust and accurate, to closely model the search-based ML estimator, and to outperform a benchmark algorithm. The computational complexity of the LI estimator is suitable for use in real-time microphone-array applications where search-based location algorithms may be infeasible View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fast algorithm for finding the adaptive component weighted cepstrum for speaker recognition

    Page(s): 84 - 86
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (76 KB)  

    In speaker recognition systems, the adaptive component weighted (ACW) cepstrum has been shown to be more robust than the conventional linear predictive (LP) cepstrum. The ACW cepstrum is derived from a pole-zero transfer function whose denominator is the pth-order LP polynomial A(z). The numerator is a (p-1)th-order polynomial that is up to now found as follows. The roots of A(z) are computed, and the corresponding residues obtained by a partial fraction expansion of 1/A(z) are set to unity. Therefore, the numerator is the sum of all the (p-1)th-order cofactors of A(z). We show that the numerator polynomial is merely the derivative of the denominator polynomial A(z). This greatly speeds up the computation of the numerator polynomial coefficients since it involves a simple scaling of the denominator polynomial coefficients. Root finding is completely eliminated. Since the denominator is guaranteed to be minimum phase and the numerator can be proven to be minimum phase, two separate recursions involving the polynomial coefficients establishes the ACW cepstrum. This new method, which avoids root finding, reduces the computer time significantly and imposes negligible overhead when compared with the approach of finding the LP cepstrum View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Stochastic trajectory modeling and sentence searching for continuous speech recognition

    Page(s): 33 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (404 KB)  

    The paper first points out a defect in hidden Markov modeling (HMM) of continuous speech, referred as trajectory folding phenomenon. A new approach to modeling phoneme-based speech units is then proposed, which represents the acoustic observations of a phoneme as clusters of trajectories in a parameter space. The trajectories are modeled by a mixture of probability density functions of a random sequence of states. Each state is associated with a multivariate Gaussian density function, optimized at the state sequence level. Conditional trajectory duration probability is integrated in the modeling. An efficient sentence search procedure based on trajectory modeling is also formulated. Experiments with a speaker-dependent, 2010-word continuous speech recognition application with a word-pair perplexity of 50, using vocabulary-independent acoustic training, monophone models trained with 80 sentences per speaker, reported about a 1% word error rate. The new models were experimentally compared to continuous density mixture HMM (CDHMM) on the same recognition task, and gave significantly smaller word error rates. These results suggest that the stochastic trajectory model provides a more in-depth modeling of continuous speech signals View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Phoneme-based vector quantization in a discrete HMM speech recognizer

    Page(s): 26 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (124 KB)  

    The quantization distortion of vector quantization (VQ) is a key element that affects the performance of a discrete hidden Markov modeling (DHMM) system. Many researchers have realized this problem and tried to use integrated feature or multiple codebook in their systems to offset the disadvantage of the conventional VQ. However the computational complexity of those systems is then increased. Investigations have shown that the speech signal space consists of finite clusters that represent phoneme data sets from male and female speakers and reveal Gaussian distributions. We propose an alternative VQ method in which the phoneme is treated as a cluster in the speech space and a Gaussian model is estimated for each phoneme. A Gaussian mixture model (GMM) is generated by the expectation-maximization (EM) algorithm for the whole speech space and used as a codebook in which each code word is a Gaussian model and represents a certain cluster. An input utterance would be classified as a certain phoneme or a set of phonemes only when the phoneme or phonemes gave highest likelihood. A typical discrete HMM system was used for both phoneme and isolated word recognition. The results show that the phoneme-based Gaussian modeling vector quantization classifies the speech space more effectively and significant improvements in the performance of the DHMM system have been achieved View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope