Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

Speech and Audio Processing, IEEE Transactions on

Issue 1 • Date Jan. 1994

Filter Results

Displaying Results 1 - 25 of 25
  • Estimation of noise-corrupted speech DFT-spectrum using the pitch period

    Publication Year: 1994 , Page(s): 1 - 8
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (803 KB)  

    This paper describes a method for utilizing the quasi-periodicity of speech in a minimum mean-square error (MMSE) estimation of the discrete Fourier transform (DFT) log-amplitude, either for speech enhancement or for noise-robust speech recognition. The estimator takes into account the periodicity by conditioning the estimate of voiced speech on the distance between the frequency of any given DFT coefficient and the nearest harmonic: if the DFT coefficient lies in the vicinity of a harmonic, the a priori probability distribution (PD) of its amplitude centers around higher values than if it lies halfway between two harmonics. Thus, knowing the pitch narrows down the a priori PD, improving the estimate. The DFT estimator is combined with a mixture model for the broadband spectral PD, so that correlations between distant frequencies are partially taken into account. The algorithm has been tested with computer-room noise using an MSE criterion for the spectral envelope, defined by Mel-scale filter-bank log energies, and in recognition experiments. The incorporation of correlations in the broadband spectrum improves recognition accuracy significantly; the periodicity conditioning reduces the MSE for voiced speech, but recognition accuracy is not improved because the overwhelming majority of errors occur in unvoiced speech. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A breakpoint analysis procedure based on temporal decomposition

    Publication Year: 1994 , Page(s): 9 - 17
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (904 KB)  

    Temporal decomposition (TD), which is an analysis procedure based on a linear model of the effects of coarticulation, yields a linear approximation of a time sequence of speech parameters in terms of a series of time-overlapping interpolation functions and an associated series of data vectors. The number and positions of these interpolation functions show a high correspondence with phonetic events present in the speech signal. A new and more robust interpolation scheme for TD that gives it a geometric interpretation as a breakpoint analysis procedure in a multidimensional parameter space is described, where breakpoints are connected by straight line segments. The interpolation scheme can be viewed as a generalization of segmentation, allowing for a gradual transition from one segment towards the next. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Encoder reverberations in adaptive predictive coding

    Publication Year: 1994 , Page(s): 18 - 23
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB)  

    In the past, the nonlinear effects of quantization in differential speech coders either were modeled as additive white Gaussian noise or were ignored in analysis and quantified experimentally. Describing functions are used to model the nonlinear quantizer effects of coarsely quantized difference signals found primarily in adaptive predictive coding (APC) systems. The analysis predicts marginal instabilities in APC encoders. The marginal instabilities result in ringing distortions in the reconstructed signal which, in the past, may have been attributed to pitch and spectral predictor mismatches. Adaptive order prediction is introduced, analyzed, and offered as a method for increasing the robustness of APC encoders. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new tandem source-channel trellis coding scheme

    Publication Year: 1994 , Page(s): 24 - 28
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (548 KB)  

    The author presents a new tandem source-channel coding scheme consisting of a trellis source coder and a trellis-coded modulation (TCM) channel coder. The motivation for the use of TCM, instead of conventional channel coding schemes such as convolutional codes, arises from the fact that by using TCM, improvement in the signal-to-quantization-noise ratio (SQNR) is achieved without bandwidth expansion. Criteria for the choice of TCM codes are discussed, and TCM schemes suitable for this application are presented. Simulation results for both a Gauss-Markov source and speech samples indicate that the present scheme results in considerable improvement not only over uncoded transmission but also in comparison with the joint source-channel trellis coding. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Selection of excitation vectors for the CELP coders

    Publication Year: 1994 , Page(s): 29 - 41
    Cited by:  Papers (1)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1207 KB)  

    The authors investigate several algorithms that construct the input for the synthesis filter in the CELP coder, they present them under the same formalism, and compare their performances. They model the excitation vector by a linear combination of K signals, which are issued from K codebooks and multiplied by K associated gains. They demonstrate that this generalized form incorporates several particular coders such as code excited linear predictive coders, multipulse coders, self excited vocoders, etc. The least squares minimization problem is presented afterwards. In the case of orthogonal codebooks, they show that the optimal solution of this least squares problem is equivalent to orthogonal transform coding. They use the Karhunen-Loeve transform to design the corresponding orthogonal codebooks. In the case of nonorthogonal codebooks, they are restricted to suboptimal iterative algorithms for index selection and gain computation. They present some new algorithms based on orthogonalization procedures and QR factorizations that attempt to reduce this suboptimality. In a particular case, when the excitation is modeled using one gain coefficient (for example, ternary excitation or concatenation of short codebook vectors), an iterative angle minimization algorithm is proposed for index selection. The different extraction algorithms are compared with regard to the resulting coder complexity and synthetic speech quality. They find a particularly attractive method that consists of modeling the excitation with one unique gain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interpolation of the pitch-predictor parameters in analysis-by-synthesis speech coders

    Publication Year: 1994 , Page(s): 42 - 54
    Cited by:  Papers (12)  |  Patents (23)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1480 KB)  

    The pitch-predictor contributes greatly to the efficiency of current analysis-by-synthesis speech coders by mapping the past reconstructed signal into the present. However, for good performance, it is required that its parameters are updated often (one every 2.5-7.5 ms). A slower update rate of the pitch-predictor delay results in time misalignment between the original signal and the pitch-predictor contribution to the reconstructed signal and the pitch-predictor contribution to the reconstructed signal. The authors introduce a new procedure, that allows a slow update rate of the pitch-predictor parameters without this problem. In this method the original signal is modified in a closed-loop fashion such that the parameter values obtained by interpolation of open-loop estimates form the optimal encoding of the modified signal. This new paradigm is a generalization of the familiar analysis-by-synthesis principle. The generalized analysis-by-synthesis principle can be used for interpolation of both the pitch-predictor delay and gain. The authors compare, by means of a subjective test, speech signals encoded with different versions of the code-excited linear predictor delay and gain. They compare, by means of a subjective test, speech signals encoded with different versions of the code-excited linear predictor (CELP) coder. The comparison shows that a pitch predictor exploiting the present interpolation strategy, with an update rate of 50 Hz, provides a subjective speed quality similar to a conventional pitch predictor where the parameters are updated for every pitch cycle. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonlinear noise filtering and beamforming using the perceptron and its Volterra approximation

    Publication Year: 1994 , Page(s): 55 - 62
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (833 KB)  

    The multilayer perceptron, an artificial neural network, is applied to the problem of interference reduction in single- and multiple-sensor systems. The filter is able to operate approximately as a linear trapped delay line if nonlinear processing cannot further reduce the mean-squared error of the output. Supplanting the activation function of the perceptron by a polynomial leads to the finite-order Volterra filter for which optimum weights can be calculated. Preliminary examples using the perceptron in single-sensor noise filtering show output signal-to-noise ratio (SNR) improvements of up to 2.2 dB compared to the optimum linear filter. Experiments with a nonlinear two-microphone beamformer show a 2.7 dB SNR enhancement for a sinusoidal target and an off-axis white noise jammer. For speech inputs under anechoic conditions, the Volterra beamformer achieved an average intelligibility improvement of 5.7%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech recognition using speaker adaptation by system parameter transformation

    Publication Year: 1994 , Page(s): 63 - 68
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (599 KB)  

    Presents a speaker adaptation scheme which transforms the prototype speaker's Hidden Markov word models to those of a new speaker. Transformations are applied to both the state transition matrix and the probability distribution functions of a hidden Markov word model. These transformations are optimized through maximizing the joint probability of a set of input pronunciations of the new speaker. Details of these parameter transformations and experimental verification are presented. The test uses a 210-word vocabulary with each having a four-state Hidden Markov word model. The test speaker consists of three males and two females with one male heavily accented. By having the system retrained up to a four-minute adaptation speech, a subset of the 210-word vocabulary, the performance shows an improvement of recognition accuracy from 22.5% to 92.1%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speech recognition using weighted HMM and subspace projection approaches

    Publication Year: 1994 , Page(s): 69 - 79
    Cited by:  Papers (19)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1249 KB)  

    A weighted hidden Markov model (HMM) algorithm and a subspace projection algorithm are proposed to address the discrimination and robustness issues for HMM-based speech recognition. A robust two-stage classifier is also proposed to incorporate these two approaches to further improve the performance. The weighted HMM enhances its discrimination power by first jointly considering the state likelihoods of different word models, then assigning a weight to the likelihood of each state, according to its contribution in discriminating words. The robustness of this model is then improved by increasing the likelihood difference between the top and the second candidates. The subspace projection approach discards unreliable observations on the basis of maximizing the divergence between different word pairs. To improve robustness, the mean of each cluster is then adjusted to obtain maximum separation different clusters. The performance was evaluated with a highly confusable vocabulary consisting of the nine English E-set words. The test was conducted in a multispeaker (100 talkers), isolated-word mode. The 61.7% word accuracy for the original HMM-based system was improved to 74.9% and 76.6%, respectively, by using the weighted HMM and the subspace projection methods. By incorporating the weighted HMM in the first stage and the subspace projection in the second stage, the two-stage classifier achieved a word accuracy of 79.4%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Waveform-based speech recognition using hidden filter models: parameter selection and sensitivity to power normalization

    Publication Year: 1994 , Page(s): 80 - 89
    Cited by:  Papers (21)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1067 KB)  

    The authors describe a novel approach to speech recognition by directly modeling the statistical characteristics of the speech waveforms. This approach allows them to remove the need for using speech preprocessors, which conventionally serve a role of converting speech waveforms into frame-based speech data subject to a subsequent modeling process. Central to their method is the representation of the speech waveforms as the output of a time-varying filter excited by a Gaussian source time-varying in its power. In order to formulate a speech recognition algorithm based on this representation, the time variation in the characteristics of the filter and of the excitation source is described in a compact and parametric form of the Markov chain. They analyze in detail the comparative roles played by the filter modeling and by the source modeling in speech recognition performance. Based on the result of the analysis, they propose and evaluate a normalization procedure intended to remove the sensitivity of speech recognition accuracy to often uncontrollable speech power variations. The effectiveness of the proposed speech-waveform modeling approach is demonstrated in a speaker-dependent, discrete-utterance speech recognition task involving 18 highly confusable stop consonant-vowel syllables. The high accuracy obtained shows promising potentials of the proposed time-domain waveform modeling technique for speech recognition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A coupled approach to ADPCM adaptation

    Publication Year: 1994 , Page(s): 90 - 93
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (445 KB)  

    The algorithms for adaptive quantization and adaptive prediction in existing adaptive differential pulse code modulation (ADPCM) are distinct and decoupled. The designs of the two algorithms are based on the assumption that they act independently whereas, due to the feedback configuration, they must interact in some way. The authors present a preliminary investigation of a new design approach where they perform joint adaptation based on a common cost function, considering the overall system. A “backward” adaptive algorithm is presented and applied to a simple example to demonstrate the feasibility of the general approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker adaptation via VQ prototype modification

    Publication Year: 1994 , Page(s): 94 - 97
    Cited by:  Papers (1)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (478 KB)  

    A statistical technique for vector quantizer (VQ) prototype adaptation, based on tied-mixture continuous-parameter HMM's, is derived and evaluated on the basis of experimental evidence. Performance on difficult adaptation tasks indicates that VQ-prototype adaptation via tied-mixture HMM's constitutes a useful mechanism for speaker adaptation, particularly when there are substantial channel differences or when there is a large mismatch between reference and target speaker characteristics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A projection-based likelihood measure for speech recognition in noise

    Publication Year: 1994 , Page(s): 97 - 102
    Cited by:  Papers (13)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (666 KB)  

    Investigates a projection-based likelihood measure that significantly improves automatic speech recognition performance in the presence of additive broadband noise. The measure was developed by modifying likelihood scores in continuous Gaussian density hidden Markov models (HMMs), resulting in the weighted projection measure (WPM). Experimental results using the proposed measure are reported for several performance factors: different cepstral-based parameters, normal and multistyle speech, and various noise signals, including white, jittering white, and broadband colored noise. In all cases, significant improvements in speaker-dependent, isolated word recognition were achieved using the WPM instead of the standard Gaussian likelihood measure (weighted Euclidean distance (WED)). As an example, at a SNR of 5 dB, the WPM resulted in improvement in recognition accuracy from 19.4 to 80.6% compared with the standard WED for the DFT mel-cepstral representation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Stochastic modeling of temporal information in speech for hidden Markov models

    Publication Year: 1994 , Page(s): 102 - 104
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (325 KB)  

    A Markov chain, namely, the temporal Markov model, is used to model the time-ordering information of the feature vectors of a spoken word. An empirical method is suggested to combine the temporal Markov model (TMM) with the hidden Markov model (HMM) for word recognition. Experiments on speaker-independent isolated English alphabet recognition showed that this method is effective in terms of improved recognition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Introduction to the special issue on neural networks for speech processing

    Publication Year: 1994 , Page(s): 113 - 114
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (239 KB)  

    The goal of this special issue is to present a representative set of current research papers that address the topic of neural networks in speech processing. The application areas addressed include speech analysis, synthesis, recognition and understanding. Due to the large volume of current research in these areas, the authors make the usual disclaimer and apologize for any work that may not be represented in the limited space allocated here. However, the included papers provide a fairly broad overview of each area, as well as citations to the related literature. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Techniques for estimating vocal-tract shapes from the speech signal

    Publication Year: 1994 , Page(s): 133 - 150
    Cited by:  Papers (36)  |  Patents (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2139 KB)  

    This paper reviews methods for mapping from the acoustical properties of a speech signal to the geometry of the vocal tract that generated the signal. Such mapping techniques are studied for their potential application in speech synthesis, coding, and recognition. Mathematically, the estimation of the vocal tract shape from its output speech is a so-called inverse problem, where the direct problem is the synthesis of speech from a given time-varying geometry of the vocal tract and glottis. Different mappings are discussed: mapping via articulatory codebooks, mapping by nonlinear regression, mapping by basis functions, and mapping by neural networks. Besides being nonlinear, the acoustic-to-geometry mapping is also nonunique, i.e., more than one tract geometry might produce the same speech spectrum. The authors show how this nonuniqueness can be alleviated by imposing continuity constraints. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Auditory models and human performance in tasks related to speech coding and speech recognition

    Publication Year: 1994 , Page(s): 115 - 132
    Cited by:  Papers (55)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2041 KB)  

    Auditory models that are capable of achieving human performance in tasks related to speech perception would provide a basis for realizing effective speech processing systems. Saving bits in speech coders, for example, relies on a perceptual tolerance to acoustic deviations from the original speech. Perceptual invariance to adverse signal conditions (noise, microphone and channel distortions, room reverberations) and to phonemic variability (due to nonuniqueness of articulatory gestures) may provide a basis for robust speech recognition. A state-of-the-art auditory model that simulates, in considerable detail, the outer parts of the auditory periphery up through the auditory nerve level is described. Speech information is extracted from the simulated auditory nerve firings, and used in place of the conventional input to several speech coding and recognition systems. The performance of these systems improves as a result of this replacement, but is still short of achieving human performance. The shortcomings occur, in particular, in tasks related to low bit-rate coding and to speech recognition. Since schemes for low bit-rate coding rely on signal manipulations that spread over durations of several tens of ms, and since schemes for speech recognition rely on phonemic/articulatory information that extend over similar time intervals, it is concluded that the shortcomings are due mainly to perceptually related rules over durations of 50-100 ms. These observations suggest a need for a study aimed at understanding how auditory nerve activity is integrated over time intervals of that duration. The author discusses preliminary experimental results that confirm human usage of such integration, with different integration rules for different time-frequency regions depending on the phoneme-discrimination task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A hybrid segmental neural net/hidden Markov model system for continuous speech recognition

    Publication Year: 1994 , Page(s): 151 - 160
    Cited by:  Papers (25)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1131 KB)  

    The current state-of-the-art in large-vocabulary, continuous speech recognition is based on the use of hidden Markov models (HMM). In an attempt to improve over HMM performance, the authors developed a hybrid system that combines the advantages of neural networks and HMM using a multiple hypothesis (or N-best) paradigm. The connectionist component of the system, the segmental neural net (SNN), models all the frames of a phonetic segment simultaneously, thus overcoming the well-known conditional-independence limitation of the HMM. They describe the hybrid system and discuss various aspects of SNN modeling, including network architectures, training algorithms and context modeling. Finally, they evaluate the hybrid system by performing several speaker-independent experiments with the DARPA Resource Management (RM) corpus, and demonstrate that the hybrid system shows a consistent improvement in performance over the baseline HMM system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Connectionist probability estimators in HMM speech recognition

    Publication Year: 1994 , Page(s): 161 - 174
    Cited by:  Papers (59)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1653 KB)  

    The authors are concerned with integrating connectionist networks into a hidden Markov model (HMM) speech recognition system. This is achieved through a statistical interpretation of connectionist networks as probability estimators. They review the basis of HMM speech recognition and point out the possible benefits of incorporating connectionist networks. Issues necessary to the construction of a connectionist HMM recognition system are discussed, including choice of connectionist probability estimator. They describe the performance of such a system using a multilayer perceptron probability estimator evaluated on the speaker-independent DARPA Resource Management database. In conclusion, they show that a connectionist component improves a state-of-the-art HMM system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximum mutual information neural networks for hybrid connectionist-HMM speech recognition systems

    Publication Year: 1994 , Page(s): 175 - 184
    Cited by:  Papers (17)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1164 KB)  

    This paper proposes a novel approach for a hybrid connectionist-hidden Markov model (HMM) speech recognition system based on the use of a neural network as vector quantizer. The neural network is trained with a new learning algorithm offering the following innovations. (1) It is an unsupervised learning algorithm for perceptron-like neural networks that are usually trained in the supervised mode. (2) Information theory principles are used as learning criteria, making the network especially suitable for combination with a HMM-based speech recognition system. (3) The neural network is not trained using the standard error-backpropagation algorithm but using instead a newly developed self-organizing learning approach. The use of the hybrid system with the neural vector quantizer results in a 25% error reduction compared with the same HMM system using a standard k-means vector quantizer. The training algorithm can be further refined by using a combination of unsupervised and supervised learning algorithms. Finally, it is demonstrated how the new learning approach can be applied to multiple-feature hybrid speech recognition systems, using a joint information theory-based optimization procedure for the multiple neural codebooks, resulting in a 30% error reduction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multilayer perceptrons as labelers for hidden Markov models

    Publication Year: 1994 , Page(s): 185 - 193
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (970 KB)  

    A novel combination of multilayer perceptrons (MLPs) and hidden Markov models (HMMs) is presented. Instead of using MLPs as probability generators for HMMs, the authors propose to use MLPs as labelers for discrete parameter HMM's. Compared with the probabilistic interpretation of MLPs, this gives them the advantage of flexibility in system design (e.g., the use of word models instead of phonetic models while using the same MLPs). Moreover, since they do not need to reach a global minimum, they can do with MLPs with fewer hidden nodes, which can be trained faster. In addition, they do not need to retrain the MLPs with segmentations generated by a Viterbi alignment. Compared with Euclidean labeling, their method has the advantages of needing fewer HMM parameters per state and obtaining a higher recognition accuracy. Several improvements of the baseline MLP labeling are investigated. When using one MLP, the best results are obtained when giving the labels a fuzzy interpretation. It is also possible to use parallel MLPs where each is based on a different parameter set (e.g., basic parameters, their time derivatives, and their second-order time derivatives). This strategy increases the recognition results considerably. A final improvement is the training of MLPs for subphoneme classification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speaker recognition using neural networks and conventional classifiers

    Publication Year: 1994 , Page(s): 194 - 205
    Cited by:  Papers (57)  |  Patents (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1271 KB)  

    An evaluation of various classifiers for text-independent speaker recognition is presented. In addition, a new classifier is examined for this application. The new classifier is called the modified neural tree network (MNTN). The MNTN is a hierarchical classifier that combines the properties of decision trees and feedforward neural networks. The MNTN differs from the standard NTN in both the new learning rule used and the pruning criteria. The MNTN is evaluated for several speaker recognition experiments. These include closed- and open-set speaker identification and speaker verification. The database used is a subset of the TIMIT database consisting of 38 speakers from the same dialect region. The MNTN is compared with nearest neighbor classifiers, full-search, and tree-structured vector quantization (VQ) classifiers, multilayer perceptrons (MLPs), and decision trees. For closed-set speaker identification experiments, the full-search VQ classifier and MNTN demonstrate comparable performance. Both methods perform significantly better than the other classifiers for this task. The MNTN and full-search VQ classifiers are also compared for several speaker verification and open-set speaker-identification experiments. The MNTN is found to perform better than full-search VQ classifiers for both of these applications. In addition to matching or exceeding the performance of the VQ classifier for these applications, the MNTN also provides a logarithmic saving for retrieval. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An N-best candidates-based discriminative training for speech recognition applications

    Publication Year: 1994 , Page(s): 206 - 216
    Cited by:  Papers (26)  |  Patents (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1233 KB)  

    The authors propose an N-best candidates-based discriminative training procedure for constructing high-performance HMM speech recognizers. The algorithm has two distinct features: N-best hypotheses are used for training discriminative models; and a new frame-level loss function is minimized to improve the separation between the correct and incorrect hypotheses. The N-best candidates are decoded based on their recently proposed tree-trellis fast search algorithm. The new frame-level loss function, which is defined as a halfwave rectified log-likelihood difference between the correct and competing hypotheses, is minimized over all training tokens. The minimization is carried out by adjusting the HMM parameters along a gradient descent direction. Two speech recognition applications have been tested, including a speaker independent, small vocabulary (ten Mandarin Chinese digits), continuous speech recognition, and a speaker-trained, large vocabulary (5000 commonly used Chinese words), isolated word recognition. Significant performance improvement over the traditional maximum likelihood trained HMMs has been obtained. In the connected Chinese digit recognition experiment, the string error rate is reduced from 17.0 to 10.8% for unknown length decoding and from 8.2 to 5.2% for known length decoding. In the large vocabulary, isolated word recognition experiment, the recognition error rate is reduced from 7.2 to 3.8%. Additionally, they have found that using more relaxed decoding constraints in preparing N-best hypotheses yields better recognition results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining TDNN and HMM in a hybrid system for improved continuous-speech recognition

    Publication Year: 1994 , Page(s): 217 - 223
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (800 KB)  

    The paper presents a hybrid continuous-speech recognition system that leads to improved results on the speaker dependent DARPA Resource Management task. This hybrid system, called the combined system, is based on a combination of normalized neural network output scores with hidden Markov model (HMM) emission probabilities. The neural network is trained under mean square error and the HMM is trained under maximum likelihood estimation. In theory, whatever criterion may be used, the same word error rate should be reached if enough training data is available. As this is never the case, the idea of combining two different criteria, each of them extracting complementary characteristics of the feature is interesting. A state-of-the-art HMM system will be combined with a time delay neural network (TDNN) integrated in a Viterbi framework. A hierarchical TDNN structure is described that splits training into subtasks corresponding to subsets of phonemes. This structure makes training of TDNNs on large-vocabulary tasks manageable on workstations. It will be shown that the combined system, despite the low accuracy of the hierarchical TDNN, achieves a word error rate reduction of 15% with respect to our state-of-the-art HMM system. This reduction is obtained with a 10% increase only in the number of parameters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An experiment in spoken language acquisition

    Publication Year: 1994 , Page(s): 224 - 240
    Cited by:  Papers (6)  |  Patents (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1862 KB)  

    The paper continues the authors' investigation of machines that adaptively acquire language through interaction with a complex environment. In particular, the present work focuses on the problem of spoken word acquisition, using the authors' proposed principles to motivate a method to govern the emergence of word symbols from the speech signal. The mechanism involves a connectionist network embedded in a feedback control system. The resulting system has two unique characteristics. First, no text is utilized by the device, in contrast to all other speech understanding systems. Second, the vocabulary and grammar is unconstrained, being acquired by the device during the course of performing its task. This is also in contrast to all other systems, in which the salient vocabulary and grammar are preprogrammed. A rudimentary baseline experiment is described, involving 1105 natural language utterances in an automated call routing application scenario. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope