Issue 5 • Date Jul 2002
New insights into the stereophonic acoustic echo cancellation problem and an adaptive nonlinearity solutionPage(s): 257 - 267
We expand the knowledge regarding the problems of two-channel (or stereophonic) echo cancellation. The major difference between two-channel and the single-channel echo cancellation is the problem of nonunique solutions in the two-channel case. In previous work, this nonuniqueness problem has been linked to the coherence between the two incoming audio channels. One proven solution to this problem is to distort the signals with a nonlinear device. In this work, we present new theory that gives insight to the existing links between: (i) coherence and level of distortion, and (ii) coherence and achievable misalignment of the stereophonic echo canceler. Furthermore, we present an adaptive nonlinear device that incorporates this new knowledge in such a way that a pre-specified maximum misalignment is maintained while improving the perceived quality by minimizing the introduced distortion. Moreover, all the ideas presented can be generalized to the multichannel (>2) case. View full abstract»
Previous work by Merhav and Lee (1993) as well as others has emphasized that the conditions required to make the maximum a posteriori (MAP) decision rule an optimal decision rule for speech recognition do not hold and have proposed techniques based upon the adjustment of model parameters to improve speech recognition. In this article, we consider the problem of text-independent speaker recognition, and present a new model called the integral decode. The integral decode, like previous work in this area, attempts to compensate for the lack of conditions necessary to ensure optimality of the MAP decision rule in environments with corrupted observations and imperfect models. The integral decode is a smoothing operation in the feature space domain. A region of uncertainty is established about each noisy observation and an approximation of the integral is computed. The MAP decision rule is then applied to the smoothed likelihood estimates. In all tested conditions, the integral decode performs as well as or better than equivalent HMMs without integral decode. View full abstract»
A new auditory-based speech processing system based on the biologically rooted property of the average localized synchrony detection (ALSD) is proposed. The system detects periodicity in the speech signal at Bark-scaled frequencies while reducing the response's spurious peaks and sensitivity to implementation mismatches, and hence presents a consistent and robust representation of the formants. The system is evaluated for its formant extraction ability while reducing spurious peaks. It is compared with other auditory-based and traditional systems in the tasks of vowel and consonant recognition on clean speech from the TIMIT database and in the presence of noise. The results illustrate the advantage of the ALSD system in extracting the formants and reducing the spurious peaks. They also indicate the superiority of the synchrony measures over the mean-rate in the presence of noise. View full abstract»
This paper presents and compares algorithms for combined acoustic echo cancellation and noise reduction for hands-free telephones. A structure is proposed, consisting of a conventional acoustic echo canceler and a frequency domain postfilter in the sending path of the hands-free system. The postfilter applies the spectral weighting technique and attenuates both the background noise and the residual echo which remains after imperfect echo cancellation. Two weighting rules for the postfilter are discussed. The first is a conventional one, known from noise reduction, which is extended to attenuate residual echo as well as noise. The second is a psychoacoustically motivated weighting rule. Both rules are evaluated and compared by instrumental and auditive tests. They succeed about equally well in attenuating the noise and the residual echo. In listening tests, however, the psychoacoustically motivated weighting rule is mostly preferred since it leads to more natural near end speech and to less annoying residual noise. View full abstract»
A new audio coding scheme using a forward masking model and perceptually weighted vector quantizationPage(s): 325 - 335
This paper presents a new audio coder that includes two techniques to improve the sound quality of the audio coding system. First, a forward masking model is proposed. This model exploits adaptation of the peripheral sensory and neural elements in the auditory system, which is often deemed as the cause of forward masking. In the proposed audio coder, the forward masking is first modeled by a nonlinear analog circuit and then difference equations for finding the solution of this circuit are formulated. The parameters of the circuit are derived from several factors, including time difference between masker and maskee, masker level, masker frequency, and masker duration. Inclusion of this model in the coding process will remove more redundancy inaudible to humans and thus improves the coding efficiency. Secondly, we propose a new vector quantization technique, whose codebooks are generated by a perceptually weighted binary-tree self-organizing feature maps (PW-BTSOFM) algorithm. This vector quantization technique adopts a perceptually weighted error criterion to train and select codewords so that the quantization error is kept below the just-noticed distortion (JND) while using the smallest possible codebook, again reducing the required coded bit rate. Experimental objective and subjective sound quality measurements show that the proposed audio coding scheme requires about 30% less bits than the MPEG layer III audio coding standard. View full abstract»
This paper presents an online/sequential linear regression adaptation framework for hidden Markov model (HMM) based speech recognition. Our attempt is to sequentially improve speaker-independent speech recognition system to handle the nonstationary environments via the linear regression adaptation of HMMs. A quasi-Bayes linear regression (QBLR) algorithm is developed to execute the sequential adaptation where the regression matrix is estimated using QB theory. In the estimation, we specify the prior density of regression matrix as a matrix variate normal distribution and derive the pooled posterior density belonging to the same distribution family. Accordingly, the optimal regression matrix can be easily calculated. Also, the reproducible prior/posterior pair provides a meaningful mechanism for sequential learning of prior statistics. At each sequential epoch, only the updated prior statistics and the current observed data are required for adaptation. The proposed QBLR is a general framework with maximum likelihood linear regression (MLLR) and maximum a posteriori linear regression (MAPLR) as special cases. Experiments on supervised and unsupervised speaker adaptation demonstrate that the sequential adaptation using QBLR is efficient and asymptotical to batch learning using MLLR and MAPLR in recognition performance. View full abstract»
Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin ChinesePage(s): 303 - 314
With the rapidly growing use of the audio and multimedia information over the Internet, the technology for retrieving speech information using voice queries is becoming more and more important. In this paper, considering the monosyllabic structure of the Chinese language, a whole class of syllable-based indexing features, including overlapping segments of syllables and syllable pairs separated by a few syllables, is extensively investigated based on a Mandarin broadcast news database. The strong discriminating capabilities of such syllable-based features were verified by comparing with the word- or character-based features. Good approaches for better utilizing such capabilities, including fusion with the word- and character-level information and improved approaches to obtain better syllable-based features and query expressions, were extensively investigated. Very encouraging experimental results were obtained. View full abstract»
Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification. View full abstract»
Aims & Scope
Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.
This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.