By Topic

Speech Communication; 10. ITG Symposium; Proceedings of

Date 26-28 Sept. 2012

Filter Results

Displaying Results 1 - 25 of 71
  • Sparse, Hierarchical and Semi-Supervised Base Learning for Monaural Enhancement of Conversational Speech

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (162 KB)  

    We address the learning of noise bases in a monaural speaker-independent speech enhancement framework based on non-negative matrix factorization. Bases are estimated from training data in batch processing by means of hierarchical and non-hierarchical sparse coding, or determined during the speech enhancement process based on the divergence of the observed noisy speech signal and the speech base. In extensive test runs on the Buckeye corpus of highly spontaneous speech and the CHiME corpus of nonstationary real-life noise, we observe that semi-supervised learning of noise bases leads to overall best results while a-priori learning of noise bases is useful to speed up computation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Speech Distortion Weighted Multichannel Wiener Filter for diffuse Noise

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (169 KB)  

    In this paper the multichannelWiener filter is analyzed under the constraint of a diffuse noise field. An equivalent filter function is derived that partly omits the computationally demanding and error-prone matrix inversion. This yields a reduction of the computational complexity of the algorithm. By a theoretical analysis and by simulation results the equivalence of the two filter functions is shown. The simulations were performed with real noise and speech data recorded in a car where a diffuse noise field can be assumed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Energy Efficiency of Network-Based Acoustic Echo Control in Mobile Radio

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (250 KB)  

    In this contribution we investigate the energy efficiency of network-based acoustic echo control (AEC). Usually, AEC is implemented in the mobile terminal. Due to the limited resources of the mobile terminal in terms of the battery, computational capacity and memory, the complexity of the AEC algorithm has to be limited. In this paper we will investigate the idea to move the AEC processing from the mobile terminal to a network-based processing unit which allows to use more sophisticated algorithms. As this might require the use of improved speech codecs with higher bit rate, we will evaluate the trade-off between increased power for the radio transmission with additional bit rate and the reduced power for signal processing at the mobile terminal. This implies to take into consideration the attenuation of the radio channel according to, e.g., the Okumura-Hata path loss model. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • From Acoustic Nonlinearity to Adaptive Nonlinear System Identification

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (211 KB)  

    Audio and acoustic signal processing predominantly relies on the theory of linear systems. The existence of nonlinearity in real systems is recognized since long though. Numerous criteria are available to assess the degree of nonlinearity. Explicit nonlinear signal processing with tangible structure and relevant performance, however, still appears rarely. In this paper, we focus on the memoryless Hammerstein-type of nonlinearity which is useful to characterize applications of speech and audio reproduction such as, e.g., hands-free systems with loudspeaker distortion. The goal of the paper is to point to some recent research on adaptive nonlinear system identification and to put it into perspective with established concepts and criteria in the nonlinear domain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Order Volterra Long-Term Predictors

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (286 KB)  

    Models based on linear prediction have been used for several decades in different areas of speech signal processing. While the linear approach has led to great advances in the last 40 years, it neglects nonlinearities present in the speech production mechanism. This paper compares the results of long-term nonlinear prediction based on second-order and third-order Volterra filters. Additional improvement can be obtained using fractionaldelay long-term prediction. Experimental results reveal that the proposed method outperforms linear long-term prediction techniques in terms of prediction gain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reference Microphone Selection for MWF-based Noise Reduction Using Distributed Microphone Arrays

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (272 KB)  

    Using an acoustic sensor network, consisting of spatially distributed microphones, a significant noise reduction can be achieved with the centralized multi-channel Wiener filter (MWF), which aims to estimate the desired speech component in one of the microphones, referred to as the reference microphone. However, since the distributed microphones are typically placed at different locations, the selection of the reference microphone has a significant impact on the performance of the MWF, largely depending on the position of the desired source with respect to the microphones. In this paper, different optimal and suboptimal reference selection procedures are presented, both broadband and frequency-dependent. Experiment results show that the proposed procedures yield better performance than an arbitrarily selected reference microphone. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved Gain Estimation for Codebook-Based Speech Enhancement

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (203 KB)  

    Codebook-based speech enhancement approaches have been proven to be able to reduce highly non-stationary noise. The key to this ability is the separation of the speech and noise spectra into gain-normalized prototype spectra??which are stored in codebooks??and gain factors, which are estimated online. Since the gain factors are estimated on a short frame basis, even rapid changes of the signal levels can be tracked accurately. There is no closed-form solution for optimal speech and noise gain factor estimation and therefore approximate solutions are required. In this paper, we use Newton??s method for estimating the speech and noise gain factors. Moreover, it is shown how information from previous estimates or information from separate noise estimators can be utilized to improve estimation performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Generalized Frequency-Domain Adaptive Filtering Algorithm Implemented on a GPU for Large-Scale Multichannel Acoustic Echo Cancellation

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (277 KB)  

    For multichannel reproduction systems a trend towards increasing numbers of reproduction channels can be observed. For the use of such systems in communication scenarios, acoustic echo cancellation (AEC) is necessary, whose computational complexity scales at least linearly with the number of reproduction channels and implies a tremendous computational effort. Powerful graphics processing units (GPUs) appear to be a suitable implementation platform for such a scenario. However, inexpensive consumer-market GPUs are not designed for real-time adaptive audio processing with low-latency. In this paper we present an implementation of the Generalized Frequency-Domain Adaptive Filtering (GFDAF) algorithm on popular GPUs. Details on the parallelization of the algorithm are given and the performance of the implementation is evaluated for single precision and for double precision computations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparison and Signal-Component-Wise Instrumental Evaluation of MMSE Log-Spectral Amplitude Estimation Under Speech Presence Uncertainty

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (161 KB)  

    This paper presents an overview of MMSE log-spectral amplitude (LSA) estimators under speech presence uncertainty (SPU). These are the nonlinear MMSE LSA estimator, the multiplicatively modified MMSE LSA estimator, and the optimally modified MMSE LSA estimator. It turns out that the instrumental evaluation of speech and noise signal components needs to be carried out for the nonlinear MMSE LSA estimator by a black-box approach, due to the nonlinear nature of the estimator. For a comparison, all other estimators were evaluated by that black-box approach as well. This is indeed not typical in speech enhancement, although there are further estimators not being a linear function of the noisy speech signal (amplitude). The black-box approach allows advantageously for a signal-component-wise instrumental evaluation of them, too. It is worthwhile to mention that the nonlinear MMSE LSA estimator was believed not to achieve substantial improvements compared to the MMSE LSA estimator without SPU estimation. This, however, could not be confirmed by our simulations, supported by a signal-component-wise subjective and instrumental evaluation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Survey of Speech Enhancement Supported by a Bone Conduction Microphone

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (704 KB)  

    This paper gives an overview to speech enhancement algorithms using a bone-conducted (BC) microphone. Unlike conventional air-conducted (AC) microphones picking up particle movement in the air, the transmission channel of the BC microphone is only related to the human body. Therefore, voice communication with the BC microphone is not affected by ambient noise. However, since the high frequency components of the BC microphone signal are attenuated significantly due to transmission loss, additional signal processing techniques are needed to provide comfortable communication. This paper summarizes the idea of BC-based speech enhancement algorithms. It also provides some pros and cons of the approaches. which might be helpful for determining the direction for further research in the BC microphonebased speech enhancement field. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quality Analysis and Optimization of the MAP-based Noise Power Spectral Density Tracker

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (234 KB)  

    It has been lately shown that noise tracking and speech denoising can be improved by a postprocessor established on a maximum a-posteriori based (MAP-B) noise power spectral density (PSD) estimation algorithm. In the current contribution we investigate the MAP-B estimator by carrying out a quality analysis comprising the following three steps. First, we analyse the estimator with respect to unbiasedness and consistency, second, the tracking ability in non-stationary noise is investigated, and finally, the sensitivity of the MAP-B noise tracker with respect to estimation errors in the preprocessing stage is considered. The findings are used to develop an optimized MAP-B postprocessor. The performance comparison with the original MAP-B tracker indeed reveals improved performance at high signal-to-noise ratios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Iterative Exchange of Soft State Information in Two-Channel Automatic Speech Recognition

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (168 KB)  

    The robustness of automatic speech recognition systems can be improved by exploiting further information sources such as additional acoustic channels or modalities. Since the arising problem of information fusion exhibits striking parallels to problems in digital communications, where the turbo principle [1] was a groundbreaking innovation, Shivappa et al. showed that a similar iterative scheme can be applied to multimodal speech recognition [2]. We provide new interpretations and propose significant modifications of their approach: First, we show that no modification of the forward-backward recognition algorithm is required; second, we dispense with their proposed heuristic model; third, we deliver our own interpretation and formulation of the extrinsic information passed between the recognizers. Our proposed method is successfully applied to a synthetic unimodal two-channel speech recognition task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining Different Recognition Schemes by Analyzing the Noise Condition

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (177 KB)  

    The degradation of the human performance is still considerably lower than the corresponding deterioration of automatic recognition systems when comparing the recognition of noisy versus clean speech. It can be observed that the degradation of the recognition rate is dependent on the applied recognition technique and the specific noise condition. We present an approach to select the appropriate recognition scheme by estimating the noise scenario at each speech input. Two different recognition schemes are applied. One is based on the extraction of robust features whereas the other approach contains an adaptation of HMMs (Hidden Markov Models). In case of extracting robust features we investigate the usage of multi-condition HMMs that have been trained on noisy speech signals. We verify that the process of selecting the appropriate scheme and the appropriate set of HMMs can be applied so that the lowest error rate is achieved for each acoustic condition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Audio-Visual Speech Recognition for Uncertain Acoustical Observations

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (173 KB)  

    speech recognition is still a challenging problem. To address this issue, so-called uncertainty-of-observation techniques can be used, either for audio-only, or for audiovisual speech recognition. There are many established uncertainty-of-observation strategies, among them, two of the computationally least expensive ones are uncertainty decoding and modified imputation. In contrast to these two standard approaches, improvements are possible by using a new technique, which combines model-based speech estimation with dynamic variance compensation, and carries little computational overhead. This new approach - significance decoding - has previously been applied only in unimodal speech recognition. In this paper, it is applied to coupled-HMM-based audio-visual speech recognition, and it is shown here to clearly outperform the two standard approaches of modified imputation and uncertainty decoding in handling acoustic uncertainty for both View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Image Transformation based Features for the Visual Discrimination of Prominent and Non-ProminentWords

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (221 KB)  

    This paper investigates how visual information extracted from a speaker??s mouth region can be used to discriminate prominent from non-prominent words. The analysis relies on a database where users interacted in a small game with a computer in a Wizard of Oz experiment. Users were instructed to correct recognition errors of the system. This was expected to render the corrected word highly prominent. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features relative energy and fundamental frequency were calculated. From the visual channel image transformation based features from the mouth region were extracted. As image transformations FFT, DCT and PCA with a varying number of coefficients are compared in this paper. Thereby the performance of the visual features by themselves or in combination with the acoustic features is investigated. The comparison is based on the classification with a Support Vector Machine (SVM). The results show that all three image transformations yield a performance of approx. 65% in this binary classification task. Furthermore, the information extracted from the visual channel is complementary to the acoustic information. The combination of both modalities significantly improves performance up to approx. 80%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fully Automatic Audiovisual Emotion Recognition: Voice,Words, and the Face

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (337 KB)  

    The recognition of human emotions from spontaneous and non-prototypical real-life data is currently one of the most challenging tasks in the field of affective computing. This contribution presents our recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual humancomputer interaction scenario. We propose a fully automatic multimodal recognition approach based on contextsensitive modeling of audio and video features. Evaluations on the Audiovisual Sub-Challenge of the 2011 Audio/ Visual Emotion Challenge show how accurately different affective dimensions can be recognized. Our experiments reveal that the proposed multimodal recognition system outperforms previously introduced techniques evaluated on the same task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2D Audio-Visual Localization in Home Environments using a Particle Filter

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (258 KB)  

    Multimodal algorithms benefit fromthe advantage that they can mutually compensate the weaknesses of the individual modalities. Therefore, we propose a system to localize concurrent speakers in a two dimensional (2D) space jointly using a combined audio-visual localization algorithm. The acoustic source localization is calculated by the multichannel cross-correlation coefficient (MCCC) algorithm and the visual localization is accomplished by the SHORE TM, (Sophisticated High-speed Object Recognition Engine (SHORE), Trademark of Fraunhofer IIS, 91058 Erlangen (Germany)), video localization system. The multimodal fusion is performed by a particle filter with adaptations to the particle weighting. An evaluation of the proposed algorithm in an home-environment living lab is performed focussing on possible gains obtained by the complementary localization modalities. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Beamformer Post-Filter with Hybrid Noise Coherence Functions Instrumentally Optimized Using a Figure of Merit

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (185 KB)  

    Multichannel speech enhancement employing a microphone array beamformer with a post-filter allows for noise attenuation along with a good preservation of the useful speech component. However, the noise coherence function plays a critical role within the post-filter estimation. Current state-of-the-art approaches for post-filter estimation utilize the a priori knowledge of a diffuse noise field in an automobile noise environment. However, the automobile noise environment varies with different driving conditions. In this paper, we focus on optimizing the post-filter by using a hybrid coherence function, which is a mixture of the diffuse and the measured noise coherence function for a specific driving condition. The idea is that the driving condition information can be taken from the controller area network (CAN)-bus data in modern cars in real time, selecting a stored hybrid coherence function. The optimization of parameters is carried out in a data-driven approach, using a figure of merit constructed by three independent instrumental quality measures. The optimized post-filter with the individually optimized hybrid coherence functions for each driving condition shows an improved performance compared to just relying on a diffuse noise coherence assumption throughout. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Suppression of Engine Noise Harmonics Using Cascaded LMS Filters

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (662 KB)  

    The background noise recorded by microphones in a car environment is mainly caused by the engine, airflow and the tires. The engine noise can be suppressed by adaptive filtering if an engine speed reference is available. This reference information is usually available from the data stream of the vehicle bus. This work presents a new filter structure for the suppression of engine harmonics. The filter structure consists of cascaded time-domain least mean squares (LMS) filters. Known LMS filter structures use a common error signal to update the filter weights for all harmonics. However, the power of the engine noise is typically dominated by the low order harmonics. Therefore, we propose a filter structure where the engine noise harmonics are processed in succession. Cascading the LMS filters improves the filter adaptation of the higher order harmonics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Suppression of Instationary Distortions in Automotive Environments

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (574 KB)  

    Most speech enhancement systems contain some sort of background noise suppression as a basic building block and many different methods have been proposed therefore. However, many approaches yield reliable results only for stationary or slowly varying background noise and fail on highly instationary noise components. This contribution proposes a method for detecting instationary noise components when multiple microphones are available. The detection algorithm outputs an estimate of the instationary noise power spectral density itself and a short and long-term decision about the presence of instationary noise components. It is further shown, how these detectors can be used to enhance the performance of the subsequent signal enhancement algorithms such as noise suppression and microphone combination. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A measurement methodology for automotive teleconferencing

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (258 KB)  

    A novel measurement methodology for in-car communication and automotive teleconferencing systems is presented. The procedure consists of several discrete signal acquisition steps and is based on the idea of separating any microphone signal into its three basic additive components: speech, noise, and acoustic system echo. In a simulation stage, previously recorded signal components are mixed with current microphone signals in real time to obtain virtual microphone signals. These are presented to the system under test, which may even be a black box and of non-linear characteristic. Arbitrary combinations of different noise conditions, dialog situations, and systems can be tested. Instrumental measurements are supported as well as third-person listening tests. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Plausibility of a Personalized Acoustic in Automobiles

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (865 KB)  

    With the emerging of algorithms able to automate or semiautomate the acoustical tuning process of sound systems m automobiles, it is nowadays possible to objectify the otherwise, almost exclusively, subjectively handled topic of acoustical sound tuning. Thus, by having a fully determined process, investigations, e.g. regarding plausibility of personalized acoustic in automotive environments, as proposed in this paper, became feasible. Thereby, by taking a certain demo car as test object, in which, for several single seat positions, as well as for different zones within its interior, special tuning sets have been generated, all based on the same RIR1 measurements, done once at the beginning, the question, whether it is plausible to personalize the acoustic in automobile environments, had been followed up. By utilizing two different kinds of measurements, one coping with the timbre and the other one covering the localization, an attempt was made to find an answer to this question, in an impartial manner. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards Automatic Intoxication Detection from Speech in Real-Life Acoustic Environments

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (167 KB)  

    In-car intoxication detection from speech is a highly promising non-intrusive method to reduce the accident risk associated with drunk driving. However, in-car noise significantly influences the recognition performance and needs to be addressed in practical applications. In this paper, we investigate how seriously the intrinsic in-car noise and background music affect the accuracy of intoxication recognition. In extensive test runs using the official speech corpus of the INTERSPEECH 2011 Intoxication Challenge, realistic car noise and original popular music we conclude that stationary driving noise as well as music introduce a significant downgrade when acoustic models are trained on clean speech only, which can partly be alleviated by multi-condition training. Besides, exploiting cumulative evidence over time by late decision fusion appears to be a promising way to further enhance performance in noisy conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Voice Activity Detection within the Nearfield of an Array of Distributed Microphones

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (197 KB)  

    In this paper a voice activity detector (VAD) for sources within the near-field of an array of distributed microphones is presented. We develop this detector with neither exact knowledge of the desired source nor microphone arrangement. The potential of the presented method is evaluated on a setup with belt- and classical hands-free microphones and is compared to a well-known detection method based on the generalized crosscorrelation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automotive Hands-free Telephony Speech Quality Evaluation - A Subjective Testing Approach

    Page(s): 1 - 4
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (500 KB)  

    We propose and introduce a testing method for subjective speech quality evaluation of hands-free telephony systems in car environments. In evaluating car hands-free speech quality it is desired to have a procedure available which is repeatable and reproducible, significant and convincing, yet fast and especially resource effective. Proposed is a method, comprising a series of subjective test cases in different scenarios and driving conditions. The goal is to gain insight into possible problems and weaknesses of systems under test and to reliably judge the overall quality during product development and product release process. We describe the approach. the conditions and set-ups we have chosen, but also the limitations and restrictions we face. Finally. first results from real-world driving tests are presented and discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.