By Topic

Applications of Signal Processing to Audio and Acoustics, 2009. WASPAA '09. IEEE Workshop on

Date 18-21 Oct. 2009

Filter Results

Displaying Results 1 - 25 of 91
  • Acoustic echo cancellation based on independent component analysis and integrated residual echo enhancement

    Page(s): 205 - 208
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (276 KB) |  | HTML iconHTML  

    This paper examines the technique of using a memoryless noise-suppressing nonlinearity in the adaptive filter error feedback-loop of an acoustic echo canceler (AEC) based on normalized least-mean square (NLMS) when there is an additive noise at the near-end. It will be shown that introducing the nonlinearity to ldquoenhancerdquo the filter estimation error is well-founded in the information-theoretic sense and has a deep connection to the independent component analysis (ICA). The paradigm of AEC as a problem that can be approached by ICA leads to new algorithmic possibilities beyond the conventional LMS family of techniques. In particular, a right combination of the error enhancement procedure and a properly implemented regularization procedure enables the AEC to be performed recursively and continuously in the frequency domain when there are both ambient noise and double-talk even without the double-talk detection (DTD) or the voice activity detection (VAD) procedure. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Applications of signal analysis using autoregressive models for amplitude modulation

    Page(s): 341 - 344
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (105 KB) |  | HTML iconHTML  

    Frequency domain linear prediction (FDLP) represents an efficient technique for representing the long-term amplitude modulations (AM) of speech/audio signals using autoregressive models. For the proposed analysis technique, relatively long temporal segments (1000 ms) of the input signal are decomposed into a set of sub-bands. FDLP is applied on each sub-band to model the temporal envelopes. The residual of the linear prediction represents the frequency modulations (FM) in the sub-band signal. In this paper, we present several applications of the proposed AM-FM decomposition technique for a variety of tasks like wide-band audio coding, speech recognition in reverberant environments and robust feature extraction for phoneme recognition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Coherent signals direction-of-arrival estimation using a spherical microphone array: Frequency smoothing approach

    Page(s): 221 - 224
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1164 KB) |  | HTML iconHTML  

    Direction-of-arrival (DOA) estimation of coherent signals is considered of great importance in signal processing. To estimate both azimuth and elevation angle with the same accuracy, 3-dimensional (3-D) array must be used. Spherical arrays have the advantage of spherical symmetry, facilitating 3-D DOA estimation. To apply high resolution subspace DOA estimation algorithms, such as MUSIC, in a coherent environment, a smoothing technique is required. This paper presents the development of a smoothing technique in the frequency domain for spherical microphone arrays. We show that frequency smoothing can be efficiently performed using spherical arrays due to the decoupling of frequency and angular components. Experimental comparison of DOA estimation between beamforming and MUSIC with frequency smoothing is performed with data measured in a real auditorium. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unifying semantic and content-based approaches for retrieval of environmental sounds

    Page(s): 13 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (153 KB) |  | HTML iconHTML  

    Creating a database of user-contributed recordings allows sounds to be linked not only by the semantic tags and labels applied to them, but also to other sounds with similar acoustic characteristics. Of paramount importance in navigating these databases are the problems of retrieving similar sounds using text or sound-based queries, and automatically annotating unlabeled sounds. We propose an integrated system, which can be used for text-based retrieval of unlabeled audio, content-based query-by-example, and automatic annotation. To this end, we introduce an ontological framework where sounds are connected to each other based on a measure of perceptual similarity, while words and sounds are connected by optimizing link weights given user preference data. Results on a freely available database of environmental sounds contributed and labeled by non-expert users, demonstrate effective average precision scores for both the text-based retrieval and annotation tasks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dolph-Chebyshev radial filter for the near-field spherical microphone array

    Page(s): 169 - 172
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (178 KB) |  | HTML iconHTML  

    When close enough to a microphone array, the spherical nature of radiating sources allows for sound field processing in terms of distance as well as direction. As part of an on-going study on beamforming given sources close to a spherical microphone array, sound field processing is examined through the use of radial filters. In this paper, a radial Dolph-Chebyshev design is presented for attenuating far-field interference given sources close to the array surface. The proposed radial filter facilitates an analytical formulation of the design technique, and may be practical for sources very close to the array surface. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Source enumeration of speech mixtures using pitch harmonics

    Page(s): 89 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (254 KB) |  | HTML iconHTML  

    This paper proposes a method to simultaneously estimate the number, pitches, and relative locations of individual speech sources within instantaneous and non-instantaneous linear mixtures containing additive white Gaussian noise. The algorithm makes no assumptions about the number of sources or the number of sensors, and is therefore applicable to over-, under-, and precisely-determined scenarios. The method is hypothesis-based and employs a power-spectrum-based FIR filter derived from probability distributions of speech pitch harmonics. This harmonic windowing function (HWF) dramatically improves time-difference of arrival (TDOA) estimates over standard cross-correlation for low SNR. The pitch estimation component of the algorithm implicitly performs voiced-region detection and does not require prior knowledge about voicing. Cumulative pitch and TDOA estimates from the HWF form the basis for robust source enumeration across a wide range of SNR. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On robustness of multi-channel minimum mean-squared error estimators under super-Gaussian priors

    Page(s): 157 - 160
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (168 KB) |  | HTML iconHTML  

    The use of microphone arrays in speech enhancement applications offer additional features, like directivity, over the classical single-channel speech enhancement algorithms. An often used strategy for multi-microphone noise reduction is to apply the multi-channel Wiener filter, which is often claimed to be mean-squared error optimal. However, this is only true if the estimator is constrained to be linear, or, if the speech and noise process are assumed to be Gaussian. Based on histograms of speech DFT coefficients it can be argued that optimal multi-channel minimum mean-squared error (MMSE) estimators should be derived under super-Gaussian speech priors instead. In this paper we investigate the robustness of these estimators when the steering vector is affected by estimation errors. Further, we discuss the sensitivity of the estimators when the true underlying distribution of speech DFT coefficients deviates from the assumed distribution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Statistical models for speech dereverberation

    Page(s): 145 - 148
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1047 KB) |  | HTML iconHTML  

    This paper discusses a statistical-model-based approach to speech dereverberation. With this approach, we first define parametric statistical models of probability density functions (pdfs) for a clean speech signal and a room transmission channel, then estimate the model parameters, and finally recover the clean speech signal by using the pdfs with the estimated parameter values. The key to the success of this approach lies in the definition of the models of the clean speech signal and room transmission channel pdfs. This paper presents several statistical models (including newly proposed ones) and compares them in a large-scale experiment. As regards the room transmission channel pdf, an autoregressive (AR) model, an autoregressive power spectral density (ARPSD) model, and a moving-average power spectral density (MAPSD) model are considered. A clean speech signal pdf model is selected according to the room transmission channel pdf model. The AR model exhibited the highest dereverberation accuracy when a reverberant speech signal of 2 sec or longer was available while the other two models outperformed the AR model when only a 1-sec reverberant speech signal was available. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Acoustic coupling in multi-dimensional finite difference schemes for physically modeled voice synthesis

    Page(s): 5 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (371 KB) |  | HTML iconHTML  

    Finite-difference time domain approximation of the wave equation has been shown to provide a good approximation to acoustic wave propagation in multiple dimensions. Two dimensional models are often assumed to be an adequate compromise between the poor geometrical representation afforded by one-dimensional models and the additional computational loading incurred by three-dimensional modeling. This paper demonstrates the validity of multi-dimensional finite-difference schemes for obtaining accurate frequency responses of uniform cylinders, then compares simulations of coupled cylindrical resonators carried out in different dimensionalities. It is found that two-dimensional models exhibit erroneous low-frequency resonant behaviour where large discontinuities in acoustic admittance are presented by the geometry under simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ITU-T G.719: A new low-complexity full-band (20 kHZ) audio coding standard for high-quality conversational applications

    Page(s): 265 - 268
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (226 KB) |  | HTML iconHTML  

    This paper describes a new low-complexity full-band (20 kHz) audio coding algorithm which has been recently standardized by ITU-T as Recommendation G.719. The algorithm is designed to provide 20 Hz - 20 kHz audio bandwidth using a 48 kHz sample rate, operating at 32 - 128 kbps. This codec features very high audio quality and low computational complexity and is suitable for use in applications such as videoconferencing, teleconferencing, and streaming audio over the Internet. Subjective test results from the optimization/characterization phase of G.719 are also presented in the paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Soundfield rendering with loudspeaker arrays through multiple beam shaping

    Page(s): 313 - 316
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (423 KB) |  | HTML iconHTML  

    This paper proposes a method for the acoustic rendering of a virtual environment based on a geometric decomposition of the wavefield into multiple elementary acoustic beams, all reconstructed with a loudspeaker array. The point of origin, the orientation and the aperture of each beam is computed according to the geometry of the virtual environment that we want to render and to the location of the sources. Space-time filters are computed with a Least Squares approach to render the desired beam. Experimental results show the feasibility as well as the critical issues of the proposed algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Wiener-based implementation of equalization-cancellation pre-processing for binaural speech intelligibility prediction

    Page(s): 233 - 236
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (204 KB) |  | HTML iconHTML  

    This paper presents a precursor to an objective measure to predict speech intelligibility in binaural listening conditions. Such measures typically consist of a binaural pre-processing stage followed by intelligibility prediction using a monaural measure such as the Speech Intelligibility Index. In this work, an implementation of the equalization-cancellation process using Wiener filters is presented as a binaural pre-processing stage. The model is tested in simulated sound-field listening. Preliminary assessment is performed by comparison with recent work in this field. Speech intelligibility measurements from the literature are also used to qualitatively assess the improvements in signal-to-noise ratio obtained. This work will form the basis for a complete binaural intelligibility prediction system involving nonlinear input signals common in hearing aids. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HRTF interpolation in the wavelet transform domain

    Page(s): 293 - 296
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (283 KB) |  | HTML iconHTML  

    This paper presents a new HRTF (Head Related Transfer Function) interpolation technique for three-dimensional sound generation. The proposed approach is based on the determination of the optimal weights to be applied to the HRTF coefficients of neighboring positions in the wavelet domain in order to obtain the HRTF at a given point. The proposed method is compared to conventional interpolation methods through the error analysis of the HRTFs in the frequency and time domains. It is verified that the proposed method presents smaller interpolation errors for all HRIRs (Head Related Impulse Responses) of an available database. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Acoustic topic model for audio information retrieval

    Page(s): 37 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (244 KB) |  | HTML iconHTML  

    A new algorithm for content-based audio information retrieval is introduced in this work. Assuming that there exist hidden acoustic topics and each audio clip is a mixture of those acoustic topics, we proposed a topic model that learns a probability distribution over a set of hidden topics of a given audio clip in an unsupervised manner. We use the Latent Dirichlet Allocation (LDA) method for the topic model, and introduce the notion of acoustic words for supporting modeling within this framework. In audio description classification tasks using Support Vector Machine (SVM) on the BBC database, the proposed acoustic topic model shows promising results by outperforming the Latent Perceptual Indexing (LPI) method in classifying onomatopoeia descriptions and semantic descriptions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A perceptually enhanced Scalable-to-Lossless audio coding scheme and a trellis-based approach for its optimization

    Page(s): 329 - 332
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (165 KB) |  | HTML iconHTML  

    Scalable-to-lossless (SLS) audio compression, as standardized by MPEG, provides a lossy base layer compatible with the advanced audio coding (AAC) format, ensuring state-of-the-art quality in the base layer, and additional fine grained enhancements that eventually provide a lossless compressed version of the signal. While SLS offers highly efficient lossless compression, the perceptual quality of its intermediate lossy layers has been observed to be suboptimal. This paper proposes a modified SLS audio coding scheme that provides enhanced perceptual quality at an intermediate bit-rate, at the expense of an additional parameter per frequency band as side-information. This scheme when coupled with a trellis-based optimization algorithm is demonstrated to outperform, in terms of quality at the intermediate bit-rate, both standard SLS and a recent perceptually enhanced variant, with minimal degradation in lossless coding performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generalized State Coherence Transform for multidimensional localization of multiple sources

    Page(s): 237 - 240
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3896 KB) |  | HTML iconHTML  

    In our recent work an effective method for multiple source localization has been proposed under the name of cumulative state coherence transform (cSCT). Exploiting the physical meaning of the frequency-domain blind source separation and the sparse time-frequency dominance of the acoustic sources, multiple reliable TDOAs can be estimated with only two microphones, regardless of the permutation problem and of the microphone spacing. In this paper we present a multidimensional generalization of the cSCT which allows one to localize several sources in the P-dimensional space. An important approximation is made in order to perform a disjoint TDOA estimation over each dimension which reduces the localization problem to linear complexity. Furthermore the approach is invariant to the array geometry and to the assumed acoustic propagation model. Experimental results on simulated data show a precise 2-D localization of 7 sources by only using an array of three elements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improved a priori SNR estimation with application in Log-MMSE speech estimation

    Page(s): 189 - 192
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (164 KB) |  | HTML iconHTML  

    A speech enhancement method utilizing the harmonic structure of speech is presented. The method is an extension of the well known minimum mean square error log-spectral amplitude estimator (Log MMSE) method for speech enhancement. The improvement lies specifically on a priori SNR estimation by utilizing harmonic structure of speech. The method is based on a conditional averaging operation over adjacent frequency bands for each processed data block. The actual frequency bands used in the conditional averaging is determined by a pitch detector. Thus voiced segments are averaged over frequency according to the pitch and the corresponding harmonic structure of voiced speech. Non-voiced segments are averaged over frequency according to a random number depending on the pitch value. The result is overall better SNR and SNRSeg values in white noise over the standard Log MMSE reference method. In babble noise, the estimator rendered similar SNR and SNRSeg values as the Log-MMSE reference method. Subjectively the residue background noise sounded more natural when using the suggested method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Spectral HRTF enhancement for improved vertical-polar auditory localization

    Page(s): 305 - 308
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4914 KB) |  | HTML iconHTML  

    Head-related transfer functions (HRTFs) can be a valuable tool for adding realistic spatial attributes to arbitrary sounds presented over stereo headphones. However, in practice, HRTF-based virtual audio displays are rarely able to approach the same level of localization accuracy that would be expected for listeners attending to real sound sources in the free field. In this paper, we present a novel HRTF enhancement technique that systematically increases the salience of the direction-dependent spectral cues that listeners use to determine the elevations of sound sources. The technique is shown to produce substantial improvements in localization accuracy in the vertical-polar dimension for individualized and non-individualized HRTFs, without negatively impacting performance in the left-right localization dimension. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the application of the LCMV beamformer to speech enhancement

    Page(s): 141 - 144
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (132 KB) |  | HTML iconHTML  

    In theory the linearly constrained minimum variance (LCMV) beamformer can achieve perfect dereverberation and noise cancellation when the acoustic transfer functions (ATFs) between all sources (including interferences) and the microphones are known. However, blind estimation of the ATFs remains a difficult task. In this paper the noise reduction of the LCMV beamformer is analyzed and compared with the noise reduction of the minimum variance distortionless response (MVDR) beamformer. In addition, it is shown that the constraint of the LCMV can be modified such that we only require relative transfer functions rather than ATFs to achieve perfect cancellation of coherent interferences. Finally, we evaluate the noise reduction performance achieved by the LCMV and MVDR beamformers for two coherent sources: one desired and one undesired. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Panoramic recording and reproduction of multichannel audio using a circular microphone array

    Page(s): 117 - 120
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (309 KB) |  | HTML iconHTML  

    Multichannel audio reproduction generally suffers from one or both of the following problems: i) the recorded audio has to be artificially manipulated to provide the necessary spatial cues, which reduces the consistency of the reproduced sound field with the actual one, and ii) reproduction is not panoramic, which degrades realism when the listener is not seated in a desired ideal position facing the center channel. A recording method using a circularly symmetric array of differential microphones, and a reproduction method using a corresponding array of loudspeakers is presented in this paper. Design of microphone directivity patterns to achieve a panoramic auditory scene is discussed. Objective results in the form of active intensity diagrams are presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A zone of quiet based approach to integrated active noise control and noise reduction in hearing AIDS

    Page(s): 229 - 232
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (124 KB) |  | HTML iconHTML  

    This paper presents an integrated approach to active noise control and noise reduction in hearing aids which is based on an optimization over a zone of quiet generated by the active noise control. A basic integrated scheme has been introduced previously to tackle secondary path effects and effects of noise leakage through an open fitting. This scheme, however, only takes the sound pressure at the ear canal microphone into account. In practice, it is desired to achieve noise control in a zone not limited to a single point. A scheme based on an average mean squared error criterion over the desired zone of quiet is presented here and compared experimentally with the original scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a musical beat emphasis function

    Page(s): 61 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (143 KB) |  | HTML iconHTML  

    We present a new method for generating input features for musical audio beat tracking systems. To emphasise periodic structure we derive a weighted linear combination of sub-band onset detection functions driven a measure of sub-band beat strength. Results demonstrate improved performance over existing state of the art models, in particular for musical excerpts with a steady tempo. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A spatio-temporal power method for time-domain multi-channel speech enhancement

    Page(s): 137 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (276 KB) |  | HTML iconHTML  

    We present a new multi-stage iterative technique for enhancing noisy speech under low signal-to-interference-ratio (SNR) environments. In the present paper, the speech is enhanced in two stages, in the first stage the noise component of the observed signal is whitened, and in the second stage a spatio-temporal power method is used to extract the desired speech component. In both the stages, the coefficient adaptation is performed using the multi-channel spatio-temporal correlation sequences of the observed data. The technique is mathematically equivalent and is computationally simpler than the existing generalized eigenvalue decomposition (GEVD) or the generalized singular value decomposition (GSVD) based techniques. Simulation results under low SNR diffuse noise scenarios indicate significant gains in SNR without introducing musical noise artifacts. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Guided harmonic sinusoid estimation in a multi-pitch environment

    Page(s): 41 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (242 KB) |  | HTML iconHTML  

    We describe an algorithm to accurately estimate the fundamental frequency of harmonic sinusoids in a mixed voice recording environment using an aligned electronic score as a guide. Taking the pitch tracking results on individual voices prior to mixing as ground truth, we are able estimate the pitch of individual voices in a 4-part piece to within 50 cents of the correct pitch more than 90% of the time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Domain decomposition method for the digital waveguide mesh

    Page(s): 21 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (433 KB) |  | HTML iconHTML  

    The digital waveguide mesh (DWM) is a discrete-time numerical method for modeling the propagation of traveling waves in multidimensional mechanical and acoustic systems. Despite the fact that the DWM is not as computationally efficient as a 1-D digital waveguide, it is still widely used for sound synthesis of musical instruments and for acoustical modeling of rooms because of the simplicity of the implementation. However, large-scale realization of the digital waveguide mesh is not adequate for many simulations because of its relatively high and direction-dependent dispersion error. The influence of dispersion error can be reduced by using a denser mesh structure though with extra computational costs. This paper presents a method for efficiently interconnecting rectangular DWM sub-domains of different mesh density. The method requires only small overlapped buffer regions to be added for each sub-domain. This allows the selective use of higher and lower density grids in a single simulation based on spatially-dependent dispersion error criteria. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.