Scheduled System Maintenance:
On May 6th, single article purchases and IEEE account management will be unavailable from 8:00 AM - 5:00 PM ET (12:00 - 21:00 UTC). We apologize for the inconvenience.
By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 3 • Date March 2010

Filter Results

Displaying Results 1 - 25 of 30
  • Table of contents

    Publication Year: 2010 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (102 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Publication Year: 2010 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (39 KB)  
    Freely Available from IEEE
  • Editorial for the Special Issue on Signal Models and Representations of Musical and Environmental Sounds

    Publication Year: 2010 , Page(s): 417 - 419
    Save to Project icon | Request Permissions | PDF file iconPDF (689 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Time-Scale Atoms Chains for Transients Detection in Audio Signals

    Publication Year: 2010 , Page(s): 420 - 433
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2333 KB) |  | HTML iconHTML  

    This paper presents a novel approach for the extraction of the transients content of audio signals, usually represented as superposition of stationary, transient, and stochastic components. The proposed model exploits the predictable and peculiar time-scale behavior of transients by modeling them as superposition of suitable wavelet atoms. These latter allow to predict transients information even at scales where the tonal component is dominant. In this way it is possible to avoid, if required, the pre-analysis of the tonal component. Extensive experimental results show that the proposed model achieves good performances with a moderate computational effort and without any user's dependence. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Audio Signal Representations for Indexing in the Transform Domain

    Publication Year: 2010 , Page(s): 434 - 446
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1461 KB) |  | HTML iconHTML  

    Indexing audio signals directly in the transform domain can potentially save a significant amount of computation when working on a large database of signals stored in a lossy compression format, without having to fully decode the signals. Here, we show that the representations used in standard transform-based audio codecs (e.g., MDCT for AAC, or hybrid PQF/MDCT for MP3) have a sufficient time resolution for some rhythmic features, but a poor frequency resolution, which prevents their use in tonality-related applications. Alternatively, a recently developed audio codec based on a sparse multi-scale MDCT transform has a good resolution both for time- and frequency-domain features. We show that this new audio codec allows efficient transform-domain audio indexing for three different applications, namely beat tracking, chord recognition, and musical genre classification. We compare results obtained with this new audio codec and the two standard MP3 and AAC codecs, in terms of performance and computation time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive Signal Modeling Based on Sparse Approximations for Scalable Parametric Audio Coding

    Publication Year: 2010 , Page(s): 447 - 460
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (565 KB) |  | HTML iconHTML  

    This paper deals with the application of adaptive signal models for parametric audio coding. A fully parametric audio coder, which decomposes the audio signal into sinusoids, transients and noise, is here proposed. Adaptive signal models for sinusoidal, transient, and noise modeling are therefore included in the parametric scheme in order to achieve high-quality and low bit-rate audio coding. In this paper, a new sinusoidal modeling method based on a perceptual distortion measure is proposed. For transient modeling, a fast and effective method based on matching pursuit with a mixed dictionary is chosen. The residue of the previous models is analyzed as a noise-like signal. The proposed parametric audio coder allows high quality audio coding for one-channel audio signals at 16 kbits/s (average bit rate). A bit-rate scalable version of the parametric audio coder is also proposed in this work. Bit-rate scalability is intended for audio streaming applications, which are highly demanded nowadays. The performance of the proposed parametric audio coders (nonscalable and scalable coders) is assessed in comparison to widely used audio coders operating at similar bit rates. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse Approximation and the Pursuit of Meaningful Signal Models With Interference Adaptation

    Publication Year: 2010 , Page(s): 461 - 472
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2063 KB) |  | HTML iconHTML  

    In the pursuit of a sparse signal model, mismatches between the signal and the dictionary, as well as atoms poorly selected by the decomposition process, can diminish the efficiency and meaningfulness of the resulting representation. These problems increase the number of atoms needed to model a signal for a given error, and they obscure the relationships between signal content and the elements of the model. To increase the efficiency and meaningfulness of a signal model built by an iterative descent pursuit, such as matching pursuit (MP), we propose integrating into its atom selection criterion a measure of interference between an atom and the model. We define interference and illustrate how it describes the contribution of an atom to modeling a signal. We show that for any nontrivial signal, the convergent model created by MP must have as much destructive as constructive interference, i.e., MP cannot avoid correction in the signal model. This is not necessarily a shortcoming of orthogonal variants of MP, such as orthogonal MP (OMP). We derive interference-adaptive iterative descent pursuits and show how these can build signal models that better fit the signal locally, and reduce the corrections made in a signal model. Compared with MP and its orthogonal variants, our experimental results not only show an increase in model efficiency, but also a clearer correspondence between the signal and the atoms of a representation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Music Scene-Adaptive Harmonic Dictionary for Unsupervised Note-Event Detection

    Publication Year: 2010 , Page(s): 473 - 486
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (832 KB) |  | HTML iconHTML  

    Harmonic decompositions are a powerful tool dealing with polyphonic music signals in some potential applications such as music visualization, music transcription and instrument recognition. The usefulness of a harmonic decomposition relies on the design of a proper harmonic dictionary. Music scene-adaptive harmonic atoms have been used with this purpose. These atoms are adapted to the musical instruments and to the music scene, including aspects related with the venue, musician, and other relevant acoustic properties. In this paper, an unsupervised process to obtain music scene-adaptive spectral patterns for each MIDI-note is proposed. Furthermore, the obtained harmonic dictionary is applied to note-event detection with matching pursuits. In the case of a music database that only consists of one-instrument signals, promising results (high accuracy and low error rate) have been achieved for note-event detection. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Robust and Computationally Efficient Subspace-Based Fundamental Frequency Estimator

    Publication Year: 2010 , Page(s): 487 - 497
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1033 KB) |  | HTML iconHTML  

    This paper presents a method for high-resolution fundamental frequency (F 0) estimation based on subspaces decomposed from a frequency-selective data model, by effectively splitting the signal into a number of subbands. The resulting estimator is termed frequency-selective harmonic MUSIC (F-HMUSIC). The subband-based approach is expected to ensure computational savings and robustness. Additionally, a method for automatic subband signal activity detection is proposed, which is based on information-theoretic criterion where no subjective judgment is needed. The F-HMUSIC algorithm exhibits good statistical performance when evaluated with synthetic signals for both white and colored noises, while its evaluation on real-life audio signal shows the algorithm to be competitive with other estimators. Finally, F-HMUSIC is found to be computationally more efficient and robust than other subspace-based F 0 estimators, besides being robust against recorded data with inharmonicities. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Comparative Evaluation of Techniques for Single-Frame Discrimination of Nonstationary Sinusoids

    Publication Year: 2010 , Page(s): 498 - 508
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (761 KB) |  | HTML iconHTML  

    Many spectral analysis and modification techniques require the separation of sinusoidal from nonsinusoidal signal components of a Fourier spectrum. Techniques exist for the estimation of the parameters of nonstationary sinusoids, and for discriminating these from other components, within a single Fourier frame. We present a comparative study of five methods for sinusoidal discrimination, considering their effectiveness and their computational cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis/Synthesis of Sounds Generated by Sustained Contact Between Rigid Objects

    Publication Year: 2010 , Page(s): 509 - 518
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (726 KB) |  | HTML iconHTML  

    This paper introduces an analysis/synthesis scheme for the reproduction of sounds generated by sustained contact between rigid bodies. This scheme is rooted in a Source/Filter decomposition of the sound where the filter is described as a set of poles and the source is described as a set of impulses representing the energy transfer between the interacting objects. Compared to single impacts, sustained contact interactions like rolling and sliding make the estimation of the parameters of the Source/Filter model challenging because of two issues. First, the objects are almost continuously interacting. The second is that the source is generally unknown and has therefore to be modeled in a generic way. In an attempt to tackle those issues, the proposed analysis/synthesis scheme combines advanced analysis techniques for the estimation of the filter parameters and a flexible model of the source. It allows the modeling of a wide range of sounds. Examples are presented for objects of various shapes and sizes, rolling or sliding over plates of different materials. In order to demonstrate the versatility of the approach, the system is also considered for the modeling of sounds produced by percussive musical instruments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generative Spectrogram Factorization Models for Polyphonic Piano Transcription

    Publication Year: 2010 , Page(s): 519 - 527
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1201 KB) |  | HTML iconHTML  

    We introduce a framework for probabilistic generative models of time-frequency coefficients of audio signals, using a matrix factorization parametrization to jointly model spectral characteristics such as harmonicity and temporal activations and excitations. The models represent the observed data as the superposition of statistically independent sources, and we consider variance-based models used in source separation and intensity-based models for non-negative matrix factorization. We derive a generalized expectation-maximization algorithm for inferring the parameters of the model and then adapt this algorithm for the task of polyphonic transcription of music using labeled training data. The performance of the system is compared to that of existing discriminative and model-based approaches on a dataset of solo piano music. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation

    Publication Year: 2010 , Page(s): 528 - 537
    Cited by:  Papers (40)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (990 KB) |  | HTML iconHTML  

    Multiple pitch estimation consists of estimating the fundamental frequencies and saliences of pitched sounds over short time frames of an audio signal. This task forms the basis of several applications in the particular context of musical audio. One approach is to decompose the short-term magnitude spectrum of the signal into a sum of basis spectra representing individual pitches scaled by time-varying amplitudes, using algorithms such as nonnegative matrix factorization (NMF). Prior training of the basis spectra is often infeasible due to the wide range of possible musical instruments. Appropriate spectra must then be adaptively estimated from the data, which may result in limited performance due to overfitting issues. In this paper, we model each basis spectrum as a weighted sum of narrowband spectra representing a few adjacent harmonic partials, thus enforcing harmonicity and spectral smoothness while adapting the spectral envelope to each instrument. We derive a NMF-like algorithm to estimate the model parameters and evaluate it on a database of piano recordings, considering several choices for the narrowband spectra. The proposed algorithm performs similarly to supervised NMF using pre-trained piano spectra but improves pitch estimation performance by 6% to 10% compared to alternative unsupervised NMF algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enforcing Harmonicity and Smoothness in Bayesian Non-Negative Matrix Factorization Applied to Polyphonic Music Transcription

    Publication Year: 2010 , Page(s): 538 - 549
    Cited by:  Papers (26)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (786 KB) |  | HTML iconHTML  

    This paper presents theoretical and experimental results about constrained non-negative matrix factorization (NMF) in a Bayesian framework. A model of superimposed Gaussian components including harmonicity is proposed, while temporal continuity is enforced through an inverse-Gamma Markov chain prior. We then exhibit a space-alternating generalized expectation-maximization (SAGE) algorithm to estimate the parameters. Computational time is reduced by initializing the system with an original variant of multiplicative harmonic NMF, which is described as well. The algorithm is then applied to perform polyphonic piano music transcription. It is compared to other state-of-the-art algorithms, especially NMF-based. Convergence issues are also discussed on a theoretical and experimental point of view. Bayesian NMF with harmonicity and temporal continuity constraints is shown to outperform other standard NMF-based transcription systems, providing a meaningful mid-level representation of the data. However, temporal smoothness has its drawbacks, as far as transients are concerned in particular, and can be detrimental to transcription performance when it is the only constraint used. Possible improvements of the temporal prior are discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation

    Publication Year: 2010 , Page(s): 550 - 563
    Cited by:  Papers (66)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (955 KB) |  | HTML iconHTML  

    We consider inference in a general data-driven object-based model of multichannel audio data, assumed generated as a possibly underdetermined convolutive mixture of source signals. We work in the short-time Fourier transform (STFT) domain, where convolution is routinely approximated as linear instantaneous mixing in each frequency band. Each source STFT is given a model inspired from nonnegative matrix factorization (NMF) with the Itakura-Saito divergence, which underlies a statistical model of superimposed Gaussian components. We address estimation of the mixing and source parameters using two methods. The first one consists of maximizing the exact joint likelihood of the multichannel data using an expectation-maximization (EM) algorithm. The second method consists of maximizing the sum of individual likelihoods of all channels using a multiplicative update algorithm inspired from NMF methodology. Our decomposition algorithms are applied to stereo audio source separation in various settings, covering blind and supervised separation, music and speech sources, synthetic instantaneous and convolutive mixtures, as well as professionally produced music recordings. Our EM method produces competitive results with respect to state-of-the-art as illustrated on two tasks from the international Signal Separation Evaluation Campaign (SiSEC 2008). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals

    Publication Year: 2010 , Page(s): 564 - 575
    Cited by:  Papers (26)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (965 KB) |  | HTML iconHTML  

    Extracting the main melody from a polyphonic music recording seems natural even to untrained human listeners. To a certain extent it is related to the concept of source separation, with the human ability of focusing on a specific source in order to extract relevant information. In this paper, we propose a new approach for the estimation and extraction of the main melody (and in particular the leading vocal part) from polyphonic audio signals. To that aim, we propose a new signal model where the leading vocal part is explicitly represented by a specific source/filter model. The proposed representation is investigated in the framework of two statistical models: a Gaussian Scaled Mixture Model (GSMM) and an extended Instantaneous Mixture Model (IMM). For both models, the estimation of the different parameters is done within a maximum-likelihood framework adapted from single-channel source separation techniques. The desired sequence of fundamental frequencies is then inferred from the estimated parameters. The results obtained in a recent evaluation campaign (MIREX08) show that the proposed approaches are very promising and reach state-of-the-art performances on all test sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Non-Negative Multilinear Principal Component Analysis of Auditory Temporal Modulations for Music Genre Classification

    Publication Year: 2010 , Page(s): 576 - 588
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1048 KB) |  | HTML iconHTML  

    Motivated by psychophysiological investigations on the human auditory system, a bio-inspired two-dimensional auditory representation of music signals is exploited, that captures the slow temporal modulations. Although each recording is represented by a second-order tensor (i.e., a matrix), a third-order tensor is needed to represent a music corpus. Non-negative multilinear principal component analysis (NMPCA) is proposed for the unsupervised dimensionality reduction of the third-order tensors. The NMPCA maximizes the total tensor scatter while preserving the non-negativity of auditory representations. An algorithm for NMPCA is derived by exploiting the structure of the Grassmann manifold. The NMPCA is compared against three multilinear subspace analysis techniques, namely the non-negative tensor factorization, the high-order singular value decomposition, and the multilinear principal component analysis as well as their linear counterparts, i.e., the non-negative matrix factorization, the singular value decomposition, and the principal components analysis in extracting features that are subsequently classified by either support vector machine or nearest neighbor classifiers. Three different sets of experiments conducted on the GTZAN and the ISMIR2004 Genre datasets demonstrate the superiority of NMPCA against the aforementioned subspace analysis techniques in extracting more discriminating features, especially when the training set has small cardinality. The best classification accuracies reported in the paper exceed those obtained by the state-of-the-art music genre classification algorithms applied to both datasets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Gamma Markov Random Fields for Audio Source Modeling

    Publication Year: 2010 , Page(s): 589 - 601
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1438 KB) |  | HTML iconHTML  

    In many audio processing tasks, such as source separation, denoising or compression, it is crucial to construct realistic and flexible models to capture the physical properties of audio signals. This can be accomplished in the Bayesian framework through the use of appropriate prior distributions. In this paper, we describe a class of prior models called Gamma Markov random fields (GMRFs) to model the sparsity and the local dependency of the energies (i.e., variances) of time-frequency expansion coefficients. A GMRF model describes a non-normalised joint distribution over unobserved variance variables, where given the field the actual source coefficients are independent. Our construction ensures a positive coupling between the variance variables, so that signal energy changes smoothly over both axes to capture the temporal and spectral continuity. The coupling strength is controlled by a set of hyperparameters. Inference on the overall model is convenient because of the conditional conjugacy of all of the variables in the model, but automatic optimization of hyperparameters is crucial to obtain better fits. The marginal likelihood of the model is not available because of the intractable normalizing constant of GMRFs. In this paper, we optimize the hyperparameters of our GMRF-based audio model using contrastive divergence and compare this method to alternatives such as score matching and pseudolikelihood maximization where applicable. We present the performance of the GMRF models in denoising and single-channel source separation problems in completely blind scenarios, where all the hyperparameters are jointly estimated given only audio data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling Music as a Dynamic Texture

    Publication Year: 2010 , Page(s): 602 - 612
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1125 KB) |  | HTML iconHTML  

    We consider representing a short temporal fragment of musical audio as a dynamic texture, a model of both the timbral and rhythmical qualities of sound, two of the important aspects required for automatic music analysis. The dynamic texture model treats a sequence of audio feature vectors as a sample from a linear dynamical system. We apply this new representation to the task of automatic song segmentation. In particular, we cluster audio fragments, extracted from a song, as samples from a dynamic texture mixture (DTM) model. We show that the DTM model can both accurately cluster coherent segments in music and detect transition boundaries. Moreover, the generative character of the proposed model of music makes it amenable for a wide range of applications besides segmentation. As examples, we use DTM models of songs to suggest possible improvements in other music information retrieval applications such as music annotation and similarity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Representing Musical Sounds With an Interpolating State Model

    Publication Year: 2010 , Page(s): 613 - 624
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (767 KB) |  | HTML iconHTML  

    A computationally efficient algorithm is proposed for modeling and representing time-varying musical sounds. The aim is to encode individual sounds and not the statistical properties of several sounds representing a certain class. A given sequence of acoustic feature vectors is modeled by finding such a set of ¿states¿ (anchor points in the feature space) that the input data can be efficiently represented by interpolating between them. The proposed interpolating state model is generic and can be used to represent any multidimensional data sequence. In this paper, it is applied to represent musical instrument sounds in a compact and accurate form. Simulation experiments were carried out which show that the proposed method clearly outperforms the conventional vector quantization approach where the acoustic feature data is k-means clustered and the feature vectors are replaced by the corresponding cluster centroids. The computational complexity of the proposed algorithm as a function of the input sequence length T is O(T log T). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incorporating Cultural Representations of Features Into Audio Music Similarity Estimation

    Publication Year: 2010 , Page(s): 625 - 637
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (885 KB) |  | HTML iconHTML  

    We address the problem of estimating automatically from audio signals the similarity between two pieces of music, a technology that has many applications in the online digital music industry. Conventional methods of audio music search use distance measures between features derived from the audio for this task. We describe three techniques that make use of music classifiers to derive representations of audio features that are based on culturally motivated information learned by the classifier. When these representations are used for similarity estimation, they produce very significant reductions in computational complexity over existing techniques (such as those based on the KL-divergence), and also produce metric similarity spaces, which facilitate the use of technologies for the sub-linear scaling of search times. We have evaluated each system using both pseudo-objective techniques and human listeners, and we demonstrate that this efficiency gain is obtained while providing a comparable level of performance when compared with existing techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Modeling of Singing Voice Robust to Accompaniment Sounds and Its Application to Singer Identification and Vocal-Timbre-Similarity-Based Music Information Retrieval

    Publication Year: 2010 , Page(s): 638 - 648
    Cited by:  Papers (18)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2613 KB) |  | HTML iconHTML  

    This paper describes a method of modeling the characteristics of a singing voice from polyphonic musical audio signals including sounds of various musical instruments. Because singing voices play an important role in musical pieces with vocals, such representation is useful for music information retrieval systems. The main problem in modeling the characteristics of a singing voice is the negative influences caused by accompaniment sounds. To solve this problem, we developed two methods, accompaniment sound reduction and reliable frame selection . The former makes it possible to calculate feature vectors that represent a spectral envelope of a singing voice after reducing accompaniment sounds. It first extracts the harmonic components of the predominant melody from sound mixtures and then resynthesizes the melody by using a sinusoidal model driven by these components. The latter method then estimates the reliability of frame of the obtained melody (i.e., the influence of accompaniment sound) by using two Gaussian mixture models (GMMs) for vocal and nonvocal frames to select the reliable vocal portions of musical pieces. Finally, each song is represented by its GMM consisting of the reliable frames. This new representation of the singing voice is demonstrated to improve the performance of an automatic singer identification system and to achieve an MIR system based on vocal timbre similarity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards Timbre-Invariant Audio Features for Harmony-Based Music

    Publication Year: 2010 , Page(s): 649 - 662
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1164 KB) |  | HTML iconHTML  

    Chroma-based audio features are a well-established tool for analyzing and comparing harmony-based Western music that is based on the equal-tempered scale. By identifying spectral components that differ by a musical octave, chroma features possess a considerable amount of robustness to changes in timbre and instrumentation. In this paper, we describe a novel procedure that further enhances chroma features by significantly boosting the degree of timbre invariance without degrading the features' discriminative power. Our idea is based on the generally accepted observation that the lower mel-frequency cepstral coefficients (MFCCs) are closely related to timbre. Now, instead of keeping the lower coefficients, we discard them and only keep the upper coefficients. Furthermore, using a pitch scale instead of a mel scale allows us to project the remaining coefficients onto the 12 chroma bins. We present a series of experiments to demonstrate that the resulting chroma features outperform various state-of-the art features in the context of music matching and retrieval applications. As a final contribution, we give a detailed analysis of our enhancement procedure revealing the musical meaning of certain pitch-frequency cepstral coefficients. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic Spectral Envelope Modeling for Timbre Analysis of Musical Instrument Sounds

    Publication Year: 2010 , Page(s): 663 - 674
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (735 KB) |  | HTML iconHTML  

    We present a computational model of musical instrument sounds that focuses on capturing the dynamic behavior of the spectral envelope. A set of spectro-temporal envelopes belonging to different notes of each instrument are extracted by means of sinusoidal modeling and subsequent frequency interpolation, before being subjected to principal component analysis. The prototypical evolution of the envelopes in the obtained reduced-dimensional space is modeled as a nonstationary Gaussian Process. This results in a compact representation in the form of a set of prototype curves in feature space, or equivalently of prototype spectro-temporal envelopes in the time-frequency domain. Finally, the obtained models are successfully evaluated in the context of two music content analysis tasks: classification of instrument samples and detection of instruments in monaural polyphonic mixtures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sound Indexing Using Morphological Description

    Publication Year: 2010 , Page(s): 675 - 687
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1903 KB) |  | HTML iconHTML  

    Sound sample indexing usually deals with the recognition of the source/cause that has produced the sound. For abstract sounds, sound effects, unnatural, or synthetic sounds, this cause is usually unknown or unrecognizable. An efficient description of these sounds has been proposed by Schaeffer under the name morphological description. Part of this description consists in describing a sound by identifying the temporal evolution of its acoustic properties to a set of profiles. In this paper, we consider three morphological descriptions: dynamic profiles (ascending, descending, ascending/descending, stable, impulsive), melodic profiles (up, down, stable, up/down, down/up) and complex-iterative sound description (non-iterative, iterative, grain, repetition). We study the automatic indexing of a sound into these profiles. Because this automatic indexing is difficult using standard audio features, we propose new audio features to perform this task. The dynamic profiles are estimated by modeling the loudness over-time of a sound by a second-order B-spline model and derive features from this model. The melodic profiles are estimated by tracking over time the perceptual filter which has the maximum excitation. A function is derived from this track which is then modeled using a second-order B-spline model. The features are again derived from the B-spline model. The description of complex-iterative sounds is obtained by estimating the amount of repetition and the period of the repetition. These are obtained by computing an audio similarity function derived from an Mel frequency cepstral coefficients (MFCC) similarity matrix. The proposed audio features are then tested for automatic classification. We consider three classification tasks corresponding to the three profiles. In each case, the results are compared with the ones obtained using standard audio features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research