By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 2 • Date Feb. 2008

Filter Results

Displaying Results 1 - 25 of 25
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • Introduction to the Special Issue on Music Information Retrieval

    Page(s): 253 - 254
    Save to Project icon | Request Permissions | PDF file iconPDF (450 KB)  
    Freely Available from IEEE
  • Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model

    Page(s): 255 - 266
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (474 KB) |  | HTML iconHTML  

    A method is described for estimating the fundamental frequencies of several concurrent sounds in polyphonic music and multiple-speaker speech signals. The method consists of a computational model of the human auditory periphery, followed by a periodicity analysis mechanism where fundamental frequencies are iteratively detected and canceled from the mixture signal. The auditory model needs to be computed only once, and a computationally efficient strategy is proposed for implementing it. Simulation experiments were made using mixtures of musical sounds and mixed speech utterances. The proposed method outperformed two reference methods in the evaluations and showed a high level of robustness in processing signals where important parts of the audible spectrum were deleted to simulate bandlimited interference. Different system configurations were studied to identify the conditions where pitch analysis using an auditory model is advantageous over conventional time or frequency domain approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Discriminating Between Pitched Sources in Music Audio

    Page(s): 267 - 277
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (645 KB) |  | HTML iconHTML  

    Though humans find it relatively easy to identify and/or isolate different sources within polyphonic music, the emulation of this ability by a computer is a challenging task, and one that has direct relevance to music content description and information retrieval applications. For an automated system without any prior knowledge of a recording, a possible solution is to perform an initial segmentation of the recording into notes or regions with some time-frequency contiguity, and then collect into groups those units that are acoustically similar, and hence have a high likelihood of arising from a common source. This article addresses the second subtask, and provides two main contributions: (1) a derivation of a suboptimal subset out of a wide range of common audio features that maximizes the potential to discriminate between pitched sources in polyphonic music and (2) an estimation of the improvement in accuracy that can be achieved by using features other than pitch in the grouping process. In addition, the hypothesis was tested that more discriminatory features can be obtained through the application of source separation techniques prior to feature computation. Machine learning techniques have been applied to an annotated database of polyphonic recordings (containing 3181 labeled audio segments) spanning a wide range of musical genres. Average source-labeling accuracies of 68% and 76% were obtained with a 10-dimensional feature subset when the number of sources per recording was unknown and known a priori. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Normalized Cuts for Predominant Melodic Source Separation

    Page(s): 278 - 290
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (940 KB) |  | HTML iconHTML  

    The predominant melodic source, frequently the singing voice, is an important component of musical signals. In this paper, we describe a method for extracting the predominant source and corresponding melody from ldquoreal-worldrdquo polyphonic music. The proposed method is inspired by ideas from computational auditory scene analysis. We formulate predominant melodic source tracking and formation as a graph partitioning problem and solve it using the normalized cut which is a global criterion for segmenting graphs that has been used in computer vision. Sinusoidal modeling is used as the underlying representation. A novel harmonicity cue which we term harmonically wrapped peak similarity is introduced. Experimental results supporting the use of this cue are presented. In addition, we show results for automatic melody extraction using the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Acoustic Chord Transcription and Key Extraction From Audio Using Key-Dependent HMMs Trained on Synthesized Audio

    Page(s): 291 - 301
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (828 KB) |  | HTML iconHTML  

    We describe an acoustic chord transcription system that uses symbolic data to train hidden Markov models and gives best-of-class frame-level recognition results. We avoid the extremely laborious task of human annotation of chord names and boundaries-which must be done to provide machine learning models with ground truth-by performing automatic harmony analysis on symbolic music files. In parallel, we synthesize audio from the same symbolic files and extract acoustic feature vectors which are in perfect alignment with the labels. We, therefore, generate a large set of labeled training data with a minimal amount of human labor. This allows for richer models. Thus, we build 24 key-dependent HMMs, one for each key, using the key information derived from symbolic data. Each key model defines a unique state-transition characteristic and helps avoid confusions seen in the observation vector. Given acoustic input, we identify a musical key by choosing a key model with the maximum likelihood, and we obtain the chord sequence from the optimal state path of the corresponding key model, both of which are returned by a Viterbi decoder. This not only increases the chord recognition accuracy, but also gives key information. Experimental results show the models trained on synthesized data perform very well on real recordings, even though the labels automatically generated from symbolic data are not 100% accurate. We also demonstrate the robustness of the tonal centroid feature, which outperforms the conventional chroma feature. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distortion Estimation in Compressed Music Using Only Audio Fingerprints

    Page(s): 302 - 317
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1169 KB) |  | HTML iconHTML  

    An audio fingerprint is a compact yet very robust representation of the perceptually relevant parts of an audio signal. It can be used for content-based audio identification, even when the audio is severely distorted. Audio compression changes the fingerprint slightly. We show that these small fingerprint differences due to compression can be used to estimate the signal-to-noise ratio (SNR) of the compressed audio file compared to the original. This is a useful content-based distortion estimate, when the original, uncompressed audio file is unavailable. The method uses the audio fingerprints only. For stochastic signals distorted by additive noise, an analytical expression is obtained for the average fingerprint difference as function of the SNR level. This model is based on an analysis of the Philips robust hash (PRH) algorithm. We show that for uncorrelated signals, the bit error rate (BER) is approximately inversely proportional to the square root of the SNR of the signal. This model is extended to correlated signals and music. For an experimental verification of our proposed model, we divide the field of audio fingerprinting algorithms into three categories. From each category, we select an algorithm that is representative for that category. Experiments show that the behavior predicted by the stochastic model for the PRH also holds for the two other algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Structural Segmentation of Musical Audio by Constrained Clustering

    Page(s): 318 - 326
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1286 KB) |  | HTML iconHTML  

    We describe a method of segmenting musical audio into structural sections based on a hierarchical labeling of spectral features. Frames of audio are first labeled as belonging to one of a number of discrete states using a hidden Markov model trained on the features. Histograms of neighboring frames are then clustered into segment-types representing distinct distributions of states, using a clustering algorithm in which temporal continuity is expressed as a set of constraints modeled by a hidden Markov random field. We give experimental results which show that in many cases the resulting segmentations correspond well to conventional notions of musical form. We show further how the constrained clustering approach can easily be extended to include prior musical knowledge, input from other machine approaches, or semi-supervision. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unified View of Prediction and Repetition Structure in Audio Signals With Application to Interest Point Detection

    Page(s): 327 - 337
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1368 KB) |  | HTML iconHTML  

    In this paper, we present a new method for analysis of musical structure that captures local prediction and global repetition properties of audio signals in one information processing framework. The method is motivated by a recent work in music perception where machine features were shown to correspond to human judgments of familiarity and emotional force when listening to music. Using a notion of information rate in a model-based framework, we develop a measure of mutual information between past and present in a time signal and show that it consist of two factors - prediction property related to data statistics within an individual block of signal features, and repetition property based on differences in model likelihood across blocks. The first factor, when applied to spectral representation of audio signals, is known as spectral anticipation, and the second factor is known as recurrence analysis. We present algorithms for estimation of these measures and create a visualization that displays their temporal structure in musical recordings. Considering these features as a measure of the amount of information processing that a listening system performs on a signal, information rate is used to detect interest points in music. Several musical works with different performances are analyzed in this paper, and their structure and interest points are displayed and discussed. Extensions of this approach towards a general framework of characterizing machine listening experience are suggested. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • LyricAlly: Automatic Synchronization of Textual Lyrics to Acoustic Music Signals

    Page(s): 338 - 349
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1641 KB) |  | HTML iconHTML  

    We present LyricAlly, a prototype that automatically aligns acoustic musical signals with their corresponding textual lyrics, in a manner similar to manually-aligned karaoke. We tackle this problem based on a multimodal approach, using an appropriate pairing of audio and text processing to create the resulting prototype. LyricAlly's acoustic signal processing uses standard audio features but constrained and informed by the musical nature of the signal. The resulting detected hierarchical rhythm structure is utilized in singing voice detection and chorus detection to produce results of higher accuracy and lower computational costs than their respective baselines. Text processing is employed to approximate the length of the sung passages from the lyrics. Results show an average error of less than one bar for per-line alignment of the lyrics on a test bed of 20 songs (sampled from CD audio and carefully selected for variety). We perform a comprehensive set of system-wide and per-component tests and discuss their results. We conclude by outlining steps for further development. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A General Framework of Progressive Filtering and Its Application to Query by Singing/Humming

    Page(s): 350 - 358
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (501 KB) |  | HTML iconHTML  

    This paper presents the mathematical formulation and design methodology of progressive filtering (PF) for multimedia information retrieval, and discusses its application to the so-called query by singing/humming (QBSH), or more formally, melody recognition. The concept of PF and the corresponding dynamic programming-based design method are applicable to large multimedia retrieval systems for striking a balance between efficiency (in terms of response time) and effectiveness (in terms of recognition rate). The application of the proposed PF to a five-stage QBSH system is reported, and the experimental results demonstrate the feasibility of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Challenging Uncertainty in Query by Humming Systems: A Fingerprinting Approach

    Page(s): 359 - 371
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1181 KB) |  | HTML iconHTML  

    Robust data retrieval in the presence of uncertainty is a challenging problem in multimedia information retrieval. In query-by-humming (QBH) systems, uncertainty can arise in query formulation due to user-dependent variability, such as incorrectly hummed notes, and in query transcription due to machine-based errors, such as insertions and deletions. We propose a fingerprinting (FP) algorithm for representing salient melodic information so as to better compare potentially noisy voice queries with target melodies in a database. The FP technique is employed in the QBH system back end; a hidden Markov model (HMM) front end segments and transcribes the hummed audio input into a symbolic representation. The performance of the FP search algorithm is compared to the conventional edit distance (ED) technique. Our retrieval database is built on 1500 MIDI files and evaluated using 400 hummed samples from 80 people with different musical backgrounds. A melody retrieval accuracy of 88% is demonstrated for humming samples from musically trained subjects, and 70% for samples from untrained subjects, for the FP algorithm. In contrast, the widely used ED method achieves 86% and 62% accuracy rates, respectively, for the same samples, thus suggesting that the proposed FP technique is more robust under uncertainty, particularly for queries by musically untrained users. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Searching Musical Audio Using Symbolic Queries

    Page(s): 372 - 381
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1153 KB) |  | HTML iconHTML  

    Finding a piece of music based on its content is a key problem in music in for music information retrieval . For example, a user may be interested in finding music based on knowledge of only a small fragment of the overall tune. In this paper, we consider the searching of musical audio using symbolic queries. We first propose a relative pitch approach for representing queries and pieces. Experiments show that this technique, while effective, works best when the whole tune is used as a query. We then present an algorithm for matching based on a pitch classes approach, using the longest common subsequence between a query and target. Experimental evaluation shows that our technique is highly effective, with a mean average precision of 0.77 on a collection of 1808 recordings. Significantly, our technique is robust for truncated queries, being able to maintain effectiveness and to retrieve correct answers whether the query fragment is taken from the beginning, middle, or end of a piece. This represents a significant reduction in the burden placed on users when formulating queries. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Index-Based Audio Matching

    Page(s): 382 - 395
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1072 KB) |  | HTML iconHTML  

    Given a large audio database of music recordings, the goal of classical audio identification is to identify a particular audio recording by means of a short audio fragment. Even though recent identification algorithms show a significant degree of robustness towards noise, MP3 compression artifacts, and uniform temporal distortions, the notion of similarity is rather close to the identity. In this paper, we address a higher level retrieval problem, which we refer to as audio matching: given a short query audio clip, the goal is to automatically retrieve all excerpts from all recordings within the database that musically correspond to the query. In our matching scenario, opposed to classical audio identification, we allow semantically motivated variations as they typically occur in different interpretations of a piece of music. To this end, this paper presents an efficient and robust audio matching procedure that works even in the presence of significant variations, such as nonlinear temporal, dynamical, and spectral deviations, where existing algorithms for audio identification would fail. Furthermore, the combination of various deformation- and fault-tolerance mechanisms allows us to employ standard indexing techniques to obtain an efficient, index-based matching procedure, thus providing an important step towards semantically searching large-scale real-world music collections. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Quick Search Method for Audio Signals Based on a Piecewise Linear Representation of Feature Trajectories

    Page(s): 396 - 407
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1463 KB) |  | HTML iconHTML  

    This paper presents a new method for a quick similarity-based search through long unlabeled audio streams to detect and locate audio clips provided by users. The method involves feature-dimension reduction based on a piecewise linear representation of a sequential feature trajectory extracted from a long audio stream. Two techniques enable us to obtain a piecewise linear representation: the dynamic segmentation of feature trajectories and the segment-based Karhunen-Loeve (KL) transform. The proposed search method guarantees the same search results as the search method without the proposed feature-dimension reduction method in principle. Experimental results indicate significant improvements in search speed. For example, the proposed method reduced the total search time to approximately 1/12 that of previous methods and detected queries in approximately 0.3 s from a 200-h audio database. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computational Models of Similarity for Drum Samples

    Page(s): 408 - 423
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1937 KB) |  | HTML iconHTML  

    In this paper, we optimize and evaluate computational models of similarity for sounds from the same instrument class. We investigate four instrument classes: bass drums, snare drums, high-pitched toms, and low-pitched toms. We evaluate two similarity models: one is defined in the ISO/IEC MPEG-7 standard, and the other is based on auditory images. For the second model, we study the impact of various parameters. We use data from listening tests, and instrument class labels to evaluate the models. Our results show that the model based on auditory images yields a very high average correlation with human similarity ratings and clearly outperforms the MPEG-7 recommendation. The average correlations range from 0.89-0.96 depending on the instrument class. Furthermore, our results indicate that instrument class data can be used as alternative to data from listening tests to evaluate sound similarity models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Musical Genre Classification Using Nonnegative Matrix Factorization-Based Features

    Page(s): 424 - 434
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (503 KB) |  | HTML iconHTML  

    Nonnegative matrix factorization (NMF) is used to derive a novel description for the timbre of musical sounds. Using NMF, a spectrogram is factorized providing a characteristic spectral basis. Assuming a set of spectrograms given a musical genre, the space spanned by the vectors of the obtained spectral bases is modeled statistically using mixtures of Gaussians, resulting in a description of the spectral base for this musical genre. This description is shown to improve classification results by up to 23.3% compared to MFCC-based models, while the compression performed by the factorization decreases training time significantly. Using a distance-based stability measure this compression is shown to reduce the noise present in the data set resulting in more stable classification models. In addition, we compare the mean squared errors of the approximation to a spectrogram using independent component analysis and nonnegative matrix factorization, showing the superiority of the latter approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Efficient Hybrid Music Recommender System Using an Incrementally Trainable Probabilistic Generative Model

    Page(s): 435 - 447
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (975 KB) |  | HTML iconHTML  

    This paper presents a hybrid music recommender system that ranks musical pieces while efficiently maintaining collaborative and content-based data, i.e., rating scores given by users and acoustic features of audio signals. This hybrid approach overcomes the conventional tradeoff between recommendation accuracy and variety of recommended artists. Collaborative filtering, which is used on e-commerce sites, cannot recommend nonbrated pieces and provides a narrow variety of artists. Content-based filtering does not have satisfactory accuracy because it is based on the heuristics that the user's favorite pieces will have similar musical content despite there being exceptions. To attain a higher recommendation accuracy along with a wider variety of artists, we use a probabilistic generative model that unifies the collaborative and content-based data in a principled way. This model can explain the generative mechanism of the observed data in the probability theory. The probability distribution over users, pieces, and features is decomposed into three conditionally independent ones by introducing latent variables. This decomposition enables us to efficiently and incrementally adapt the model for increasing numbers of users and rating scores. We evaluated our system by using audio signals of commercial CDs and their corresponding rating scores obtained from an e-commerce site. The results revealed that our system accurately recommended pieces including nonrated ones from a wide variety of artists and maintained a high degree of accuracy even when new users and rating scores were added. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Regression Approach to Music Emotion Recognition

    Page(s): 448 - 457
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (845 KB) |  | HTML iconHTML  

    Content-based retrieval has emerged in the face of content explosion as a promising approach to information access. In this paper, we focus on the challenging issue of recognizing the emotion content of music signals, or music emotion recognition (MER). Specifically, we formulate MER as a regression problem to predict the arousal and valence values (AV values) of each music sample directly. Associated with the AV values, each music sample becomes a point in the arousal-valence plane, so the users can efficiently retrieve the music sample by specifying a desired point in the emotion plane. Because no categorical taxonomy is used, the regression approach is free of the ambiguity inherent to conventional categorical approaches. To improve the performance, we apply principal component analysis to reduce the correlation between arousal and valence, and RReliefF to select important features. An extensive performance study is conducted to evaluate the accuracy of the regression approach for predicting AV values. The best performance evaluated in terms of the R 2 statistics reaches 58.3% for arousal and 28.1% for valence by employing support vector machine as the regressor. We also apply the regression approach to detect the emotion variation within a music selection and find the prediction accuracy superior to existing works. A group-wise MER scheme is also developed to address the subjectivity issue of emotion perception. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Score-Independent Audio Features for Description of Music Expression

    Page(s): 458 - 466
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (758 KB) |  | HTML iconHTML  

    During a music performance, the musician adds expressiveness to the musical message by changing timing, dynamics, and timbre of the musical events to communicate an expressive intention. Traditionally, the analysis of music expression is based on measurements of the deviations of the acoustic parameters with respect to the written score. In this paper, we employ machine learning techniques to understand the expressive communication and to derive audio features at an intermediate level, between music intended as a structured language and notes intended as sound at a more physical level. We start by extracting audio features from expressive performances that were recorded by asking the musicians to perform in order to convey different expressive intentions. We use a sequential forward selection procedure to rank and select a set of features for a general description of the expressions, and a second one specific for each instrument. We show that higher recognition ratings are achieved by using a set of four features which can be specifically related to qualitative descriptions of the sound by physical metaphors. These audio features can be used to retrieve expressive content on audio data, and to design the next generation of search engines for music information retrieval. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semantic Annotation and Retrieval of Music and Sound Effects

    Page(s): 467 - 476
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (798 KB) |  | HTML iconHTML  

    We present a computer audition system that can both annotate novel audio tracks with semantically meaningful words and retrieve relevant tracks from a database of unlabeled audio content given a text-based query. We consider the related tasks of content-based audio annotation and retrieval as one supervised multiclass, multilabel problem in which we model the joint probability of acoustic features and words. We collect a data set of 1700 human-generated annotations that describe 500 Western popular music tracks. For each word in a vocabulary, we use this data to train a Gaussian mixture model (GMM) over an audio feature space. We estimate the parameters of the model using the weighted mixture hierarchies expectation maximization algorithm. This algorithm is more scalable to large data sets and produces better density estimates than standard parameter estimation techniques. The quality of the music annotations produced by our system is comparable with the performance of humans on the same task. Our ldquoquery-by-textrdquo system can retrieve appropriate songs for a large number of musically relevant words. We also show that our audition system is general by learning a model that can annotate and retrieve sound effects. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Audio, Speech, and Language Processing Edics

    Page(s): 477 - 478
    Save to Project icon | Request Permissions | PDF file iconPDF (30 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing Information for authors

    Page(s): 479 - 480
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • IEEE Signal Processing Society Information

    Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research