By Topic

Audio, Speech, and Language Processing, IEEE Transactions on

Issue 10 • Date Oct. 2013

Filter Results

Displaying Results 1 - 25 of 27
  • Front Cover

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (401 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): 1989 - 1990
    Save to Project icon | Request Permissions | PDF file iconPDF (225 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): 1991 - 1992
    Save to Project icon | Request Permissions | PDF file iconPDF (226 KB)  
    Freely Available from IEEE
  • Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach

    Page(s): 2015 - 2028
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1623 KB) |  | HTML iconHTML  

    In speaker diarization, standard approaches typically perform speaker clustering on some initial segmentation before refining the segment boundaries in a re-segmentation step to obtain a final diarization hypothesis. In this paper, we integrate an improved clustering method with an existing re-segmentation algorithm and, in iterative fashion, optimize both speaker cluster assignments and segmentation boundaries jointly. For clustering, we extend our previous research using factor analysis for speaker modeling. In continuing to take advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features (i.e., i-vectors), we develop a probabilistic approach to speaker clustering by applying a Bayesian Gaussian Mixture Model (GMM) to principal component analysis (PCA)-processed i-vectors. We then utilize information at different temporal resolutions to arrive at an iterative optimization scheme that, in alternating between clustering and re-segmentation steps, demonstrates the ability to improve both speaker cluster assignments and segmentation boundaries in an unsupervised manner. Our proposed methods attain results that are comparable to those of a state-of-the-art benchmark set on the multi-speaker CallHome telephone corpus. We further compare our system with a Bayesian nonparametric approach to diarization and attempt to reconcile their differences in both methodology and performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Direct Masking Approach to Robust ASR

    Page(s): 1993 - 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1892 KB) |  | HTML iconHTML  

    Recently, much work has been devoted to the computation of binary masks for speech segregation. Conventional wisdom in the field of ASR holds that these binary masks cannot be used directly; the missing energy significantly affects the calculation of the cepstral features commonly used in ASR. We show that this commonly held belief may be a misconception; we demonstrate the effectiveness of directly using the masked data on both a small and large vocabulary dataset. In fact, this approach, which we term the direct masking approach, performs comparably to two previously proposed missing feature techniques. We also investigate the reasons why other researchers may have not come to this conclusion; variance normalization of the features is a significant factor in performance. This work suggests a much better baseline than unenhanced speech for future work in missing feature ASR. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Semi-Blind Noise Extraction Using Partially Known Position of the Target Source

    Page(s): 2029 - 2041
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2978 KB) |  | HTML iconHTML  

    An extracted noise signal provides important information for subsequent enhancement of a target signal. When the target's position is fixed, the noise extractor could be a target-cancellation filter derived in a noise-free situation. In this paper we consider a situation when such cancellation filters are prepared for a set of several possible positions of the target in advance. The set of filters is interpreted as prior information available for the noise extraction when the target's exact position is unknown. Our novel method looks for a linear combination of the prepared filters via Independent Component Analysis. The method yields a filter that has a better cancellation performance than the individual filters or filters based on a minimum variance principle. The method is tested in a highly noisy and reverberant real-world environment with moving target source and interferers. A post-processing by Wiener filter using the noise signal extracted by the method is able to improve signal-to-noise ratio of the target by up to 8 dB. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Experimental Analysis on Integrating Multi-Stream Spectro-Temporal, Cepstral and Pitch Information for Mandarin Speech Recognition

    Page(s): 2006 - 2014
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (979 KB) |  | HTML iconHTML  

    Gabor features have been proposed for extracting spectro-temporal modulation information from speech signals, and have been shown to yield large improvements in recognition accuracy. We use a flexible Tandem system framework that integrates multi-stream information including Gabor, MFCC, and pitch features in various ways, by modeling either or both of the tone and phoneme variations in Mandarin speech recognition. We use either phonemes or tonal phonemes (tonemes) as either the target classes of MLP posterior estimation and/or the acoustic units of HMM recognition. The experiments yield a comprehensive analysis on the contributions to recognition accuracy made by either of the feature sets. We discuss their complementarities in tone, phoneme, and toneme classification. We show that Gabor features are better for recognition of vowels and unvoiced consonants, while MFCCs are better for voiced consonants. Also, Gabor features are capable of capturing changes in signals across time and frequency bands caused by Mandarin tone patterns, while pitch features further offer extra tonal information. This explains why the integration of Gabor, MFCC, and pitch features offers such significant improvements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accurate Estimation of Low Fundamental Frequencies From Real-Valued Measurements

    Page(s): 2042 - 2056
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2599 KB) |  | HTML iconHTML  

    In this paper, the difficult problem of estimating low fundamental frequencies from real-valued measurements is addressed. The methods commonly employed do not take the phenomena encountered in this scenario into account and thus fail to deliver accurate estimates. The reason for this is that they employ asymptotic approximations that are violated when the harmonics are not well-separated in frequency, something that happens when the observed signal is real-valued and the fundamental frequency is low. To mitigate this, we analyze the problem and present some exact fundamental frequency estimators that are aimed at solving this problem. These estimators are based on the principles of nonlinear least-squares, harmonic fitting, optimal filtering, subspace orthogonality, and shift-invariance, and they all reduce to already published methods for a high number of observations. In experiments, the methods are compared and the increased accuracy obtained by avoiding asymptotic approximations is demonstrated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiobjective Time Series Matching for Audio Classification and Retrieval

    Page(s): 2057 - 2072
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2707 KB) |  | HTML iconHTML  

    Seeking sound samples in a massive database can be a tedious and time consuming task. Even when metadata are available, query results may remain far from the timbre expected by users. This problem stems from the nature of query specification, which does not account for the underlying complexity of audio data. The Query By Example (QBE) paradigm tries to tackle this shortcoming by finding audio clips similar to a given sound example. However, it requires users to have a well-formed soundfile of what they seek, which is not always a valid assumption. Furthermore, most audio-retrieval systems rely on a single measure of similarity, which is unlikely to convey the perceptual similarity of audio signals. We address in this paper an innovative way of querying generic audio databases by simultaneously optimizing the temporal evolution of multiple spectral properties. We show how this problem can be cast into a new approach merging multiobjective optimization and time series matching, called MultiObjective Time Series (MOTS) matching. We formally state this problem and report an efficient implementation. This approach introduces a multidimensional assessment of similarity in audio matching. This allows to cope with the multidimensional nature of timbre perception and also to obtain a set of efficient propositions rather than a single best solution. To demonstrate the performances of our approach, we show its efficiency in audio classification tasks. By introducing a selection criterion based on the hypervolume dominated by a class, we show that our approach outstands the state-of-art methods in audio classification even with a few number of features. We demonstrate its robustness to several classes of audio distortions. Finally, we introduce two innovative applications of our method for sound querying. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition

    Page(s): 2073 - 2084
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2172 KB) |  | HTML iconHTML  

    In this paper, we propose a discriminative dynamic Gaussian mixture selection (DGMS) strategy to generate reliable accent-specific units (ASUs) for multi-accent speech recognition. Time-aligned phone recognition is used to generate the ASUs that model accent variations explicitly and accurately. DGMS reconstructs and adjusts a pre-trained set of hidden Markov model (HMM) state densities to build dynamic observation densities for each input speech frame. A discriminative minimum classification error criterion is adopted to optimize the sizes of the HMM state observation densities with a genetic algorithm (GA). To the author's knowledge, the discriminative optimization for DGMS accomplishes discriminative training of discrete variables that is first proposed. We found the proposed framework is able to cover more multi-accent changes, thus reduce some performance loss in pruned beam search, without increasing the model size of the original acoustic model set. Evaluation on three typical Chinese accents, Chuan, Yue and Wu, shows that our approach outperforms traditional acoustic model reconstruction techniques with a syllable error rate reduction of 8.0%, 5.5% and 5.0%, respectively, while maintaining a good performance on standard Putonghua speech. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis and Synthesis of Speech Using an Adaptive Full-Band Harmonic Model

    Page(s): 2085 - 2095
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1825 KB) |  | HTML iconHTML  

    Voice models often use frequency limits to split the speech spectrum into two or more voiced/unvoiced frequency bands. However, from the voice production, the amplitude spectrum of the voiced source decreases smoothly without any abrupt frequency limit. Accordingly, multiband models struggle to estimate these limits and, as a consequence, artifacts can degrade the perceived quality. Using a linear frequency basis adapted to the non-stationarities of the speech signal, the Fan Chirp Transformation (FChT) have demonstrated harmonicity at frequencies higher than usually observed from the DFT which motivates a full-band modeling. The previously proposed Adaptive Quasi-Harmonic model (aQHM) offers even more flexibility than the FChT by using a non-linear frequency basis. In the current paper, exploiting the properties of aQHM, we describe a full-band Adaptive Harmonic Model (aHM) along with detailed descriptions of its corresponding algorithms for the estimation of harmonics up to the Nyquist frequency. Formal listening tests show that the speech reconstructed using aHM is nearly indistinguishable from the original speech. Experiments with synthetic signals also show that the proposed aHM globally outperforms previous sinusoidal and harmonic models in terms of precision in estimating the sinusoidal parameters. As a perspective, such a precision is interesting for building higher level models upon the sinusoidal parameters, like spectral envelopes for speech synthesis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation

    Page(s): 2096 - 2107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1808 KB) |  | HTML iconHTML  

    Separating singing voice from music accompaniment can be of interest for many applications such as melody extraction, singer identification, lyrics alignment and recognition, and content-based music retrieval. In this paper, a novel algorithm for singing voice separation in monaural mixtures is proposed. The algorithm consists of two stages, where non-negative matrix factorization (NMF) is applied to decompose the mixture spectrograms with long and short windows respectively. A spectral discontinuity thresholding method is devised for the long-window NMF to select out NMF components originating from pitched instrumental sounds, and a temporal discontinuity thresholding method is designed for the short-window NMF to pick out NMF components that are from percussive sounds. By eliminating the selected components, most pitched and percussive elements of the music accompaniment are filtered out from the input sound mixture, with little effect on the singing voice. Extensive testing on the MIR-1K public dataset of 1000 short audio clips and the Beach-Boys dataset of 14 full-track real-world songs showed that the proposed algorithm is both effective and efficient. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Non-Negative Temporal Decomposition of Speech Parameters by Multiplicative Update Rules

    Page(s): 2108 - 2117
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1984 KB) |  | HTML iconHTML  

    I invented a non-negative temporal decomposition method for line spectral pairs and articulatory parameters based on the multiplicative update rules. These parameters are decomposed into a set of temporally overlapped unimodal event functions restricted to the range [0,1] and corresponding event vectors. When line spectral pairs are used, event vectors preserve their ordering property. With the proposed method, the RMS error of the measured and reconstructed articulatory parameters is 0.21 mm and the spectral distance of the measured and reconstructed line spectral pairs parameters is 2.0 dB. The RMS error and spectral distance in the proposed method are smaller than those in conventional methods. This technique will be useful for many applications of speech coding and speech modification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Learning Optimal Features for Polyphonic Audio-to-Score Alignment

    Page(s): 2118 - 2128
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1582 KB) |  | HTML iconHTML  

    This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

    Page(s): 2129 - 2139
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1679 KB) |  | HTML iconHTML  

    This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter generation criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can significantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

    Page(s): 2140 - 2151
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1964 KB) |  | HTML iconHTML  

    Reducing the interference noise in a monaural noisy speech signal has been a challenging task for many years. Compared to traditional unsupervised speech enhancement methods, e.g., Wiener filtering, supervised approaches, such as algorithms based on hidden Markov models (HMM), lead to higher-quality enhanced speech signals. However, the main practical difficulty of these approaches is that for each noise type a model is required to be trained a priori. In this paper, we investigate a new class of supervised speech denoising algorithms using nonnegative matrix factorization (NMF). We propose a novel speech enhancement method that is based on a Bayesian formulation of NMF (BNMF). To circumvent the mismatch problem between the training and testing stages, we propose two solutions. First, we use an HMM in combination with BNMF (BNMF-HMM) to derive a minimum mean square error (MMSE) estimator for the speech signal with no information about the underlying noise type. Second, we suggest a scheme to learn the required noise BNMF model online, which is then used to develop an unsupervised speech enhancement system. Extensive experiments are carried out to investigate the performance of the proposed methods under different conditions. Moreover, we compare the performance of the developed algorithms with state-of-the-art speech enhancement schemes using various objective measures. Our simulations show that the proposed BNMF-based methods outperform the competing algorithms substantially. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems

    Page(s): 2152 - 2161
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1354 KB) |  | HTML iconHTML  

    Model adaptation techniques are an efficient way to reduce the mismatch that typically occurs between the training and test condition of any automatic speech recognition (ASR) system. This work addresses the problem of increased degradation in performance when moving from speaker-dependent (SD) to speaker-independent (SI) conditions for connectionist (or hybrid) hidden Markov model/artificial neural network (HMM/ANN) systems in the context of large vocabulary continuous speech recognition (LVCSR). Adapting hybrid HMM/ANN systems on a small amount of adaptation data has been proven to be a difficult task, and has been a limiting factor in the widespread deployment of hybrid techniques in operational ASR systems. Addressing the crucial issue of speaker adaptation (SA) for hybrid HMM/ANN system can thereby have a great impact on the connectionist paradigm, which will play a major role in the design of next-generation LVCSR considering the great success reported by deep neural networks - ANNs with many hidden layers that adopts the pre-training technique - on many speech tasks. Current adaptation techniques for ANNs based on injecting an adaptable linear transformation network connected to either the input, or the output layer are not effective especially with a small amount of adaptation data, e.g., a single adaptation utterance. In this paper, a novel solution is proposed to overcome those limits and make it robust to scarce adaptation resources. The key idea is to adapt the hidden activation functions rather than the network weights. The adoption of Hermitian activation functions makes this possible. Experimental results on an LVCSR task demonstrate the effectiveness of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Blind Channel Magnitude Response Estimation in Speech Using Spectrum Classification

    Page(s): 2162 - 2171
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2514 KB) |  | HTML iconHTML  

    We present an algorithm for blind estimation of the magnitude response of an acoustic channel from single microphone observations of a speech signal. The algorithm employs channel robust RASTA filtered Mel-frequency cepstral coefficients as features to train a Gaussian mixture model based classifier and average clean speech spectra are associated with each mixture; these are then used to blindly estimate the acoustic channel magnitude response from speech that has undergone spectral modification due to the channel. Experimental results using a variety of simulated and measured acoustic channels and additive babble noise, car noise and white Gaussian noise are presented. The results demonstrate that the proposed method is able to estimate a variety of channel magnitude responses to within an Itakura distance of dI ≤0.5 for SNR ≥10 dB. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Feature Enhancement With Joint Use of Consecutive Corrupted and Noise Feature Vectors With Discriminative Region Weighting

    Page(s): 2172 - 2181
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2370 KB) |  | HTML iconHTML  

    This paper proposes a feature enhancement method that can achieve high speech recognition performance in a variety of noise environments with feasible computational cost. As the well-known Stereo-based Piecewise Linear Compensation for Environments (SPLICE) algorithm, the proposed method learns piecewise linear transformation to map corrupted feature vectors to the corresponding clean features, which enables efficient operation. To make the feature enhancement process adaptive to changes in noise, the piecewise linear transformation is performed by using a subspace of the joint space of corrupted and noise feature vectors, where the subspace is chosen such that classes (i.e., Gaussian mixture components) of underlying clean feature vectors can be best predicted. In addition, we propose utilizing temporally adjacent frames of corrupted and noise features in order to leverage dynamic characteristics of feature vectors. To prevent overfitting caused by the high dimensionality of the extended feature vectors covering the neighboring frames, we introduce regularized weighted minimum mean square error criterion. The proposed method achieved relative improvements of 34.2% and 22.2% over SPLICE under the clean and multi-style conditions, respectively, on the Aurora 2 task. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Noise Model Transfer: Novel Approach to Robustness Against Nonstationary Noise

    Page(s): 2182 - 2192
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1802 KB) |  | HTML iconHTML  

    This paper proposes an approach, called noise model transfer (NMT), for estimating the rapidly changing parameter values of a feature-domain noise model, which can be used to enhance feature vectors corrupted by highly nonstationary noise. Unlike conventional methods, the proposed approach can exploit both observed feature vectors, representing spectral envelopes, and other signal properties that are usually discarded during feature extraction but that are useful for separating nonstationary noise from speech. Specifically, we assume the availability of a noise power spectrum estimator that can capture rapid changes in noise characteristics by leveraging such signal properties. NMT determines the optimal transformation from the estimated noise power spectra into the feature-domain noise model parameter values in the sense of maximum likelihood. NMT is successfully applied to meeting speech recognition, where the main noise sources are competing talkers; and reverberant speech recognition, where the late reverberation is regarded as highly nonstationary additive noise. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-Time Multiple Sound Source Localization and Counting Using a Circular Microphone Array

    Page(s): 2193 - 2206
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2490 KB) |  | HTML iconHTML  

    In this work, a multiple sound source localization and counting method is presented, that imposes relaxed sparsity constraints on the source signals. A uniform circular microphone array is used to overcome the ambiguities of linear arrays, however the underlying concepts (sparse component analysis and matching pursuit-based operation on the histogram of estimates) are applicable to any microphone array topology. Our method is based on detecting time-frequency (TF) zones where one source is dominant over the others. Using appropriately selected TF components in these “single-source” zones, the proposed method jointly estimates the number of active sources and their corresponding directions of arrival (DOAs) by applying a matching pursuit-based approach to the histogram of DOA estimates. The method is shown to have excellent performance for DOA estimation and source counting, and to be highly suitable for real-time applications due to its low complexity. Through simulations (in various signal-to-noise ratio conditions and reverberant environments) and real environment experiments, we indicate that our method outperforms other state-of-the-art DOA and source counting methods in terms of accuracy, while being significantly more efficient in terms of computational complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Ontology Generation for Musical Instruments Based on Audio Analysis

    Page(s): 2207 - 2220
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2431 KB) |  | HTML iconHTML  

    In this paper we present a novel hybrid system that involves a formal method of automatic ontology generation for web-based audio signal processing applications. An ontology is seen as a knowledge management structure that represents domain knowledge in a machine interpretable format. It describes concepts and relationships within a particular domain, in our case, the domain of musical instruments. However, the different tasks of ontology engineering including manual annotation, hierarchical structuring and organization of data can be laborious and challenging. For these reasons, we investigate how the process of creating ontologies can be made less dependent on human supervision by exploring concept analysis techniques in a Semantic Web environment. In this study, various musical instruments, from wind to string families, are classified using timbre features extracted from audio. To obtain models of the analysed instrument recordings, we use K-means clustering to determine an optimised codebook of Line Spectral Frequencies (LSFs), or Mel-frequency Cepstral Coefficients (MFCCs). Two classification techniques based on Multi-Layer Perceptron (MLP) neural network and Support Vector Machines (SVM) were tested. Then, Formal Concept Analysis (FCA) is used to automatically build the hierarchical structure of musical instrument ontologies. Finally, the generated ontologies are expressed using the Ontology Web Language (OWL). System performance was evaluated under natural recording conditions using databases of isolated notes and melodic phrases. Analysis of Variance (ANOVA) were conducted with the feature and classifier attributes as independent variables and the musical instrument recognition F-measure as dependent variable. Based on these statistical analyses, a detailed comparison between musical instrument recognition models is made to investigate their effects on the automatic ontology generation system. The proposed system is general and also applicable to other rese- rch fields that are related to ontologies and the Semantic Web. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Audio, Speech, and Language Processing Edics

    Page(s): 2221 - 2222
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Audio, Speech, and Language Processing Information for Authors

    Page(s): 2223 - 2224
    Save to Project icon | Request Permissions | PDF file iconPDF (146 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.

 

This Transactions ceased publication in 2013. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Li Deng
Microsoft Research