<?xml version="1.0" ?>
<rss version="2.0">
	<channel>
		<title><![CDATA[ Audio, Speech, and Language Processing, IEEE Transactions on - new TOC ]]></title>
		<link>http://ieeexplore.ieee.org</link>
		<description>TOC Alert for Publication# 10376 </description>
		<year>2012</year>
		<month>February </month>
		<day>10</day>
		<item>
			<title><![CDATA[Speaker Identification and Verification by Combining MFCC and Phase Information]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047571]]></link>
			<description><![CDATA[In conventional speaker recognition methods based on Mel-frequency cepstral coefficients (MFCCs), phase information has hitherto been ignored. In this paper, we propose a phase information extraction method that normalizes the change variation in the phase according to the frame position of the input speech and combines the phase information with MFCCs in text-independent speaker identification and verification methods. There is a problem with the original phase information extraction method when comparing two phase values. For example, the difference in the two values of <formula formulatype="inline"><tex Notation="TeX">$pi -{mathtildetheta}_{1}$</tex></formula> and <formula formulatype="inline"> <tex Notation="TeX">${mathtildetheta}_{2}=-pi+{mathtildetheta}_{1}$</tex> </formula> is <formula formulatype="inline"><tex Notation="TeX">$2pi -2{mathtildetheta}_{1}$</tex> </formula>. If <formula formulatype="inline"><tex Notation="TeX">${mathtildetheta}_{1}approx 0$</tex></formula>, then the difference <formula formulatype="inline"><tex Notation="TeX">$approx 2pi$</tex></formula>, despite the two phases being very similar to one another. To address this problem, we map the phase into coordinates on a unit circle. Speaker identification and verification experiments are performed using the NTT database which consists of sentences uttered by 35 (22 male and 13 female) Japanese speakers with normal, fast and slow speaking modes during five sessions. Although the phase information-based method performs worse than the MFCC-based method, it augments the MFCC and the combination is useful for speaker recognition. The proposed modified phase information is more robust than the original phase information for all speaking modes. By integrating the modified phase information with the MFCCs, the speaker identification rate was improved to 98.8% from 97.4% (MFCC), and equal error rate for speaker verification was reduced to 0.45% from 0.72% (MFCC), respectively. We al-
o conducted the speaker identification and verification experiments on a large-scale Japanese Newspaper Article Sentences (JNAS) database, a similar trend as NTT database was obtained.]]></description>
			<pubDate><![CDATA[May  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047571]]></guid>
			<volume>20</volume>
			<issue>4</issue>
			<startPage>1085</startPage>
			<endPage>1095</endPage>
			<fileSize>1226</fileSize>
			<authors><![CDATA[Nakagawa, S.;Wang, L.;Ohtsuka, S.;]]></authors>
		</item>
		<item>
			<title><![CDATA[A Generative Context Model for Semantic Music Annotation and Retrieval]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047567]]></link>
			<description><![CDATA[While a listener may derive semantic associations for audio clips from direct auditory cues (e.g., hearing &#x201C;bass guitar&#x201D;) as well as from &#x201C;context&#x201D; (e.g., inferring &#x201C;bass guitar&#x201D; in the context of a &#x201C;rock&#x201D; song), most state-of-the-art systems for automatic music annotation ignore this context. Indeed, although contextual relationships correlate tags, many auto-taggers model tags independently. This paper presents a novel, generative approach to improve automatic music annotation by modeling contextual relationships between tags. A Dirichlet mixture model (DMM) is proposed as a second, additional stage in the modeling process, to supplement any auto-tagging system that generates a semantic multinomial (SMN) over a vocabulary of tags when annotating a song. For each tag in the vocabulary, a DMM captures the broader context the tag defines by modeling tag co-occurrence patterns in the SMNs of songs associated with the tag. When annotating songs, the DMMs refine SMN annotations by leveraging contextual evidence. Experimental results demonstrate the benefits of combining a variety of auto-taggers with this generative context model. It generally outperforms other approaches to modeling context as well.]]></description>
			<pubDate><![CDATA[May  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047567]]></guid>
			<volume>20</volume>
			<issue>4</issue>
			<startPage>1096</startPage>
			<endPage>1108</endPage>
			<fileSize>1109</fileSize>
			<authors><![CDATA[Miotto, R.;Lanckriet, G.;]]></authors>
		</item>
		<item>
			<title><![CDATA[A Generative Data Augmentation Model for Enhancing Chinese Dialect Pronunciation Prediction]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047570]]></link>
			<description><![CDATA[Most spoken Chinese dialects lack comprehensive digital pronunciation databases, which are crucial for speech processing tasks. Given complete pronunciation databases for related dialects, one can use supervised learning techniques to predict a Chinese character's pronunciation in a target dialect based on the character's features and its pronunciation in other related dialects. Unfortunately, Chinese dialect pronunciation databases are far from complete. We propose a novel generative model that makes use of both existing dialect pronunciation data plus medieval rime books to discover patterns that exist in multiple dialects. The proposed model can augment missing dialectal pronunciations based on existing dialect pronunciation tables (even if incomplete) and the pronunciation data in rime books. The augmented pronunciation database can then be used in supervised learning settings. We evaluate the prediction accuracy in terms of phonological features, such as tone, initial phoneme, final phoneme, etc. For each character, features are evaluated on the whole, overall pronunciation feature accuracy (OPFA). Our first experimental results show that adding features from dialectal pronunciation data to our baseline rime-book model dramatically improves OPFA using the support vector machine (SVM) model. In the second experiment, we compare the performance of the SVM model using phonological features from closely related dialects with that of the model using phonological features from non-closely related dialects. The experimental results show that using features from closely related dialects results in higher accuracy. In the third experiment, we show that using our proposed data augmentation model to fill in missing data can increase the SVM model's OPFA by up to 7.6%.]]></description>
			<pubDate><![CDATA[May  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047570]]></guid>
			<volume>20</volume>
			<issue>4</issue>
			<startPage>1109</startPage>
			<endPage>1117</endPage>
			<fileSize>1601</fileSize>
			<authors><![CDATA[Lin, C.-C.;Tsai, R. T.-H.;]]></authors>
		</item>
		<item>
			<title><![CDATA[A General Flexible Framework for the Handling of Prior Information in Audio Source Separation]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047568]]></link>
			<description><![CDATA[Most audio source separation methods are developed for a particular scenario characterized by the number of sources and channels and the characteristics of the sources and the mixing process. In this paper, we introduce a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints. While this framework generalizes several existing audio source separation methods, it also allows to imagine and implement new efficient methods that were not yet reported in the literature. We first introduce the framework by describing the model structure and constraints, explaining its generality, and summarizing its algorithmic implementation using a generalized expectation&#x2013;maximization algorithm. Finally, we illustrate the above-mentioned capabilities of the framework by applying it in several new and existing configurations to different source separation problems. We have released a software tool named Flexible Audio Source Separation Toolbox (FASST) implementing a baseline version of the framework in Matlab.]]></description>
			<pubDate><![CDATA[May  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047568]]></guid>
			<volume>20</volume>
			<issue>4</issue>
			<startPage>1118</startPage>
			<endPage>1133</endPage>
			<fileSize>1011</fileSize>
			<authors><![CDATA[Ozerov, A.;Vincent, E.;Bimbot, F.;]]></authors>
		</item>
		<item>
			<title><![CDATA[Discovering Time-Constrained Sequential Patterns for Music Genre Classification]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047569]]></link>
			<description><![CDATA[A music piece can be considered as a sequence of sound events which represent both short-term and long-term temporal information. However, in the task of automatic music genre classification, most of text-categorization-based approaches could only capture temporal local dependencies (e.g., unigram and bigram-based occurrence statistics) to represent music contents. In this paper, we propose the use of time-constrained sequential patterns (TSPs) as effective features for music genre classification. First of all, an automatic language identification technique is performed to tokenize each music piece into a sequence of hidden Markov model indices. Then TSP mining is applied to discover genre-specific TSPs, followed by the computation of occurrence frequencies of TSPs in each music piece. Finally, support vector machine classifiers are employed based on these occurrence frequencies to perform the classification task. Experiments conducted on two widely used datasets for music genre classification, GTZAN and ISMIR2004Genre, show that the proposed method can discover more discriminative temporal structures and achieve a better recognition accuracy than the unigram and bigram-based statistical approach.]]></description>
			<pubDate><![CDATA[May  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047569]]></guid>
			<volume>20</volume>
			<issue>4</issue>
			<startPage>1134</startPage>
			<endPage>1144</endPage>
			<fileSize>1455</fileSize>
			<authors><![CDATA[Ren, J.-M.;Jang, J.-S. R.;]]></authors>
		</item>
		<item>
			<title><![CDATA[On Dynamic Stream Weighting for Audio-Visual Speech Recognition]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047566]]></link>
			<description><![CDATA[The integration of audio and visual information improves speech recognition performance, specially in the presence of noise. In these circumstances it is necessary to introduce audio and visual weights to control the contribution of each modality to the recognition task. We present a method to set the value of the weights associated to each stream according to their reliability for speech recognition, allowing them to change with time and adapt to different noise and working conditions. Our dynamic weights are derived from several measures of the stream reliability, some specific to speech processing and others inherent to any classification task, and take into account the special role of silence detection in the definition of audio and visual weights. In this paper, we propose a new confidence measure, compare it to existing ones, and point out the importance of the correct detection of silence utterances in the definition of the weighting system. Experimental results support our main contribution: the inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures.]]></description>
			<pubDate><![CDATA[May  2012]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=6145765&arnumber=6047566]]></guid>
			<volume>20</volume>
			<issue>4</issue>
			<startPage>1145</startPage>
			<endPage>1157</endPage>
			<fileSize>1188</fileSize>
			<authors><![CDATA[Estellers, V.;Gurban, M.;Thiran, J.-P.;]]></authors>
		</item>
	</channel>
</rss>
