By Topic

Analysis and Recognition of NAM Speech Using HMM Distances and Visual Information

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Heracleous, P. ; Speech & Cognition Dept., Stendhal Univ., St. Martin d''Heres, France ; Tran, V.-A. ; Nagai, T. ; Shikano, K.

Non-audible murmur (NAM) is an unvoiced speech signal that can be received through the body tissue with the use of special acoustic sensors (i.e., NAM microphones) attached behind the talker's ear. The authors had previously reported experimental results for NAM recognition using a stethoscopic and a silicon NAM microphone. Using a small amount of training data from a single speaker and adaptation approaches, 93.9% of word accuracy was achieved for a 20 k Japanese vocabulary dictation task. In this paper, further analysis of NAM speech is made using distance measures between hidden Markov models (HMMs). It has been shown that owing to the reduced spectral space of NAM speech, the HMM distances are also reduced when compared with those of normal speech. In the case of Japanese vowels and fricatives, the distance measures in NAM speech follow the same relative inter-phoneme relationship as that in normal speech without significant differences. However, significant differences have been found in the case of Japanese plosives. More specifically, in NAM speech, the distances between voiced/unvoiced consonant pairs articulated in the same place drastically decreased. As a result, the inter-phoneme relationship as compared to normal-speech changed significantly, causing a substantial decrease in the recognition accuracy. A speaker-dependent phoneme recognition experiment has been conducted, obtained 81.5% NAM phoneme correct, showing a relationship between HMM distance measures and phoneme accuracy. In a NAM microphone, body transmission and loss of lip radiation act as a low-pass filter. As a result, higher frequency components are attenuated in a NAM signal. Because of spectral reduction, NAM's unvoiced nature, and the type of articulation, NAM sounds become similar, causing a larger number of confusions when compared with normal speech. Yet many of those sounds are visually different on face/mouth/lips, and the integration of visual information increases their discrim- - ination. As a result, recognition accuracy increases as well. In this article, the visual information extracted from the talkers' facial movements is fused with NAM speech. The experimental results reveal a relative improvement of 10.5% on average when fused NAM speech and facial information were used compared with using only NAM speech.

Published in:

Audio, Speech, and Language Processing, IEEE Transactions on  (Volume:18 ,  Issue: 6 )
Biometrics Compendium, IEEE