Loading [MathJax]/extensions/MathZoom.js
Perceptual similarity measurement of speech by combination of acoustic features | IEEE Conference Publication | IEEE Xplore

Perceptual similarity measurement of speech by combination of acoustic features


Abstract:

Future cast system is a new entertainment system where participant’s face is captured and rendered into the movie as an instant Computer Graphics (CG) movie star, which h...Show More

Abstract:

Future cast system is a new entertainment system where participant’s face is captured and rendered into the movie as an instant Computer Graphics (CG) movie star, which had been first exhibited at the 2005 World Exposition in Aichi Japan. We are working to add new functionality which enables mapping not only faces but also speech individualities to the cast. Our approach is to find a speaker with the closest speech individuality and apply voice conversion. This paper investigates acoustic features to estimate perceptual similarity of speech individuality. We propose a method linearly combined eight acoustic features related to the perception of speech individualities. The proposed method optimizes weights for the acoustic features considering perceptual similarities. We have evaluated performance of our method with Spearman’s rank correlation coefficients to perceptual similarities. As the results, the experiments evidenced that the proposed method achieves a correlation coefficient of 0.66.
Date of Conference: 31 March 2008 - 04 April 2008
Date Added to IEEE Xplore: 12 May 2008
ISBN Information:

ISSN Information:

Conference Location: Las Vegas, NV, USA

1. INTRODUCTION

Future cast system (FCS) [1] is the world's first entertainment system which enables anyone to easily participate in a prerecorded movie as an instant CG movie star. FCS can automatically perform all the processes from capturing participant's facial characteristics using a 3D range scanner for rendering them into the movie. Additionally, this system allocates a suitable role for the participants in the story and each actor begins to speak and perform in a fully CG based movie as vividly as any real actor. However, the prerecorded voice of either an actor or actress is used as a substitute for that of each participant. The substitute voice is selected depending on only each participant's gender information which is estimated based on the scanned face shape without consideration of other information such as age and voice quality. This caused some mismatch for those who perceive the voice of the character to be different from their own or the people they know. Therefore we decided to focus on selecting the similar speaker from speech database to reduce the mismatch. We propose a method to measure the perceptual similarity of speech, because it is impossible to record all speech of participants in advance, and to convert voice quality sufficiently with the present conversion technology.

Contact IEEE to Subscribe

References

References is not available for this document.