Skip to Main Content
We propose a concatenative synthesis approach to the problem of foreign accent conversion. The approach consists of replacing the most accented portions of nonnative speech with alternative segments from a corpus of the speaker's own speech based on their similarity to those from a reference native speaker. We propose and compare two approaches for selecting units, one based on acoustic similarity [e.g., mel frequency cepstral coefficients (MFCCs)] and a second one based on articulatory similarity, as measured through electromagnetic articulography (EMA). Our hypothesis is that articulatory features provide a better metric for linguistic similarity across speakers than acoustic features. To test this hypothesis, we recorded an articulatory-acoustic corpus from a native and a nonnative speaker, and evaluated the two speech representations (acoustic versus articulatory) through a series of perceptual experiments. Formal listening tests indicate that the approach can achieve a 20% reduction in perceived accent, but also reveal a strong coupling between accent and speaker identity. To address this issue, we disguised original and resynthesized utterances by altering their average pitch and normalizing vocal tract length. An additional listening experiment supports the hypothesis that articulatory features are less speaker dependent than acoustic features.