Abstract:
Dysarthric speech reconstruction (DSR) is a challenging task due to difficulties in repairing unstable prosody and correcting imprecise articulation. Inspired by the succ...Show MoreMetadata
Abstract:
Dysarthric speech reconstruction (DSR) is a challenging task due to difficulties in repairing unstable prosody and correcting imprecise articulation. Inspired by the success of sequence-to-sequence (seq2seq) based text-to-speech (TTS) synthesis and knowledge distillation (KD) techniques, this paper proposes a novel end-to-end voice conversion (VC) method to tackle the reconstruction task. The proposed approach contains three components. First, a seq2seq based TTS is first trained with the transcribed normal speech. Second, with the text-encoder of this trained TTS system as "teacher", a teacher-student framework is proposed for cross-modal KD by training a speech-encoder to extract appropriate linguistic representations from the transcribed dysarthric speech. Third, the speech-encoder of the previous component is concatenated with the attention and decoder of the first component (TTS) to perform the DSR task, by directly mapping the dysarthric speech to its normal version. Experiments demonstrate that the proposed method can generate the speech with high naturalness and intelligibility, where the comparisons of human speech recognition between the reconstructed speech and the original dysarthric speech show that 35.4% and 48.7% absolute word error rate (WER) reduction can be achieved for dysarthric speakers with low and very low speech intelligibility, respectively.
Published in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-08 May 2020
Date Added to IEEE Xplore: 09 April 2020
ISBN Information: