Abstract:
Automatic speech recognition (ASR) for rare words is difficult as there are little relevant text-audio data pairs to train an ASR model. To obtain more text-audio pairs, ...Show MoreMetadata
Abstract:
Automatic speech recognition (ASR) for rare words is difficult as there are little relevant text-audio data pairs to train an ASR model. To obtain more text-audio pairs, text-only data are fed to Text-To-Speech (TTS) systems to generate synthetic audio. Previous works use a single TTS system conditioned on multiple speakers to produce different speaker voices to improve the output data’s speaker diversity, and they show that training an ASR model on the more diverse data can avoid overfitting and improve the model’s robustness. As an alternative way to improve the diversity, we study the speaker embedding distribution of audios synthesized by different TTS systems and found that the audios synthesized by different TTS systems have different speaker distributions even when they are conditioned on the same speaker. Inspired by this, this paper proposes to condition multiple TTS systems repeatedly on a single speaker to synthesize more diverse speaker data, so ASR models can be trained more robustly. When we apply our method to a rare word dataset partitioned from National Speech Corpus SG, which contains mostly road names and addresses in its text transcripts, experiments show that a pretrained ASR model adapted to our multi-TTS-same-SPK data gives relatively 9.8% lower word error rate (WER) compared to the ASR models adapted to same-TTS-multi-SPK data of the same data size, and our overall adaptation improves the model’s WER from 57.6% to 16.5% without using any real audio as training data.
Published in: 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Date of Conference: 31 October 2023 - 03 November 2023
Date Added to IEEE Xplore: 20 November 2023
ISBN Information: