Abstract:
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have...Show MoreMetadata
Abstract:
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonie chunkwise attention (MoChA). However, the elaborate attention calculation process is not robust against long-form speech utterances. Moreover, the sequence-level training objective and time-restricted streaming encoder cause a nonnegligible delay in token emission during inference. To address these problems, we propose CTC synchronous training (CTC-ST), in which CTC alignments are leveraged as a reference for token boundaries to enable a MoChA model to learn optimal monotonie input-output alignments. We formulate a purely end-to-end training objective to synchronize the boundaries of MoChA to those of CTC. The CTC model shares an encoder with the MoChA model to enhance the encoder representation. Moreover, the proposed method provides alignment information learned in the CTC branch to the attention-based decoder. Therefore, CTC-ST can be regarded as self-distillation of alignment knowledge from CTC to MoChA. Experimental evaluations on a variety of benchmark datasets show that the proposed method significantly reduces recognition errors and emission latency simultaneously. The robustness to long-form and noisy speech is also demonstrated. We compare CTC-ST with several methods that distill alignment knowledge from a hybrid ASR system and show that the CTC-ST can achieve a comparable tradeoff of accuracy and latency without relying on external alignment information.
Published in: IEEE/ACM Transactions on Audio, Speech, and Language Processing ( Volume: 31)
![Author image of Hirofumi Inaguma](/mediastore/IEEE/content/freeimages/6570655/9970249/9640576/inagu-3133217-small.gif)
Graduate School of Informatics, Kyoto University, Kyoto, Japan
Hirofumi Inaguma (Member, IEEE) received the B.E. degree in engineering, and the M.S. and Ph.D. degrees in informatics from Kyoto University, Kyoto, Japan, in 2016, 2018, and 2021, respectively. His research interests include automatic speech recognition and speech translation. He is a Member of ISCA, ASJ, and IPSJ.
Hirofumi Inaguma (Member, IEEE) received the B.E. degree in engineering, and the M.S. and Ph.D. degrees in informatics from Kyoto University, Kyoto, Japan, in 2016, 2018, and 2021, respectively. His research interests include automatic speech recognition and speech translation. He is a Member of ISCA, ASJ, and IPSJ.View more
![Author image of Tatsuya Kawahara](/mediastore/IEEE/content/freeimages/6570655/9970249/9640576/kawah-3133217-small.gif)
Graduate School of Informatics, Kyoto University, Kyoto, Japan
Tatsuya Kawahara (Fellow, IEEE) received the B.E., M.E., and Ph.D. degrees in information science from Kyoto University, Kyoto, Japan, in 1987, 1989, and 1995, respectively. From 1995 to 1996, he was a Visiting Researcher with Bell Laboratories, Murray Hill, NJ, USA. He is currently a Professor and the Dean of the School of Informatics, Kyoto University. He was also an Invited Researcher with ATR and NICT. He has authored...Show More
Tatsuya Kawahara (Fellow, IEEE) received the B.E., M.E., and Ph.D. degrees in information science from Kyoto University, Kyoto, Japan, in 1987, 1989, and 1995, respectively. From 1995 to 1996, he was a Visiting Researcher with Bell Laboratories, Murray Hill, NJ, USA. He is currently a Professor and the Dean of the School of Informatics, Kyoto University. He was also an Invited Researcher with ATR and NICT. He has authored...View more
![Author image of Hirofumi Inaguma](/mediastore/IEEE/content/freeimages/6570655/9970249/9640576/inagu-3133217-small.gif)
Graduate School of Informatics, Kyoto University, Kyoto, Japan
Hirofumi Inaguma (Member, IEEE) received the B.E. degree in engineering, and the M.S. and Ph.D. degrees in informatics from Kyoto University, Kyoto, Japan, in 2016, 2018, and 2021, respectively. His research interests include automatic speech recognition and speech translation. He is a Member of ISCA, ASJ, and IPSJ.
Hirofumi Inaguma (Member, IEEE) received the B.E. degree in engineering, and the M.S. and Ph.D. degrees in informatics from Kyoto University, Kyoto, Japan, in 2016, 2018, and 2021, respectively. His research interests include automatic speech recognition and speech translation. He is a Member of ISCA, ASJ, and IPSJ.View more
![Author image of Tatsuya Kawahara](/mediastore/IEEE/content/freeimages/6570655/9970249/9640576/kawah-3133217-small.gif)
Graduate School of Informatics, Kyoto University, Kyoto, Japan
Tatsuya Kawahara (Fellow, IEEE) received the B.E., M.E., and Ph.D. degrees in information science from Kyoto University, Kyoto, Japan, in 1987, 1989, and 1995, respectively. From 1995 to 1996, he was a Visiting Researcher with Bell Laboratories, Murray Hill, NJ, USA. He is currently a Professor and the Dean of the School of Informatics, Kyoto University. He was also an Invited Researcher with ATR and NICT. He has authored or coauthored more than 400 academic papers on speech recognition, spoken language processing, and spoken dialogue systems. He is conducting several projects including speech recognition software Julius, the automatic transcription system deployed in the Japanese Parliament (Diet), and the autonomous android ERICA. Dr. Kawahara received the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology (MEXT) in 2012. From 2003 to 2006, he was a Member of IEEE SPS Speech Technical Committee. He was the General Chair of IEEE ASRU 2007. He was the Tutorial Chair of INTERSPEECH 2010, the Local Arrangement Chair of ICASSP 2012, and the General Chair of APSIPA ASC 2020. He was an Editorial Board Member of the Elsevier Journal of Computer Speech and Language and IEEE/ACM Transactions on Audio, Speech, and Language Processing. He is the Editor-in-Chief of the APSIPA Transactions on Signal and Information Processing. Dr. Kawahara is a Board Member of the APSIPA and ISCA.
Tatsuya Kawahara (Fellow, IEEE) received the B.E., M.E., and Ph.D. degrees in information science from Kyoto University, Kyoto, Japan, in 1987, 1989, and 1995, respectively. From 1995 to 1996, he was a Visiting Researcher with Bell Laboratories, Murray Hill, NJ, USA. He is currently a Professor and the Dean of the School of Informatics, Kyoto University. He was also an Invited Researcher with ATR and NICT. He has authored or coauthored more than 400 academic papers on speech recognition, spoken language processing, and spoken dialogue systems. He is conducting several projects including speech recognition software Julius, the automatic transcription system deployed in the Japanese Parliament (Diet), and the autonomous android ERICA. Dr. Kawahara received the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology (MEXT) in 2012. From 2003 to 2006, he was a Member of IEEE SPS Speech Technical Committee. He was the General Chair of IEEE ASRU 2007. He was the Tutorial Chair of INTERSPEECH 2010, the Local Arrangement Chair of ICASSP 2012, and the General Chair of APSIPA ASC 2020. He was an Editorial Board Member of the Elsevier Journal of Computer Speech and Language and IEEE/ACM Transactions on Audio, Speech, and Language Processing. He is the Editor-in-Chief of the APSIPA Transactions on Signal and Information Processing. Dr. Kawahara is a Board Member of the APSIPA and ISCA.View more