Abstract:
Some End-to-End (E2E) Automatic Speech Recognition (ASR) models, such as Attention-based Encoder-Decoder (AED) and Recurrent Neural Network Transducer (RNN-T) are known t...Show MoreMetadata
Abstract:
Some End-to-End (E2E) Automatic Speech Recognition (ASR) models, such as Attention-based Encoder-Decoder (AED) and Recurrent Neural Network Transducer (RNN-T) are known to have components that effectively act as internal language models (ILM), implicitly modelling the prior probability of the output sequence. However, the existence of an ILM in pure Connectionist Temporal Classification (CTC) ASR systems remains debated. In this paper, we investigate the existence and strength of an ILM in CTC systems. Since CTC posterior probabilities cannot be analytically factorised, we propose a novel empirical method to probe the ILM. After validating our method on a hybrid DNN model with various external language models, we apply it to CTC models trained under different conditions, examining the effects of training data, modelling units, and training or pre-training methods. Our results show no strong evidence of an ILM in CTC-based ASR systems, even with the largest training dataset in our experiments. However, we make the surprising finding that when a CTC encoder is jointly trained with an AED loss, an ILM emerges, even when only the CTC component is used in decoding.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:
ISSN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Language Model ,
- Internal Model ,
- Internal Language ,
- E2E Automatic Speech Recognition ,
- CTC-based Automatic Speech Recognition ,
- Training Data ,
- Training Dataset ,
- Training Methods ,
- Largest Dataset ,
- Probability Of Sequence ,
- Joint Training ,
- Pre-training Method ,
- Automatic Speech Recognition System ,
- Large Datasets ,
- Convolutional Layers ,
- Utterances ,
- Small Datasets ,
- Hybrid Model ,
- Transformer Model ,
- Linear Layer ,
- Word Error Rate ,
- Acoustic Model ,
- Substitution Errors ,
- Hourly Data ,
- Deletion Errors ,
- Masking Strategy ,
- Conditional Independence Assumption ,
- Word Level
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Language Model ,
- Internal Model ,
- Internal Language ,
- E2E Automatic Speech Recognition ,
- CTC-based Automatic Speech Recognition ,
- Training Data ,
- Training Dataset ,
- Training Methods ,
- Largest Dataset ,
- Probability Of Sequence ,
- Joint Training ,
- Pre-training Method ,
- Automatic Speech Recognition System ,
- Large Datasets ,
- Convolutional Layers ,
- Utterances ,
- Small Datasets ,
- Hybrid Model ,
- Transformer Model ,
- Linear Layer ,
- Word Error Rate ,
- Acoustic Model ,
- Substitution Errors ,
- Hourly Data ,
- Deletion Errors ,
- Masking Strategy ,
- Conditional Independence Assumption ,
- Word Level
- Author Keywords