Abstract:
The Mixture-of-Expert (MoE) structure has been effectively utilized in multilingual ASR tasks. However, the potential of external language information remains underutiliz...Show MoreMetadata
Abstract:
The Mixture-of-Expert (MoE) structure has been effectively utilized in multilingual ASR tasks. However, the potential of external language information remains underutilized. In this paper, we introduce the Mixture of MoE (M-MoE) structure, featuring multiple language-specific MoEs and a language-unknown MoE. The language-unknown MoE reuses experts from language-specific MoEs. Inputs with language IDs are directed to language-specific MoEs, while those without IDs go to the language-unknown MoE. We propose a two-stage training method for the M-MoE-based model. Our unified model structure is suitable for streaming ASR tasks in both language-known and language-unknown scenarios. Experiments on a three-language dataset show that compared to the Conformer baseline, our model achieves an average of 12% and 9% relative improvement in language-known and language-unknown scenarios. Compared to the strong MoE baseline, there is an average 5% relative improvement in the language-known scenario.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: