I. Introduction
In recent years, the construction of ASR systems utilizing large-scale pre-trained models has become mainstream. Models such as XLSR [1] and Whisper [2], trained on tens of thousands to hundreds of thousands of hours of multilingual speech, have led to rapid advancements in multilingual speech processing. However, for low-resource languages and dialects not included in the pre-training data, or included only in small quantities, the recognition accuracy is often not practical.