Towards Bringing Parity in Pretraining Datasets for Low-resource Indian Languages | IEEE Conference Publication | IEEE Xplore

Towards Bringing Parity in Pretraining Datasets for Low-resource Indian Languages


Abstract:

Lack of large-scale pretraining data for low resource languages from the Indian sub-continent, leads to their underrepresentation in existing massively multilingual model...Show More

Abstract:

Lack of large-scale pretraining data for low resource languages from the Indian sub-continent, leads to their underrepresentation in existing massively multilingual models. In this work, we address this gap by proposing a framework to create large raw audio datasets for such under-represented languages by collating publicly accessible audio content. Leveraging this framework, we present MahaDhwani, a corpus comprising 279K hours of raw audio across 22 Indian languages. To test the utility of MahaDhwani, we pretrain a conformer style model, and then further finetune it to build a multilingual ASR model supporting the 22 languages. Using a hybrid multi-softmax decoder, we balance the benefit of shared parameters which enable crosslingual transfer, and the benefit of dedicated capacity for each language. Our evaluations on the IndicVoices benchmark show the benefits of pre-training, particularly in low-resource settings. We will open-source our framework, code and scripts to reproduce the dataset.
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information:

ISSN Information:

Conference Location: Hyderabad, India

Contact IEEE to Subscribe

References

References is not available for this document.