Conferences >ICASSP 2025 - 2025 IEEE Inter...

Towards Bringing Parity in Pretraining Datasets for Low-resource Indian Languages

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Lack of large-scale pretraining data for low resource languages from the Indian sub-continent, leads to their underrepresentation in existing massively multilingual model...Show More

Metadata

Abstract:

Lack of large-scale pretraining data for low resource languages from the Indian sub-continent, leads to their underrepresentation in existing massively multilingual models. In this work, we address this gap by proposing a framework to create large raw audio datasets for such under-represented languages by collating publicly accessible audio content. Leveraging this framework, we present MahaDhwani, a corpus comprising 279K hours of raw audio across 22 Indian languages. To test the utility of MahaDhwani, we pretrain a conformer style model, and then further finetune it to build a multilingual ASR model supporting the 22 languages. Using a hybrid multi-softmax decoder, we balance the benefit of shared parameters which enable crosslingual transfer, and the benefit of dedicated capacity for each language. Our evaluations on the IndicVoices benchmark show the benefits of pre-training, particularly in low-resource settings. We will open-source our framework, code and scripts to reproduce the dataset.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10888018

Conference Location: Hyderabad, India

Contents

References is not available for this document.

Towards Bringing Parity in Pretraining Datasets for Low-resource Indian Languages

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Towards Bringing Parity in Pretraining Datasets for Low-resource Indian Languages

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?