Conferences >ICASSP 2024 - 2024 IEEE Inter...

Prompting Large Language Models with Speech Recognition Abilities

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Large language models (LLMs) have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended ques...Show More

Metadata

Abstract:

Large language models (LLMs) have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLM by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audio embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% relatively in WER and perform multilingual speech recognition, despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen, or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

Published in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 14-19 April 2024

Date Added to IEEE Xplore: 18 March 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP48485.2024.10447605

Conference Location: Seoul, Korea, Republic of

Contents

References is not available for this document.

Prompting Large Language Models with Speech Recognition Abilities

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Prompting Large Language Models with Speech Recognition Abilities

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?