Compressing Transformer-Based ASR Model by Task-Driven Loss and Attention-Based Multi-Level Feature Distillation | IEEE Conference Publication | IEEE Xplore

Compressing Transformer-Based ASR Model by Task-Driven Loss and Attention-Based Multi-Level Feature Distillation


Abstract:

The current popular knowledge distillation (KD) methods effectively compress the transformer-based end-to-end speech recognition model. However, existing methods fail to ...Show More

Abstract:

The current popular knowledge distillation (KD) methods effectively compress the transformer-based end-to-end speech recognition model. However, existing methods fail to utilize complete information of the teacher model, and they distill only a limited number of blocks of the teacher model. In this study, we first integrate a task-driven loss function into the decoder’s intermediate blocks to generate task-related feature representations. Then, we propose an attention-based multi-level feature distillation to automatically learn the feature representation summarized by all blocks of the teacher model. Under the 1.1M parameters model, the experimental results on the Wall Street Journal dataset reveal that our approach achieves a 12.1% WER reduction compared with the baseline system.
Date of Conference: 23-27 May 2022
Date Added to IEEE Xplore: 27 April 2022
ISBN Information:

ISSN Information:

Conference Location: Singapore, Singapore

Contact IEEE to Subscribe

References

References is not available for this document.