I. Introduction
In recent years, self-supervised learning (SSL) based speech foundation models such as wav2vec2.0 [1], HuBERT [2] and WavLM [3] have demonstrated performance advancements across a range of applications such as automatic speech recognition (ASR). However, the practical deployment of current speech foundation models to on-device and resource-constrained scenarios is hindered by their memory footprint and computational cost.