Abstract:
Audio super-resolution (ASR), also known as bandwidth extension (BWE), aims to enhance the quality of low-resolution audio by recovering high-frequency components. Howeve...Show MoreMetadata
Abstract:
Audio super-resolution (ASR), also known as bandwidth extension (BWE), aims to enhance the quality of low-resolution audio by recovering high-frequency components. However, existing methods often struggle to model harmonic relationships accurately and balance the inference speed and computational complexity. In this paper, we propose VM-ASR, a novel lightweight ASR model that leverages the Visual State Space (VSS) block to effectively capture global and local contextual information within audio spectrograms. This enables VM-ASR to model harmonic relationships more accurately, improving audio quality. Our experiments on the VCTK dataset demonstrate that VM-ASR consistently outperforms state-of-the-art methods in spectral reconstruction across various input-output sample rate pairs, achieving significantly lower Log-Spectral Distance (LSD) while maintaining a smaller model size (3.01 M parameters) and lower computational complexity (2.98 GFLOPS). This makes VM-ASR not only a promising solution for real-time applications and resource-constrained environments but also opens up exciting possibilities in telecommunications, speech synthesis, and audio restoration.
Published in: IEEE Transactions on Audio, Speech and Language Processing ( Volume: 33)