Abstract:
Speech enhancement (SE) models based on deep neural networks (DNNs) have shown excellent denoising performance. However, mainstream SE models often have high structural c...Show MoreMetadata
Abstract:
Speech enhancement (SE) models based on deep neural networks (DNNs) have shown excellent denoising performance. However, mainstream SE models often have high structural complexity and large parameter sizes, requiring substantial computational resources, which limits their practical application. In this paper, a high-efficiency encoder-decoder structure, inspired by the top-down attention mechanism in human brain perception and named human-like perception attention network (HPANet), is proposed for monaural speech enhancement, which is able to emulate brain perceptual attention in noise environments. In HPANet, the raw waveform is first encoded by using attention encoder to capture shallow global features. These features are then downsampled, and multi-scale information is aggregated through top attention module to prevent the loss of crucial information. Next, down attention module integrates features from neighboring layers to reconstruct signal in a top-down manner. Finally, the decoder reconstructs the denoised clean signal. Experiments show that the proposed method effectively reduces model complexity while maintaining competitive performance.
Published in: IEEE Signal Processing Letters ( Volume: 32)