Abstract:
This work introduces Cleanformer —a streaming multichannel neural enhancement frontend for automatic speech recognition (ASR). This model has a Conformer-based architectu...Show MoreMetadata
Abstract:
This work introduces Cleanformer —a streaming multichannel neural enhancement frontend for automatic speech recognition (ASR). This model has a Conformer-based architecture which takes as inputs a single channel each of raw and enhanced signals, and uses self-attention to derive a time-frequency mask. The enhanced input is generated by a multichannel adaptive noise cancellation algorithm known as Speech Cleaner. The time-frequency mask is applied to the noisy input to produce enhanced features for ASR. Detailed evaluations are presented with speech- and non-speech-based noise that show significant reduction in word error rate (WER) – about 80% for -6 dB SNR – over a state-of-the-art ASR model alone. It also significantly outperforms enhancement using a beamformer with ideal steering. The enhancement model can be used with different microphone arrays without the need for retraining.
Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information: