Abstract:
Acoustic modeling with deep learning has drastically improved the performance of automatic speech recognition (ASR) where the main stream of the acoustic feature is still...Show MoreMetadata
Abstract:
Acoustic modeling with deep learning has drastically improved the performance of automatic speech recognition (ASR) where the main stream of the acoustic feature is still log-Mel filtered one. While the log-Mel filtered features lose harmonic-structure information, they still include useful information for ASR. Several attempts have been made to integrate higher-resolution information into the network. In order to improve the ASR accuracy in noisy conditions, we propose new features integrated into acoustic modeling to represent which parts in the time-frequency domain have a distinct harmonic structure, since it is partially observed in noisy environments. The new features are combined with the standard acoustic features, and the network is trained with them using various noisy data. Through these operations, it learns the acoustic features with a kind of quality tag describing which parts are clean or degraded. Our model reduced the word error rate in an Aurora-4 task by 10.3% in DNN compared with the strong baseline while retaining the high accuracy in clean test cases.
Published in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 05-09 March 2017
Date Added to IEEE Xplore: 19 June 2017
ISBN Information:
Electronic ISSN: 2379-190X