Abstract:
To achieve robustness against environmental interferences, the incorporation of visual information has been shown as effective approach to robust automatic speech recogni...Show MoreMetadata
Abstract:
To achieve robustness against environmental interferences, the incorporation of visual information has been shown as effective approach to robust automatic speech recognition (ASR). However, still questionable in multimodal speech processing is the optimal stage of information integration. Considering results from multiple-in-single-out (MISO) mobile communications suggesting early integration levels, multimodal ASR may suffer from early integration due to inherent asynchrony of audio and video features. In this paper we investigate whether early or middle integration strategies perform best in multimodal ASR by comparing feature concatenation and turbo decoding approaches. Applied to an audio-visual speech recognition task on a large database, we show the significant benefit of turbo ASR approaches (middle integration) over early integration feature vector concatenation outperforming these by about 13% absolute at a signal-to-noise ratio (SNR) of 0 dB.
Published in: Speech Communication; 11. ITG Symposium
Date of Conference: 24-26 September 2014
Date Added to IEEE Xplore: 17 October 2014
Print ISBN:978-3-8007-3640-9
Conference Location: Erlangen, Germany