Multimodal ASR by Turbo Decoding vs. Feature Concatenation: Where to Perform Information Integration? | VDE Conference Publication | IEEE Xplore

Multimodal ASR by Turbo Decoding vs. Feature Concatenation: Where to Perform Information Integration?

; ;

Abstract:

To achieve robustness against environmental interferences, the incorporation of visual information has been shown as effective approach to robust automatic speech recogni...Show More

Abstract:

To achieve robustness against environmental interferences, the incorporation of visual information has been shown as effective approach to robust automatic speech recognition (ASR). However, still questionable in multimodal speech processing is the optimal stage of information integration. Considering results from multiple-in-single-out (MISO) mobile communications suggesting early integration levels, multimodal ASR may suffer from early integration due to inherent asynchrony of audio and video features. In this paper we investigate whether early or middle integration strategies perform best in multimodal ASR by comparing feature concatenation and turbo decoding approaches. Applied to an audio-visual speech recognition task on a large database, we show the significant benefit of turbo ASR approaches (middle integration) over early integration feature vector concatenation outperforming these by about 13% absolute at a signal-to-noise ratio (SNR) of 0 dB.
Date of Conference: 24-26 September 2014
Date Added to IEEE Xplore: 17 October 2014
Print ISBN:978-3-8007-3640-9
Conference Location: Erlangen, Germany

Contact IEEE to Subscribe