Loading [a11y]/accessibility-menu.js
Cross-Modal Mutual Learning for Cued Speech Recognition | IEEE Conference Publication | IEEE Xplore

Cross-Modal Mutual Learning for Cued Speech Recognition


Abstract:

Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movemen...Show More

Abstract:

Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movements and hand gestures to code spoken language for hearing-impaired people. Previous ACSR approaches often utilize direct feature concatenation as the main fusion paradigm. However, the asynchronous modalities (i.e., lip, hand shape and hand position) in CS may cause interference for feature concatenation. To address this challenge, we propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction. Compared with the vanilla self-attention, our model forces modality-specific information of different modalities to pass through a modality-invariant codebook, concatenating linguistic representations with tokens of each modality. Then the shared linguistic knowledge is used to re-synchronize multi-modal sequences. Moreover, we establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese. To our knowledge, this is the first work on ACSR for Mandarin Chinese. Extensive experiments are conducted for different languages (i.e., Chinese, French, and British English). Results demonstrate that our model exhibits superior recognition performance to the state-of-the-art by a large margin.
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information:

ISSN Information:

Conference Location: Rhodes Island, Greece

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.