Collaborative Viseme Subword and End-to-End Modeling for Word-Level Lip Reading | IEEE Journals & Magazine | IEEE Xplore

Collaborative Viseme Subword and End-to-End Modeling for Word-Level Lip Reading


Abstract:

We propose a viseme subword modeling (VSM) approach to improve the generalizability and interpretability capabilities of deep neural network based lip reading. A comprehe...Show More

Abstract:

We propose a viseme subword modeling (VSM) approach to improve the generalizability and interpretability capabilities of deep neural network based lip reading. A comprehensive analysis of preliminary experimental results reveals the complementary nature of the conventional end-to-end (E2E) and proposed VSM frameworks, especially concerning speaker head movements. To increase lip reading accuracy, we propose hybrid viseme subwords and end-to-end modeling (HVSEM), which exploits the strengths of both approaches through multitask learning. As an extension to HVSEM, we also propose collaborative viseme subword and end-to-end modeling (CVSEM), which further explores the synergy between the VSM and E2E frameworks by integrating a state-mapped temporal mask (SMTM) into joint modeling. Experimental evaluations using different model backbones on both the LRW and LRW-1000 datasets confirm the superior performance and generalizability of the proposed frameworks. Specifically, VSM outperforms the baseline E2E framework, while HVSEM outperforms VSM in a hybrid combination of VSM and E2E modeling. Building on HVSEM, CVSEM further achieves impressive accuracies on 90.75% and 58.89%, setting new benchmarks for both datasets.
Published in: IEEE Transactions on Multimedia ( Volume: 26)
Page(s): 9358 - 9371
Date of Publication: 17 April 2024

ISSN Information:

Funding Agency:

Citations are not available for this document.

Cites in Papers - |

Cites in Papers - IEEE (1)

Select All
1.
Chen-Yue Zhang, Hang Chen, Jun Du, Sabato Marco Siniscalchi, Ya Jiang, Chin-Hui Lee, "Summary on the Chat-Scenario Chinese Lipreading (ChatCLR) Challenge", 2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp.1-6, 2024.

Cites in Papers - Other Publishers (1)

1.
Yinuo Ma, Xiao Sun, "Spatiotemporal Feature Enhancement for Lip-Reading: A Survey", Applied Sciences, vol.15, no.8, pp.4142, 2025.

Contact IEEE to Subscribe

References

References is not available for this document.