CoSLR: Contrastive Chinese Sign Language Recognition with prior knowledge And Multi-Tasks Joint Learning | IEEE Conference Publication | IEEE Xplore

CoSLR: Contrastive Chinese Sign Language Recognition with prior knowledge And Multi-Tasks Joint Learning


Abstract:

Perceiving by computer vision, Sign Language Recognition (SLR) obtains the advantage of transforming the posture video into a sentence, compared with the methods of senso...Show More

Abstract:

Perceiving by computer vision, Sign Language Recognition (SLR) obtains the advantage of transforming the posture video into a sentence, compared with the methods of sensors to collect signals. However, learning representative features from a multimodal perspective is challenging. To this end, this study proposes a multi-task joint learning framework termed Contrastive Learning-based Sign Language Recognition Network (CoSLR) for Chinese sign language, which embeds text representation into the general video-based framework of SLR. In virtue of the profound ability of the pre-trained multimodal encoders, they are employed as processing modules to extract features from the original input video and text. Then, a contrastive learning between the video representation and corresponding token embedding is utilized to the feature extractor training. Finally, the linear combination of contrast and cross-entropy loss functions drives the end-to-end network to converge. Experiments show that the 1.27% WER of CoSLR has outperformed the state-of-the-art works in the comparison.
Date of Conference: 14-19 April 2024
Date Added to IEEE Xplore: 18 March 2024
ISBN Information:

ISSN Information:

Conference Location: Seoul, Korea, Republic of

Funding Agency:


1. INTRODUCTION

In contrast with the communication between normal people in daily life, sign language almost plays an irreplaceable role in the deaf community, which can convey meaning through visualized signals with a series of coherent gestural movements and facial expressions. Computer vision-based continuous Sign Language Recognition (SLR) can extract visual features from the original input and recognize the related sign language glosses [1], [2]. As a cutting-edge de facto tool, deep learning allows multi-layer networks to be fed with preprocessed vectors and to automatically extract rules, which is more effective in recognizing images, video, speech, and audio [3]. Therefore, it should be taken for granted that Deep Neural Networks (DNN) have dramatically brought about breakthroughs in continuous SLR [4].

Contact IEEE to Subscribe

References

References is not available for this document.