Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space

Continuous Sign Language Recognition (CSLR) refers to the challenging problem of recognizing sign language glosses and their temporal boundaries from weakly annotated video sequences. Previous methods focus mostly on visual feature extraction neglecting text information and failing to effectively model the intra-gloss dependencies. In this work, a cross-modal learning approach that leverages text information to improve vision-based CSLR is proposed. To this end, two powerful encoding networks are initially used to produce video and text embeddings prior to their mapping and alignment into a joint latent representation. The purpose of the proposed cross-modal alignment is the modelling of intra-gloss dependencies and the creation of more descriptive video-based latent representations for CSLR. The proposed method is trained jointly with video and text latent representations. Finally, the aligned video latent representations are classiﬁed using a jointly trained decoder. Extensive experiments on three well-known sign language recognition datasets and comparison with state-of-the-art approaches demonstrate the great potential of the proposed approach.


I. INTRODUCTION
Sign language (SL) is the primary communication tool for deaf-mute people, making use of gestures produced with the body and perceived with the eyes. SLs have independent vocabularies and grammatical structures just like spoken ones [1]. Signs, which have an internal structure similar to spoken words, are characterized by a combination of hand shapes, positions and motion trajectories, orientations of palm and fingers and facial expressions. The closest meaning of a visual sign is a gloss, which is the fundamental building block of SLs.
Sign Language Recognition (SLR) is the task of recognizing glosses from video captures of sign language. SLR is of great significance to the deaf community, as it enables the communication of deaf people with the world, removing accessibility barriers and improving their social inclusion. The SLR tasks can be divided into two categories: isolated The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . sign language recognition (ISLR) [2]- [5] and continuous sign language recognition (CSLR) [6], [7]. In ISLR, the annotation boundaries of the signs in videos are predefined in a similar way to gesture and action classification. On the other hand, CSLR is more challenging than ISLR since only the temporal order of the gloss sequence is given without any prior segmentation information.
Early machine learning techniques were mainly focusing on isolated gloss classification or gesture spotting. Such techniques were often making use of handcrafted features with temporal modelling methods, such as hidden Markov models (HMM) [8], [9] and conditional random fields [10]. Recently, deep learning methods have shown their potential to outperform conventional machine learning approaches in several computer vision tasks, such as gesture recognition and human action recognition [11]- [13]. As a result, current CSLR approaches take advantage of deep learning concepts, such as convolutional neural networks (CNNs) to capture powerful image and video representations and recurrent neural networks (RNNs) to accurately model temporal dependencies, thus improving recognition accuracy [14]- [17]. However, most CLSR methods either extract frame-level representations that are inadequate for modelling the temporal dynamics (present in a gloss) or they fail to effectively model intra-gloss dependencies. Additionally, there is limited research on ways to exploit the relationship between visual content and text information for improving recognition accuracy in CSLR.
To overcome the aforementioned shortcomings, in this work, a novel unified deep learning framework for CSLR is proposed. The proposed approach consists of two encoders that learn the individual video and text embeddings, which are then projected into a joint latent space through linear transformations. A common loss function is used to align the latent representations and minimize their distance, while the final classification of the aligned video latent representations is performed by a jointly trained decoder. During training, both video and text information is employed, while during inference, only video information is used as input.
More specifically, the main contributions of this work are summarized as follows: • A novel cross-modal learning approach for video-based CSLR is introduced. The proposed method leverages text information to model intra-gloss dependencies and create more descriptive video-based latent representations that improve the recognition accuracy.
• A new approach for the alignment of video and text embeddings using a joint loss function is proposed. The joint loss function aims to minimize the distance between the corresponding embeddings of the two modalities enabling the creation of a common latent representation. The inference of the aligned video latent representations is finally performed by a jointly trained decoder network.
• The proposed approach is evaluated on three challenging sign language recognition datasets and compared with several state-of-the-art CSLR methods, showing promising results.
The remainder of this paper is organized as follows. In Section II related work in SLR is described. The components of the proposed CSLR method and the optimization process are described in Section III and Section IV, respectively. Finally, the implementation details and the experimental results are discussed in Section V, while conclusions are drawn in Section VI.

II. RELATED WORK
Early SLR works were relying on handcrafted feature extraction such as hand shape, appearance and motion trajectories [2], [18], while recent approaches are automatically extracting features with the use of deep neural networks. Most CSLR methods consist of a feature extractor followed by a sequence modelling module. In [9], [19], the authors employed a CNN feature extractor, whose output is fed into a HMM for temporal modelling. They used frame-state alignments generated from the HMM to train the CNN. Later, they extended their work by incorporating a long short-term memory (LSTM) unit in top of a CNN [20]. In their most recent work [16], the authors introduced two additional streams of cropped hands and mouth modalities. The full architecture was trained iteratively by frame-state alignments provided by the HMM. However, frame-state alignments can be noisy due to the lack of frame-level ground truth annotations and such methods are forced to make strong initial assumptions on gloss boundaries in order to overcome HMM's limited learning capacity [14].
Other methods make use of connectionist temporal classification (CTC) [21], which is designed for sequence labelling problems, such as speech recognition and handwriting recognition. CTC can effectively deal with weakly labelled data, making it appropriate for continuous SLR. Camgoz et al. [22] were among the first ones who proposed a shallow CNN-LSTM architecture trained end-to-end with CTC. In [23], the authors employed a 2D-CNN-LSTM architecture with CTC loss in parallel with a gloss-detection network to refine predictions. Later, they extended their work using temporal convolutions and a new iterative training scheme, achieving superior performance in CSLR datasets [14]. However, the major weakness of CTC is the conditional independence assumption and therefore it fails to model intra-gloss dependencies.
On the other hand, a crucial issue for SLR is video representation, i.e., the extraction of video embeddings. 3D-CNNs have strong video representation capabilities since they extract motion features unlike 2D-CNNs and have also been adopted in CSLR task. Huang et al. [24] proposed a 3D-CNN network along with a hierarchical attention network for recognition. Yang et al. [25] proposed a shallow hybrid CNN with 2D and 3D convolutions followed by two LSTM networks for sequence modelling at gloss and sentence level respectively, which can be trained end-to-end with CTC loss. Pu et al. [26] adopted a 3D-ResNet to extract video representations with stacked dilated temporal convolutions instead of a LSTM to alleviate the problem of backpropagation through a recurrent network. In [17], the authors proposed a framework with a 3D-ResNet integrating an encoder-decoder network with a CTC decoder, jointly trained and aligned with soft-DTW (Dynamic Time Warping) [27]. Pseudo-labels were inferred from the decoders' alignment to train the 3D-CNN. In [28], the authors adopted the I3D architecture from the action recognition field [12] with a gated recurrent unit (GRU) for sequence modelling. The whole architecture was trained iteratively with CTC and a new dynamic pseudolabelling method. However, training 3D architectures with limited data in a weakly supervised setting is challenging. In [29], the authors used deep temporal convolution layers instead of RNN to model the short-and long-term dependencies simultaneously. They utilized several classifiers in each temporal convolution layer and fused the predictions in a CTC decoder for increased performance. Guo et al. [30] proposed a hierarchical adaptive recurrent network with temporal pooling and attention-aware weighting mechanisms. In [31], the authors fused 2D and 3D-CNN features to learn short-term temporal dependencies and a new decoding algorithm, which learns a temporal mapping among features, sign labels and the generated gloss sequence. Cross-modal methods have been successfully applied to various fields, such as action recognition and video captioning. In [32], the authors employed transfer learning from the image domain to enhance video action recognition, while in [33], the authors proposed a Generative Adversarial Network that learns a common feature space of images and videos to improve recognition accuracy. Finally, in [34], the authors integrated images and videos into a common representation using cross-modal similarity metrics to enhance the action recognition accuracy. In this work, a cross-modal method for CSLR is proposed, which takes advantage of the ability of CTC to handle weakly labeled data, while simultaneously leverages text information to model intra-gloss dependencies through the cross-modal alignment of video and text embeddings.

III. PROPOSED METHOD
In this work, a video encoder is proposed that consists of a CNN for spatial feature extraction, stacked 1D temporal convolution layers (TCL) for short-term temporal modelling and a bidirectional long short-term memory (BLSTM) units for global context learning. Furthermore, a text encoder is implemented using a unidirectional LSTM to model the sequences of sign language glosses. The outputs of both encoders are projected into a joint latent space through linear transformations. In addition, alignment is achieved by using a common loss function for minimizing the distance of video and text embeddings. An overview of the proposed approach is depicted in Figure 1. In the remainder of this section the encoding of each modality is initially formulated and then the joint latent space representation is described.

A. VIDEO ENCODER
The proposed video encoder adopts a 2D-CNN followed by temporal convolution layers to extract spatiotemporal features from the input video. The extracted features are then processed through a BLSTM layer to learn long-term dependencies over all timesteps.

1) FEATURE EXTRACTION
Let x = (x 1 , · · · , x T ) be the input frame sequence of length T , where x τ is the τ th frame of the video sequence. The CNN represented as function F CNN extracts a spatial representation is the feature dimension of the CNN. Therefore, all features are represented as follows: The feature sequence f ∈ R T ×D CNN is processed by the TCL module, represented by the function F TCL , modelling the temporal dependencies between adjacent frames. The TCL module consists of stacked 1D convolutions and pooling layers that learn short-term temporal dependencies between frames. The receptive field of the TCL module depends on the layers' filter size k, pooling size p, stride s and dilation factor d. The TCL module extracts a spatiotemporal feature sequence represented as: where r ∈ R T ×D TCL , D TCL is the feature dimension of the TCL module and T = T /σ is the length of the extracted spatiotemporal sequence, with σ depending on the receptive field of the TCL module.

2) SEQUENCE MODELLING
Recurrent neural networks have been successfully applied to many sequence-to-sequence problems, such as speech recognition and neural machine translation. LSTMs are able to learn long-term temporal dependencies avoiding vanishing gradients due to backpropagation through all timesteps in contrast to traditional RNNs. However, LSTMs compute the current output based only on previous timesteps. In CSLR, the signed video is mapped to a sentence with grammatical rules, meaning that each sign depends on the previous and succeeding context. To this end, a BLSTM layer is chosen instead of a unidirectional LSTM layer to learn the complete sequential information over all timesteps. Using R to represent the BLSTM layer with H hidden units, the outputs are computed as: where h ν ∈ R T ×H are the concatenated forward and backward hidden state sequences. The concatenated hidden state sequence is passed through a fully connected and a softmax layer denoted as that produces the gloss label probabilities from a given vocabulary of C classes.
where g ν ∈ [0, 1] T ×C is the output probability distribution among C classes.

B. TEXT ENCODER
The proposed text encoder is a RNN Language Model (RNNLM) [35], [36] that models the probability of a word occurrence under the condition of its previous words in a sentence, i.e., it aims to learn the structure and syntax of sign language. The model maximizes the log-likelihood of the target sentence given the hidden states and the previous words. The text encoder employs a word embedding layer and a LSTM layer with H text hidden units. Each gloss y k of the input sentence is passed through a word embedding layer, which is a fully connected layer that learns a linear projection from discrete gloss categories to a denser vector denoted as we k . In other words, the gloss y k , which is represented by a unique one-hot vector, is transformed into a continuous vector with smaller dimension compared to the gloss vocabulary size. The hidden state of the LSTM layer h y k encapsulates the history of the sentence up to gloss y k , i.e., all previous words. The hidden states are generated as follows, where h y ∈ R K ×H text and K is the length of the sentence. Then, the hidden states of the LSTM layer are passed through a fully connected and a softmax layer, denoted as LM to output the gloss label probabilities as: where g y ∈ [0, 1] K ×C is the output probability distribution among C classes.

C. JOINT LATENT SPACE
The hidden states h ν = {h ν t } T t=1 of the video encoder and the hidden states h y = {h y k } K k=1 of the text encoder are mapped into the joint latent space through two mapping networks, V2E and T2E, respectively. Each mapping network consists of a fully connected layer that computes the following latent representations: where Z is the latent space dimension, e ν ∈ R T ×Z and e y ∈ R K ×Z are the video and text representations in the joint latent space, respectively.
The above latent representations are passed through a joint decoder joint that consists of a fully-connected and a softmax layer to obtain gloss probabilities. Both modalities share the same joint decoder weights to enforce a common representation between them.
where g ν joint ∈ [0, 1] T ×C and g y joint ∈ [0, 1] K ×C are the output probability distributions computed from the video and text latent representations, respectively.

IV. OPTIMIZATION A. LEARNING EMBEDDINGS
The proposed framework employs a CTC loss function to train the video encoder given the frame sequence and the joint decoder given the video latent space representations, respectively. The objective of using the CTC loss is to maximize the sum of probabilities of all possible mappings between input and target sequences. CTC extends the vocabulary C with a blank label '' − ", representing the silence or transition between two consecutive labels. The extended vocabulary can be defined as V = C ∪ {blank}. Given a frame sequence x = {x τ } T τ =1 of length T , the proposed framework outputs two gloss probability distributions g ν and g ν joint with length T to predict the corresponding sequence of target glosses y = {y k } K k=1 of length K . The emission probability p(j, t|x) of label j at time-step t is denoted as g j,t and can be modelled either from the video encoder or the joint decoder. An alignment path is defined as π = {π t } T t=1 , where label π t ∈ V . The posterior probability of a CTC alignment path π is defined as: VOLUME 8, 2020 The alignment path π is mapped to the target sequence y with a many-to-one mapping operation B that removes repeated labels and blanks from the given path. Subsequently, an inverse operation B −1 (y) = {π|B(π) = y} is used that represents all the possible alignments corresponding to target labels y. The conditional probability of y is defined as the sum of the probabilities of all corresponding paths π: Furthermore, to allow for blank labels in the computed alignment paths, a modified label sequence y of length K = 2K + 1 is defined and used as target sequence in the proposed method by adding blanks before and after each label in y. Since single labelling can be derived from a huge amount of paths, a method is required to efficiently calculate p(y|x). CTC employs dynamic programming to compute the sum over different paths for a single labelling iteratively, using for-ward and backward variables α ∈ R T ×K and β ∈ R T ×K , respectively.
The total probability α t,s of y 1:s (i.e., the first s symbols of modified label sequence y ) at time-step t is defined as: and correspondingly the total probability β t,s of y s:K at timestep t is equal to: Therefore, p(y|x) for any t is calculated as follows: The objective function L CTC that guides the training process, is derived from the principle of maximum likelihood [37] and is used to optimize the video encoder. The loss function of the video encoder is formulated as: The text encoder is used as a language model. The objective is to maximize the probability of the current word given the previous hidden states and it is trained using the cross-entropy criterion denoted as L LM .

B. LATENT SPACE ALIGNMENT
The cross-modal alignment aims to jointly encode and project video and text information into a common latent space by minimizing the distance between video and text embeddings. Due to the different length of video and text embedding sequences, the alignment paths are calculated from the nonblank probabilities α and β ∈ R T ×K of the CTC forwardbackward algorithm [38]. Non-blank probabilities α and β are calculated from α and β, respectively, by removing the probabilities that correspond to blank labels of the modified label sequence y recursively as: Then, the soft alignments w ∈ R T ×K are defined as: Intuitively, w t,k is the probability of gloss k in target sequence y occurring at time-step t and is used as a weighting factor between the possible alignments. To minimize the distance of the video and text latent representations, a mapping loss is defined as: with d(e ν t , e y k ) = || e ν t − e y k || 2 being the Euclidean distance between two vectors. The L map function is illustrated in Figure 2. The purpose of the L map loss function is to drive the video and text embeddings closer to one another using the weighting factor w computed by the CTC alignment. The factor w t,k is a probability in the range of [0, 1] that expresses the degree the predicted gloss at time t matches the ground truth gloss y k . When w t,k is high, the corresponding video segment at time t matches the target gloss y k and L map is significantly affected by the Euclidean distance between the video and text embeddings in an attempt to bring them closer. When w t,k is close to 0, the corresponding video segment at time t is not matched with the target gloss y k and L map is only slightly affected by the Euclidean distance between the video and text embeddings. In this way, L map aligns the video and text embeddings only when the alignment is meaningful (i.e., the predicted sequence is close to the ground truth and w t,k is close to 1). The joint decoder is trained using L CTC and L LM for the video and text latent representations, respectively. The video and text encoders, the latent space mapping networks and the joint decoder are jointly trained with the following objective function: L joint = L map + aL CTC + bL LM (22) where a, b are tunable hyperparameters to balance the effect of each latent representation in the training procedure.

C. OPTIMIZATION STRATEGY
In this work, a two-stage optimization process is followed. It has been shown that training the video encoder only with L CTC end-to-end has limited contribution to the parameters of CNN as the gradients are vanished after backpropagation through the BLSTM layer due to the chain rules of backpropagation [17], [28]. At the first stage, the proposed video encoder (i.e., 2D-CNN, TCL, BLSTM and Decoder-Classifier modules) is optimized with L CTC . At the second stage, the feature extractor (i.e., 2D-CNN and TCL modules) of the video encoder is optimized using pseudo-labels generated from the soft alignments w with cross-entropy loss as a stronger supervision. Then, the video encoder learns a better video representation and generates more accurate pseudo-labels. The two stages are performed iteratively until no further improvement in recognition error is observed. Both video and text encoders are trained until convergence. Then, the latent space mapping modules are optimized with L map to align and learn the embeddings. After removing the two decoders-classifiers from the video and text encoders, the full architecture (including the latent space and the joint decoder) is trained with L joint loss to fine-tune the proposed CSLR method.

V. EXPERIMENTS
In this section, the implementation details of the proposed method are initially described. Then, experimental results on three well-known CSLR datasets are presented and discussed.

A. EVALUATION
The proposed method is evaluated on three publicly available datasets, RWTH-Phoenix-Weather-2014 [6], RWTH-Phoenix-Weather-2014T [39] and CSL [24]. To evaluate performance in CSLR datasets, the word error rate (WER) metric has been adopted, which measures the similarity between predicted and ground truth gloss sequences. WER calculates the least number of operations needed to transform the aligned predicted sequence to the ground truth and can be defined as: where S is the total number of substitutions, D is the total number of deletions, I is the total number of insertions and N is the total number of glosses in the ground truth.

B. IMPLEMENTATION DETAILS
For the proposed video encoder, a 2D-CNN (BN-Inception) network [40] is used that is initialized with weights pretrained on the ImageNet dataset. The kernel and stride sizes of the TCL module are manually tuned to approximately cover the average gloss duration. TCL has two 1D convolutional layers with 1024 filters and two max-pooling layers. For the CSL dataset, the convolutional layers have kernel sizes equal to 7, while the pooling layers have kernel and stride sizes equal to 3 and cover the average gloss duration of 58 frames. For the RWTH-Phoenix-Weather-2014T and RWTH-Phoenix-Weather-2014 datasets, the convolutional layers of the TCL module have kernel sizes equal to 5, while the pooling layers have kernel and stride size equal to 2 resulting in a receptive field of 16 frames. The BLSTM layer consists of 2 LSTMs with 512 hidden units each. The text encoder has 1 LSTM layer with 512 hidden units. It should be noted that a BLSTM layer can also be adopted for modelling the text information. However, in the experimental results the performance was similar and a LSTM was chosen due to its smaller computational complexity.
The following data processing techniques are used for all datasets. Each frame is resized to 256 × 256 and cropped at a random position to a fixed size of 224×224. Random temporal frame sampling is used up to 80% of video length. Bright- ness, contrast, saturation and hue values of frames are randomly jittered up to 10%. The full architecture is trained with Adam optimizer with an initial learning rate λ 0 = 5 * 10 −5 and a batch size of 1 because of the computational cost and the fact that each video sequence consists of a different number of frames. Long videos are downsampled to a maximum length of 250 frames, if necessary. The learning rate is decreased by a factor of 0.5 when validation loss starts to plateau. The training process lasts 10 epochs for the CSL dataset and 20 epochs for the other two datasets. The proposed method is implemented in PyTorch and the experiments are conducted in a NVIDIA GeForce GTX-1080-Ti GPU.

C. RESULTS
To define the optimal hyperparameters of the network and study the effectiveness of each module, extensive experiments are conducted using the RWTH-Phoenix-Weather-2014 dataset, which is the most popular CSLR dataset.
Initially, the relationship between performance and dimensionality of the joint latent space is investigated. To this  end, experimental results with different latent space sizes are presented in Table 1, showing that by increasing the size of the joint latent space, the WER is further reduced. In the experiments, a latent space dimensionality of 1024, instead of 2048, is chosen since further increase leads to a decrease in WERs of only 0.2% and 0.1% on Dev set (i.e., validation set) and Test set, respectively, but in the cost of a higher number of parameters and slower training speed. Subsequently, to evaluate the effectiveness of loss functions L map and L joint , a series of experiments are conducted. At first, it is observed that when L map is introduced in the early stages of training, performance drops by 5% in WER. The main reason is that the network produces unstable probability distributions. To this end, L map is introduced at a later training stage when CTC has already converged. As shown in Table 2, the overall CSLR performance is improved by 1.5% when L map is introduced at a later training stage. This means that by bringing closer video and text embeddings using L map , the intra-gloss dependencies are effectively modelled decreasing the WER of the network, despite any additional errors that the text encoder may introduce.
After learning the joint latent space, the joint decoder is trained with L joint using video and text latent representations. Finally, the outputs of the joint decoder when it is fed with the video latent representations are used for CSLR. In order to set optimal hyperparameters a and b, experimental results using different values for the two hyperparameters are conducted. Note that when a = 0 or b = 0 the joint decoder is trained only using the text or video embeddings, respectively. In the case of training the joint decoder using only text embeddings (a = 0, b = 1), the CSLR performance was not satisfactory with WER of only 87.0% on Dev set and 88.1% on Test set, respectively. However, training the joint decoder using only video embeddings (a = 1, b = 0), a CLSR performance with WER of 24.5% on Dev set and 24.4% on Test set, respectively is reached. After varying the hyperparameters a, b in the range [0, 1], the optimal values are set to 0.9 and 0.1, respectively, with WER of 23.9% on Dev set and 24.0% on Test set. Further increase in the contribution of the L LM loss function (e.g., b = 0.2) was found to decrease the performance of the proposed method (with WER 24.9% and 24.8% on Test set for a = 0.9 and 0.8, respectively).

1) EVALUATION ON THE RWTH-PHOENIX-WEATHER-2014 DATASET
In this section, the proposed method is evaluated on the RWTH-Phoenix-Weather-2014 dataset, which contains recordings of weather forecasts. Videos are recorded with 9 different signers at a frame rate of 25 frames per second. The vocabulary size is 1295 and the dataset contains 5672, 540 and 629 videos for training, validation and testing respectively. In Table 3, the proposed approach is compared to several state-of-the-art approaches. It can be observed that the proposed method outperforms all state-of-the-art approaches, achieving a WER of 24.0% on the Test set. This indicates the advantage of exploring the correlation between sentence semantics and video. More specifically, the proposed method reduces WER by 0.4% with respect to CNN-TEMP-RNN [14], which justifies the importance of modelling intra-gloss dependencies. Furthermore, the proposed method reduces WER by 2% and 2.8% with respect to Re-Sign [20] and CNN-LSTM-HMM [16] methods, respectively, that use HMM frame-state alignments for network training. An example of alignment paths between video and text embeddings is shown in Figure 3. Each video embedding is aligned to its corresponding text embedding. The proposed method minimizes the distance of video and text embeddings using the alignment path. Qualitative recognition results with different model settings are shown in Figure 4. It can be observed that the video encoder without the latent space alignment is more prone to recognition errors. However, when introducing the joint latent space to combine and align video and text embeddings, the network yields better recognition results, while the use of the joint decoder leads to an even better performance.

2) EVALUATION ON THE RWTH-PHOENIX-WEATHER-2014T DATASET
RWTH-Phoenix-Weather-2014T [39] is an extended database of RWTH-Phoenix-Weather-2014, providing spoken language translations and gloss level annotations for German sign language videos of weather broadcasts. It contains 8257 videos from 9 different signers performing 1088 unique signs. The spoken language translations consist of 2887 different words. All videos are recorded with 25 frames per second and resolution of 210 × 260. The dataset is divided into three splits for training, validation and testing and there is no overlap with the previous version of the dataset in any split. As shown in Table 4, the proposed method achieves a WER of 24.1% on Dev set and 24.3% WER on Test set. The proposed method achieves a relative reduction in WER by 9.0% on the Test set compared to the CNN-LSTM-HMM method [16] that adopts HMM for temporal modelling and uses frame-level alignments.

3) EVALUATION ON THE CSL DATASET
The Chinese Sign Language (CSL) dataset [24] is a popular SLR dataset with a smaller vocabulary compared to RWTH-Phoenix-Weather-2014. Videos are recorded in a predefined  laboratory environment with Chinese words widely used in daily conversations. It contains 100 sentences performed 5 times from 50 signers with 25000 videos in total. The signer independent split of train and test set in [32] is adopted, meaning that videos from 40 and 10 signers are used for training and testing, respectively. The dataset also provides an isolated version that contains 500 unique words. The proposed method is pretrained on the isolated version of the dataset achieving similar performance to other methods without time-consuming iterations. In Table 5, the proposed method is compared against several state-of-the-art approaches evaluated on the CSL dataset. The proposed method shows again superior performance achieving 2.4% WER, i.e., a 2.3% absolute reduction (95% relative) compared to the DPD method [28] that uses a deep 3D-CNN architecture.

VI. CONCLUSION
In this paper, a novel deep learning method for continuous sign language recognition was introduced. In contrast to previous state-of-the-art approaches, the proposed method applies a cross-modal alignment between video and text embeddings to better model the intra-gloss dependencies in sign language recognition. Experimental results on the three most widely used CSLR datasets demonstrate the ability of the proposed method to provide highly accurate CSLR results.
Concerning future work, integrating other modalities, such as cropped hands, optical flow and skeletal keypoints can also be explored. The incorporation of additional modalities in a joint latent space could further enhance CSLR performance. Finally, it would be interesting to extend the proposed method for Sign Language Translation and exploit the relationship of video, sign language and spoken language simultaneously. PETROS DARAS (Senior Member, IEEE) received the Diploma degree in electrical and computer engineering and the M.Sc. and Ph.D. degrees in electrical and computer engineering from the Aristotle University of Thessaloniki, Greece, in 1999Greece, in , 2002Greece, in , and 2005, respectively. He is currently a Senior Researcher Grade B (Associate Professor) and the Chair of the Visual Computing Lab. His main research interests include visual content processing, multimedia indexing, search engines, recommendation algorithms, and relevance feedback. His involvement with those research areas has led to the coauthoring of more than 150 articles in refereed journals and international conferences. He has been involved in more than 20 projects, funded by the EC and the Greek Ministry of Research and Technology. Among them, he is the Technical Manager of the EC projects VICTORY, I-SEARCH, and ADVISE. He regularly acts as a Reviewer of the European Commission and the GSRT.