Development of an End-to-End Deep Learning Framework for Sign Language Recognition, Translation, and Video Generation

The recent developments in deep learning techniques evolved to new heights in various domains and applications. The recognition, translation, and video generation of Sign Language (SL) still face huge challenges from the development perspective. Although numerous advancements have been made in earlier approaches, the model performance still lacks recognition accuracy and visual quality. In this paper, we introduce novel approaches for developing the complete framework for handling SL recognition, translation, and production tasks in real-time cases. To achieve higher recognition accuracy, we use the MediaPipe library and a hybrid Convolutional Neural Network + Bi-directional Long Short Term Memory (CNN + Bi-LSTM) model for pose details extraction and text generation. On the other hand, the production of sign gesture videos for given spoken sentences is implemented using a hybrid Neural Machine Translation (NMT) + MediaPipe + Dynamic Generative Adversarial Network (GAN) model. The proposed model addresses the various complexities present in the existing approaches and achieves above 95% classification accuracy. In addition to that, the model performance is tested in various phases of development, and the evaluation metrics show noticeable improvements in our model. The model has been experimented with using different multilingual benchmark sign corpus and produces greater results in terms of recognition accuracy and visual quality. The proposed model has secured a 38.06 average Bilingual Evaluation Understudy (BLEU) score, remarkable human evaluation scores, 3.46 average Fréchet Inception Distance to videos (FID2vid) score, 0.921 average Structural Similarity Index Measure (SSIM) values, 8.4 average Inception Score, 29.73 average Peak Signal-to-Noise Ratio (PSNR) score, 14.06 average Fréchet Inception Distance (FID) score, and an average 0.715 Temporal Consistency Metric (TCM) Score which is evidence of the proposed work.


I. INTRODUCTION
Communication is essential for all human lives to explore their requirements and interactions with other people. Based on recent studies, various researchers found an interesting The associate editor coordinating the review of this manuscript and approving it for publication was Agostino Forestiero . and unique style of communication in sign language across different countries. The sign languages are obviously visual cues and co-ordinate the human manual and non-manual components dramatically. It greatly supports the hard-ofhearing and speech-impaired society in getting education, jobs, and societal rights. The governments of various nations amended the multiple acts to standardize the sign language to benefit the hard-of-hearing and speech-impaired community. Since, the sign language performs important role in hardof-hearing and speech-impaired communication, the understanding and responding by the normal people requires additional training and knowledge. This creates a communication gap between ordinary people and the impaired community. The recent advancements in deep learning techniques handle such task efficiently by encompassing numerous mechanisms and mathematical approaches. The development of such systems incurs huge complexities in various phases of development, such as misclassification, self-occlusion, movement epenthesis, ambiguity, noise, and blurred output. We investigated all these challenges in a novel way to provide a better solution and aimed to build a powerful architecture to provide greater performance.
The emergence of deep learning techniques entered all the fields to exhibit their strength towards robust model development. The deep learning techniques produces impressive results in areas such as agriculture [1], anomaly detection [2], activity recognition [3], business analysis [4], [5], crop selection [6], defect monitoring [7], DNA systems [8], earth analysis [9], fraud detection [10], genomic prediction [11], human activity recognition [12], image classification [13], job matching [14], kinematic analysis [15], location prediction [16], medical systems [17], [18], [19], [20], network traffic analysis [21], number plate recognition [22], object detection [23], predictive maintenance [24], quality control [25], robotics [26], stock prediction [27], time series data analysis [28], and text generation [29], unmanned vehicle path findings [30], vehicle monitoring [31], weather forecasting [32], x-ray imaging [33], YouTube video analysis [34], zone segmentation [35]. These developments highly motivate us to pursue research in the deep learning area. Deep learning models are highly powerful and have produced intelligible achievements in a wider range of applications. However, due to the complex structures and higher number of layers, the model training process and producing the greater accuracy performances create additional challenges during the model development. These reasons cause their applicability to produce powerful models for handling complex tasks. We propose a Hybrid Deep Neural Architecture (H-DNA) which integrates the sign language recognition, translation, and video generation tasks into a single application as shown in Figure 1. We proposed a Hybrid-Deep Neural Architecture (H-DNA) which is designed to learn the different modalities of sign gestures in a signer-independent environment. This enhances the model to understand the underlying complex relationship between the input and output. The experimental results explore the effectiveness of the proposed work in terms of recognition accuracy and visual quality. In order to achieve greater flexibility and simultaneous processing of gesture sequences, we use attention mechanisms and mathematical approaches to enhance the performance of the deep model. To justify these factors, we have shown the sample output screens and outcomes of the proposed methods in section 4. The main goal of deep neural networks is to mimic the functions of the human brain to explore tremendous performance over wider tasks and diverse domain applications. The research studies on such implementations provide detailed information about the layer information, hyper parameters and advancements.
In this paper, we introduce a method to leverage the new advancements in deep neural networks to produce plausible results in translation and video generation tasks. In fact, the proposed ideology further extends to developing user interface based applications for handling real-time cases and potentially addressing the various challenges of existing approaches. Our contributions comprise the creation of Indian Sign Language (ISL) related sentence level video datasets using multiple signers without involving any specialized components like color gloves and sensors. We used a digital SLR camera and web camera devices for recording the gesture videos. The proposed H-DNA systems are capable of producing high quality videos given spoken sentence input, processing the sign gestures and translating them into spoken text. This two way mechanism is found to be superior to existing developments and comparably produces greater results in recognition and translation tasks. The experimental results have been plotted to showcase the performance of the proposed model for handling different sign corpus. We enlisted 50 student and staff volunteers to evaluate the model's performance and tabulated the scores of their evaluation by considering different parameters. Overall, the proposed H-DNA systems are designed and implemented to handle the various nuances of traditional approaches and yield better results in Sign Language Recognition and Translation (SLRT) tasks.
Although the numerous developments have made for Chinese sign Language (CSL), American Sign Language (ASL) and German Sign Language (GSL), still the performance of the model lacks in continuous cases and fails to handle the real time inputs.
The proposed H-DNA systems facilitate the real-time and accurate recognition of multimodal and multilingual sign gestures. It allows an opportunity for developing the robust applications to handle various countries based sign languages and provides solutions for communication gap exists between normal and impaired community. The proposed work has been developed as a User Interface (UI) application for handling multilingual inputs, recognizing the multimodal sign gestures, generating the sign videos, and providing accurate results over the translation and recognition tasks. To achieve the expectations, the model development underwent various stages of development to handle multimodal features and variations of multilingual sign corpuses. The proposed model has been trained using 40K videos for continuous recognition and 35K images for world level recognition tasks. The proposed system explores solutions for the real time interactions of hard-of-hearing and speech-impaired people with normal people.
The detailed investigation and various refinements of Convolutional Neural Network (CNN), Long Short Term Memory, Gated Recurrent Unit and Generative Adversarial Networks (GAN) models yield better translation results and generate high quality photorealistic videos. Sign Language plays vital role in the communication of hard-of-hearing and speech-impaired community due to their inability towards reading and writing the native language. Since the various studies dealing with SLRT research, the earlier developments have their own limitations and are still unable to be used for continuous cases. Some of the research has been known to be successful for recognizing sign language, but it requires an expensive set up and sensor devices to handle it. The tracking and recognition of specialized multimodal gesture signs is very crucial, especially in recognizing signs of different languages (multilingual). The research study on SL recognition focuses on the translation of sign gestures into English sentences and produces the text transcription for the sequence of signs. This is due to the misconception that deaf people are comfortable with reading spoken language and therefore do not require translation into sign language. To facilitate easy and clear communication between the hearing and the impaired community, it is vital to build robust systems that can translate spoken languages into sign languages and vice versa. This two way process can be facilitated using sign language recognition, translation, and video generation. With this motivation, the proposed approach intends to develop and build a novel H-DNA framework for SL recognition and translation systems as well as enhance the interactive communication between the normal and impaired community.
To the best of our knowledge, the proposed H-DNA is the first novel unified deep learning framework which addresses two different problem dimensions in SL: Sign Language Recognition (SLR) and Sign Language Translation (SLT). Using Neural Machine Translation (NMT), MediaPipe library and Dynamic GAN the proposed H-DNA will be developed for generating the high resolution videos. The proposed work simplifies the translation of spoken text to subunit signs and then defines the mapping between glosses and sign gesture images using the open pose library. Further the SL videos are produced using and DynamicGAN model. On the other hand, using CNN, LSTM and MediaPipe library, the proposed H-DNA recognizes the multilingual datasets which comprises of isolated signs and continuous sign sentences by considering multimodal features. The H-DNA was developed and implemented on GPU-powered workstations. The collection of benchmark datasets and the recording of own datasets are carried out as the first steps in implementation. To evaluate the performance of proposed H-DNA, the experimentation is performed to have three folds: The first fold deals with SL recognition, and the second fold focuses on SL video generation. The SL recognition model achieves an accuracy of not less than 98% and shows the improved performance of the proposed H-DNA. Criteria like robustness, flexibility, and scalability are considered in the third fold. We summarize the overall objectives of the proposed work as follows: • To create & integrate heterogeneous data sources and to build a novel knowledge base consisting of multilingual and multimodal sign sentences with minimal sign glosses and skeletal level annotations by breaking down the signs into dedicated subunits.
• To augment and generate sign videos based on subunits from spoken language sentences to facilitate communication between normal and impaired (hard-of-hearing and speechimpaired) communities.
• To track and recognize the signs consisting of isolated words and continuous sign sentences including manual (onehanded and two-handed signs) and non-manual gestures in real-time scenarios.
• To build a novel application with end-to-end video generation and recognition capabilities by sharing the qualitative and quantitative results of generated sign sequences without using animated avatars or sensors, and to ensure accuracy with minimal cost.
The further discussions about the proposed model are discussed as follows. Section 2 investigates the earlier developments and provides the research gap in SLRT research and seeks the advancements in various phases of development. The proposed system details are wisely explained in Section 3 and provide sufficient details about the model development. The experimental outcome of the proposed model is shown in section 4, and finally, the conclusion and future work part summarizes the entire information about the proposed work.

II. RELATED WORK
Sign language communication explores the powerfulness of human intelligence through hand actions and movements. Despite relying on a single component (hand), it involves numerous human upper body components such as head, mouth, and gaze movements to provide a real understanding of gesture sequences in real time. Sign languages are made up of visual actions and do not have a unique pattern to identify their motion sequences. It greatly follows different styles based on its own country's nature and culture. Understanding and processing such inputs is extremely difficult for traditional machine learning approaches. It mainly supports the hard-of-hearing and speech-impaired society by getting those benefits such as education, employment, and engaging them in societal activities. There have been numerous research efforts made to produce better translation models. The real time recognition and translation of sign languages requires careful investigation of various features to produce plausible output without any misclassification and wrong sign output.
The progress Deep Learning approaches steps towards newer heights and produces fabulous results in computer vision and human action recognition applications. The introduction of hybrid models and ensembling techniques advances the capabilities of such models to handle tedious tasks. The recent research works in CNN, LSTM, GRU and GAN techniques has been investigated related to the SL recognition, translation and video generation tasks and helps to introduce the novel contributions to build a powerful framework.
The author, Barbhuiya et al. [36] proposed CNN+SVM based hand gesture recognition methods for static signs. This approach mainly deals with alphabets and numerals. The authors, Aly et al. [37] proposed a system for handling the words of Arabic SL using DeepLabv3+ gesture segmentation techniques and Bi-LSTM. The ASL recognition system for 26 alphabet level sign gesture recognition tasks is proposed by the author Lee et al. [38] uses LSTM with KNN techniques to provide higher recognition results. This work deals with world-level sign language communications. In addition to that, the researchers Xiao et al. [39] introduced continuous SL recognition using NMT approaches. The author, Elakkiya et al. [40] proposed an SL recognition framework using GAN+3D-CNN+LSTM Techniques. This approach utilizes the deep reinforcement learning based evaluation strategy to produce highly accurate results. The various details of the earlier literature are shown in Table 1 for exploring the new advancements with different sign languages such as American Sign Language (ASL), Chinese Sign Language (CSL) and German Sign Language (GSL). The conventional sensor-based approaches demand extra equipment to be worn by the signer. The use of data gloves, color gloves, depth cameras, and leap motion controllers creates additional overhead for the signer to communicate normally and poses huge limitations [41]. Although it gives good prediction results, drastically loses the scope in real time applications. In addition to that, it creates discomfort for the child and normal people during the conversation.
The optimization of hyper parameter values and the imposing of various constraints produces plausible outcomes and attracts the researchers. The primary version of the CNN model is introduced by authors Chen and Koltun [43] produces images from semantic layouts. The model investigates the different loss functions and produces photographic results. The model performance bottlenecks while handling the large scale of images and adds the various intrinsic challenges. Similarly, the researchers in Oord et. al. [44] discussed the development of gating mechanism based Pix-elCNN models for image generation. The model has been evaluated using the datasets CIFAR-10 and ImageNet. Since the model applies different conditions on embedding features to produce quality image generation results, and extending the performance for videos creates additional overheads. Although numerous advancements were made in the research work [45], the production of sign gesture videos is blurred and spatial details are incoherent.
The development of ambient models such as FUNIT [46], StarGAN [47], StarGAN v2 [48], MoCoGAN [49], LPGAN [50], InfoGAN [51], pix2pix [52], and CycleGAN [53] deals with the image generation and video production tasks efficiently. Since SL communication involves the various manual and non-manual cues of humans and their facial, eye, gaze, and mouth expressions, it demands some advancement in the earlier approaches. In addition to that, the ordering of gesture sequences greatly varies from the English sentence order. In order to address these aforementioned challenges, we introduce a novel approach for aligning the frame sequences and generating the intermediary frames between the sign gesture images. The proposed model deals with the various nuances of SL gestures and its components and produces plausible outcomes. The GAN networks are found to be highly capable of producing plausible results across a wider range of diverse domain applications. The applications such as security [54], [55], baggage inspection [56], infected leaf identification [57], covid-19 prediction [58], agriculture [59], business process monitoring [60], Brain MRI synthesis [61], flood visualization [62], estimating the standards of gold [63], ECG wave synthesis [64], Internet of Things (IoT) [65] and Dengue Fever sampling [66].
The two major components of GAN networks are generator and discriminator, and they play a vital role in image or video generation. The discriminating capability of the discriminator helps to produce high quality videos in diverse domains and is further investigated in the proposed work for qualitative production of sign gesture videos. The incorporation of CNN models with conditional GAN networks produces drastic improvements in video generation quality and efficiently handles the various traits of details present in an image or video. Based on the discriminator classification, the generator networks underwent the fine-tuned training process to produce photorealistic results. The authors, Mirza and Osindero [67], introduced the conditional-based GAN network model by applying constraints on label information. We use this approach in our work to produce videos based on the conditioned labeling approach. The advent of Dynam-icGAN models addresses the existing challenges by using strided mechanisms in convolution operations to produce improved results. The video generation process using GAN networks encompasses the additional approaches to produce photo-realistic videos and keeps the coherent spatial details clear.
Although there are enormous research going on in the field of Sign Language Recognition, translation and generation systems the existing systems still face a lots of challenges. The primary challenge with the Sign Language recognition and generation system is the lack of availability of large-scale open-source Indian Sign Language Dataset with natural conditions. To overcome this issue we have developed a multisigner, multi-model Sign Language dataset and have provided it as open-source resource for further research purposes. For Sign Language recognition systems, we have built the recognition model in such a way that it detects the signs irrespective of the complex backgrounds, multi-modality, signer skin tone, signer clothing constraints, sign speed etc., which are the major drawbacks in the existing systems. In case of Sign Generation, we have considered the limitation to small size vocabulary, model performance improvement, low model complexity, proper alignment of the key points, signs in spatial domain etc.

III. THE PROPOSED H-DNA SYSTEM
The proposed hybrid H-DNA framework model comprises various phases of development, such as SL recognition, multilingual sentences into sign word conversion, pose estimation using MediaPipe, and SL video generation. The proposed H-DNA framework aims to integrate all these modules and provide a real time solution to SLRT research challenges. Neural Machine Translation (NMT) is the process of translating sentences from one language into another. It uses artificial neural networks to yield highly translatable results. The identification of human poses in images or videos is performed using the mediapipe library. It helps to predict the various poses of humans in various environments. Pose estimation is based on a number of key points on the human body.It uses the Parity Affinity Fields approach to implement it. The VGG-19 model is used for classifying the different gesture styles. It uses different 3 × 3 filters in the convolution layers. The convolution layers provide a feature map by scanning the image features. The role of pooling layers is to reduce the information generated by the convolution layers. To vectorize the output as a single array, the fully connected layer is used. The incorporation of dynamic GAN [86] provides high quality video generation results by encompassing the various approaches such as frame generation and video completion techniques. The LSTM network is used for predicting the text equivalents of the sign gestures and further helps to produce the language sentences. The following subsections explore the various technical details and summarize the powerfulness of each technique.
This section explains the implementation details of the proposed H-DNA framework. In the first fold, we developed the SL recognition model using the MediaPipe library and the VGG-19 model. Furthermore, we incorporate the Bi-LSTM network for text generation. In SLR, the input of continuous gesture sequences is processed by the MediaPipe library to capture the pose sequences, angle between fingers, hand movements and locations, orientations, mouth expressions, and facial actions. Based on these key points, the VGG-19 model estimates the class of gestures. The incorporation of CNN and LSTM networks in such a hybrid way produces higher recognition accuracy and noticeable performance. The temporal details are analyzed sequentially to predict the translation text without any misclassification. We trained our model using 40,000 videos for 320 classes to provide wider support over multilingual sign corpus comprises of multimodal features. The sample gesture images of our own created ISL-CSLTR dataset [42] are shown in Figure 2 and greatly support the ISL-related SLRT research. In general, the SL video generation process is treated as a highly intensive task due to the production of sign gesture videos from English sentences. The qualitative production of SL videos for the new input sentences poses various levels of difficulties by considering the manual and non-manual cues of the signers. Such a translation process demands more attention at each step to produce high quality results. The emergence of various deep generative models has advanced and secured new milestones in photorealistic image generation and video production.

A. SL RECOGNITION
In the first phase, the development of SL recognition using hybrid CNN+Bi-LSTM techniques is carried out. The main objective of this hybrid approach is to sequence predict in SL videos. The CNN layers are used for gesture class identification and LSTM networks for predicting the class sequence. The combination of these two networks processes the spatio-temporal details of SL input videos and produces the text output. The first segment uses CNN layers and is further utilized by the Bi-LSTM networks with dense layers to yield plausible results. We used the VGG-19 model [68], which consists of 16 convolution layers and 3 fully connected layers. The CNN network processes images of size 254 × 254 and the first and second layers are convolutional layers. It uses 3 × 3 filters with stride level 1. The max pooling operation is performed using stride level 2 and a window size of 2 × 2. After this process, the dimensions of pixels are reduced to 112 × 112 × 64. Further, the convolution layer of varying filter size 128, 56, 28 is applied and reduces the size of the image as well as focuses the important features. The fully connected layer summarizes all classes of inputs and produces the probability of prediction values using the softmax layer. The network is trained to handle 35K images of 192 classes representing different gesture poses based images for different words. After completion of preprocessing steps, the videos of high resolution to be 1920 × 1080 and converted into numpy arrays for easier processing using skvideo packages. Each class of sign gestures is recorded with 50 repetitions to provide better learning and prediction performance of the model. The key points based pose information is captured parallel to maintain the gesticulation details and aids the better assistance over classification and prediction tasks. The incorporation of the CNN based pre-trained model VGG-19 helps to automate the SL recognition tasks in a better way. The various basic operations, such as convolution operations and max pooling, are applied repeatedly to learn the finer details of the images. The VGG-19 model provides better classification results than the multilingual sign language datasets. The results are passed to the Bi-LSTM networks to predict the target sentences matches with the video sequences. The intermediary feature map results of the proposed hybrid CNN+Bi-LSTM model are shown in Figure 3. The VGG-19 model produces the vector representation of images and classification results. Based on such input, the LSTM layers process the information and generate the textual descriptions. In this context, the textual descriptions are language sentences that match with gesture sequences. The entire CNN model is handled by the time distribution layers to handle multiple inputs for different time steps. The LSTM units apply back propagation to tune the hyper parameters such as learning rate, batch sizes. The weight and bias values are also updated to build a powerful framework. We set the learning rate value as 0.01 and the batch size as 64. The LSTM networks [69] are found to be powerful components in text generation, image captioning, and machine translation tasks. The LSTM network has three gates: (i) input gate, (ii) output gate, and (iii) forget gate. The separate memory cell is added and handles a higher number of layers than GRU. The forget gate decides the kind of information to be discarded from memory and uses sigmoid activation functions to squish the values between zero and one. Due to this functionality, the values multiplied by 0 become zero and can be easily removed. The input gate updates the cell state for processing the new inputs. The memory cells remain the amount of information for time stamp t. The output gate finalizes the information to be output from the model. The forget gate functions are represented using the following Equation (1) The cell state is a key for the LSTM network, passing through the entire chain link of LSTM modules and governed by the aforementioned three layers. The forget gate decides the information to be thrown away from the memory. The role of input layers is to provide the desired inputs and update the cell state values. The output layers produce the text results. We use sigmoid and tanh activation functions to VOLUME 10, 2022 produce plausible outcomes. We use a bi-directional LSTM approach to focus on the text generation tasks efficiently. The proposed hybrid CNN + Bi-LSTM techniques based SL recognition system architecture is shown in Figure 4. The LSTM network is envisioned as a strong method to handle sequential tasks. It provides a solution to the vanishing gradient problem. Since it handles the longer sequential inputs, which are applied in domains such as image captioning, text generation and time series based applications. The LSTM network was introduced by the authors, Cho et al. [70]. The forget gate (fr t ) operations are represented using Equation 1. It decides the information to be discarded from cell states by applying the sigmoid activation function. The value 1 represents keeping the information and 0 denotes its removal. The general equations describing the various operations of the LSTM Network are stated in Equation 1.
The next step processes the sequence of inputs and decides the next information to be fed into the cell state. The input layer represented using Equation 2 denotes the next value to be updated in the cell state. Next, Equation 3 represents the vector values of candidate results.
The update of new values (Ca t ) by using multiplication operations and the refinement of old cell state values takes place using Equation 4.
The output gate operations are denoted using Equation 5 and Equation 6. It decides the information to be passed as output.
The proposed hybrid CNN + Bi-LSTM Technique is shown in Figure 5. The detailed steps of our implementation and provides the step-by-step procedures. It gives a detailed overview of the execution of VGG-19 model training. The LSTM network operations to produce the language sentence output are clearly elaborated in the rest of the sections. The LSTM network utilizes the memory cell component explicitly, and the cell states regulate the kind of information to be kept or discarded from memory. During each iteration cycle process, the LSTM network processes the previous hidden state values (ht-1), current input values (Int) and the previous cell state values (ht). The parameters weight and bias vectors are updated regularly during the back propagation process to produce accurate translation results. We use Adam optimizer and drop out regularization techniques to obtain greater results over the benchmark datasets.

B. MULTILINGUAL SENTENCES INTO SIGN WORD CONVERSION
This section explains the translation process of language sentences into sign words using the NMT and attention mechanism. The conventional NMT techniques have proven to have appreciable performance in language translation tasks. We use a hybrid NMT + Attention mechanism for translating the multilingual sentences into sign words. The NMT technique uses RNN and its variants to process the longer sequences and produces better results in different domain applications. We introduce the novel deep-stacked GRU technique in machine translation tasks to achieve greater translation results over multilingual input sentences. The translation process is carried out using the following steps: The first step deals with the text preprocessing of the spoken sentences. The spoken sentences are cleaned by removing the special characters, punctuation marks, and symbols. We add the <START> token at the beginning of the sentences and <END> tokens at the end of the sentences. This approach benefits the model learning process of where to start and stop. The word embedding techniques are used to convert the tokens into dense vectors and pass them to the next level. The proposed deep-stacked GRU technique efficiently handles the translation tasks and produces accurate results. The GRU networks use two gates: (i) the update gate and (ii) the forget gate. The update gate governs the information to be newly added and the forget gate regulates the information to be kept or thrown away. The following equations clearly explore the various operations of GRU units.
The deep-stacked GRU units are chain-link based on different modules which are executed iteratively in order to produce the sequential outputs. The input value from the current step is denoted as x t and the input of previous hidden layers is represented as h t−1 . The operations of the update gate (Z t ) are represented using Equation 7.
The current input value (x t ) and the weight (W) values are multiplied in the first part, and the second part multiplies the previous hidden state values (h t−1 ) and its weights (U) and finally the values are summed up to provide the new values to the update gate. The sigmoid (σ ) activation function is applied over the resultant values to round up the prediction results in the range of zero to one. The update gate concludes the volume of information to be passed to the next state. The reset gate decides the removal of information based on the importance of particular vector towards the prediction of next sequences. The executions of reset gate are demonstrated using the Equation 8 as follows.
The reset gate (r t ) combines the results of the multiplication operation performed on the input (x t ) and weight (W) values as well as the previous hidden node values (h t−1 ) and its weight values (U). The sigmoid activation is applied to the results. The current values (h cur ) to be present in the memory unit are computed using Equation 9.
h cur = tanh (Wx t + r t Uh t−1 ) The current and previous node values are multiplied with weight values. The Hadamard product, known as elementwise multiplication, is performed over the reset gate and previous hidden states values. Finally, the non-linear activation function tanh is computed on the final outcome. The last step results in being recorded in memory units (h f ) at time step t is computed using Equation 10.
The deep stacked approach provides better results over a wider range of applications and reduces the computational complexity of the model drastically. The deep stacked GRU has several units of GRU blocks and performs the model training in parallel. The detailed structure of deep stacked GRU units is depicted in Figure 6. Further, we incorporate the attention mechanism proposed by Bahdanau et al. [71]. The attention mechanism focuses on the particular context in encoder unit matching with target translation to yield high quality results. The cyclic execution of the deep stacked GRU units is shown clearly in Figure 7.
The GRU units process the spoken sentences input using encoder and decoder based approach. The encoder network of GRU processes the source format of input sentences.  We incorporated the attention mechanism proposed by the researcher Bahdanau et al. [71] to yield the accurate translation results. The attention vector is estimated by concatenating the context vectors and previous output. Finally, the decoder network produces the target sign gloss output. The proposed hybrid NMT + Attention model is evaluated using the three benchmark sign corpus datasets such as RWTH PHOENIX Weather 2014T dataset [72], How2Sign Dataset [73], and ISL-CSLTR Dataset [41] and the results are shown in section 4. The computation of attention weights is done using Equation 11.
The context vector is calculated by using Equation 12.
The Bahdanau's attention vector is calculated by using Equation 13.
The proposed Deep stacked GRU algorithm uses stacked layers of GRU to effectively process the sequential inputs and VOLUME 10, 2022 translate them into target form. We apply the Bahdanau et al. [71] attention mechanism to compute distinct context vector values and get good results. The recursive nature of GRU processes the entire source sentences and translates them into target sentences. We use beam size 10 and tanh and sigmoid activation functions. The proposed model totally processes 40k sentences by combining multilingual sign corpus collected from different sources.

C. POSE ESTIMATION USING MEDIAPIPE
The MediaPipe library was developed to provide human pose estimation results over image and video files. This framework is stated as an impressive one to track the details of human activity in public environments, sign gesture pose recognition, fraud monitoring, and yoga pose analysis. We use the MediaPipe library to estimate the poses of different signers and key points, which are used for generating the new poses using the deep generative networks. The sample results of the MediaPipe library are shown in Figure 8.

D. SL VIDEO GENERATION
The sign gesture video generation tasks are performed using deep generative models. We introduce the novel Dynamic-GAN network for producing plausible, photo-realistic high quality videos. The video generation involves a series of stepby-step approaches to produce high-quality results. We carefully investigated the various mathematical models and deep generative frameworks to develop the novel framework. The advancements of GAN networks have found them proficient in generating high quality images and videos. The GAN networks synthesis the medical images efficiently as well.
We incorporate the conditional GAN model [67] as the basic framework for our proposed DynamicGAN model. Furthermore, we use the VGG-19 pre-trained CNN network for sign gesture classification. The techniques such as intermediary frame generation, deblurring and image alignment, pixel normalization, video completion are added additionally with the generator network to produce the photo-realistic high quality sign gesture videos. The integrated architecture for translating the multilingual sentences to sign video generation is shown in Figure 9. The GAN network consists of two units known as the generator unit and the discriminator unit. The generator unit produces the new images or videos from the noise distribution of real data. The latent space provides various details of real data based on which, it produces the new images or videos. The conditional GAN model uses conditioned labels to produce the sharp images. The generated results are verified by the other unit known as the discriminator. The discriminator unit classifies the real and fake samples as shown in Figure 10. Depending on the classification results, the generator networks fine tune their performance to produce plausible images and videos. We use a U-Net-like framework [74] for learning the structure of real data distribution. From which, the target pose images are generated quantitatively. The encoder network performs the convolution function, batch normalization and activation function for Leaky ReLU. The decoder network utilizes the transposed convolution function, batch normalization techniques, dropout regularizer, and finally, ReLU activation functions. The loss value for the generator network is computed using the sigmoid cross-entropy loss. Further, the L1 loss calculates the mean absolute error between the real and generated results and aids in producing high quality results. The Discriminator unit incorporates the PatchGAN [52] classification techniques to discriminate the real and fake samples. The Convolution function, Batch normalization, and the Leaky ReLU activation function are applied sequentially to produce plausible outcomes. The discriminator network estimates the realness of the generated results. It uses sigmoid cross-entropy loss function to measure the quality of generated results compared with real ones. The proposed Dynamic GAN model is implemented in high end GPU based environment. The Dell Precision 7820 Tower workstation is used to accomplish the entire development process. It comprises pairs of Intel Xeon Silver 4210 2.2. GHz processors and 10 cores. The Nvidia Quadro RTX40000 provides GPU support for model training. We use batch normalization and Adam optimization techniques with the values α = 2e-4, β = 0.5 and β = 0.999. We set the batch size value as 128, dropout is 0.01 and initial learning rate as 0.01. The Leaky ReLu value is set as 0.1 and ReLu activation functions are further applied. The mini batch size is set as 100 and the momentum is 0.05. The proposed DynamicGAN framework is experimented using the multilingual sign corpus such as RWTH-PHOENIX-Weather 2014T dataset, ISL-CSLTR dataset, and How2Sign dataset. The results are shown in section 4.
We use the Mean Squared Error (MSE) Metric to evaluate the loss values in the generator network outcomes stated in Equation 14.
The Sigmoid Cross-Entropy loss combines the activation function sigmoid as well as the Cross-Entropy loss function. Due to the independent execution of these loss functions, it does not affect the results of one on another.

E. DATASET
RWTH-PHOENIX-Weather 2014T dataset: This dataset deals with the SLRT research for German sign language [72]. It consists of 40k videos for sentence level. The videos are recorded using 9 native signers.
ISL-CSLTR dataset: The ISL-CSLTR dataset was published by the researchers [41] to conduct the SLRT research in Indian sign language. It consists of 700 videos for 100 sentences each. The videos are recorded using seven different signers. How2Sign dataset: The How2Sign dataset [73] contains the SLRT research for American Sign Language. It consists of 2,456 videos for sentences. The videos were recorded using 11 different signers.

IV. EXPERIMENTAL RESULTS
This section provides the experimental results of various phases of development, which are performed and investigated to build a complete framework for SLRT research challenges. The proposed H-DNA framework functionalities are tested in different stages of the development cycle. In addition to that, we have shown the user interface screens of the final application. During the first phase of development, the SL recognition model is implemented using hybrid CNN+Bi-LSTM techniques. The proposed model has been trained and validated to produce better results. We inputted 25k images for training and 5k images for validation purposes. The model performance is shown in Figure 11. The proposed model achieves significant improvements in classification accuracy and recognition performance. Furthermore, we plot the confusion matrix for obtaining the classification performance. The confusion matrix results are shown in Figure 12. This demonstrates the improved performance of the proposed hybrid CNN-LSTM model.  The classification performance of the proposed hybrid CNN-LSTM model is evaluated using the following metrics. The accuracy of the proposed model is compared with the existing work and the comparison results are tabulated in Table 2.
We further investigated the proposed hybrid CNN+Bi-LSTM model performance using the following equations. The precision is computed using the Equation 15, the Recall is calculated using Equation 16, F1 Score is calculated using the Equation 17 and the accuracy is computed using the Equation 18 where TP, TN, FP, FN denotes true positive, true negative, false positive and false negative values. The various quality metrics are computed using the following Equations and the results are tabulated in Table 3.
The performance of the hybrid NMT + Attention model is evaluated using the BLEU metrics depicted in Figure 13. It shows the performance of the proposed hybrid NMT + Attention model compared with existing work. Further, the performance of the hybrid NMT + Attention model is analyzed using the attention plots depicted in Figure 14.
The attention plot shows the real translation performance of the model by comparing the source and target sentences. The blocks are highlighted in white color representing the role of attention mechanism in the context of particular word translation.   We compared the proposed Dynamic GAN model performance in terms of quality and quantity by experimenting with multilingual sign language datasets and the results are shown below. Figure 15 depicts the video generation results of RWTH-PHOENIX-Weather 2014T dataset, Figure 16 shows the video generation results of the ISL-CSLTR dataset and Figure 17 depicts the video generation results of RWTH-PHOENIX-Weather 2014T dataset. We further compared the proposed Dynamic GAN model with existing deep generative   models. The quantitative evaluation is carried out using the benchmark sign corpus. The results show the improved performance of our approach compared with existing models. Table 4 depicts the performance of the proposed model compared with existing models in terms of realism, relevance, and coherence using human evaluators. We validated the generated frame quality and temporal coherence using FID2Vid scores shown in Table 5.
The Structural Similarity Index Measure (SSIM) metric represented using Equation 19 is used for assessing the image quality. We use the SSIM metric for comparing the model's performance with existing approaches. This metric assesses the structural information degradation of generated video frames and the results are shown in Table 6.
l (x, y) = 2µ x µ y + C 1 µ 2 x + µ 2 y + C 1 The proposed DynamicGAN model performance has experimented with inception score metrics. The high score denotes   the model's performance over multiple domains and the generation capability of the generator. The computation of IS is performed using the following Equation 20.
Let x denotes the generated images of the generator network G, let p (y | x) denotes the class distribution of generated  samples and the marginal probability function, denoted as p (y) . The Inception score results are depicted in Table 7.
The PSNR metric provides a comparison result between real and generated results. The high PSNR indicates the improved quality of the generated results. The PSNR metric is compared between different sign corpus for analyzing the proposed DynamicGAN Model performance. The results of the PSNR metric are shown in Table 8 The Frechet Inception Distance (FID) metric is used to assess the quality of generated video frames and is computed using Equation 23. The quality of pixels and temporal consistency are measured. The lowest FID scores indicate better results. The mean and covariance values are computed to compare  the generated results with real data distribution. The results of the FID metric are shown in Table 9.
We further evaluated our model performance using Temporal Consistency Metric (TCM) metric to provide real score for videos related to consistency in the temporal sequences to produce high quality videos rather than comparing with single frame level. The table 10 list the evaluation scores for TCM Metric and compares with other benchmark datasets. The user interface based H-DNA implementations are shown in Figures 18 and 19. It shows the sample SL recognition and SL video generation results. Figure 18 and Figure 19.

V. CONCLUSION
This paper contributes to the development of a deep learning framework for end-to-end sign language recognition, translation, and generation. We addressed the challenges that persist with earlier SL recognition and video generation approaches using the proposed H-DNA framework. We evaluated the model performance using the RWTH-PHOENIX-Weather 2014T dataset, the How2Sign dataset, and the ISL-CSLTR datasets quantitatively and qualitatively. The proposed H-DNA framework is also evaluated qualitatively using various quality metrics. The generated video frames show the quality of the outcome of our work. We achieved a comparatively greater recognition rate and generating performance than earlier approaches. The proposed model has achieved the above 95% classification accuracy towards SL recognition, 38.56 average BLEU score, remarkable human evaluation scores, 3.46 average FID2vid score, 0.921 average SSIM values, 8.4 average Inception Score, 29.73 average PSNR score, 14.06 average FID score, and an average 0.715 TCM Score. These scores are notably higher than earlier models. The evaluation of realism, relevance, and coherence factors is carried out by employing human evaluators and produces good results in real time scenarios. She is currently working as a Project Associate with SASTRA Deemed University. Her research interests include sign language recognition, music emotion recognition, deep neural network, image processing, and computer vision. She has contributed various articles and chapters for many high-quality Scopus and SCI/SCIE indexed journals, conferences, and books. She is a Lifetime Member of International Association of Engineers and a member of Association for Computing Machinery.

FUNDING SUPPORT
R. ELAKKIYA received the Doctor of Philosophy from Anna University, Chennai, in 2018. She is currently working as an Assistant Professor with the Department of Computer Science and Engineering, School of Computing, SASTRA University, Thanjavur. She got three patents. She has published more than 20 research papers in leading journals, conference proceedings, and book including IEEE, Elsevier, and Springer. She is currently an Editor of Information Engineering and Applied Computing journal and also a Life Time Member of International Association of Engineers.
KETAN KOTECHA is currently an Administrator and a Teacher of deep learning with Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune. He has expertise and experience in cutting-edge research and projects in A.I. and deep learning for the last 25 years. He has published more than 100 widely in several excellent peer-reviewed journals on various topics ranging from cutting edge A.I., education policies, teaching-learning practices, and A.I. for all. He has published three patents and delivered keynote speeches at various national and international forums, including at the Machine Intelligence Laboratory, USA, IIT Bombay under the World Bank Project, the International Indian Science Festival organized by the Department of Science and Technology, Government of India, and many more. His research interests include artificial intelligence, computer algorithms, machine learning, and deep learning. He was a recipient of the two SPARC Projects worth INR 166 lakhs from MHRD Government of India in A.I. in collaboration with Arizona State University, USA, and The University of Queensland, Australia. He was also a Recipient of numerous prestigious awards, such as Erasmus+ Faculty Mobility Grant to Poland, the DUO-India Professors Fellowship for research in responsible A.I. in collaboration with Brunel University, U.K., the LEAP Grant at Cambridge University, U.K., the UKIERI Grant with Aston University, U.K., and a Grant from the Royal Academy of Engineering, U.K., under Newton Bhabha Fund. He is an Associate Editor of IEEE ACCESS journal.