Helping hearing-impaired in emergency situations: A deep learning-based approach

Hearing-impaired people use sign language to express their thoughts and emotions and reinforce information delivered in daily conversations. Though they make a significant percentage of any population, the majority of people can’t interact with them due to limited or no knowledge of sign languages. Sign language recognition aims to detect the significant motions of the human body, especially hands, analyze them and understand them. Such systems may become life-saving when hearing-challenged people are in desperate situations like heart attacks, accidents, etc. In the present study, deep learning-based hand gesture recognition models are developed to accurately predict the emergency signs of Indian Sign Language (ISL). The dataset used contains the videos for eight different emergency situations. Several frames were extracted from the videos and are fed to three different models. Two models are designed for classification, while one is an object detection model, applied after annotating the frames. The first model consists of a three-dimensional convolutional neural network (3D CNN), while the second comprises of a pre-trained VGG-16 and a recurrent neural network with a long short-term memory (RNN-LSTM) scheme. The last model is based on YOLO (You Only Look Once) v5, an advanced object detection algorithm. The prediction accuracies of the classification models were 82% and 98%, respectively. YOLO based model outperformed the rest and achieved an impressive mean average precision of 99.6%.


I. INTRODUCTION
Human-Computer interaction is an interdisciplinary research area focusing on designing computational technologies to make the interaction between humans and computers possible. Hand gesture recognition is its sub-field in which computer vision and artificial intelligence have aided to provide nonverbal communication between humans and computers by identifying significant movements of the human hands [1]. Though there are a variety of applications of hand gesture recognition, accurate recognition remains a challenging task [1].
Sign language recognition is a typical application of hand gesture recognition [2]. It is often considered that only deaf people rely on sign languages for conveying their thoughts. However, particular medical problems such as down syndrome, autism, cerebral palsy, trauma, and brain diseases or speech difficulties may require a nonverbal mode of communication [6]. 6,909 spoken languages and 138 sign languages have been identified, but there is no universal sign language [3]. Each has its own syntactical and grammatical structures to provide definitive means of communication, primarily for deaf communities worldwide. Sign languages emphasize on the movement of the hands, arms, head, and body in a conceptually predetermined manner to significantly construct a gesture language. Indian Sign Language (ISL) is the name given to the sign language used in India. According to the 2011 census, 2.7 million people in India cannot speak, and 1.8 million are deaf [4]. They face difficulty communicating with others because most normal people are unfamiliar with sign language. However, communication between them becomes inevitable in emergencies. Sign language interpreters are required to convert sign language to spoken language and vice versa, but their supply is limited and expensive. As a result, automatic sign language recognition systems are needed to translate signs into corresponding text or voice without the assistance of interpreters [7]. Through human-computer interaction, systems can be built to aid in the development of the deaf and other communities who rely on sign languages. recognition industries. Numerous applications have made significant progress, including the recognition of sign language [11,12], the control of robots [13,14], and the playing of virtual musical instruments [15,16]. The progress in recognizing gestures by hands has also attracted much attention from the industrial sector in manufacturing interactions between human robots [17,18] and self-driving cars [19]. The recognition of sign language [20,21,22], recognition of sports' specific sign language [23], the accuracy of human activity [24], stance and posture [25,26], and physical monitoring of exercises [27] are all new and innovative applications of hand gesture recognition.
To address the gesture recognition problem, there are various handmade feature techniques [28,29] and deep learning-based approaches [30,31]. Deep learning is a form of machine learning that has gained prominence in the field of sign recognition in recent years. Deep learning has found applications in a variety of areas, including cancer prediction (colon, blood, or lung cancer), tumor detection, medical imaging, Alzheimer and Parkinson disease prediction and modeling, skin lesions detection, optical coherence tomography image processing, abnormalities related to body parts (breast, heart, etc.), and so on [72]. Igor Kononenko [73] gives a machine learning-based review of the evolution of intelligent data analysis in medicine, emphasizing the importance of using machine learning in medical diagnostics. Various deep learning procedures are discussed by Zhang et al. [74] in order to develop computer-aided medical diagnosis tools. Additional works are being done to improve its utilization in the medical area, such as [75]. Deep learning's progress has allowed academics to explore answers to contemporary concerns, including Covid-19 [76]. As a result, to address the gesture recognition problem, there are various handmade feature techniques [28,29] and deep learning-based approaches [30,31]. CNNs (Convolutional Neural Networks) are a type of deep neural network used to process visual data. Researchers have widely employed them for image classification and detection challenges in recent years. CNNs are also being used for video classification due to their effectiveness in classifying and detecting images and their contents [43]. Deep learning models have been shown to perform recognition and classification tasks fast and correctly, but their implementation in real-time application settings is limited [30]. A general implementation of sign language recognition involves gathering data (images or videos), pre-processing, feature extraction, and classification ( Figure 2). For static gestures, images are required, while a sequence of images or videos is needed to extract the spatio-temporal features for a dynamic gesture. Pre-processing entails selecting several frames from the videos, applying different filters if required, and resizing the frames based on the model's input. Following preparation, the input is sent to a combination of convolutional and pooling layers for feature extraction. To provide a label for each frame, the extracted feature output is passed to the fully connected layers and subsequently to the SoftMax layer. To extract temporal characteristics and categorize a sequence of frames, often RNN with long shorttime memory (LSTM) is employed.
Emergency communication includes an alert, warning, help, self-protective measures, and other matters involving immediate response and recovery. Sign language-based emergency situations such as feeling pain, calling for help, or a doctor are not easily recognizable by ordinary people [60]. In this study, we developed vision-based gesture identification techniques employing CNNs on a video dataset to recognize ISL based hand movements in emergency scenarios. Three models have been proposed for the same. The first two models classify the frames and extract the spatio-temporal features. Out of these two, one model utilizes 3-D Convolutional Neural Networks (3-D CNN) to retrieve spatiotemporal features directly, while the other model combines pre-trained VGG-16 and a LSTM. Since LSTM is best suited for learning long-term temporal information, the short-term spatiotemporal features are learnt using CNN before using LSTM. The third model is a You Only Look Once (YOLO) v5 algorithm based object detection model that detects the hand in the frame and returns the label. The video dataset is a publicly available dataset containing videos of eight different emergency signs, including accident, call, doctor, help, hot, lose, pain, and thief. The object detection algorithm performs the best of the three models with 99.6% mean average precision. The results are compared with previous works and outperformed them significantly, showing the effectiveness of proposed models. A thorough analysis of results is also presented.
The major contributions of the present work include:  Implementation of state-of-the-art classification and object detection methods that can be used when hearing-impaired people are in desperate situations and want to convey their thoughts to other people at the earliest.  A thorough literature review of related studies.  Based on the first publicly available dataset of the hand gestures of the emergency ISL words.  Analysis of videos of eight emergency signs.  Comparison of the performance of proposed methods with each other on various metrics and with the previous studies.  Establishing the superiority of the VGG+LSTM model over other classification models.  To unveil an impressive performance of the objectdetection model to identify dynamic gestures. The remainder of the paper is organized in the following manner. The second part briefly examines similar gesture detection algorithms before moving on to different deep learning network approaches. The third portion of this article discusses the design concept and structure of the models. The experiment's outcomes and effects are presented in the fourth part. The final part contains the conclusions.

II. RELATED WORKS
Due to the constantly increasing need for Hand Gesture Recognition, a growing number of academics are concentrating their efforts on dynamic HGR based on video data. Researchers have implemented several tasks involving Classification and Detection aiming to recognize a gesture.
Gestures in Sign languages can be broadly classified as static and dynamic gestures. Static hand gestures can be classified using images, whereas dynamic hand gesture recognition necessitates utilizing a sequence of images or videos. It learns the temporal and spatial features of a gesture. The accomplishment of object detection and classification tasks through deep learning algorithms like CNNs [32][33][34][35][36] and others have led to a growing trend towards using them in computer vision applications such as Sign Language Recognition, Games, and Virtual Reality, Human-Computer Interaction, and so on. [67]. CNNs are also being used to achieve cutting-edge performance on images and videos [34], [35]. There have been various approaches to classifying hand gestures using CNN. The Convolutional Neural Network (CNN) has also been successfully applied on hand gesture recognition tasks [36], [37]. Rao et al. [68] utilized a neural network classifier for selfie continuous sign language identification. To achieve dynamic sign language recognition, this sort of system primarily employed deep learning technology, such as the techniques of extracting the discriminative characteristics of hand movements with Convolutional Neural Networks (CNNs) in [36].
The inputs in static hand gestures are not time-related or required in chronological order. However, the previous state must be preserved for dynamic motions through a chain-like structure of repeating modules that aid in learning long-term dependencies in sequential data. For hand gestures, video datasets are used to display the dynamics of the gestures easily. Commonly, 2D CNN is used for the recognition of static images. While for dynamic gestures, spatiotemporal features need to be extracted. Despite the exemplary performance of 2D CNNs on static gesture tasks, they are bound to model temporal characteristics and motion sequence. To solve this issue, authors have applied Long Short-Term Memory (LSTM) at the end of 2D CNN for spatio-temporal features recognition [38,39,40,41,42]. Do et al. [38] proposed a multi-level feature LSTM on a Dynamic Hand Gesture Recognition (DHG) dataset of 14 and 28 classes and achieved 96.07 and 94.40% accuracy. Cui et al. [69] developed a video-based recurrent convolutional neural network for continuous sign language recognition. Elboushaki et al. [39] used two models, Residual Networks (ResNets), for learning the Spatio-temporal information from colored images, and Convolutional Long Short-Term Memory Networks (ConvLSTM) to capture their temporal interdependencies. And 2D-ResNets was then used to extract deep features from the gesture. Yanqiu Liao et al. [70] proposed a multimodal dynamic sign language recognition method based on deep 3-dimensional residual ConvNet and bi-directional LSTM networks (BLSTM-3D residual network or B3D ResNet) for the recognition of complex hand gestures. The technique consisted of three main parts. First, the hand object was localized in the video frames, utilizing R-CNN to gather hand position information. The second part was the video sequence features extraction module, which performed the task of long-term spatio-temporal features extraction with inputted segmented video frames. The third part was the dynamic sign language recognition module, which analyzed the long-term temporal dynamics and predicted the hand gesture label. Through which, the video sequence label was predicted. Based on the top label prediction scores, this label would be treated as the video sequence label and outputted as the recognition result. The results on test datasets show that the proposed method obtained recognition accuracy of 89.8% on the DEVISIGN_D dataset and 86.9% on SLR_Dataset.
John et al. [40] extracted frames that represent the gestures efficiently from the video dataset before feeding them to long-term recurrent convolution networks (LRCN). This way of extracting frames improved the classification accuracy of the model. Lai and Yanushkevich [41] used both depth and skeleton data for hand gesture recognition by combining the convolutional neural networks (CNN) and the recurrent neural networks (RNN) on the dynamic hand gesture dataset. They attained an overall accuracy of 85.46 %. Fahad Obaid et al. [71] proposed a model comprised of two stages to solve the hand gesture recognition (HGR) problem in Video Sequences. The first stage is preprocessing, while the second is to classify, label the frames and recognize the hand gestures through deep learning. For the later stage, they proposed two models-A convolutional neural network (CNN) and a recurrent neural network with a large-short-term memory were used in the first model (RNN-LSTM). Model I used two types of neural networks in its network architecture: a CNN for spatial feature extraction and an RNN for temporal feature extraction. The second network design consists of a single CNN supplied with grayscale or depth data. The CNNs were fed training and testing data after they had been trained. Individual frames from each gesture were transformed into a sequence and utilized as a dataset for RNN training and testing.
The RNN with long short-time memory (LSTM) was then used for temporal feature extraction and classification of the sequence of frames. The VIVA dataset was used to evaluate the model. The results show that the proposed method obtained Validation Accuracy of 82% with color information, 89% with depth information for Model II, and 93% validation accuracy when color and depth information was combined for Model I. Molchanov et al. [42] suggested a combination of 3-D CNN and RNN for gesture recognition.
In addition, 3D CNNs have been utilized to capture discriminative characteristics in both the spatial and temporal dimensions. It takes a series of video frames as inputs. Saqib et al. [43] proposed a 3D CNN model to classify static and dynamic gestures of Pakistani sign language and attained an accuracy of 90.79%. Camgoz et al. [44] proposed a novel 3D CNN architecture for broad-scale, independent gesture recognition. The architecture uses eight 3D convolutional layers, five 3D max-pooling layers for feature extraction, two fully connected layers, and a SoftMax for classification. For the hand detection task, [45] proposed a real-time hand gesture recognition model based on YOLO (You Only Look Once) v5 and DarkNet-53 convolutional neural networks. The model was able to successfully detect gestures in lowresolution images and complex environments. They implemented both static and dynamic gestures recorded on a video and attained an accuracy of 97.68%. Mambou et al. [46] suggested a sexual assault alert system from various scenes at night. The model was implemented on a combination of YOLO and CNN architecture. Generally, techniques like 3D CNN, a combination of CNN and RNN, have been widely used by researchers. Thus, this study too proposes a comparative study in classifying static as well as dynamic hand gestures through both these methods. Not surprisingly, YOLO has produced state-of-the-art results in various detection tasks. Therefore, we implemented YOLOv5 on our dataset before annotating the images.
Khari, Garg, Crespo, and Verdú [47] suggested a static gesture recognition method by using a pre-trained VGG19 model on the ASL Dataset. The model achieved an accuracy of 94.8%. Furthermore, Paul et al. [48] proposed two CNNs to categorize 24 static signs from ASL. The work is based on the ASL Finger Spelling dataset and attained an accuracy of 86.52% and 85.88% on RGB images. Masood, Srivastava, Thuwal, and Ahmad [49] used Inception-v3 pre-trained CNN, with a combination of LSTM, to propose three models of various LSTM units for the Argentinean video dataset (LSA). They attained the best accuracy of 95.2%. Using pretrained models like VGG-16 or inception before LSTM has achieved excellent results in the past. In model II proposed in this work, we also took the same approach of using VGG-16 before LSTM on the emergency dataset. In addition, the method proposed by V. Adithya [60], who employed two models on the same emergency dataset, also suggested pretrained GoogleNet [65] with LSTM. The other approach by V. Adithya [60] was using a Multi-class SVM.
ASL is based on a single-hand sign language, whereas ISL uses both hands, which makes it difficult to recognize as compared to ASL. In addition, most of the signs are dynamic, for which classification and detection tasks require image sequence input. Furthermore, hands involved in gesturing may involve complicated motion in ISL. The non-availability of the benchmark dataset has also hindered the developments in automatic sign recognition. Due to these issues, comparatively less research work has been carried out on the ISL recognition system [50]. In the present work, we tried to fill this research gap with the help of advanced deep learning methods. The emergency signs studied are mentioned in Figure 3.

III. METHODOLOGY
This section discusses the dataset used and the methods adopted for sign language recognition.

A. DATASET
Sign Language recognition has not been well-researched for ISL due to the non-availability of a standard publicly available dataset. For other languages, especially for ASL, there are many standard datasets. The three such word-level ASL datasets are Purdue RVL-SLLL ASL Database [51], Boston ASLLVD [52], and RWTH-BOSTON-50 [53]. LSA64 [54] is an Argentine word-level dataset, PSL Kinect 30 [55] is a Polish word-level dataset, DEVISIGN [56] is a Chinese, GSL [57] is Greek, DGS Kinect [58] is German, and LSE-sign [59] is Spanish Sign Language dataset. In this study, a video-based ISL dataset is used that contains 412 videos [60]. Researchers working on vision-based sign language recognition and hand gesture recognition will get benefitted from this dataset. The dataset's primary goal is to advance in the field of sign recognition since it has several applications in society, such as providing a platform for the deaf to communicate essential messages to authorities. Furthermore, the dataset may be used as a basic benchmark database for a collection of emergency ISL hand gestures.
The dataset included eight hand gestures representing ISL words such as 'accident,' 'call,' 'doctor,' 'help,' 'hot,' 'lose,' 'pain,' and 'thief'; often used to transmit information or request help in emergency scenarios ( Figure 3). Out of a total of 412 videos, each sign is represented in 50 different videos on average. The complete list is provided in Table-

B. PRE-PROCESSING
The present study utilizes a short subsequence rather than the whole video to recognize the sign words. To efficiently teach the model the dynamics of the movements, we first implemented our strategy by extracting 20 frames from each video. To shorten the training period, we began reducing the numbers of extracted frames (1-10) and discovered that 5 frames from each video sequence at equal intervals were sufficient to reflect the dynamics of the gestures without compromising the prediction accuracy. Furthermore, previous researches demonstrate that human action recognition may be accomplished by just a few frames (1-7 frames) [77,78].
Images were graded from 0 to 7 to represent eight different classes. Table-1 shows the class labels or scores, corresponding hand gestures, and class size. For the object detection purpose, the dataset was labeled before applying the model. The YOLO format was used to label data in a text file format and store information such as class ID and the class to which it belongs. The extracted frames have been resized from (500 by 600) to (150 by 150) pixels. Data normalization was also used to verify that the data distribution in each input pixel is uniform, resulting in faster model training convergence. The overall dataset (100%) is divided into three subsets: training (60%), validation (20%), and testing (20%).

C. PROPOSED METHODS
We utilized two models for the classification task, Model I and Model II. Model I consist of 3D CNN, while Model II uses a combination of pre-trained CNN and LSTM. Model

1) CLASSIFICATION
Artificial Neural Network such as convolutional neural network (CNN) is widely employed to analyze pixel input [41]. CNN is applied for image recognition and processing, object detection, feature learning, and sequence prediction with RNN. CNN uses a variation of the multilayer perceptron model and carries at least one convolutional layer that can be wholly connected or pooled. The convolutional layers generate feature maps to record a portion of an image before being split down into boxes and sent out for non-linear processing. The input data is passed through the network, and each layer extracts features further to pass them to the subsequent layers [43].
A convolution layer is a tool that allows image feature extraction based on various filters trained by the network itself. The convolution function is defined in Equation (1). (1) where k represents the feature map of a layer l, is the convolution kernel, f is the activation function of the hidden layer, is the map of features of the previous layer, and b is the biases. Moreover, an Activation function is added to let the network learn complex patterns. The most commonly used activation function is ReLu (Rectified Linear Unit) and is mathematically defined in Equation (2). (2) where x is the input of a neuron. The SoftMax activation function is used in the final fully-connected layer of a network. It is mathematically defined in Equation (3).
where nc denotes the number of classes and z is the output value(probability) for the current patch after passing through the convolution neural network. The loss function is a metric for evaluating the model during training and should ideally decrease over iterations. We used categorical cross-entropy loss in our work, defined in Equation (4).
where y' is the label predicted by our classifier, and y is the ground truth label.
Pooling layers are used to reduce the complexity of the model. It reduces the learning parameters and the number of computations performed on the network. Max pooling is used to summarise the strongest activations throughout a region by taking the largest input value within a filter and dropping the remaining values. Furthermore, normalization and dropout layers are used to avoid over-fitting and to make the model learn independently.
Model I make use of 3D CNN for feature extraction and dense layers at the end for classification as shown in Figure   FIGURE 4. The detailed overall architecture of Model I 4. The architecture of model I comprises of two Convolutional blocks, each consisting of a 3D convolutional layer, MaxPooling 3D layer, Batch Normalization and Dropout Layer. The size of all convolution kernels is 3 × 3 × 3 with stride 1 × 1 ×1. The number of convolution kernels is 32 and 64, respectively. The size of all pooling kernels is 1 × 2 × 2. Furthermore, there is a 50% and 40% dropout respectively at the end of each convolutional block. ReLu was used as the activation function in each block. A Flatten layer is placed after the convolutional block to convert the extracted data into one dimension followed by a fully connected layer of 256 neurons. Finally, an output layer with SoftMax activation function of 8 neurons corresponds to the total number of classes in the hand gesture dataset.
All inputs are regarded as independent in a conventional neural network. This technique is restricted by problems where the network has to recall events from past information, such as predicting one word in a phrase or predicting a video framework. Recent neural networks (RNNs) are built particularly for identifying patterns in data streams. The result depends on prior calculations in these networks. In addition, these networks contain a "memory" to collect information on the calculations made thus far. The most popular RNN network is LSTM [61]. As illustrated in Figure 1, a general approach to sign language recognition on video datasets starts from extracting frames from videos followed by pre-processing of the frames. Features are then extracted from the processed frames before being classified.
The second model, i.e., Model II is based on LSTM on top of a pre-trained CNN (VGG-16) to classify the video sequence as shown in Figure 6. It uses transfer learning and LSTM for learning spatial and temporal features respectively. Each frame is passed onto the CNN (VGG-16) to extract spatial features. The architecture of VGG-16 is shown in Figure 5. The outputs are then sent into a LSTM to find temporal characteristics in the image stream. Finally, the extracted features are sent to a fully connected layer that predicts the classification for the whole input sequence.
The VGG 16 network was originally trained on the ImageNet dataset, which had over 14 million high-resolution images with 1000 distinct classifications. In order to utilize it for our work, we deleted the classification layers that were trained on the ImageNet dataset and modified the pre-trained model's input shape to meet our needs. As a consequence, our model includes almost 14 Million learned parameters and concludes with a maxpooling layer from the network's Feature Learning section. After that, we used a 256-unit LSTM layer to learn the temporal characteristics, followed by a fully connected layer of 1024 neurons to classify the features. Finally, a Softmax layer was added to give the output, as shown in Table 3. The training is carried out using the Adam optimizer, with a learning rate = 0.001 and batch size of 32 for both models. 2) DETECTION Model III uses YOLO v5 to detect the hand movements and classifies the gesture with a confidence score. YOLO v5 is an advanced real-time object detection algorithm with top performances on two official object detection datasets: Pascal VOC [32] and Microsoft COCO [33]. It is a state-of-the-art detector that predicts the coordinates of objects in the image and the class label's confidence score (probability). The architecture of the network is shown in Figure 7. YOLO v5 comprises of three segments; CSPDarknet network, PANet network, and YOLO Layer, often referred to as backbone, neck, and head of the architecture, respectively. Initially, the data is sent to the CSPDarknet for feature extraction, followed by PANet for feature fusion before outputting the class, score, location, and object's size through the Yolo layer.
The input data needs to be annotated in the specific YOLO format that labels the data into text file format having information as object class, object coordinates, height, and width. For our study, hands in the images are annotated into the required YOLO format.
We did not employ the YOLO model to compensate for the classification model's shortcomings; instead, we developed it as a detection model that can provide better performance in real-time deployment [62]. With multiple objects in the frame, the classification model fails to classify while detection model can efficiently detect and classify the target object due to the bounding box. Hence, the classification and detection models are not meant to be compared as they demonstrate different concepts.

IV. RESULTS AND DISCUSSIONS
In this section, the results of all three models are presented. We have used Google Colab to train the proposed models, a free environment that runs on the cloud and provides GPU based computation facility. The KerasLibrary [63] and TensorFlow [64] backend were utilized to implement the models.

A. EVALUATION METRICS
To evaluate the models, commonly used measures such as accuracy, precision, recall, and confusion metrics are considered. To assess model performance, object detection models such as YOLO also rely on a specific metric, i.e., mean average precision (mAP). Average precision is defined in Equation (5): In addition, Mean Average Accuracy (mAP) is the sum of average precision of all classes by the number of classes and is defined in Equation (6), (6) where N is the number of objects for classification. Two medium accuracy ranges, map@0.5 and map@0.5:0.95, are used. The mAP@0.5 shows the average confidence level accuracy of 50%. The mean value of average precision in the range of 50% to 95% is the mAP@0.5:0.95.

B. PERFORMANCE OF CLASSIFICATION MODELS
The classification models were trained and tested on 2060 frames from 412 videos. We trained both the models on a common training set and tested them on the same testing set. The models classify various emergency signs from ISL as 'accident', 'call', 'doctor', 'help', 'hot', 'lose', 'pain' and 'thief'. The results for each sign from both models are presented in Table 4.
The accuracy and loss calculated during training for Model I are shown in Figure 8. The architect and parameters of the models were tuned during training to achieve better results. 3D CNN-based model was able to achieve a maximum of 82% accuracy on the test set. The difference between the training and testing accuracy infers that the model was still over-fitting. Furthermore, the fluctuating testing accuracies and testing loss suggest that the model could not generalize, which could be solved if the available dataset was large. The Confusion matrix of Model I is shown in Figure 9. It was also noticed that 3D CNN could not learn hands that had excessive shifting. Since the signs' Doctor' and 'Thief' had the least movement among all, the model achieved relatively better results on them while the other signs such as 'Help', 'Lose' and 'Call' weren't classified with accuracy. The reason could be that the model was not able to learn temporal features in the sequence. Furthermore, it was observed that double-handed signs were classified more accurately than single-handed ones. Some of the misclassified instances of Model-I are shown in Figure 10.
On the other hand, Model II, i.e., VGG-16, combined with LSTM achieved an accuracy of 98%. The accuracy and loss evaluated during training for Model II are shown in Figure  11. The accuracy gradually improved throughout training. Since the feature extraction block was pre-trained, the model was able to learn in very quickly. After around ten epochs, the performance in terms of accuracy and loss became steady. It is evident from the confusion matrix ( Figure-12) that the model successfully learned both the spatial and temporal features of each sign and differentiated each one of them. Due to the similarity of movement of the hands, Model II showed some error in differentiating the signs' Pain' and 'Call' as shown in Figure 13. The same can be inferred from the corresponding confusion matrix too.
Because of the benefit of feature extraction from a pretrained CNN before LSTM, the results achieved using Model II outperformed those obtained using Model I, as seen in the graphs above.
V. Adithya's technique [60] was used to compare the outcomes of both models, who employed two models on the same dataset. One approach used a Multi-class SVM, and the other was a pre-trained GoogleNet [65] with LSTM. The comparison showed that our Model II outperformed their methods. However, Model-I did not produce better results than theirs. Table 5 gives a comparative overview of the same. Table 6 shows the results obtained from YOLO Model. Since the dataset contains a constant background, the bounding box was intentionally left a bit large to cover the maximum area of the gesture without fearing the model learning anything from the background. The model returns the class ID and the confidence score of the gesture.

C. PERFORMANCE OF DETECTION MODELS
Model III is trained on 500 epochs with a batch size of 16. The learning rate is 0.01, and the decay is 0.0005. It took around 4 hours to train the model. Precision and recall are considered to evaluate the object detection model as they can provide valuable insight into the model performance at various confidence values. Generally, as the confidence threshold increases, the precision also increases while the recall decreases. Therefore, the F1 score is especially helpful in deciding the optimum confidence that equally poises the precision and recall values for the model. The confidence score and various metrics were plotted to find the appropriate confidence score. Figure 14 illustrates that before a confidence value of 0.8, the model predicts everything accurately. In addition, the confusion matrix on the test set is given in Figure 15. A high precision value indicates a highly confident model in classifying a given sample as positive, and a high recall value indicates a model's ability to correctly classify positive samples as positive. Furthermore, as the recall increases, the precision decreases since as the number of positive samples increases (high recall), the accuracy of classifying each sample correctly decreases (low precision). So, in order to find the optimum values of precision and recall, a precisionrecall curve is plotted to easily determine the value at which precision as well as recall, both are high. The average precision (AP) is a metric that represents the mean of all precision values by summarizing the whole precision-recall curve with the help of a single value. In other words, the AP is calculated as the weighted sum of precisions at each threshold where the weight is the increase in recall. The PR graph is shown in Figure 16. The region below the curve is utilized to determine the AP value of the object. As seen in

D. PERFORMANCE COMPARISON AND DISCUSSION
In the present study, we developed models for classifying and detecting hand gestures for ISL. The dataset utilized includes hand gestures, double and single-handed motions, and dynamic and static movements, which adds realism to the suggested models. We compared the performance of classification models with the related works [60] (Table 5), and YOLO v5 was used as the detection model.
To compare the performance of models, their respective precision and recall values are shown in Figure 17. It can be seen that signs representing 'Accident', 'Doctor', 'Help' and 'thief' are classified accurately by all the models. Also, these   signs require both hands in making the gesture, leading to the conclusion signs including both hands are relatively easier to classify. Next, 'Call,' and 'Pain' were the most difficult signs to be categorized accurately by any model. Both signs have dynamic gestures, and at a point in their movement, they seem to have a common hand position, which might have been challenging for models to differentiate. The model I could not yield satisfactory results in identifying dynamic signs such as Accident, Call, Hot, and Pain, implying that 3D CNN did not correctly learn all of the temporal characteristics.
In contrast, Model II was able to learn both the spatial and temporal features more comfortably because of the presence of a pre-trained VGG16 feature extractor. Compared to the 3D-CNN network, the combination of a VGG16 network with an LSTM architecture achieved considerably greater precision and recall rates. On the other hand, YOLO v5 successfully detected each sign in the dataset with an overall precision and recall of 99.5%. YOLO offers several benefits over other techniques, making it an advanced detector. YOLO uses a single CNN for both classification and localization instead of utilizing a two-step approach for object classification and localization. Second, YOLO is fast and processes images at around 40-90 frames per second [66]. It implies that streaming video can be handled in realtime, with only a few milliseconds of delay. It suggests that Model III may be used to recognize emergency gestures in real-time.

V. CONCLUSION
The present work proposes classification and detection models on an emergency ISL dataset. The best classification model uses a combination of pre-trained VGG-16 and LSTM, while the detection model is based on the YOLO v5. The classification model achieved 98% accuracy, and the detection model achieved 99.6% mean average precision. The developed hand gesture recognition system classifies and detects both static as well as dynamic hand gestures from video frames. Furthermore, we found that even a smaller set of images are enough to recognize the dynamic gesture. For deaf people, sign language provides a means of emergency communication that helps them deal with difficult times. Situations like feeling pain, calling for help, or a doctor may arise anytime and anywhere. The present study opened the door to develop applications based on proposed methods that can be used when hearing-impaired people are in desperate situations and want to convey their thoughts to other people at the earliest.

COMPLIANCE WITH ETHICAL STANDARDS
 Disclosure of potential conflicts of interest: We declare that there is no conflict of interests among authors.  Research involving human participants and/or animals (Ethical approval): This article does not contain any studies with human participants or animals performed by any of the authors.
 Informed consent: Informed consent was obtained from all individual participants included in the study.