SIGNFORMER: DeepVision Transformer for Sign Language Recognition

Sign language is the most common form of communication for the hearing impaired. To bridge the communication gap with such impaired people, a normal people should be able to recognize the signs. Therefore, it is necessary to introduce a sign language recognition system to assist such impaired people. This paper proposes the Transformer Encoder as a useful tool for sign language recognition. For the recognition of static Indian signs, the authors have implemented a vision transformer. To recognize static Indian sign language, proposed methodology archives noticeable performance over other state-of-the-art convolution architecture. The suggested methodology divides the sign into a series of positional embedding patches, which are then sent to a transformer block with four self-attention layers and a multilayer perceptron network. Experimental results show satisfactory identification of gestures under various augmentation methods. Moreover, the proposed approach only requires a very small number of training epochs to achieve 99.29 percent accuracy.


I. INTRODUCTION
A communication medium consists of hand gestures and the most structured and organized language to effectively communicate for impaired people. Sign language is a collection of various gesture-generation techniques. Sign language is a more effective method of communication than leap moment identification or writing a message. Sign language is vast and consists entirely of gestures to properly comprehend messages. Sign language is not just a gesture using fingers and palms; it involves visual cues through the eyes, face, mouth, eyebrows, etc. Additional components, like facial expressions, involve expressing the complex meaning. Normal verbal language is much more creative and cultivated than normal verbal language. The artistic spirit of life is given The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Li. by the hand moment, body, and facial expression. Although the sign language can be simple and professional, it can also be an animated way to communicate, even though the sign language is very formal.
Sign language recognition is an area of research that involves pattern matching, deep learning, computer vision, natural language processing, and a design module or algorithm to identify sign language. It can be extended further to human-computer interaction without a voice interface. This system belongs to multidisciplinary content and the approach can be considered as a part of the Sign Language System.
There are around 300 sign languages used around the world [1]. The numbers don't have some level of confidence because day by day some countries immerge with their own sign language. American, British, and Chinese sign languages are the most widely used worldwide [1]. There is a very huge diversification in sign language, which varies from region to region. It involves several local or conventional subsets of the language. The diversity may lead to different flows of gesture singing, expression, jargon and the formation of gestures. As certain English words are spoken differently in different parts of one country, different accents and distinct dialects may be present in sign language. Indian sign language is the most widely used language in the South Asian regions [2].
Sign language is classified based on static and dynamic or involving manual and non-manual body parts. This classification can be helpful to researchers and designers of sign language recognition systems. We must combine both sign components to design a robust or real-time sign language recognition system. As shown in figure 1, sign language can be divided into two basic categories: static and dynamic. Static signs can be generated using one hand or two hands, while dynamic signs are further divided into two subcategories as isolated and continuous. To include emotional substances, dynamic signs can be further divided by the involvement of non-manual body parts. Non-manual body parts can be eyes, head movements, leaps, and eyebrows [3].
The main contributions of the work as, i) Proposed Transformer-based DeepVisionTransformer to recognize Indian sign language. ii) Evaluate the proposed model with a very small number of learning cycles (5 epochs), which is tiny compared to other state-of-the-art sign recognition models The rest of the article organized as section II contains literature study relevant to sign language recognition. Section III includes a detailed description of Vision Transformer and Vision Transformer for image recognition. Section IV contains details of the proposed methodology. Section V contains details of the experiment and result. Section VI present discussion and conclusion of proposed work.

II. RELATED WORK
Rokade [4] proposed a methodology for automatically recognising fingerspelling in the Indian sign language. Input the sign image first to perform segmentation based on skin colour to detect the sign's shape. The detected area is converted into a binary image. Furthermore, a Euclidian distant transformation is applied to the binary image. After the feature extraction using Hu's moment, classification is done with ANN and SVM. The accuracy was 94.37% with SVM and 92.12% with ANN over 13 features [4]. The author has found good accuracy with ANN even with a smaller number of features set. The author has used a black background image of the letter (26class) with dimensions of 320*240.Video-based Indian signs are used to recognize them by the proposed system. Katoch et al. [5] present a technique that uses the ''Bag of Visual Words'' model (BOVW) to recognize Indian sign language letters and digits. The proposed methodology uses segmentation based on skin color and background subtraction. The authors used histogram-based sign mapping. At the end, CNN and SVM were used for classification. The author has also developed a GUI to make access easier. The author has used a custom dataset of more than 36,000 images to recognize Indian sign language. Over the dataset mask generates binary and canny edges to extract the feature with SURF. The proposed methodology with SVM and CNN found 99.17% and 99.64% accuracy, respectively [5]. Shenoy et al. [6] proposed a static hand pose for Indian sign language recognition. Video frames are captured from a smartphone and transmitted to a server for processing. The author has used skin color segmentation for hand detection and tracking. Feature extraction uses a grid base technique to represent hand gestures in a feature vector. The author has used KNN for static hand pose (alphabet and number) classification, while HMM was used to classify other gestures of Indian sign language with an accuracy of 99.7% and 97.23%, respectively [6]. The author used a custom dataset of over 24624 images for the experiment. De Coster et al. [7] proposed a sign language recognition methodology over the Flemish Sign Language corpus. The author has used OpenPose feature extraction and end-to-end learning with CNN, and applied a multi-head attention approach to isolated sign recognition. Over the class of 100 signs, 74.7% accuracy has been obtained as a state-ofthe-art result over the Flemish Sign Language Corpus. The author introduces the Multimodal Transformer Network with Pose LSTM and Pose Transformer, especially self-attention for sign language recognition [7]. Mannan et al. [8] proposed Hypertuned DeepCNN for American Static sign language, author has used data augmentation to create more number of learning data sample, as deep learning model accuracy will increase with more samples for the training process. The proposed architecture follows conventional CNN with tuned hyper that parameters able to achieve 99.67% accuracy with 20 epochs. Zakariah et al. [9] proposed CNN based architecture for Arabic letter sign recognition. The authors generated 160000 images from 32000 images with data augmentation, which helps to consider different brightness and angular scenario. EfficientNetB4 method used for simulation. The authors also modified existing EfficientNet by adding one fully connected dense layer Author used a standard ArSL2018 dataset with 32 classes and get 95% of accuracy in 30 epochs. Kamruzzaman [10] proposed CNN based method for Sign language detection. Authors have proposed ResNet50 and MobileNetV2 based methodology for Arabic sign language recognition. ResNet50 and Mobile-NetV2 simulated separately on 32 classes of ArSL2018 dataset. A combination of ResNet50 and MobileNetV2 can able achieve 98.2% accuracy with 10 epochs. Rathi et al. [11] proposed deep learning based sign language recognition model. Authors have used 2-level ResNet50 architecture to recognize sign language, authors have used 36 classes of the American Sign Language dataset form Massey university [12] having approx. 70 RGB images. Proposed 2 level ResNet50 methodology archive 99.03% accuracy. S. Jiang, et al. proposed skeleton aware multi-model for sign language recognition [13]. The authors used hand detectors with a pose estimator to extract hand key points. Methodology introduces the sign language GraphCN(SL-GCN). As a result, proposed methodology archive 98.425 of accuracy over RGB images. Roman Tongi, introduced a transfer learning based methodology for sign language recognition [14]. The methodology proposes how transfer learning can be applied to SLR using inflated 3D convolution neural network. American Sign Language (MS-SAL) and German Sign Language Dataset (SIGNUM) were used for simulation, and archive compatible result. In 2017, Google Brain researchers proposed an encoder and decoder network architecture based on attention mechanisms. The author used this transformer architecture to translate the language. For a model architecture that completely relies on an attention mechanism, foregoing recurrence, to identify global dependencies between input and output. Experiment with 8 GPUs (P100). We obtain state-ofthe-art results for English to French translations [15].
By following outstanding performance on a language task, the transformer opens up a new dimension for computer vision problems. Use a transformer for image classification. Attention can be implemented in conjunction with a convolution network in computer vision. The proposed method takes an image as input and does not extract any features. Instead, convert the image into patches, and the sequence of patches serves as an input matrix to the transformer's encoder layer. Further classification to be done with the MLP. The authors have introduced three variants of ViT, as Base, Large, and Huge, having 12, 24, and 32 layers, respectively [16]. Table 1 shows the comparative analytics of static sign language detection with state-of-the-art deep learning models, with parameter details and the number of classes.

III. MATERIALS AND METHODS
The proposed architecture used a transformer-based method for static Indian sign recognition, multihead self-attention is proposed in the encoder phase of the transformer, the detail study of vision transformer and multi-head attention as follow.

A. VISION TRANSFORMER
The emergence of Vision Transformer (ViT) strongly competes with the CNN, the state-of-the-art of computer vision, so commonly utilized in several image recognition tasks. The ViT models surpass the convolutional neural networks (CNNs) in terms of computational capabilities, efficiency, and accuracy [17]. In the field of natural language processing, transformer architectures have state-of-the-art performance standards. Only a few use cases are included in the field of computer vision because attention is used in association with convolutional networks (CNNs) or can be used as a substitute for certain convolution features while preserving their original composition. The transformer encoder detaches these dependencies of the CNN, and the standard transformer architecture can be directly applied to the sequences of image patches, and it works surprisingly well and accurately over image classification tasks.
Initially, the transformer was introduced for language processing tasks. The trans-former used attention mechanisms rather than convolution layers. The design of a trans-former consists of an encoder and a decoder. Both of them involve self-attention and feed-forward mechanism. The transformer can be applied to various computer vision tasks by performance capabilities. The transformer performs better than other convolution methodologies in various computer vision task [18]. The special use of a transformer in a computer vision task is known as a vision transformer (ViT). ViTs (different variants of ViT) achieve promising and remarkable results in computer vision tasks. ViT has two major benefits: 1) self-attention mechanism, where the model reads a long range of input seeds (tokens) in a universal context. 2) The ability to train on large tasks. Figure 2 depicts the ViT design, in which first images are converted into patches in accordance with the model design. Then patches directly feed into the linear projection layer. In the second stage, the patch embedding process is performed. The class token has been added to the sequence of the embedded patches. Thus, the size of patches increases by one. Embedded patches are also added with positional embedding to the memory positional sequence of patches. Finally, patch embedding and positional encoding with a class token are fed to the encoder layer as the first transformer layer. Encoding is the most important component of a transformer, especially in ViT. It contains two major components, Multi Head Self-Attention (MHSA) and Multi-Layer Perceptron (MLP). The embedded input is normalized through the normalization layer. A normalized value is used to obtain Query (q), Key (k), and value (v) as a matrix, as shown in equation (1). The MHSA module executes the following equation to achieve attention operation inside the encoding [19]. Finally, the attention layer's output is fed to the feed forward layer, which generates the encoder's final output. (1)

B. VIT FOR IMAGE CLASSIFICATION
The vision, to vision transformer for image recognition task was introduced in ''An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale'' by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov et al. and successfully trained on ImageNet, attaining good results compared to the convolution network [15]. The transformer encoder receives an input image, generates fixed size nonoverlapped patches from these images, and then linearly embeds the sequence of patches. The class/token is embedded to represent an entire image, which can be needed at the classification phase. The author also adds the resulting vector's absolute position embedding and set sequence to the pure transformer encoder. The original image resolution and patch resolution used in training and fine tuning are reflected at every transformer layer checkpoint; however, any pre-train model can be used for the same. The author does this to improve the accuracy and predictive power of the transformer, because each head has its own way of internal representation and computation of input, each head can manage to understand the relationship between patches in sequence (i.e., collective, shared knowledge). Any relationship information among the patches missed by one head is highly likely to be covered by another head [20].

IV. PROPOSED ARCHITECTURE DESIGN
The proposed multihead self-attention based architecture contains three major parts: patch embedding, feature extraction and a classification head. Stacked encoders are a core part of the simulation because of the feature extraction. Algorithm 1 represent pseudo code for proposed methodology. Initially 2D images convert into 1D sequences of embedding tokens. Images (Xi) are reshaped as, X ∈ R (H * W * C) where c is the number of channels as 3, and (H, W) consider a weight and height as (72 × 72). These images are converted into the sequence of flattened 1D patches with the shape of (N , P 2 * C), where N is the total number of patches and (P, P) is the dimension of the patch. A positional embedding tensor (Epos) with shape of (N, D), learns 1D positional information of each patch and generates the spatial representation of the patches.

B. MULTIHEAD SELF-ATTENTION
The multihead mechanism learns embedding vectors with different aspect. The multihead attention (MHA) is included in each of the 8 layered stacked transformers. The hidden state divided into n=4 heads to generate n feature tensor. Each self-attention head has three trainable matrices (q, k, v), represented in equation 2. Every (four) heads have an attention tensor that can calculate as equation 1. softmax operation gives the attention score for every attention head. Farther selfattention matrices can be calculated as dot products of A and v (equation 3) [16], and concatenated features of the tensor can be generated with equation 4 [16].

C. MLP CLASSIFICATION
An output tensor of multihead is added to the residual connection, which is projected by point wise feed-forward network with two linear layers with ReLU activation in between. Each layer uses different weights (W1, W2) and bias (b1, b2) as shown in equation 5.
Proposed methodology uses MLP [21] classifier with four hidden layers of vary in size to perform classification over Multihead transformer network as shown in figure 4. classification network is defined as x n * a1 n * a2 n * a3 n * a4 n * y n were x n represent input feature vector, a(x) n represent neurons in respective hidden layers, y n represent output class prediction. Proposed methodology uses MLP classifier based on capabilities like i) ability to learn in complex and nonlinear networks ii) Generalization ability can be improved iii) MLP learns independently from input variable size [22]. Farther Adam Deep learning optimizer has been used, which inherit the feature of RMSprop and AdaGard [23]. Parameter of classifier were set to improve model performance.
The authors have proposed a transformer-based encoder model to recognize static Indian sign language from an image-based dataset. Initially, the dataset was divided into train validation splits with a 0.2 splitting factor, and the read resized image was 72 × 72. Resize image converted to the same size as the non-overlapped patches. The proposed methodology creates 144 patch form input images, as shown in figure 3. The sequential embedding layer creates a patch sequence, which is further combined with the positional vector. The output of the positional embedding layer is fed to multi-head attention. The proposed architecture uses six selfattention layers, and an appropriate tensor is managed at the positional embedding layer. The classification has been performed in the MLP head as a single layer of fine-tuned time. The proposed methodology can archive validation accuracy of 99.29% with only five epochs. A small amount of training can result in a high recognition rate. As per table 1, the proposed methodology can recognize static signs with little training and high accuracy.

V. EXPERIMENT SETUP A. DATASET
The dataset used in the simulation is prepared from collection of publicly available Indian sign language dataset (static) [24], [25], which includes gesture of numbers (0-9) and alphabet. Dataset consist of RGB images of total 36 classes with more than 1000 images per class to improve data generalization augmentation has been done. Table 2 shows the characteristics of dataset used in this experiment.

B. DATA AUGMENTATION
Data augmentation is mainly used for sample balancing and improving training sample variability. Data augmentation is also significant for transformer based framework because the huge amount of data is essential for model training. Using different augmentation techniques like flipping, cropping, rotation, etc., authors simulate horizontal flipping, colour space transformation, random zoom with 0.2 weight and height and slight rotation from −1 to −10 to train the model on more generalized data. Figure 5 shows an overview of data set after augmentation.

C. IMPLEMENTATION DETAILS
The authors have worked on a standard Static Indian Sign Language dataset [24], [25]. During this study, the authors implemented a modified transformer using TensorFlow-Keras. The proposed methodology has achieved 99.29% accuracy over the 36 image-based Indian sign language classes. The proposed methodology uses a static Indian sign language dataset [22] of images with a size of 480 × 320 pixels with 3 (RGB) channels. The dataset is split into 2 parts (train-test) with a 0.2 splinting rate as for training and testing. Initially 72 × 72 images are converted in patch size of 6 × 6, total 144 patches as ((image_size/patch_size) 2 ) will be created for every image. The position embedding tensor of the patches will be used as the encoder input. Farther tensor will pass through two normalized layers with activation (ReLU) and MLP classification head. The performance of the proposed methodology was evaluated by three different experiments. The parameters used in training are fine-tuned, like the optimizer number of layers and activation. Precision [26], recall [27] and f1-score [28] were calculated as equations 6, 7 and 8, respectively, where TP and FP are the number of true and false positives, respectively The proposed methodology has achieved significant accuracy using a smaller number of attention layers in the encoding component of the transformer as well as a very small number of learning cycles. All the results were taken on a personal computer with an Intel Core i7 and 16 GB of RAM only. Jupyter Lab is used to implement the proposed methodology. Figure 8 represents a heat-map for the class wise recognition of static sign language.

D. RESULT OF PROPOSED METHOD
In this simulation, we evaluated different combinations of state-of-the-art classification networks with different classifiers. Furthermore, the Author has also simulated proposed methodology with different classifiers and train test split ratio. Table 3 shows result comparison of The Indian sign language dataset with and without static background, in both cases training-testing parameters will be constant as five epochs and 80-20 train-test splitting ratio has been taken for simulation. Data augmentation was performed in both the scenario. The outcome of this experiment to the analysis of the proposed methodology can recognize sign gestures over various backgrounds (vary from sign to sign). The different appearances of the background do not depend on the environmental factor and resolution of camera.

2) EXPERIMENT WITH DIFFERENT TRAIN-TEST SPLIT RATIO
In this experiment, the we simulated the proposed methodology's performance over different train-test split ratios as 80-20, 70-30, 60-40 shown in table 4. The reduction in the training dataset slightly effect on the recognition rate.

3) EXPERIMENT WITH DIFFERENT SELF-ATTENTION HEAD
Author also experiment with different numbers of selfattention head, as figure 6 shows that classification accuracy differs with the number of attention head. This study aims VOLUME 11, 2023   to know how classification accuracy depend on the number of self-attention head. It also observes that from one head to 2 head, there is no major change in accuracy, whereas 2 to 4 head shows exponentially changes in accuracy. Figure 7 present the comparative analytics of different state-of-the-art classifiers over the Indian sign language dataset [24]. VGG16, VGG19, Inception V3 and ResNet-50 were taken for convolution with different classifiers.

E. PERFORMANCE MEASURES
Several standard metrics for performance evaluation like accuracy, classification error and precision have been considered for model computation performance measurement. Accuracy can be considered an indicator of the model's performance across all classes. The precision can be calculated as the ratio of the total number of positive samples identify correctly over the total number of samples classified as positive. Classification errors can be defined as missing of classification accuracy and error in the classification instant. Figure 8 shows the classification results of all 36 classes as a part of class wise performance analytics. Heat map represent erroneously classify signs. These three performance metrics are used to better understand the model performance with the existing model to identify the significant performance of Transformer for sign language recognition. Figure 9 shows the result comparative analysis of Indian Sign Language Dataset with augmentation and MLP classifier with ReLU activation over other state-of-the-art deep learning methodology of gesture recognition. Augmented ISL datasets have been tested over CNN(core), Fast-R-CNN and Adaptive CNN. Five training cycles (epochs) have been taken as static parameters for comparisons, figure 9 also represent graphical representation of the classification report over tested deep learning models.

F. DISCUSSION
In this article sign language recognition is considered for static Indian sign language. The proposed methodology may assist impaired people in communicating with other  normal people. The dataset used in the study is a static ISL dataset having signs of digit and alphabet in the context of the Indian community. The proposed methodology is able to achieve very good accuracy as 99.29% with a higher number class as 36 and very small number of training cycles as five epochs. There is no need for data pre-processing while working with the transformer encoder. Multihead attention helps the model to improve the performance compared to other standard transformer model. Table 5 shows the performance of the proposed methodology over other standard static sign language datasets, such as American Sign Language (ASL) with and without augmentation and Bangle Sign Language. The proposed methodology also performs effectively on other standard datasets with the same number of training cycles as five epochs.

VI. CONCLUSION
Sign language recognition systems have lots of potential applications in the field of human-computer interaction. A vision-based static sign gesture recognition system is essential to reduce the communication gap between normal people and visually impaired people. The proposed methodology presents transformer-based sign language recognition for static signs. A multi-head attention-based encoding framework can achieve good accuracy with a very small number of training layers and epochs. A framework with a tiny training process can also find good accuracy over a large set of classes. The multi-head encoding framework in the Transformer broke up the recognition rate of gesture and sign language recognition as a part of human-computer interaction applications. The Proposed methodology can also detect images with augmentation like different angular position and different brightness levels. Multihead based transformers are successful for static sign recognition, for more advancement proposed model can be modified for isolated and continuous sign language detection. Father transformer based methodology can proceed to identified sign gesture form isolated video of sign language, and farther extended to recognize continues sign language.
CHINTAN M. BHATT worked as an Assistant Professor with the CE Department, CSPIT, CHARUSAT, for 11 years. He is currently working as an Assistant Professor with the Department of Computer Science and Engineering (CSE), School of Technology, Pandit Deendayal Energy University (PDEU). He is the author or coauthor of more than 80 publications in the areas of computer vision, the Internet of Things, and fog computing. He was involved in successful organization of few special issues in SCI/Scopus journals. He has won several awards, including the CSI Award and the Best Paper Award for his CSI articles and conference publications. He is also a PI in several funded projects and also completed projects funded from MOHE Malaysia, Saudi Arabia. He is the author of more than 200 ISI journal articles and conferences. His research interests include data mining, health informatics, and pattern recognition. He received the Rector Award for the 2010 Best Student in the university.
SAEED ALI BAHAJ received the Ph.D. degree from Pune University, India, in 2006. He is currently an Associate Professor with the Department of Computer Engineering, Hadramout University, and also an Associate Professor with Prince Sattam bin Abdulaziz University. His research interests include artificial intelligence, information management, forecasting, information engineering, big data, and information security. VOLUME 11, 2023