Generalization of Bangla Sign Language Recognition using Angular Loss Functions

Sign Language provides the means of conveying messages for deaf and mute people. Effective communication with the masses is a great challenge for the deaf and mute community, as Sign Language is not commonly understood. Many researchers have done numerous works in foreign language datasets like English, French, Japanese, etc. However, for Bangla, one of the most widely spoken languages, much significant work has not been done yet. Most of the works on Bangla Sign Language are executed on small datasets and report satisfactory performance. However, when small datasets are evaluated from the perspective of generalizability, particularly when using deep learning based solutions, these models fail to reproduce to the same performance. Therefore, this paper poses inter-dataset evaluation as the main evaluation criteria and evaluates several deep learning based models. This evaluation is done for Bangla by leveraging two popular datasets of Bangla Sign Language. Unsurprisingly, the inter-dataset performance is inferior, and several approaches to improve are explored and documented, including the use of angular margin based loss functions. The results demonstrate the importance of such an evaluation and also show that one of the proposed approaches shows promising performance, albeit with significant room for improvement. This raises the need for a standardized dataset to overcome this issue of generalization for real-life applications, as well as the need to encourage future works to concentrate on challenging evaluations instead of pursuing deceptively good intra-dataset performance.


I. INTRODUCTION
L ANGUAGE is the core tool used in expressing one's thoughts, ideas, and increasingly social identity as it plays an inevitable role in the simple yet powerful act of passing a message to some other. Nevertheless, some individuals are deprived of the ability to speak and hear. These individuals may not be able to communicate through common means, yet they most definitely need to communicate. The dominant approach to affect communication is the use of a special gesture-based method, popularly called Sign Language. Around 3 million hearing-impaired people are living in Bangladesh [1]. Deaf and mute people face major difficulty communicating with normal healthy people, especially in Bangladesh with lower social awareness. Such situations require the use of interpretation of Sign Language, as very few people can comprehend it. Currently, the only way of communication is via human interpreters. Still, the availability of such expertise is scarce, which becomes a significant barrier in the effective integration of the deaf and mute communities. As a result, due to lack of expressiveness, they face critical challenges in education, socialization, medical therapy, and significantly in gaining employment. Therefore, automatic Sign Language recognition can play a vital role in more effective interaction with the special needs community to bridge the gap. Over the past few decades, image-based machine learning, and deep learning techniques, have stood out in the field of recognition. These techniques are then utilized to translate hand signs in text or audio form so that the hearing majority can understand easily [2].
The two dominant approaches to Sign Language recognition are based on inputs, namely vision-based and sensorbased methods [2]. Vision-based techniques are conducted by VOLUME 9, 2021 obtaining images from a camera without any use of extra sensors or special gloves, whereas, in the sensor-based methods, gloves are worn by the signer, or sensors are used to recognize the signs. Although it may be reasonably inconvenient to carry and use these sensors for practical purposes, these devices are not always available or affordable. Generally, gestures are classified into static and dynamic forms in every Sign Language, where static signs are still image forms, and dynamic signs are represented by video capturing motion. Sign Language recognition can be categorized into three major components: finger-spelling, word-level sign vocabulary, and non-manual features as in facial expressions [3]. Most of the researchers mainly worked on the finger-spelling type dataset of Sign Language. In this category, signs are divided into further sub-categories depending on the number of hands used (one-handed and two-handed methods).
There is no widely adopted international Sign Language, and for every country, Sign Languages vary from region to region. Standard benchmark Sign Language datasets are available for high resource languages, such as English, Arabic, Chinese, etc. Shi et al. [4] introduced the largest American Sign Language (ASL) dataset using naturally occurring video data. Kang et al. [5] made a dataset consisting of 1,000 images for each of the 31 different hand signs from five subjects, including alphabets and numbers for ASL recognition. A dataset consisting of over 500 samples of each symbol, which makes a total of 48,000 samples, were recorded from 4 different persons by Pugeault et al. [6] to build a large-size ASL dataset. Rioux-Maldague et al. [7], Aryanie et al. [8], and Ameen et al. [9] used this dataset introduced by Pugeault et al. for the classification of ASL. Hu et al. [10], in their work, used a large dataset of 120,000 images representing 24 alphabet letters. Apart from ASL, Nakjai et al. [11] made a dataset for the Thai Sign Language classification, where data was collected from 11 signers. Pariwat et al. [12] also made a similar Thai dataset from five hand signers. Jiang et al. [13] developed a self-collected dataset containing 1320 images with the help of 44 volunteers, which were later augmented for Chinese Sign Language. For Arabic Sign Language, a dataset has been developed by Aly et al. [14], where the 28 Arabic alphabets were repeated 50 times by three different users. This dataset has more samples than the one made by Dahmani et al. [15], where the authors made a dataset of 30 hand signs from 30 different subjects who repeated the 30 letters on an average of 4 times. Similarly, there exist well-enriched and well-established datasets for other popular languages. However, for Bangla Sign Language (BSL), there are no large datasets that can take full advantage of machine learning and deep learning methods for recognition despite being the seventh most popular language with 265 million speakers [16].
Most of the works that have been done so far on Bangla Sign Language (BSL) have relied on small author-curated datasets. Among these datasets, the Ishara-Lipi dataset introduced by Sanzidul et al. [17] is commonly utilized to identify Bangla hand gestures, as it is pre-processed and publicly available. The collection consists of 2800 samples, of which 1000 are for digits, and the rest 1800 are for alphabets. This amount of data is inadequate for robust and effective deep learning models. Shafiqul et al. [18] constructed a dataset for BSL containing 30916 samples, of which 23864 are alphabets and 7052 are digits. This is the extensive dataset on BSL that we could find when we initiated our study. While there are numerous papers reporting works on BSL, the datasets are either too small or not publicly available.
This paper demonstrates that although the studies on automatic BSL recognition may seem promising, the results are only impressive on individual collections. The exact solutions do not generalize well, as shown in this study using inter-dataset evaluation. The proposed study aims to investigate the current state-of-the-art BSL recognition and present the necessity of generalization of BSL recognition in order to deploy models for real-world scenarios where the data can be more diverse. This study conducted multiple experiments in both inter-dataset and intra-dataset testing schemes in order to evaluate the existing hypothesis with more diverse data. Although the intra-dataset test performance is promising, inter-dataset test performance raises a question of the importance of generalization capability that utilizes several recent architectures and loss functions. Our findings demonstrate that, while the intra-dataset performance of the model is satisfactory, the inter-dataset performance is comparably low. In order to improve the inter-dataset performance, we have experimented with different angular margin loss functions, as angular margin loss functions have shown some success in better generalization in face recognition [19] and speaker recognition [20] in the last few years. After incorporating these specialist loss functions, results present relatively better inter-dataset performance compared to other more widely used methods. However, the improvement does not result in a practically usable outcome because the two datasets' samples of corresponding classes are quite different from each other in terms of angle, light contrast, orientation, hand size, and background. This paper also presents an investigation on biases in training samples using visualization of Grad-CAM. The findings of this study can serve as an instigation for research on both finding more generalized methods and making a more diverse dataset on BSL recognition. The contributions of this paper are summarized as follows: • To the best of our knowledge, this is the first study that evaluates automatic finger-spelled BSL recognition in an inter-dataset setting. The rest of the paper is organized as follows: the following section summarizes the existing literature and present the different loss functions which are used in our work, section 3 describes the datasets we have used, data processing, and methodology of this work, the different experimental configurations and experimental details are described in section 4. Section 5 presents results obtained from the different configurations followed by their discussion in section 6. An ablation study is presented in section 7 and section 8 concludes the paper.

II. BACKGROUND STUDY
This section discusses various researches implemented on Sign Language recognition. Many studies employ machine learning or deep learning techniques, and some even incorporate image processing to achieve a better outcome. The last part arranges a brief analysis of different loss functions, particularly those applied in this work.

A. MACHINE LEARNING TECHNIQUES
Before deep learning, machine learning algorithms were implemented effectively for recognizing static images. Aryanie et al. [9] proposed a machine learning approach that recognizes finger-spelling ASL characters using K-Nearest Neighbour (KNN) algorithm along with Principal Component Analysis (PCA) for dimensionality reduction. Mukai et al. [21] presented in their research a Japanese finger-spelling recognition system using classification tree and Support Vector Machine (SVM) and achieved 86% accuracy. Pariwat et al. [12] also implemented SVM on Thai finger-spelling recognition. They extracted features from input images locally as well as globally and showed comparisons on different types of SVM kernels. In the specific context of BSL, Hasan et al. [22], in their work, proposed a Linear Support Vector Machine (LSVM) for classification on a dataset of 5400 images for training. Compared to the KNN algorithm, which gave 90.94% accuracy, their model accuracy was 96.463%.
The proposed system provides a low-cost translator that can convert an image to text. Hasan et al. [23] also used the SVM algorithm as a classifier for their work after extracting HOG descriptors. A dataset of 320 sample images was used, where 64 images were used for testing, and the dataset was divided into nine classes, with each category containing 20 samples. This small amount of data is inadequate for robust training and evaluation. Similar to the work of Hasan et al. [23], Uddin et al. [24] also trained an SVM on a dataset of 2400 images, which was then evaluated on another 2400. They reported an accuracy of 97.7% with their method, which involved converting RGB images to the HSV color space, extracting features using a bank of Gabor Filters, and using Kernel PCA for dimensionality reduction. Yasir et al. [25], in their work, trained on a relatively small dataset of 330 RGB images using PCA and Linear Discriminant Analysis (LDA) to minimize intra-class scatter and maximize interclass distance. Yasir et al. [26] used SIFT features on a vocabulary dataset of only 90 images. KNN and SVM were then trained on these features. A recurring theme in these studies and the self-collection of small datasets, most of which are not available for comparative study.

B. DEEP LEARNING TECHNIQUES
Image-based deep learning techniques have been used in many works for Sign Language classification. Aly et al. [14] introduced an Arabic Sign Language recognition system using PCANet, which helps to identify local features from the depth and intensity of images. Ameen et al. [8] developed a convolutional neural network (CNN) for classifying fingerspelling images and reported precision of 82%, and recall of 80%. Hu et al. [10] presented Deep Belief Network (DBN) with three Restricted Boltzman Machines (RBMs) for classifying ASL. Maldague et al. [7] also introduced DBN for ASL recognition and achieved 99% recall. This work focused on improving real-time ASL recognition while being robust to different light intensities. Kang et al. [5] also applied a CNN architecture to classify finger-spelling in ASL while also incorporating information about the position of the thumb. Nakajai et al. [11] combined two approaches: CNN and Histogram of Oriented Gradients (HOG), and they reported 91.26% of precision for Thai finger-spelling recognition.
Most of the works for BSL used CNN-based architecture along with softmax loss for the classification of images. Shafiqul et al. [18], made the extensive dataset for BSL, consisting of 30916 samples which were collected from 25 different students. They developed a CNN-based architecture and reported an accuracy of 99.83%, 100%, and 99.80%, respectively, for basic characters and numerals and combined usage. Ahamed et al. [27] used a dataset of 518 image samples in an artificial neural network (ANN) while identifying fingertip position for recognition and achieved a recognition rate of 99%. A dataset consisting of 1000 gestures and 1000 letters was made by Hoque et al. [28] and was used to train a Region-based Convolutional Network. Their model achieved an accuracy of 98.20% on their dataset. Rafi et al. [29] gathered a dataset of 12581 images of hand signs for 38 alphabets. They proposed a network of VGG19 architecture to recognize static hand gestures with an accuracy of 89.6%. Similarly, Urmee et al. [30] employed the Xception architecture on a dataset of 2000 images with 37 different signs obtaining an accuracy of 98.93%. Yasir et al. used a virtual reality-based hand tracking controller to track the hand motion using CNN-based architecture. They operated this to generate a limited amount of data for their work with an error rate of 2% [31].
Sanzidul et al. [32] introduced Ishara-Bochon in their work, a processed dataset of BSL digits. This dataset consists of 1000 images of Bangla sign digits from 0 to 9, and images were fed into a CNN model while obtaining an accuracy of 92.87%. Sanzidul et al. [33] also built the Ishara-Lipi dataset with 1800 image samples, where they worked on only with alphabets of 36 classes of the BSL. Sanzidul et al. [34] proposed another study that employed this dataset and train a CNN architecture, reporting accuracy of 92.74% and VOLUME 9, 2021 94.74% on the training and validation sets, respectively. Both the Ishara-Bochon and the Ishara-Lipi datasets are made open source for public contribution. In another work, Sanzidul et al. used the Bangla Sign numerals dataset (Ishara-Bochon), which underwent image processing wherein the images were converted into grayscale. Those images were then classified using a CNN architecture and reported an accuracy of 95% [35]. This numeral dataset was also utilized by Hossain et al. [36] in their work, and a capsule network has been used for recognizing digits of BSL. Their proposed model achieved better performance than the previous work done by Sanzidul et al. [35] with an accuracy of 98.84%. The Ishara-Lipi dataset was again used by Hasan et al. [37] and achieved an accuracy of 99.22%. Nihal et al. [38] presented a robust model by establishing a standard automated recognition system for seen and unseen data utilizing two approaches: transfer learning and zero-shot learning (ZSL). Furthermore, they contributed to making a BSL dataset by gathering nine different datasets with a variation in the background, hand size, capture angle, skin tone. They trained 378 models on a small dataset and found that, out of 378 models, Densenet201 followed by LDA classifier has the best performance. Thus, they trained their best model on a large dataset and reported 93.68% accuracy. Most of the researchers obtained their proposed system using VGG16 architecture with some changes in the classification layer. Their system performed well with their respective dataset. However, interestingly we have not found any studies that examine the generalizability of the solutions. We aim to fill this gap by performing an interdataset evaluation.

C. LOSS FUNCTIONS
Our work mainly focuses on the generalization of solutions so that the Sign Language systems can perform robustly. The prior works for recognizing BSL used Deep Learning models with the traditional softmax loss. In this work, we have experimented with different loss functions to get preferable results for obtaining better performance in interdataset evaluations. Experiments are done with SphereFace, CosFace, and ArcFace, as these loss functions have achieved noticeable results in face recognition and have produced robust models in that domain [39] [40] [19]. These angular loss functions have usually been used in face recognition tasks, and very few researchers used these loss functions for other recognition tasks. Chowdhury et al. [20] proposed in their work a joint angular margin loss for speaker recognition and achieved a notable performance improvement in an interdataset evaluation setting. This work aims to analyze the generalizability of some deep learning hand sign recognition models when trained on the previously mentioned publicly available datasets. As angular margin losses have proven to be useful in building better-generalized face recognition models, we have used them in this context of BSL recognition.

1) Ishara-Lipi Dataset
• The total number of classes in the dataset is 36. However, we removed the Bangla sign character for the alphabet "la", as it was not present in the BdSL dataset. Therefore, we worked only on 35 classes of Bangla alphabets. • Each class has between 26-36 images, with an average of 28 images per class. • It has a total of 978 image data.

2) BdSL Dataset
• There are 35 classes in this dataset as mentioned earlier.
• Each class has around 660-700 images, with an average of 680 images. • This is a total of 23,786 images.  The datasets are arranged in such a way so that labels are consistent with associated sign characters of both datasets because the Ishara-Lipi dataset contains only characters, whereas BdSL comprises both digits and characters. The BdSL dataset contains images of size 100×100 dimensions, and the size of the Ishara-Lipi dataset's images are 64×64 dimensions. Hence all the images are resized into 64×64 pixels. We experimented with multiple pre-trained models. The images were normalized as per the requirements of each pre-trained model [42]. As the Ishara-Lipi dataset is significantly smaller, its training data has been augmented using two techniques: random translation and changing brightness. For increasing the number of samples, the images are shifted horizontally, and the brightness level is also enhanced.

C. PROBLEM FORMULATION
We formulate our evaluation criteria by assuming, the training set and test set of BdSL dataset as, ., x N }, respectively. We also consider training set and test for Ishara-Lipi dataset as, X Ishara−T rain = {x 1 , x 2 , x 3 , ..., x N }, and X Ishara−T est = {x 1 , x 2 , x 3 , ..., x N }, respectively, where each set has N number of signed images of 35 classes. We also assume label as Y c , where c ∈ [1,35]. The goal of our research is to compare the performance of different deep learning models on intra and inter datasets. So, for intra dataset evaluation, we have trained and tested the model on the same dataset; for instance, when deep learning models are trained on X BdSL−T rain , then X BdSL−T est images are given for testing the model. On the other hand, we can assess inter-dataset performance by training models on one dataset and testing on another. In particular, for a given test set of X BdSL−T est sign images, we aim to assign a label Y c while the model is trained on X Ishara−T rain .

D. MODEL
Recent deep learning approaches, specifically CNN, have outperformed prior state-of-the-art machine learning techniques, particularly in computer vision tasks. Most commonly, CNN is used to analyze visual imagery to recognize/classify an object. This study focuses on using such deep learning architectures to classify the type of Bangla fingerspelled signs presented as static images in figures 2 and 3. This work considered several modified pre-trained CNN architectures, including AlexNet [43], ResNet50 [44], VGG16, VGG16 with batch normalisation (VGG16-BN), VGG19 and VGG19 with batch normalization (VGG19-BN) for transfer learning [45]. To speed up the learning process and improve the generalizability of the network [46], we use pre-trained models to train our pre-processed datasets. We have discarded the classification layer from the pre-trained models and experimented with different combinations of fully connected layers after the final feature vector to determine the optimal number of layers and the number of neurons in each layer after the feature vector. Our experiments discovered that the combination of 512-256 after the final feature vector showed the best results. Between the fully connected layers, batch normalization layers are also integrated to stabilize the learning process [47]. For model selection, we have used a softmax function in the last fully connected layer. Rectified linear units (ReLU) activation function is used for all of the models, as it provides non-linearity to the network that helps in the improvement in generalization of the classifier by saturating the output [48] [49]. Later, we have continued our further experiments with angular losses using the model that produced the best results among the six networks.  In this section, we discuss loss functions that have been considered in this study. Traditional softmax loss can produce better results in multi-class classification. However, angular margin losses are able to maximize the decision margin in angular space, which reduces intra-class variation while increasing inter-class distance, allowing to enhance discriminative power [40]. Therefore, we have incorporated different angular margin based loss functions in our work. It is worth mentioning that most of the works on hand sign recognition that we have studied so far are trained on CNN models using the softmax loss.

1) Softmax Loss
The softmax loss combines softmax function and crossentropy loss [50]. Softmax function processes the output of a fully connected layer, whose values are usually interpreted as scores, and then projects them to a probability distribution. The mathematical expression of softmax loss defined as equation (1) where x i represents the learned feature vector of i-th sample corresponding to the y i -th class. W c denotes the weight matrix of the last fully connected layer (c ∈ [1, C], C is the number of classes) and N is the number of batch size. Here, we fix bias as 0 for simplicity.

2) SphereFace Loss
Angular Softmax loss function is established from the traditional softmax loss function by manipulating decision boundaries to construct angular margin. The inner dot product of input feature vectors and weights from equation (1) can be expressed as, where θ c is the angle between W c and x i . From equation (2), it is evident that the posterior probability depends on both norm and angle vectors. As a result, normalizing the weight vector by L2 normalization, the cosine angle only contributes to the prediction. The size of the angular margin can be adjusted by a positive parameter m for the basis of learning an angular margin between different distinctive classes. The proposed loss of [39] is formulated as,

3) Large Margin Cosine Loss
Similar to SphereFace [39], the author [40] has also normalized weights by L2 normalization along with embedding vector, which is set as scaling parameter, s. The scaling parameter adjusts the magnitude of radius on the hypersphere manifold. To maximize the decision margin, they introduced hyper-parameter, m in the cosine space rather than angular space [40]. The Large Margin Cosine loss is defined as follows, where θ c is the angle between x i which is the i th feature vector of corresponding ground truth y i and W c is the weight of c-th class and N is the number of batch size. In the [40], they exemplify that the discriminative capability of the learned features can be increased by raising the margin m.

4) ArcFace Loss
In this work, the authors [19] have also normalized both the deep feature vector and weight vector to project the embedding vectors on hypersphere with a scaling constant s. They have introduced additive angular margin m to the target angle in the angular space to simultaneously enhance the concentration in intra-class and divergence between interclass data [19]. Besides, this study also illustrates that Ar-cFace is better in comparison with the other loss functions SphereFace, and CosFace since it has superior geometric attributes because the angular margin is correspondent to geodesic distance exactly.

A. EXPERIMENTAL CONFIGURATION
In [17] [18], it is observed that the VGG family of networks perform the best among the other deep learning-based architectures in BSL recognition. From the six models, the VGG19 CNN architecture is selected as the feature extractor for carrying out further experiments, including those incorporating angular margin based loss functions. The VGG19 model with SphereFace loss performed well in certain settings and is selected for further analysis. The model is depicted in figure 4.  [17] and [18] by classifying the 35 hand gestures in intra-datasets on the VGG16 network using cross-entropy loss. Afterward, we explore the phenomenon of generalization by assessing interdataset performance while maintaining all hyper-parameters constant. For inter-datasets combinations, we have found that setting the batch size to 100 yields better results. All the networks are trained with SGD (Stochastic Gradient Descent) optimizer with a 0.01 learning rate for up to 30 epochs. Training is regularized by fixing weight decay to 5e-4, and batch norm momentum is set as 0.1. All of the experiments are conducted using the Pytorch framework [51]. After analyzing the models, we have modified hyperparameters of three angular margin losses according to our experimental setting. After trial and error, we have observed that angular margin, m = 3 achieves the best performance for SphereFace loss [19]. While in both CosFace [41] and ArcFace [41] losses provide better performance in setting margin m as 0.1 with scaling constant s as of value of 2.
Hardware: The experiments are conducted on a PC with a Ryzen 7 4.2 GHz CPU, NVIDIA RTX 2060 6GB GPU, and 16 GB RAM.

V. RESULT ANALYSIS A. PERFORMANCE OF FULLY CONNECTED LAYERS
In this study, we have evaluated the performance of interdataset on the fine-tuned VGG19 model with softmax loss by changing the number of neurons in the fully connected layers to enhance generalization performance, the results of which are represented in the table 3 and 4.  The experiments with hidden layers are mainly focused on configuration A and B, as our primary objective is to improve VOLUME 9, 2021 generalization performance; moreover, the accuracy of the remaining two configurations, C and D, are around 99%, as demonstrated in [17] [18]. Initially, we have added four fully connected layers with 1000-512-256-128 neurons on top of the pre-trained network, and then the last fully connected layer, which is 128, is fed to the classification layer. However, we have discovered that increasing the number of neurons makes the model over-fitted which results in poor performance for test accuracy as demonstrated in table 3. As a result, we steadily reduced the number of layers until we discovered that training the model with two layers may significantly improve performance. The drop in accuracy for adding more neurons can be due to the fact that the model memorizes the features particular to the training data and unable to serve them well on new unseen data [52]. It is evident that the fully connected layer with 512 to 256 nodes outperforms other combinations of hidden layers for not only test sets but also validation sets in both configurations, as reported in tables 3 and 4. The findings also indicate that two 256-128 neurons layers do not perform any better than 512-256. The accuracy degrades most likely due to the model's inability to acquire valuable information from the data for the fewer number of neurons, causing underfitting. Hence, all experiments are carried out using our optimum 512-256 neurons fully connected layers after the pre-trained network.

B. PERFORMANCE OF CLASSIFICATION
The dataset of [17] and [18] are mainly created for the classification of the BSL. In this subsection, we have analyzed the performance of several architectures for the classification of 35 distinct classes. All six models are evaluated using softmax loss, while all hyper-parameters are held constant for fairness.

1) Model Evaluation on Intra Dataset
Tables 5 and 6 demonstrate the comparative analysis of the six models, along with the proposed approaches of [17] and [18] works.   [17]. However, a noticeable observation can be made that the differences of accuracies across all the models are significantly low, as shown in both tables 5 and 6.

2) Model Evaluation on Inter Dataset
We have performed a thorough assessment of the deep learning models trained on inter-datasets (configuration A and B) to determine the best model, as no conclusion can be drawn from the aforementioned findings due to the closer results of the models.     In our work, we have considered the validation set and training set from the same dataset to evaluate the trained model. Furthermore, we have tested the trained model with different dataset's test set and highlighted the generalization problem. We demonstrate that the validation partition is useful in the intra-dataset as it is drawn from the same dataset as the training set. However, the validation set does not serve its purpose in the inter-dataset, further strengthening our argument that these datasets do not generalize. The results in tables 7 and 8 show that the accuracy of all the networks trained on inter-datasets, which had previously performed well on intra-datasets, has significantly decreased. According to table 7, VGG16 beats all other models for configuration A, while VGG19 is the second-best model with just a 0.2% difference. Table 8 illustrates that VGG19 achieves the best performance so far with the accuracy of 52.32% and out-performs all other models for configuration B. On the other hand, VGG16 performs poorly in configuration B as shown in table 8, albeit achieving maximum accuracy in configuration A. The difference of 0.2% in configuration A between the VGG19 and VGG16 models is negligible compared to the 6.21% difference achieved by VGG19 in configuration B. Therefore, we can conclude that the VGG19 model can generalize well than other models, though there is still scope for improvement. The following experiments are conducted using VGG19, considering it as an optimal model. Furthermore, we have also observed that VGG16_BN and VGG19_BN perform well enough in intra-datasets. In contrast, their accuracy falls in both configurations A and B, as represented in table 7 and 8. Moreover, ResNet50 has reported the lowest accuracy in configuration A, configuration B, as well as, in configuration C. Hence, ResNet50 architecture is certainly not a suitable model for this analysis, based on all of these studies. Although the VGG19 model performed better than other CNN models, there is a significant performance gap between the validation and test sets. We demonstrate Grad-CAM visualization to analyze the failure modes of the VGG19 model on inter dataset and identify dataset bias, as shown in figure 5. Grad-CAM technique generates a visual explanation for the transparency of CNN-based models, which localizes classdiscriminative regions [53]. It is evident from the images in the dataset of original and the inter dataset test images that the images representing the same class differ mainly in angle and orientation. In order to evaluate what wrong decisions the model is making, we conducted inter-dataset and intra-dataset testing, respectively. In both figures 5a and 5b, Grad-CAM visualization of the model revealed that the model has learned to localize the discriminative region for the intra-dataset test, as the model focuses on the center of the important regions. However, the model has misclassified hand gestures for the same class in the inter-dataset test, indicating that the model is unable to concentrate on the crucial regions that help distinguish between classes. These figures provide explicit visual representations of the problem of generalization in biased datasets. The visual representation suggests that the images of these datasets are not diverse and biased in a specific orientation, thus could not highlight the class-discriminative regions to classify correctly.
We have further investigated the reason behind the significant performance gap of intra and inter datasets by plotting the tSNE of VGG19 features. As can be observed from the figure 6, the cluster structure of intra-dataset's feature vectors is much better defined with tighter and more separate clusters, which implies that clusters of the same classes are compacted tightly, and the cluster of distinct hand gestures are separated almost perfectly. However, the learned embedding vectors of the inter-dataset are scattered in the embedding space, implying that the model failed to predict the correct class. Therefore, it is clear that VGG19 extracts more distinct features when using the intra dataset rather than the inter dataset.

3) Generalization Evaluation
The purpose of this study is to signify the importance of generalization while enhancing inter-dataset classification performance. It is mentioned above that inter-dataset evaluation is to train a model in one dataset while testing on another  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 True label  Therefore, this subsection only focuses on inter-dataset classification rather than intra dataset classification because the accuracy of intra-datasets classification is consistently around 99%, leaving us with little opportunity for improvement. A comparative analysis of loss functions trained on VGG19 model are illustrated in tables 9 and 10. As shown in table 9, the performance of VGG19 is enhanced by a considerable amount after utilizing angular loss functions • It is worth mentioning that configuration B has a comparatively smaller test set than configuration A. As we VOLUME 9, 2021 have seen, SphereFace performs well on a large test set with a performance improvement of almost 8%. In configuration B, SphereFace performs poorly than SoftMax. However, considering the large test set in configuration A, it might be possible that the model is overfitted in the small test set as configuration B. • For configuration A, ArcFace and CosFace have almost the same performance, which is still better than softmax, indicating that softmax shows the worst performance for this specific setting. • Arcface and CosFace both performed poorly in configuration B compared to other loss functions.
Therefore, we can conclude that SphereFace loss infers to be the best loss function for inter-dataset classification. Additionally, we can state that VGG19 with the SphereFace loss is the optimal solution for our generalization model.  Figures 7 and 8 illustrates the confusion matrix of the classification of hand gestures using modified VGG19 network with SphereFace loss of configuration A and B, respectively, for a more detailed interpretation of generalization. The confusion matrix is able to describe the performance for each class in classification, where the scores along the diagonal represent the correctly classified predicted labels to the output class. From the recognition of alphabets for configuration A, it can be deduced that model is unable to recognize the appropriate class of class 5 i.e. "ē", as demonstrated in figure 7. In figure 8, the confusion matrix of configuration B is demonstrated, where "pa" and "i" classes yield the worst results. In this illustration, class "ē" produces better results than configuration A, whereas class "i" shows poor performance. We can deduce from these findings that the sign of "ē" and "i" are quite identical, so the model could not distinguish both at the same time. We have also noticed that some classes are misclassified as well. Along with the confusion matrix, we have presented the accuracy of each class in table 11 for a clear representation of both configurations. According to table 11, "ya" and "tha" have the best accuracy scores of all the classes, with 98.6% and 89.66% in configuration A and B, respectively. However, the prediction accuracy of these classes is comparatively poor. It can also be observed that 19 alphabets for configuration A and 22 alphabets of configuration B have an accuracy greater than 50%. In conclusion, the analysis of our research results provides evidence that, while intra-dataset results are encouraging, there is a significant performance gap in interdatasets.

A. COMPARISON OF DATASETS
After pre-processing the data, it is observed that the BdSL dataset is significantly deficient in diversity since many of the data are almost identical. Although the data was obtained from 25 people, they augmented it by rotating, zooming, and shifting, which is likely why the images are similar. In  Ishara-Lipi, however, the collected data exhibits variety, but each folder only includes a limited quantity of data, which is insufficient to train a complex CNN architecture. Even though the findings of [17] and [18] are satisfactory on their own datasets, the techniques could not perform equally well on other datasets. Furthermore, our findings also demonstrate the inferior generalization potential of the Bangla fingerspelled recognition system. These deep learning models may be useful in some situations, but they are ineffective in realworld applications.

B. COMPARISON OF DIFFERENT LOSS FUNCTIONS
As previously noted, angular loss functions are commonly used in face recognition datasets of a large number of classes. Three different approaches of margin penalty are proposed in SpherFace, CosFace, ArcFace: multiplicative angular margin, additive cosine margin, and additive angular margin, respectively [19]. These all losses impose a margin penalty to enforce intra-class concentration by lowering the angle between samples and ground truth and inter-class diversity by penalizing the target logit while increasing the angle. Therefore, we have performed our generalization evaluation based on these loss functions to achieve more generalization capability so that model can learn highly discriminative hand features. Initially, we set the hyper-parameters of s to 64 and m as 0.5 by following [19] for ArcFace and CosFace, but the accuracy drops and perform worse than softmax. After several trials and errors, we have found that when hyper-parameters set to s as 2 and m as 0.1, they show optimal performance. We have discovered that lowering the hyper-parameters values shows better performance. It is most likely due to the small number of classes compared to face recognition datasets' classes. Furthermore, we have found that m = 3 produces an optimal result for SphereFace after experimentation. Figures 9 and 10 demonstrate the performance of loss functions which justify table 9 and 10. It can be observed that for both of the configurations, SphereFace achieves the best performance.

VII. ABLATION STUDY
Apart from the six models described above, we also employed GoogleNet [54], MobileNet [55], MnasNet [56], and InceptionNet [57] CNN architectures. All of these networks perform comparatively poorly and report less than 30% accuracy. As we are concerned with the data, we processed the data by converting the RGB image to GrayScale. However, it performed considerably worse in the VGG16 model, with an accuracy of 38%. We have also implemented a Laplacian filter, i.e., an edge detector, to compute the second derivatives of an image. In addition, we also sharpened the images for overcoming blurring to enhance the pixel value of the edge pixels, whose gray value tends to be higher. However, the performance of the models suffered after these pre-processings. Therefore, we only used normalized images in the models. In the VGG16 model, we have used dropout layers with p value set to 0.25, which were later removed as they lowered accuracy. We have also attempted freezing the convolution layers of the VGG models, but the model could not learn anything.
Furthermore, figure 11 demonstrates the two-dimensional representation of feature distributions in four different configurations. We can observe that the cluster structures of intra-dataset's feature vectors are much well defined due to angular margin, where intra-class compactness and interclass diversity are evident. However, numerous misclassifications can be observed in inter-dataset cluster structures resulting in ambiguity in decision boundaries, while all of the different classes' feature vectors are compact together. We have also computed Cosine similarity and Euclidean distance metrics using the feature vectors. The accuracy score of Cosine similarity and Euclidean distance of intra and inter datasets are presented in table 12. Both techniques have shown unsatisfactory performance. The intra and inter datasets have an accuracy of range in 20-23% percent and 11-14% percent, respectively.

VIII. CONCLUSION
Recognition of Sign Language has been a popular research field for the Deaf and Mute community. Although Bangla is a primary language with a large deaf and mute population, there has been limited research on Bangla Sign Language(BSL) recognition. This paper demonstrates the importance of generalization in finger-spelled BSL recognition by utilizing several deep learning models and angular loss func- tions to improve inter-dataset performance. Among all the experiments, we have found that VGG-19 architecture with SphereFace loss function has shown optimal performance achieving 55.93% and 47.81% on inter-dataset. However, the experimental results are not promising for inter-dataset configurations that have been reported in the intra-dataset findings. Our research has discovered that while the BdSL dataset may be large enough for recognition tasks, it lacks diversity, which is vital for generalization. As well as, the deep learning model tends to be underfitting, especially with a short dataset like Ishara-Lipi. To summarize, it is vital to develop a large and diversified dataset for BSL that is more generic and can be deployed in real-world systems and serve the deaf and mute population. NABEEL MOHAMMED received a bachelor's degree in computer science from Monash University, Australia. He is currently an Associate Professor with the Department of Electrical and Computer Engineering, North South University. He worked as a Software Developer for Editure Ltd., a Melbourne based Software Firm specializing in providing software solutions for K-12 schools. After three and a half years in that role he moved back into academia to complete the Ph.D. degree at Monash University, Australia authentication. VOLUME 9, 2021